smarter_csv 1.15.2 → 1.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +9 -0
  3. data/CHANGELOG.md +68 -1
  4. data/CONTRIBUTORS.md +3 -1
  5. data/Gemfile +1 -0
  6. data/README.md +123 -27
  7. data/docs/_introduction.md +40 -24
  8. data/docs/bad_row_quarantine.md +285 -0
  9. data/docs/basic_read_api.md +151 -9
  10. data/docs/basic_write_api.md +474 -59
  11. data/docs/batch_processing.md +161 -4
  12. data/docs/column_selection.md +183 -0
  13. data/docs/data_transformations.md +162 -29
  14. data/docs/examples.md +339 -46
  15. data/docs/header_transformations.md +93 -12
  16. data/docs/header_validations.md +56 -18
  17. data/docs/history.md +117 -0
  18. data/docs/instrumentation.md +165 -0
  19. data/docs/migrating_from_csv.md +290 -0
  20. data/docs/options.md +150 -87
  21. data/docs/parsing_strategy.md +63 -1
  22. data/docs/real_world_csv.md +262 -0
  23. data/docs/releases/1.16.0/benchmarks.md +223 -0
  24. data/docs/releases/1.16.0/changes.md +272 -0
  25. data/docs/releases/1.16.0/performance_notes.md +114 -0
  26. data/docs/row_col_sep.md +14 -5
  27. data/docs/value_converters.md +193 -57
  28. data/ext/smarter_csv/extconf.rb +3 -0
  29. data/ext/smarter_csv/smarter_csv.c +1007 -71
  30. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  31. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  32. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  36. data/lib/smarter_csv/errors.rb +8 -0
  37. data/lib/smarter_csv/file_io.rb +1 -1
  38. data/lib/smarter_csv/hash_transformations.rb +14 -13
  39. data/lib/smarter_csv/header_transformations.rb +21 -2
  40. data/lib/smarter_csv/headers.rb +2 -1
  41. data/lib/smarter_csv/options.rb +124 -7
  42. data/lib/smarter_csv/parser.rb +362 -75
  43. data/lib/smarter_csv/reader.rb +494 -46
  44. data/lib/smarter_csv/version.rb +1 -1
  45. data/lib/smarter_csv/writer.rb +71 -19
  46. data/lib/smarter_csv.rb +95 -12
  47. data/smarter_csv.gemspec +20 -10
  48. metadata +37 -80
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 41a8d63c5aea4500d77b4268079521194f0d2d34de2b3e5f2264c48181159273
4
- data.tar.gz: 586facc801af166270eebf0ece90949061ccfeaadfa3e7837678cb935e032bcb
3
+ metadata.gz: '082f1c3a20d98a975bc3992ab4dd71d30a3008308f8c8dcbc116f60a2206b824'
4
+ data.tar.gz: 936107710f1588d54b7655e96781a677567d83a302269b1ddf7f2e63d0a31b63
5
5
  SHA512:
6
- metadata.gz: ed4072e64c4e66fb5b982dfaffe49d32370b087aa9a1ff689c2f73bfa6450ae275547bb17818ff227e8843834bcb981a8a906b5e7936bbf999f497e89b2cb91d
7
- data.tar.gz: 31ecb71b2b50e1bb5f2aa037583550eb878f2e1faf66adf0803c8dcdeafbd52b0fa24c3b78bcc9bcdc3a3c759b53667004541257c32799d08b944a4ed53d9b49
6
+ metadata.gz: 212ba64f768895c1250cbae766a6dd91dfe7e83e6ca856b5d94008c8c295bda7a1a1bdaa88e71b25fcacee46fe1400f68fe2574173043e0c1537fa40be90fb9d
7
+ data.tar.gz: 4dfdc5ce2275b3b4a4e82f2abcc45a112a4eb5e90e3324218688317ed72eadde9b59d64ffc9b4bbf108fcd8e36c702172b22600b766762d393831c3404d8a560
data/.rubocop.yml CHANGED
@@ -133,6 +133,9 @@ Style/SoleNestedConditional:
133
133
  Style/SpecialGlobalVars: # DANGER: unsafe rule!!
134
134
  Enabled: false
135
135
 
136
+ Style/StderrPuts:
137
+ Enabled: false # DANGER: unsafe rule!! we DO NOT want warn here
138
+
136
139
  Style/StringConcatenation:
137
140
  Enabled: false
138
141
 
@@ -164,6 +167,12 @@ Style/TrailingUnderscoreVariable:
164
167
  Style/TrivialAccessors:
165
168
  Enabled: false
166
169
 
170
+ Style/WhileUntilModifier:
171
+ Enabled: false
172
+
173
+ Style/WordArray:
174
+ Enabled: false
175
+
167
176
  # Style/UnlessModifier:
168
177
  # Enabled: false
169
178
 
data/CHANGELOG.md CHANGED
@@ -1,9 +1,76 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.16.0 (2026-03-12) — Minor Breaking Change
5
+
6
+ [Full details](docs/releases/1.16.0/changes.md) · [Benchmarks](docs/releases/1.16.0/benchmarks.md) · [Performance notes](docs/releases/1.16.0/performance_notes.md)
7
+
8
+ ### Minor Breaking Change
9
+
10
+ New option **`quote_boundary:`**
11
+ * defaults to `:standard`**: quotes are now only recognized as field delimiters at field boundaries;
12
+ mid-field quotes are treated as literal characters.
13
+
14
+ This aligns SmarterCSV with RFC 4180 and other CSV libraries. In practice, mid-field quotes
15
+ were already producing silently corrupt output in previous versions — so most users will see
16
+ correct behavior improve, not regress.
17
+
18
+ * Use `quote_boundary: :legacy` only in exceptional cases to restore previous behavior. See [Parsing Strategy](../../parsing_strategy.md).
19
+
20
+ ### Performance
21
+
22
+ * **1.8×–8.6× faster** than Ruby `CSV.read` (raw tokenization only; no post-processing)
23
+ * **7×–129× faster** than Ruby `CSV.table` (nearest equivalent output)
24
+ * **up to 2.4× faster** for accelerated path vs 1.15.2 (15/19 benchmark files faster)
25
+ * **up to 2× faster** for Ruby path vs 1.15.2
26
+ * **9×–65× faster** for accelerated path vs 1.14.4
27
+
28
+ Measured on 19 benchmark files, Apple M1, Ruby 3.4.7. See [benchmarks](docs/releases/1.16.0/benchmarks.md).
29
+
30
+ ### New Read API
31
+
32
+ * **`SmarterCSV.parse(csv_string, options)`**: can now parse a CSV string directly. See [Migrating from Ruby CSV](docs/migrating_from_csv.md).
33
+ * **`SmarterCSV.each` / `Reader#each`**: row-by-row enumerator; `Reader` now includes `Enumerable`.
34
+ * **`SmarterCSV.each_chunk` / `Reader#each_chunk`**: chunked enumerator yielding `(Array<Hash>, chunk_index)`.
35
+
36
+ ### New Options
37
+
38
+ * **`on_bad_row:`** — bad row quarantine: `:skip`, `:collect`, `:raise`, or callable. See [Bad Row Quarantine](docs/bad_row_quarantine.md).
39
+ * **`bad_row_limit: N`** — raises `SmarterCSV::TooManyBadRows` after N bad rows.
40
+ * **`collect_raw_lines:`** (default: `true`) — include raw line in bad-row error records.
41
+ * **`field_size_limit: N`** — cap field size in bytes; prevents DoS from unclosed quotes. Raises `SmarterCSV::FieldSizeLimitExceeded`.
42
+ * **`headers: { only: [...] }` / `headers: { except: [...] }`** — column selection; excluded columns skipped in C hot path. See [Column Selection](docs/column_selection.md).
43
+ * **`nil_values_matching:`** — replaces deprecated `remove_values_matching:`.
44
+ * **`missing_headers:`** (default: `:auto`) — replaces deprecated `strict:`.
45
+ * **`verbose: :quiet/:normal/:debug`** — replaces deprecated `verbose: true/false`.
46
+ * **`on_start:` / `on_chunk:` / `on_complete:`** — instrumentation hooks. See [Instrumentation](docs/instrumentation.md).
47
+
48
+ ### New Write API
49
+
50
+ * **IO/StringIO support**: `SmarterCSV.generate` and `Writer.new` now accept any `IO`-compatible object. See [Write API](docs/basic_write_api.md).
51
+ * **`SmarterCSV.generate` returns a String** when called without a destination argument.
52
+ * **Streaming mode**: when `headers:` or `map_headers:` is provided upfront, Writer skips the temp file and streams directly.
53
+ * **`encoding:` / `write_nil_value:` / `write_empty_value:` / `write_bom:`** — new writer options.
54
+
55
+ ### Deprecations
56
+
57
+ * `remove_values_matching:` → use `nil_values_matching:`
58
+ * `strict:` → use `missing_headers: :raise/:auto`
59
+ * `verbose: true/false` → use `verbose: :debug/:normal`
60
+ * `only_headers:` / `except_headers:` → use `headers: { only: }` / `headers: { except: }`
61
+
62
+ ### Bug Fixes
63
+
64
+ * **Empty headers** ([#324](https://github.com/tilo/smarter_csv/issues/324), [#312](https://github.com/tilo/smarter_csv/issues/312)): empty/whitespace-only header fields now auto-generate names via `missing_header_prefix`.
65
+ * **All library output now goes to `$stderr`** — nothing written to `$stdout`.
66
+ * **`SmarterCSV.generate` raises `ArgumentError`** (not blank `RuntimeError`) when called without a block.
67
+ * **Writer temp file** no longer hardcoded to `/tmp` (fixes Windows); properly cleaned up with `Tempfile#close!`.
68
+ * **Writer `StringIO`**: `finalize` no longer attempts to close a caller-owned `StringIO`.
69
+
70
+
4
71
  ## 1.15.2 (2026-02-20)
5
72
 
6
- * Performance Optimizations
73
+ ### Performance Optimizations
7
74
  - 1.6× to 7.2× faster than CSV.read
8
75
  - 6× to 113× faster than Ruby’s CSV.table
9
76
  - 5.4× to 37.4× faster than SmarterCSV 1.14.4 (with C-acceleration)
data/CONTRIBUTORS.md CHANGED
@@ -1,4 +1,4 @@
1
- # A Big Thank You to all 59 Contributors!!
1
+ # A Big Thank You to all 61 Contributors!!
2
2
 
3
3
 
4
4
  A Big Thank you to everyone who filed issues, sent comments, and who contributed with pull requests:
@@ -62,3 +62,5 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
62
62
  * [Felipe Cabezudo](https://github.com/felipekb)
63
63
  * [Skye Shaw](https://github.com/sshaw)
64
64
  * [Mark Bumiller](https://github.com/makrsmark)
65
+ * [Tophe](https://github.com/tophe)
66
+ * [Dom Lebron](https://github.com/biglebronski)
data/Gemfile CHANGED
@@ -8,6 +8,7 @@ gemspec
8
8
  gem "rake"
9
9
  gem "rake-compiler"
10
10
 
11
+ gem "awesome_print"
11
12
  gem 'pry'
12
13
  gem "rubocop"
13
14
 
data/README.md CHANGED
@@ -3,19 +3,22 @@
3
3
 
4
4
  ![Gem Version](https://img.shields.io/gem/v/smarter_csv) [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [View on RubyGems](https://rubygems.org/gems/smarter_csv) [View on RubyToolbox](https://www.ruby-toolbox.com/search?q=smarter_csv)
5
5
 
6
- SmarterCSV provides a convenient interface for reading and writing CSV files and data.
6
+ SmarterCSV is a high-performance CSV ingestion and generation for Ruby, focused on fastest end-to-end CSV ingestion not just parsing.
7
7
 
8
- Unlike traditional CSV parsing methods, SmarterCSV focuses on representing the data for each row as a Ruby hash, which lends itself perfectly for direct use with ActiveRecord, Sidekiq, and JSON stores such as S3. For large files it supports processing CSV data in chunks of array-of-hashes, which allows parallel or batch processing of the data.
8
+ If SmarterCSV saved you hours of import time, please star the repo.
9
9
 
10
- Its powerful interface is designed to simplify and optimize the process of handling CSV data, and allows for highly customizable and efficient data processing by enabling the user to easily map CSV headers to Hash keys, skip unwanted rows, and transform data on-the-fly.
10
+ Beyond raw speed, SmarterCSV is designed to provide a significantly more convenient and developer-friendly interface than traditional CSV libraries. Instead of returning raw arrays that require substantial post-processing, SmarterCSV produces Rails-ready hashes for each row, making the data immediately usable with ActiveRecord, Sidekiq pipelines, parallel processing, and JSON-based workflows such as S3.
11
11
 
12
- This results in a more readable, maintainable, and performant codebase. Whether you're dealing with large datasets or complex data transformations, SmarterCSV streamlines CSV operations, making it an invaluable tool for developers seeking to enhance their data processing workflows.
12
+ The library includes intelligent defaults, automatic detection of column and row separators, and flexible header/value transformations. These features eliminate much of the boilerplate typically required when working with CSV data and help keep ingestion code concise and maintainable.
13
13
 
14
- When writing CSV data to file, it similarly takes arrays of hashes, and converts them to a CSV file.
14
+ For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines. The C acceleration further optimizes the full ingestion path — including parsing, hash construction, and conversions — so performance gains reflect real-world workloads, not just tokenizer benchmarks.
15
15
 
16
- One user wrote:
16
+ The interface is intentionally designed to robustly handle messy real-world CSV while keeping application code clean. Developers can easily map headers, skip unwanted rows, quarantine problematic data, and transform values on the fly without building custom post-processing pipelines. See [Real-World CSV Files](docs/real_world_csv.md) for a comprehensive guide to production CSV patterns.
17
17
 
18
- > *Best gem for CSV for us yet. [...] taking an import process from 7+ hours to about 3 minutes. [...] Smarter CSV was a big part and helped clean up our code ALOT*
18
+ When exporting data, SmarterCSV converts arrays of hashes back into properly formatted CSV, maintaining the same focus on convenience and correctness.
19
+
20
+ **User Testimonial:**
21
+ > "Best gem for CSV for us yet. […] taking an import process from 7+ hours to about 3 minutes. […] SmarterCSV was a big part and helped clean up our code A LOT."
19
22
 
20
23
  ## Performance
21
24
 
@@ -25,19 +28,45 @@ SmarterCSV is designed for **real-world CSV processing**, returning fully usable
25
28
 
26
29
  For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to SmarterCSV.
27
30
 
28
- | Comparison | Range |
29
- |------------------------------------------|----------------------|
30
- | vs SmarterCSV 1.14.4 (with acceleration) | 5.4× to 37.4x faster |
31
- | vs SmarterCSV 1.14.4 (pure Ruby) | 1.4× to 9.5× faster |
32
- | vs CSV.read (arrays of arrays) | 1.6x to 7.2x faster |
33
- | vs CSV.table (arrays of hashes) | 6× to 113× faster |
34
- | vs ZSV (arrays of hashes) | 1.4× to 6.3× faster |
31
+ | Comparison (SmarterCSV 1.16.0, C-accelerated) | Range |
32
+ |-------------------------------------------------|-------------------------|
33
+ | vs SmarterCSV 1.15.2 (with C acceleration) | up to 2. faster |
34
+ | vs SmarterCSV 1.14.4 (with C acceleration) | 9×–65× faster |
35
+ | vs SmarterCSV 1.14.4 (Ruby path) | 1.7×–10. faster |
36
+ | vs CSV.read (arrays of arrays) | 1.7×–8.6× faster |
37
+ | vs CSV.table (arrays of hashes) | 7×–129× faster |
38
+ | vs ZSV (arrays of hashes, equiv. output) | 1.1×–6.6× faster † |
39
+
40
+ † SmarterCSV faster on 15 of 16 files. ZSV raw arrays (no hashes, no conversions) are 2×–14× faster — but that omits the post-processing work needed to produce usable output.
41
+
42
+ _Benchmarks: 19 CSV files (20k–80k rows), Ruby 3.4.7, Apple M1._
43
+
44
+ ![SmarterCSV 1.16.0 vs Ruby CSV 3.3.5 speedup](images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png)
45
+
46
+ ![SmarterCSV 1.16.0 vs previous versions — C-accelerated path](images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg)
47
+
48
+ See [SmarterCSV 1.15.2: Faster Than Raw CSV Arrays](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for more details.
49
+
50
+
51
+ ## Switching from Ruby CSV?
52
+
53
+ It's a one-line change:
54
+
55
+ ```ruby
56
+ # Before
57
+ rows = CSV.table('data.csv').map(&:to_h)
58
+
59
+ # After — up to 129× faster, same symbol keys
60
+ rows = SmarterCSV.process('data.csv')
61
+ ```
35
62
 
36
- [More details here](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [here](https://github.com/tilo/smarter_csv/pull/319)
63
+ `SmarterCSV.parse(string)` works like `CSV.parse(string, headers: true, header_converters: :symbol)` with numeric conversion included by default:
37
64
 
38
- SmarterCSV also wins 14 of 16 benchmark files head-to-head against ZSV+wrapper (SIMD-accelerated C parser with Ruby wrapper to produce equivalent hash output).
65
+ ```ruby
66
+ data = SmarterCSV.parse(csv_string)
67
+ ```
39
68
 
40
- _Benchmarks: 16 CSV files (43k–80k rows), Ruby 3.4.7, Apple M1. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for details._
69
+ See [**Migrating from Ruby CSV**](docs/migrating_from_csv.md) for a full comparison of options, behavior differences, and a quick-reference table.
41
70
 
42
71
  ## Examples
43
72
 
@@ -67,6 +96,29 @@ Notice how SmarterCSV automatically (all defaults):
67
96
  - Removes empty values → `remove_empty_values: true`
68
97
  - Preserves Unicode and emoji characters
69
98
 
99
+ ### Header Transformation Pipeline
100
+
101
+ Once the header line is read, SmarterCSV normalizes it through these steps:
102
+
103
+ ```
104
+ comment_regexp → strip_chars_from_headers → split on col_sep → strip quote_char
105
+ → strip_whitespace → [gsub spaces/dashes→_ → downcase_header]
106
+ → disambiguate_headers → symbolize → key_mapping
107
+ ```
108
+
109
+ `user_provided_headers` bypasses all of the above. Each step is individually configurable. See [Header Transformations](docs/header_transformations.md) for the full step-by-step table and options.
110
+
111
+ ### Value Transformation Pipeline
112
+
113
+ After each row is parsed, SmarterCSV applies a transformation pipeline to field values:
114
+
115
+ ```
116
+ strip_whitespace → nil_values_matching → remove_empty_values → remove_zero_values
117
+ → convert_values_to_numeric → value_converters → remove_empty_hashes
118
+ ```
119
+
120
+ Each step is individually configurable. See [Data Transformations](docs/data_transformations.md) and [Value Converters](docs/value_converters.md) for details.
121
+
70
122
  ### Batch Processing:
71
123
 
72
124
  Processing large CSV files in chunks minimizes memory usage and enables powerful workflows:
@@ -86,11 +138,46 @@ end
86
138
 
87
139
  # Parallel processing with Sidekiq
88
140
  SmarterCSV.process(filename, chunk_size: 100) do |chunk|
89
- MyWorker.perform_async(chunk) # each chunk processed in parallel
141
+ Sidekiq::Client.push_bulk('class' => MyWorker, 'args' => chunk) # each chunk processed in parallel
90
142
  end
91
143
  ```
92
144
 
93
- See [Examples](docs/examples.md), [Batch Processing](docs/batch_processing.md), and [Configuration Options](docs/options.md) for more.
145
+ ### Modern Enumerator API:
146
+
147
+ `Reader#each` is the modern, idiomatic way to process rows — `Reader` includes `Enumerable`, so all standard Ruby methods work:
148
+
149
+ ```ruby
150
+ reader = SmarterCSV::Reader.new('data.csv', options)
151
+ reader.each { |hash| MyModel.upsert(hash) }
152
+
153
+ # Enumerable methods
154
+ active = reader.select { |h| h[:status] == 'active' }
155
+ names = reader.map { |h| h[:name] }
156
+
157
+ # Lazy — stop early without reading the whole file
158
+ first_ten = reader.lazy.select { |h| h[:active] }.first(10)
159
+
160
+ # Manual batching without chunk_size
161
+ reader.each_slice(500) { |batch| MyModel.insert_all(batch) }
162
+ ```
163
+
164
+ ### Bad Row Handling:
165
+
166
+ SmarterCSV can quarantine malformed rows instead of crashing the entire import:
167
+
168
+ ```ruby
169
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
170
+ good_rows = reader.process
171
+
172
+ puts "#{good_rows.size} imported, #{reader.errors[:bad_rows].size} bad rows"
173
+ reader.errors[:bad_rows].each do |rec|
174
+ puts "Line #{rec[:file_line_number]}: #{rec[:error_message]}"
175
+ end
176
+ ```
177
+
178
+ See [Bad Row Quarantine](docs/bad_row_quarantine.md) for full details including `bad_row_limit` and `field_size_limit`.
179
+
180
+ See [13 Examples](docs/examples.md) for more, including value converters, header validation, writing CSV, encoding handling, and resumable Rails ActiveJob imports.
94
181
 
95
182
  ## Requirements
96
183
 
@@ -99,7 +186,7 @@ See [Examples](docs/examples.md), [Batch Processing](docs/batch_processing.md),
99
186
  **C Extension:** SmarterCSV includes a native C extension for accelerated CSV parsing.
100
187
  The C extension is automatically compiled on MRI Ruby. For JRuby and TruffleRuby, SmarterCSV falls back to a pure Ruby implementation.
101
188
 
102
- # Installation
189
+ ## Installation
103
190
 
104
191
  Add this line to your application's Gemfile:
105
192
  ```ruby
@@ -114,31 +201,40 @@ Or install it yourself as:
114
201
  $ gem install smarter_csv
115
202
  ```
116
203
 
117
- # Documentation
204
+ ## Documentation
118
205
 
119
206
  * [Introduction](docs/_introduction.md)
207
+ * [**Migrating from Ruby CSV**](docs/migrating_from_csv.md)
120
208
  * [Parsing Strategy](docs/parsing_strategy.md)
121
209
  * [The Basic Read API](docs/basic_read_api.md)
122
210
  * [The Basic Write API](docs/basic_write_api.md)
123
- * [Batch Processing](./docs/batch_processing.md)
211
+ * [Batch Processing](docs/batch_processing.md)
124
212
  * [Configuration Options](docs/options.md)
125
213
  * [Row and Column Separators](docs/row_col_sep.md)
126
214
  * [Header Transformations](docs/header_transformations.md)
127
215
  * [Header Validations](docs/header_validations.md)
216
+ * [Column Selection](docs/column_selection.md)
128
217
  * [Data Transformations](docs/data_transformations.md)
129
218
  * [Value Converters](docs/value_converters.md)
130
-
131
- # Articles
219
+ * [Bad Row Quarantine](docs/bad_row_quarantine.md)
220
+ * [Instrumentation Hooks](docs/instrumentation.md)
221
+ * [Examples](docs/examples.md)
222
+ * [Real-World CSV Files](docs/real_world_csv.md)
223
+ * [SmarterCSV over the Years](docs/history.md)
224
+ * [Release Notes](docs/releases/1.16.0/changes.md)
225
+
226
+ ## Articles
132
227
  * [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38)
133
228
  * [CSV Writing with SmarterCSV](https://tilo-sloboda.medium.com/csv-writing-with-smartercsv-26136d47ad0c)
134
229
  * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
135
230
  * [Faster Parsing CSV with Parallel Processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing) by [Jack lin](https://github.com/xjlin0/)
136
231
  * The original [Stackoverflow Question](https://stackoverflow.com/questions/7788618/update-mongodb-with-array-from-csv-join-table/7788746#7788746) that inspired SmarterCSV
137
232
  * [The original post](http://www.unixgods.org/Ruby/process_csv_as_hashes.html) for SmarterCSV
233
+ * [SmarterCSV over the Years](docs/history.md) — version timeline and performance journey (9×–65× faster than v1.14.4)
138
234
 
139
235
  # [ChangeLog](./CHANGELOG.md)
140
236
 
141
- # Reporting Bugs / Feature Requests
237
+ ## Reporting Bugs / Feature Requests
142
238
 
143
239
  Please [open an Issue on GitHub](https://github.com/tilo/smarter_csv/issues) if you have feedback, new feature requests, or want to report a bug. Thank you!
144
240
 
@@ -147,10 +243,10 @@ For reporting issues, please:
147
243
  * open a pull-request adding a test that demonstrates the issue
148
244
  * mention your version of SmarterCSV, Ruby, Rails
149
245
 
150
- # [A Special Thanks to all 59 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
246
+ # [A Special Thanks to all 61 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
151
247
 
152
248
 
153
- # Contributing
249
+ ## Contributing
154
250
 
155
251
  1. Fork it
156
252
  2. Create your feature branch (`git checkout -b my-new-feature`)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [**Introduction**](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [Parsing Strategy](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,49 +11,64 @@
10
11
  * [Row and Column Separators](./row_col_sep.md)
11
12
  * [Header Transformations](./header_transformations.md)
12
13
  * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [Data Transformations](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ --------------
17
25
 
18
26
  # SmarterCSV Introduction
19
27
 
20
- `smarter_csv` is a Ruby Gem for convenient reading and writing of CSV files. It has intelligent defaults, and auto-discovery of column and row separators. It imports CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, kicking-off batch jobs with Sidekiq, parallel processing, or oploading data to S3. Similarly, writing CSV files takes Hashes, or Arrays of Hashes to create a CSV file.
28
+ `smarter_csv` is a Ruby gem for fast & convenient importing and exporting of CSV files. It has intelligent defaults and auto-discovery of column and row separators. Importing returns Rails-ready hashes suitable for direct use with ActiveRecord, Sidekiq, parallel processing, or S3 workflows. Exporting takes hashes or arrays of hashes and writes properly formatted CSV.
21
29
 
22
30
  ## Why another CSV library?
23
31
 
24
- Ruby's original 'csv' library's API is pretty old, and its processing of CSV-files returning an array-of-array format feels unnecessarily 'close to the metal'. Its output is not easy to use - especially not if you need a data hash to create database records, or JSON from it, or pass it to Sidekiq or S3. Another shortcoming is that Ruby's 'csv' library does not have good support for huge CSV-files, e.g. there is no support for batching and/or parallel processing of the CSV-content (e.g. with Sidekiq jobs).
32
+ Ruby's built-in `csv` library is **slow** up to 129× slower than SmarterCSV for equivalent work and its API is inconvenient. It returns arrays of arrays, which means your application code must handle column indexing, header normalization, type conversion, and whitespace stripping manually. It also has no built-in support for chunked or parallel processing of large files.
33
+
34
+ ![SmarterCSV 1.16.0 vs Ruby CSV 3.3.5 speedup](../images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png)
25
35
 
26
- When SmarterCSV was envisioned, I needed to do nightly imports of very large data sets that came in CSV format, that needed to be upserted into a database, and because of the sheer volume of data needed to be processed in parallel.
27
- The CSV processing also needed to be robust against variations in the input data.
36
+ SmarterCSV was created to solve exactly these problems: nightly imports of large datasets that needed to be upserted into a database, processed in parallel, and remain robust against real-world variations in input data.
28
37
 
29
38
  ## Benefits of using SmarterCSV
30
39
 
31
- * Improved Robustness:
32
- Typically you have little control over the data quality of CSV files that need to be imported. Because SmarterCSV has intelligent defaults and auto-detection of typical formats, this improves the robustness of your CSV imports without having to manually tweak options.
40
+ * **Performance:**
41
+ SmarterCSV's C extension accelerates the full ingestion pipeline parsing, hash construction, and value conversions not just tokenization. Real-world benchmarks against `CSV.table` (the closest equivalent) show 7×–129× faster end-to-end throughput.
42
+
43
+ * **Rails-ready output:**
44
+ Each CSV row is returned as a Ruby hash with symbol keys, numeric conversion, and whitespace stripping applied automatically. No post-processing boilerplate needed — records can be passed directly to `ActiveRecord`, `insert_all`, Sidekiq, message queues, or JSON serializers.
33
45
 
34
- * Easy-to-use Format:
35
- By using a Ruby hash to represent a CSV row, SmarterCSV allows you to directly use this data and insert it into a database, or use it with Sidekiq, S3, message queues, etc
46
+ * **Intelligent defaults and robustness:**
47
+ SmarterCSV auto-detects row and column separators, handles BOMs, strips extra whitespace, and tolerates common real-world inconsistencies all without manual configuration. This makes imports robust against data you don't fully control, such as user-uploaded files or third-party exports.
36
48
 
37
- * Normalized Headers:
38
- SmarterCSV automatically transforms CSV headers to Ruby symbols, stripping leading or trailing whitespace.
39
- There are many ways to customize the header transformation to your liking. You can re-map CSV headers to hash keys, and you can ignore CSV columns.
49
+ * **Flexible header and value transformations:**
50
+ Headers are automatically downcased, symbolized, and normalized. You can remap or drop columns with `key_mapping`, override headers entirely with `user_provided_headers`, and apply per-field value converters for custom type coercion (dates, booleans, currency, etc.).
40
51
 
41
- * Normalized Data:
42
- SmarterCSV transforms the data in each CSV row automatically, stripping whitespace, converting numerical data into numbers, ignoring nil or empty fields, and more. There are many ways to customize this. You can even add your own value converters.
52
+ * **Batch and streaming processing:**
53
+ `chunk_size` enables memory-efficient batch processing of arbitrarily large files each chunk is an array of hashes ready for `insert_all`, Sidekiq, or other data sinks. The `Reader#each` enumerator includes `Enumerable`, giving you lazy evaluation, `each_slice`, `select`, `map`, and more.
43
54
 
44
- * Batch Processing of large CSV files:
45
- Processing large CSV files in chunks, reduces the memory impact and allows for faster / parallel processing.
46
- By adding the option `chunk_size: numeric_value`, you can switch to batch processing. SmarterCSV will then return arrays-of-hashes. This makes parallel processing easy: you can pass whole chunks of data to Sidekiq, bulk-insert into a DB, or pass it to other data sinks.
55
+ * **Bad row quarantine:**
56
+ Malformed rows can be collected or skipped instead of crashing the entire import. `on_bad_row: :collect` lets you inspect and log bad rows after processing completes.
47
57
 
48
58
  ## Additional Features
49
59
 
50
- * Header Validation:
51
- You can validate that a set of hash keys is present in each record after header transformations are applied.
52
- This can help ensure importing data with consistent quality.
60
+ * **Header validation:**
61
+ Use `required_keys` to raise an error before any data rows are processed if expected columns are missing. Works with post-transformation key names, so it's safe to combine with `key_mapping`. See [Header Validations](./header_validations.md).
53
62
 
54
- * Data Validations
55
- (planned feature)
63
+ * **Instrumentation hooks:**
64
+ `on_start`, `on_chunk`, and `on_complete` callbacks give you visibility into import progress — useful for logging, progress bars, and alerting in long-running jobs. See [Instrumentation Hooks](./instrumentation.md).
65
+
66
+ * **Resumable imports:**
67
+ The `chunk_index` parameter pairs naturally with Rails 8.1's `ActiveJob::Continuable` for jobs that can pause and resume mid-import without reprocessing already-completed chunks. See [Examples](./examples.md#example-12-resumable-csv-import-with-rails-activejob-rails-81).
68
+
69
+ * **CSV writing:**
70
+ `SmarterCSV.generate` writes arrays of hashes to CSV, with support for header renaming and value converters on output. See [The Basic Write API](./basic_write_api.md).
56
71
 
57
72
  ---------------
58
- PREVIOUS [README](../README.md) | NEXT: [Parsing Strategy](./parsing_strategy.md)
73
+
74
+ NEXT: [Migrating from Ruby CSV](./migrating_from_csv.md) | UP: [README](../README.md)