smarter_csv 1.15.2 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. checksums.yaml +4 -4
  2. data/.rspec +2 -0
  3. data/.rubocop.yml +9 -0
  4. data/CHANGELOG.md +112 -1
  5. data/CONTRIBUTORS.md +4 -1
  6. data/Gemfile +1 -0
  7. data/README.md +129 -27
  8. data/docs/_introduction.md +45 -24
  9. data/docs/bad_row_quarantine.md +342 -0
  10. data/docs/basic_read_api.md +152 -9
  11. data/docs/basic_write_api.md +475 -59
  12. data/docs/batch_processing.md +162 -4
  13. data/docs/column_selection.md +184 -0
  14. data/docs/data_transformations.md +163 -29
  15. data/docs/examples.md +340 -46
  16. data/docs/header_transformations.md +94 -12
  17. data/docs/header_validations.md +57 -18
  18. data/docs/history.md +119 -0
  19. data/docs/instrumentation.md +166 -0
  20. data/docs/migrating_from_csv.md +565 -0
  21. data/docs/options.md +151 -87
  22. data/docs/parsing_strategy.md +64 -1
  23. data/docs/real_world_csv.md +263 -0
  24. data/docs/releases/1.16.0/benchmarks.md +223 -0
  25. data/docs/releases/1.16.0/changes.md +273 -0
  26. data/docs/releases/1.16.0/performance_notes.md +114 -0
  27. data/docs/row_col_sep.md +15 -5
  28. data/docs/ruby_csv_pitfalls.md +514 -0
  29. data/docs/value_converters.md +194 -57
  30. data/ext/smarter_csv/extconf.rb +3 -0
  31. data/ext/smarter_csv/smarter_csv.c +1017 -82
  32. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  36. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  37. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  38. data/lib/smarter_csv/errors.rb +8 -0
  39. data/lib/smarter_csv/file_io.rb +1 -1
  40. data/lib/smarter_csv/hash_transformations.rb +14 -13
  41. data/lib/smarter_csv/header_transformations.rb +21 -2
  42. data/lib/smarter_csv/headers.rb +2 -1
  43. data/lib/smarter_csv/options.rb +124 -7
  44. data/lib/smarter_csv/parser.rb +358 -74
  45. data/lib/smarter_csv/reader.rb +494 -46
  46. data/lib/smarter_csv/version.rb +1 -1
  47. data/lib/smarter_csv/writer.rb +71 -19
  48. data/lib/smarter_csv.rb +134 -13
  49. data/smarter_csv.gemspec +20 -10
  50. metadata +38 -80
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 41a8d63c5aea4500d77b4268079521194f0d2d34de2b3e5f2264c48181159273
4
- data.tar.gz: 586facc801af166270eebf0ece90949061ccfeaadfa3e7837678cb935e032bcb
3
+ metadata.gz: 043745aedb1c63fd4a044b9ae46bb8e5d98324c14e609214ee3d895acfd5f501
4
+ data.tar.gz: c39a10521b767daf51887278c9020c9ff6d8d93c32c5ec3f95a17ec575ebdab5
5
5
  SHA512:
6
- metadata.gz: ed4072e64c4e66fb5b982dfaffe49d32370b087aa9a1ff689c2f73bfa6450ae275547bb17818ff227e8843834bcb981a8a906b5e7936bbf999f497e89b2cb91d
7
- data.tar.gz: 31ecb71b2b50e1bb5f2aa037583550eb878f2e1faf66adf0803c8dcdeafbd52b0fa24c3b78bcc9bcdc3a3c759b53667004541257c32799d08b944a4ed53d9b49
6
+ metadata.gz: 5f1d125138443f02e0276e964dac9e584b996de6acafe8b3856316852a38220094e69a2c14302f922bcc93b6d23cc594bbf926940ccd70a2bd65ab08c5a18b49
7
+ data.tar.gz: '0929051996781c8643c0239556d123c840e7041d9c12f7d867e3800dfb2c2eb92e6f6fb77b5fc08660a36f34f470cb826f255943fa445d3e29d231e648da51b4'
data/.rspec CHANGED
@@ -1 +1,3 @@
1
1
  --require spec_helper
2
+ --color
3
+ --format documentation
data/.rubocop.yml CHANGED
@@ -133,6 +133,9 @@ Style/SoleNestedConditional:
133
133
  Style/SpecialGlobalVars: # DANGER: unsafe rule!!
134
134
  Enabled: false
135
135
 
136
+ Style/StderrPuts:
137
+ Enabled: false # DANGER: unsafe rule!! we DO NOT want warn here
138
+
136
139
  Style/StringConcatenation:
137
140
  Enabled: false
138
141
 
@@ -164,6 +167,12 @@ Style/TrailingUnderscoreVariable:
164
167
  Style/TrivialAccessors:
165
168
  Enabled: false
166
169
 
170
+ Style/WhileUntilModifier:
171
+ Enabled: false
172
+
173
+ Style/WordArray:
174
+ Enabled: false
175
+
167
176
  # Style/UnlessModifier:
168
177
  # Enabled: false
169
178
 
data/CHANGELOG.md CHANGED
@@ -1,9 +1,120 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.16.1 (2026-03-16) — Bug Fixes & New Features
5
+
6
+ RSpec tests: **1,247 → 1,410** (+163 tests)
7
+
8
+ ### New Features
9
+
10
+ * **`SmarterCSV.errors`** — class-level error access after any `process`, `parse`, `each`, or `each_chunk` call.
11
+ Exposes the same `reader.errors` hash without requiring access to the `Reader` instance.
12
+ Errors are cleared at the start of each call and stored per-thread (safe in Puma/Sidekiq).
13
+
14
+ ```ruby
15
+ # Previously — required Reader instance to access errors
16
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :skip)
17
+ reader.process
18
+ puts reader.errors[:bad_row_count]
19
+
20
+ # Now — works with the class-level API too
21
+ SmarterCSV.process('data.csv', on_bad_row: :skip)
22
+ puts SmarterCSV.errors[:bad_row_count]
23
+ ```
24
+
25
+ > **Note:** `SmarterCSV.errors` only surfaces errors from the **most recent run on the
26
+ > current thread**. In a multi-threaded environment (Puma, Sidekiq), each thread maintains
27
+ > its own error state independently. If you call `SmarterCSV.process` twice in the same
28
+ > thread, the second call's errors replace the first's. For long-running or complex
29
+ > pipelines where you need to aggregate errors across multiple files, use the Reader API.
30
+ >
31
+ > ⚠️ **Fibers:** `SmarterCSV.errors` uses `Thread.current` for storage, which is **shared
32
+ > across all fibers running in the same thread**. If you process CSV files concurrently
33
+ > in fibers (e.g. with `Async`, `Falcon`, or manual `Fiber` scheduling), `SmarterCSV.errors`
34
+ > may return stale or wrong results. **Use `SmarterCSV::Reader` directly** — errors are
35
+ > scoped to the reader instance and are always correct regardless of fiber context.
36
+
37
+ ### Bug Fixes
38
+
39
+ * fixed [#325](https://github.com/tilo/smarter_csv/issues/325): `col_sep` in quoted headers was handled incorrectly; Thanks to Paho Lurie-Gregg.
40
+ * fixed issue with quoted numeric fields that were not converted to numeric
41
+
42
+ ### Tests
43
+
44
+ * Added 163 tests covering new features and corner cases
45
+
46
+ ## 1.16.0 (2026-03-12) — Minor Breaking Change
47
+
48
+ [Full details](docs/releases/1.16.0/changes.md) · [Benchmarks](docs/releases/1.16.0/benchmarks.md) · [Performance notes](docs/releases/1.16.0/performance_notes.md)
49
+
50
+ RSpec tests: **714 → 1,247** (+533 tests)
51
+
52
+ ### Minor Breaking Change
53
+
54
+ New option **`quote_boundary:`**
55
+ * defaults to `:standard`**: quotes are now only recognized as field delimiters at field boundaries;
56
+ mid-field quotes are treated as literal characters.
57
+
58
+ This aligns SmarterCSV with RFC 4180 and other CSV libraries. In practice, mid-field quotes
59
+ were already producing silently corrupt output in previous versions — so most users will see
60
+ correct behavior improve, not regress.
61
+
62
+ * Use `quote_boundary: :legacy` only in exceptional cases to restore previous behavior. See [Parsing Strategy](../../parsing_strategy.md).
63
+
64
+ ### Performance
65
+
66
+ * **1.8×–8.6× faster** than Ruby `CSV.read` (raw tokenization only; no post-processing)
67
+ * **7×–129× faster** than Ruby `CSV.table` (nearest equivalent output)
68
+ * **up to 2.4× faster** for accelerated path vs 1.15.2 (15/19 benchmark files faster)
69
+ * **up to 2× faster** for Ruby path vs 1.15.2
70
+ * **9×–65× faster** for accelerated path vs 1.14.4
71
+
72
+ Measured on 19 benchmark files, Apple M1, Ruby 3.4.7. See [benchmarks](docs/releases/1.16.0/benchmarks.md).
73
+
74
+ ### New Read API
75
+
76
+ * **`SmarterCSV.parse(csv_string, options)`**: can now parse a CSV string directly. See [Migrating from Ruby CSV](docs/migrating_from_csv.md).
77
+ * **`SmarterCSV.each` / `Reader#each`**: row-by-row enumerator; `Reader` now includes `Enumerable`.
78
+ * **`SmarterCSV.each_chunk` / `Reader#each_chunk`**: chunked enumerator yielding `(Array<Hash>, chunk_index)`.
79
+
80
+ ### New Options
81
+
82
+ * **`on_bad_row:`** — bad row quarantine: `:skip`, `:collect`, `:raise`, or callable. See [Bad Row Quarantine](docs/bad_row_quarantine.md).
83
+ * **`bad_row_limit: N`** — raises `SmarterCSV::TooManyBadRows` after N bad rows.
84
+ * **`collect_raw_lines:`** (default: `true`) — include raw line in bad-row error records.
85
+ * **`field_size_limit: N`** — cap field size in bytes; prevents DoS from unclosed quotes. Raises `SmarterCSV::FieldSizeLimitExceeded`.
86
+ * **`headers: { only: [...] }` / `headers: { except: [...] }`** — column selection; excluded columns skipped in C hot path. See [Column Selection](docs/column_selection.md).
87
+ * **`nil_values_matching:`** — replaces deprecated `remove_values_matching:`.
88
+ * **`missing_headers:`** (default: `:auto`) — replaces deprecated `strict:`.
89
+ * **`verbose: :quiet/:normal/:debug`** — replaces deprecated `verbose: true/false`.
90
+ * **`on_start:` / `on_chunk:` / `on_complete:`** — instrumentation hooks. See [Instrumentation](docs/instrumentation.md).
91
+
92
+ ### New Write API
93
+
94
+ * **IO/StringIO support**: `SmarterCSV.generate` and `Writer.new` now accept any `IO`-compatible object. See [Write API](docs/basic_write_api.md).
95
+ * **`SmarterCSV.generate` returns a String** when called without a destination argument.
96
+ * **Streaming mode**: when `headers:` or `map_headers:` is provided upfront, Writer skips the temp file and streams directly.
97
+ * **`encoding:` / `write_nil_value:` / `write_empty_value:` / `write_bom:`** — new writer options.
98
+
99
+ ### Deprecations
100
+
101
+ * `remove_values_matching:` → use `nil_values_matching:`
102
+ * `strict:` → use `missing_headers: :raise/:auto`
103
+ * `verbose: true/false` → use `verbose: :debug/:normal`
104
+ * `only_headers:` / `except_headers:` → use `headers: { only: }` / `headers: { except: }`
105
+
106
+ ### Bug Fixes
107
+
108
+ * **Empty headers** ([#324](https://github.com/tilo/smarter_csv/issues/324), [#312](https://github.com/tilo/smarter_csv/issues/312)): empty/whitespace-only header fields now auto-generate names via `missing_header_prefix`.
109
+ * **All library output now goes to `$stderr`** — nothing written to `$stdout`.
110
+ * **`SmarterCSV.generate` raises `ArgumentError`** (not blank `RuntimeError`) when called without a block.
111
+ * **Writer temp file** no longer hardcoded to `/tmp` (fixes Windows); properly cleaned up with `Tempfile#close!`.
112
+ * **Writer `StringIO`**: `finalize` no longer attempts to close a caller-owned `StringIO`.
113
+
114
+
4
115
  ## 1.15.2 (2026-02-20)
5
116
 
6
- * Performance Optimizations
117
+ ### Performance Optimizations
7
118
  - 1.6× to 7.2× faster than CSV.read
8
119
  - 6× to 113× faster than Ruby’s CSV.table
9
120
  - 5.4× to 37.4× faster than SmarterCSV 1.14.4 (with C-acceleration)
data/CONTRIBUTORS.md CHANGED
@@ -1,4 +1,4 @@
1
- # A Big Thank You to all 59 Contributors!!
1
+ # A Big Thank You to all 61 Contributors!!
2
2
 
3
3
 
4
4
  A Big Thank you to everyone who filed issues, sent comments, and who contributed with pull requests:
@@ -62,3 +62,6 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
62
62
  * [Felipe Cabezudo](https://github.com/felipekb)
63
63
  * [Skye Shaw](https://github.com/sshaw)
64
64
  * [Mark Bumiller](https://github.com/makrsmark)
65
+ * [Tophe](https://github.com/tophe)
66
+ * [Dom Lebron](https://github.com/biglebronski)
67
+ * [Paho Lurie-Gregg](https://github.com/paholg)
data/Gemfile CHANGED
@@ -8,6 +8,7 @@ gemspec
8
8
  gem "rake"
9
9
  gem "rake-compiler"
10
10
 
11
+ gem "awesome_print"
11
12
  gem 'pry'
12
13
  gem "rubocop"
13
14
 
data/README.md CHANGED
@@ -3,19 +3,27 @@
3
3
 
4
4
  ![Gem Version](https://img.shields.io/gem/v/smarter_csv) [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [View on RubyGems](https://rubygems.org/gems/smarter_csv) [View on RubyToolbox](https://www.ruby-toolbox.com/search?q=smarter_csv)
5
5
 
6
- SmarterCSV provides a convenient interface for reading and writing CSV files and data.
6
+ SmarterCSV is a high-performance CSV ingestion and generation for Ruby, focused on fast end-to-end CSV ingestion of real-world data — no silent failures, no surprises, not just tokenization.
7
7
 
8
- Unlike traditional CSV parsing methods, SmarterCSV focuses on representing the data for each row as a Ruby hash, which lends itself perfectly for direct use with ActiveRecord, Sidekiq, and JSON stores such as S3. For large files it supports processing CSV data in chunks of array-of-hashes, which allows parallel or batch processing of the data.
8
+ If SmarterCSV saved you hours of import time, please star the repo, and consider sponsoring this project.
9
9
 
10
- Its powerful interface is designed to simplify and optimize the process of handling CSV data, and allows for highly customizable and efficient data processing by enabling the user to easily map CSV headers to Hash keys, skip unwanted rows, and transform data on-the-fly.
10
+ Ruby's built-in CSV library has 10 documented failure modes that can silently corrupt or lose data duplicate headers, blank header cells, extra columns, BOMs, whitespace, encoding issues, and more all without raising an exception.
11
+ SmarterCSV handles 8 our of 10 by default, and the remaining 2 with a single option each.
11
12
 
12
- This results in a more readable, maintainable, and performant codebase. Whether you're dealing with large datasets or complex data transformations, SmarterCSV streamlines CSV operations, making it an invaluable tool for developers seeking to enhance their data processing workflows.
13
+ > See [**Ruby CSV Pitfalls**](docs/ruby_csv_pitfalls.md) for 10 ways `CSV.read` silently corrupts or loses data, and how SmarterCSV handles them.
13
14
 
14
- When writing CSV data to file, it similarly takes arrays of hashes, and converts them to a CSV file.
15
+ Beyond raw speed, SmarterCSV is designed to provide a significantly more convenient and developer-friendly interface than traditional CSV libraries. Instead of returning raw arrays that require substantial post-processing, SmarterCSV produces Rails-ready hashes for each row, making the data immediately usable with ActiveRecord, Sidekiq pipelines, parallel processing, and JSON-based workflows such as S3.
15
16
 
16
- One user wrote:
17
+ The library includes intelligent defaults, automatic detection of column and row separators, and flexible header/value transformations. These features eliminate much of the boilerplate typically required when working with CSV data and help keep ingestion code concise and maintainable.
17
18
 
18
- > *Best gem for CSV for us yet. [...] taking an import process from 7+ hours to about 3 minutes. [...] Smarter CSV was a big part and helped clean up our code ALOT*
19
+ For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines. The C acceleration further optimizes the full ingestion path — including parsing, hash construction, and conversions so performance gains reflect real-world workloads, not just tokenizer benchmarks.
20
+
21
+ The interface is intentionally designed to robustly handle messy real-world CSV while keeping application code clean. Developers can easily map headers, skip unwanted rows, quarantine problematic data, and transform values on the fly without building custom post-processing pipelines. See [Real-World CSV Files](docs/real_world_csv.md) for a comprehensive guide to production CSV patterns.
22
+
23
+ When exporting data, SmarterCSV converts arrays of hashes back into properly formatted CSV, maintaining the same focus on convenience and correctness.
24
+
25
+ **User Testimonial:**
26
+ > "Best gem for CSV for us yet. […] taking an import process from 7+ hours to about 3 minutes. […] SmarterCSV was a big part and helped clean up our code A LOT."
19
27
 
20
28
  ## Performance
21
29
 
@@ -25,19 +33,45 @@ SmarterCSV is designed for **real-world CSV processing**, returning fully usable
25
33
 
26
34
  For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to SmarterCSV.
27
35
 
28
- | Comparison | Range |
29
- |------------------------------------------|----------------------|
30
- | vs SmarterCSV 1.14.4 (with acceleration) | 5.4× to 37.4x faster |
31
- | vs SmarterCSV 1.14.4 (pure Ruby) | 1.4× to 9.5× faster |
32
- | vs CSV.read (arrays of arrays) | 1.6x to 7.2x faster |
33
- | vs CSV.table (arrays of hashes) | 6× to 113× faster |
34
- | vs ZSV (arrays of hashes) | 1.4× to 6.3× faster |
36
+ | Comparison (SmarterCSV 1.16.0, C-accelerated) | Range |
37
+ |-------------------------------------------------|-------------------------|
38
+ | vs SmarterCSV 1.15.2 (with C acceleration) | up to 2. faster |
39
+ | vs SmarterCSV 1.14.4 (with C acceleration) | 9×–65× faster |
40
+ | vs SmarterCSV 1.14.4 (Ruby path) | 1.7×–10. faster |
41
+ | vs CSV.read (arrays of arrays) | 1.7×–8.6× faster |
42
+ | vs CSV.table (arrays of hashes) | 7×–129× faster |
43
+ | vs ZSV (arrays of hashes, equiv. output) | 1.1×–6.6× faster † |
44
+
45
+ † SmarterCSV faster on 15 of 16 files. ZSV raw arrays (no hashes, no conversions) are 2×–14× faster — but that omits the post-processing work needed to produce usable output.
46
+
47
+ _Benchmarks: 19 CSV files (20k–80k rows), Ruby 3.4.7, Apple M1._
48
+
49
+ ![SmarterCSV 1.16.0 vs Ruby CSV 3.3.5 speedup](images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png)
35
50
 
36
- [More details here](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [here](https://github.com/tilo/smarter_csv/pull/319)
51
+ ![SmarterCSV 1.16.0 vs previous versions — C-accelerated path](images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg)
37
52
 
38
- SmarterCSV also wins 14 of 16 benchmark files head-to-head against ZSV+wrapper (SIMD-accelerated C parser with Ruby wrapper to produce equivalent hash output).
53
+ See [SmarterCSV 1.15.2: Faster Than Raw CSV Arrays](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for more details.
39
54
 
40
- _Benchmarks: 16 CSV files (43k–80k rows), Ruby 3.4.7, Apple M1. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for details._
55
+
56
+ ## Switching from Ruby CSV?
57
+
58
+ It's a one-line change:
59
+
60
+ ```ruby
61
+ # Before
62
+ rows = CSV.table('data.csv').map(&:to_h)
63
+
64
+ # After — up to 129× faster, same symbol keys
65
+ rows = SmarterCSV.process('data.csv')
66
+ ```
67
+
68
+ `SmarterCSV.parse(string)` works like `CSV.parse(string, headers: true, header_converters: :symbol)` — with numeric conversion included by default:
69
+
70
+ ```ruby
71
+ data = SmarterCSV.parse(csv_string)
72
+ ```
73
+
74
+ * See [**Migrating from Ruby CSV**](docs/migrating_from_csv.md) for a full comparison of options, behavior differences, and a quick-reference table.
41
75
 
42
76
  ## Examples
43
77
 
@@ -67,6 +101,29 @@ Notice how SmarterCSV automatically (all defaults):
67
101
  - Removes empty values → `remove_empty_values: true`
68
102
  - Preserves Unicode and emoji characters
69
103
 
104
+ ### Header Transformation Pipeline
105
+
106
+ Once the header line is read, SmarterCSV normalizes it through these steps:
107
+
108
+ ```
109
+ comment_regexp → strip_chars_from_headers → split on col_sep → strip quote_char
110
+ → strip_whitespace → [gsub spaces/dashes→_ → downcase_header]
111
+ → disambiguate_headers → symbolize → key_mapping
112
+ ```
113
+
114
+ `user_provided_headers` bypasses all of the above. Each step is individually configurable. See [Header Transformations](docs/header_transformations.md) for the full step-by-step table and options.
115
+
116
+ ### Value Transformation Pipeline
117
+
118
+ After each row is parsed, SmarterCSV applies a transformation pipeline to field values:
119
+
120
+ ```
121
+ strip_whitespace → nil_values_matching → remove_empty_values → remove_zero_values
122
+ → convert_values_to_numeric → value_converters → remove_empty_hashes
123
+ ```
124
+
125
+ Each step is individually configurable. See [Data Transformations](docs/data_transformations.md) and [Value Converters](docs/value_converters.md) for details.
126
+
70
127
  ### Batch Processing:
71
128
 
72
129
  Processing large CSV files in chunks minimizes memory usage and enables powerful workflows:
@@ -86,11 +143,46 @@ end
86
143
 
87
144
  # Parallel processing with Sidekiq
88
145
  SmarterCSV.process(filename, chunk_size: 100) do |chunk|
89
- MyWorker.perform_async(chunk) # each chunk processed in parallel
146
+ Sidekiq::Client.push_bulk('class' => MyWorker, 'args' => chunk) # each chunk processed in parallel
147
+ end
148
+ ```
149
+
150
+ ### Modern Enumerator API:
151
+
152
+ `Reader#each` is the modern, idiomatic way to process rows — `Reader` includes `Enumerable`, so all standard Ruby methods work:
153
+
154
+ ```ruby
155
+ reader = SmarterCSV::Reader.new('data.csv', options)
156
+ reader.each { |hash| MyModel.upsert(hash) }
157
+
158
+ # Enumerable methods
159
+ active = reader.select { |h| h[:status] == 'active' }
160
+ names = reader.map { |h| h[:name] }
161
+
162
+ # Lazy — stop early without reading the whole file
163
+ first_ten = reader.lazy.select { |h| h[:active] }.first(10)
164
+
165
+ # Manual batching without chunk_size
166
+ reader.each_slice(500) { |batch| MyModel.insert_all(batch) }
167
+ ```
168
+
169
+ ### Bad Row Handling:
170
+
171
+ SmarterCSV can quarantine malformed rows instead of crashing the entire import:
172
+
173
+ ```ruby
174
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
175
+ good_rows = reader.process
176
+
177
+ puts "#{good_rows.size} imported, #{reader.errors[:bad_rows].size} bad rows"
178
+ reader.errors[:bad_rows].each do |rec|
179
+ puts "Line #{rec[:file_line_number]}: #{rec[:error_message]}"
90
180
  end
91
181
  ```
92
182
 
93
- See [Examples](docs/examples.md), [Batch Processing](docs/batch_processing.md), and [Configuration Options](docs/options.md) for more.
183
+ See [Bad Row Quarantine](docs/bad_row_quarantine.md) for full details including `bad_row_limit` and `field_size_limit`.
184
+
185
+ See [13 Examples](docs/examples.md) for more, including value converters, header validation, writing CSV, encoding handling, and resumable Rails ActiveJob imports.
94
186
 
95
187
  ## Requirements
96
188
 
@@ -99,7 +191,7 @@ See [Examples](docs/examples.md), [Batch Processing](docs/batch_processing.md),
99
191
  **C Extension:** SmarterCSV includes a native C extension for accelerated CSV parsing.
100
192
  The C extension is automatically compiled on MRI Ruby. For JRuby and TruffleRuby, SmarterCSV falls back to a pure Ruby implementation.
101
193
 
102
- # Installation
194
+ ## Installation
103
195
 
104
196
  Add this line to your application's Gemfile:
105
197
  ```ruby
@@ -114,31 +206,41 @@ Or install it yourself as:
114
206
  $ gem install smarter_csv
115
207
  ```
116
208
 
117
- # Documentation
209
+ ## Documentation
118
210
 
119
211
  * [Introduction](docs/_introduction.md)
212
+ * [**Migrating from Ruby CSV**](docs/migrating_from_csv.md)
213
+ * [Ruby CSV Pitfalls](docs/ruby_csv_pitfalls.md)
120
214
  * [Parsing Strategy](docs/parsing_strategy.md)
121
215
  * [The Basic Read API](docs/basic_read_api.md)
122
216
  * [The Basic Write API](docs/basic_write_api.md)
123
- * [Batch Processing](./docs/batch_processing.md)
217
+ * [Batch Processing](docs/batch_processing.md)
124
218
  * [Configuration Options](docs/options.md)
125
219
  * [Row and Column Separators](docs/row_col_sep.md)
126
220
  * [Header Transformations](docs/header_transformations.md)
127
221
  * [Header Validations](docs/header_validations.md)
222
+ * [Column Selection](docs/column_selection.md)
128
223
  * [Data Transformations](docs/data_transformations.md)
129
224
  * [Value Converters](docs/value_converters.md)
130
-
131
- # Articles
225
+ * [Bad Row Quarantine](docs/bad_row_quarantine.md)
226
+ * [Instrumentation Hooks](docs/instrumentation.md)
227
+ * [Examples](docs/examples.md)
228
+ * [Real-World CSV Files](docs/real_world_csv.md)
229
+ * [SmarterCSV over the Years](docs/history.md)
230
+ * [Release Notes](docs/releases/1.16.0/changes.md)
231
+
232
+ ## Articles
132
233
  * [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38)
133
234
  * [CSV Writing with SmarterCSV](https://tilo-sloboda.medium.com/csv-writing-with-smartercsv-26136d47ad0c)
134
235
  * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
135
236
  * [Faster Parsing CSV with Parallel Processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing) by [Jack lin](https://github.com/xjlin0/)
136
237
  * The original [Stackoverflow Question](https://stackoverflow.com/questions/7788618/update-mongodb-with-array-from-csv-join-table/7788746#7788746) that inspired SmarterCSV
137
238
  * [The original post](http://www.unixgods.org/Ruby/process_csv_as_hashes.html) for SmarterCSV
239
+ * [SmarterCSV over the Years](docs/history.md) — version timeline and performance journey (9×–65× faster than v1.14.4)
138
240
 
139
241
  # [ChangeLog](./CHANGELOG.md)
140
242
 
141
- # Reporting Bugs / Feature Requests
243
+ ## Reporting Bugs / Feature Requests
142
244
 
143
245
  Please [open an Issue on GitHub](https://github.com/tilo/smarter_csv/issues) if you have feedback, new feature requests, or want to report a bug. Thank you!
144
246
 
@@ -147,10 +249,10 @@ For reporting issues, please:
147
249
  * open a pull-request adding a test that demonstrates the issue
148
250
  * mention your version of SmarterCSV, Ruby, Rails
149
251
 
150
- # [A Special Thanks to all 59 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
252
+ # [A Special Thanks to all 62 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
151
253
 
152
254
 
153
- # Contributing
255
+ ## Contributing
154
256
 
155
257
  1. Fork it
156
258
  2. Create your feature branch (`git checkout -b my-new-feature`)
@@ -2,6 +2,8 @@
2
2
  ### Contents
3
3
 
4
4
  * [**Introduction**](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
5
7
  * [Parsing Strategy](./parsing_strategy.md)
6
8
  * [The Basic Read API](./basic_read_api.md)
7
9
  * [The Basic Write API](./basic_write_api.md)
@@ -10,49 +12,68 @@
10
12
  * [Row and Column Separators](./row_col_sep.md)
11
13
  * [Header Transformations](./header_transformations.md)
12
14
  * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
13
16
  * [Data Transformations](./data_transformations.md)
14
17
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
17
26
 
18
27
  # SmarterCSV Introduction
19
28
 
20
- `smarter_csv` is a Ruby Gem for convenient reading and writing of CSV files. It has intelligent defaults, and auto-discovery of column and row separators. It imports CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, kicking-off batch jobs with Sidekiq, parallel processing, or oploading data to S3. Similarly, writing CSV files takes Hashes, or Arrays of Hashes to create a CSV file.
29
+ `smarter_csv` is a Ruby gem for fast & convenient importing and exporting of CSV files. It has intelligent defaults and auto-discovery of column and row separators. Importing returns Rails-ready hashes suitable for direct use with ActiveRecord, Sidekiq, parallel processing, or S3 workflows. Exporting takes hashes or arrays of hashes and writes properly formatted CSV.
21
30
 
22
31
  ## Why another CSV library?
23
32
 
24
- Ruby's original 'csv' library's API is pretty old, and its processing of CSV-files returning an array-of-array format feels unnecessarily 'close to the metal'. Its output is not easy to use - especially not if you need a data hash to create database records, or JSON from it, or pass it to Sidekiq or S3. Another shortcoming is that Ruby's 'csv' library does not have good support for huge CSV-files, e.g. there is no support for batching and/or parallel processing of the CSV-content (e.g. with Sidekiq jobs).
33
+ **Inconvenient.** Ruby's built-in `csv` library returns arrays of arrays, which means your application code must handle column indexing, header normalization, type conversion, and whitespace stripping manually. It also has no built-in support for chunked or parallel processing of large files.
34
+
35
+ **Hidden failure modes.** `CSV.read` has 10 ways to silently corrupt or lose data — no exception, no warning, no log line. Duplicate headers, blank header cells, extra columns, BOMs, whitespace, inconsistent empty-field representation, runaway quoted fields, and encoding issues all fail silently. See [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md) for reproducible examples and the SmarterCSV fix for each.
36
+
37
+ **Slow.** On top of everything else, it is up to 129× slower than SmarterCSV for equivalent end-to-end work.
38
+
39
+ ![SmarterCSV 1.16.0 vs Ruby CSV 3.3.5 speedup](../images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png)
25
40
 
26
- When SmarterCSV was envisioned, I needed to do nightly imports of very large data sets that came in CSV format, that needed to be upserted into a database, and because of the sheer volume of data needed to be processed in parallel.
27
- The CSV processing also needed to be robust against variations in the input data.
41
+ SmarterCSV was created to solve exactly these problems: nightly imports of large datasets that needed to be upserted into a database, processed in parallel, and remain robust against real-world variations in input data.
28
42
 
29
43
  ## Benefits of using SmarterCSV
30
44
 
31
- * Improved Robustness:
32
- Typically you have little control over the data quality of CSV files that need to be imported. Because SmarterCSV has intelligent defaults and auto-detection of typical formats, this improves the robustness of your CSV imports without having to manually tweak options.
45
+ * **Performance:**
46
+ SmarterCSV's C extension accelerates the full ingestion pipeline parsing, hash construction, and value conversions not just tokenization. Real-world benchmarks against `CSV.table` (the closest equivalent) show 7×–129× faster end-to-end throughput.
33
47
 
34
- * Easy-to-use Format:
35
- By using a Ruby hash to represent a CSV row, SmarterCSV allows you to directly use this data and insert it into a database, or use it with Sidekiq, S3, message queues, etc
48
+ * **Rails-ready output:**
49
+ Each CSV row is returned as a Ruby hash with symbol keys, numeric conversion, and whitespace stripping applied automatically. No post-processing boilerplate needed records can be passed directly to `ActiveRecord`, `insert_all`, Sidekiq, message queues, or JSON serializers.
36
50
 
37
- * Normalized Headers:
38
- SmarterCSV automatically transforms CSV headers to Ruby symbols, stripping leading or trailing whitespace.
39
- There are many ways to customize the header transformation to your liking. You can re-map CSV headers to hash keys, and you can ignore CSV columns.
51
+ * **Intelligent defaults and robustness:**
52
+ SmarterCSV auto-detects row and column separators, handles BOMs, strips extra whitespace, and tolerates common real-world inconsistencies — all without manual configuration. This makes imports robust against data you don't fully control, such as user-uploaded files or third-party exports.
40
53
 
41
- * Normalized Data:
42
- SmarterCSV transforms the data in each CSV row automatically, stripping whitespace, converting numerical data into numbers, ignoring nil or empty fields, and more. There are many ways to customize this. You can even add your own value converters.
54
+ * **Flexible header and value transformations:**
55
+ Headers are automatically downcased, symbolized, and normalized. You can remap or drop columns with `key_mapping`, override headers entirely with `user_provided_headers`, and apply per-field value converters for custom type coercion (dates, booleans, currency, etc.).
43
56
 
44
- * Batch Processing of large CSV files:
45
- Processing large CSV files in chunks, reduces the memory impact and allows for faster / parallel processing.
46
- By adding the option `chunk_size: numeric_value`, you can switch to batch processing. SmarterCSV will then return arrays-of-hashes. This makes parallel processing easy: you can pass whole chunks of data to Sidekiq, bulk-insert into a DB, or pass it to other data sinks.
57
+ * **Batch and streaming processing:**
58
+ `chunk_size` enables memory-efficient batch processing of arbitrarily large files each chunk is an array of hashes ready for `insert_all`, Sidekiq, or other data sinks. The `Reader#each` enumerator includes `Enumerable`, giving you lazy evaluation, `each_slice`, `select`, `map`, and more.
59
+
60
+ * **Bad row quarantine:**
61
+ Malformed rows can be collected or skipped instead of crashing the entire import. `on_bad_row: :collect` lets you inspect and log bad rows after processing completes.
47
62
 
48
63
  ## Additional Features
49
64
 
50
- * Header Validation:
51
- You can validate that a set of hash keys is present in each record after header transformations are applied.
52
- This can help ensure importing data with consistent quality.
65
+ * **Header validation:**
66
+ Use `required_keys` to raise an error before any data rows are processed if expected columns are missing. Works with post-transformation key names, so it's safe to combine with `key_mapping`. See [Header Validations](./header_validations.md).
67
+
68
+ * **Instrumentation hooks:**
69
+ `on_start`, `on_chunk`, and `on_complete` callbacks give you visibility into import progress — useful for logging, progress bars, and alerting in long-running jobs. See [Instrumentation Hooks](./instrumentation.md).
53
70
 
54
- * Data Validations
55
- (planned feature)
71
+ * **Resumable imports:**
72
+ The `chunk_index` parameter pairs naturally with Rails 8.1's `ActiveJob::Continuable` for jobs that can pause and resume mid-import without reprocessing already-completed chunks. See [Examples](./examples.md#example-12-resumable-csv-import-with-rails-activejob-rails-81).
73
+
74
+ * **CSV writing:**
75
+ `SmarterCSV.generate` writes arrays of hashes to CSV, with support for header renaming and value converters on output. See [The Basic Write API](./basic_write_api.md).
56
76
 
57
77
  ---------------
58
- PREVIOUS [README](../README.md) | NEXT: [Parsing Strategy](./parsing_strategy.md)
78
+
79
+ NEXT: [Migrating from Ruby CSV](./migrating_from_csv.md) | UP: [README](../README.md)