smarter_csv 1.17.0.pre5 → 1.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5d2154634f98b9df235995b9c6368e6208027d31da8d6d80ad09a526dd51fbf0
4
- data.tar.gz: 62dd06196aef83b0e2c7dd6391ce9004fe95ba17310eec5f8ae9f2a74c2008a8
3
+ metadata.gz: 702bd7049e83c0beb85f0ca11a122e6f1659eddef6afec66eaf1c37c5b30f43f
4
+ data.tar.gz: dd1915694d041c9b631324de7408f46fc8f426f9e1c60136c35a8f1e754d4590
5
5
  SHA512:
6
- metadata.gz: fdac413102754c859247b876f8a2130ff7dcf700c3531e4c58e72c36e0e53081c0ad7a93654d343428ed7c1c91dd0a35aa90e94877df72dab2b0aa5d9ae0bf65
7
- data.tar.gz: c800d6e4c807ff2c502ac9c7f819de5b6eef582c587b505e1efe2b5ee284a941f809bdb7166d638ac3ee5595763116c72ccfbe76d30b83f83793a53b1f728a6c
6
+ metadata.gz: fa00d07c21cffa711a43ecb4622ad3a09b667f1c1965ad26bee864ada6a3c168076ec04550781c75c4f0acbb28fcab60001278f459cf3065cceaef6820764e30
7
+ data.tar.gz: eab835a356e5343e20a5cc0784ffd9aafa8ab631256d412ec570a0192060b8f3e9c6f619db36e59494364f6c51cda16a9c434c2565ef5a5b6e23f4813a7eaaef
data/.rubocop.yml CHANGED
@@ -13,6 +13,9 @@ Layout/SpaceInsideHashLiteralBraces:
13
13
  Layout/SpaceAroundOperators:
14
14
  Enabled: false
15
15
 
16
+ Lint/ConstantDefinitionInBlock:
17
+ Enabled: false
18
+
16
19
  Lint/UnderscorePrefixedVariableName:
17
20
  Enabled: false
18
21
 
data/CHANGELOG.md CHANGED
@@ -1,9 +1,9 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
- ## 1.17.0.pre5 (2026-04-28)
4
+ ## 1.17.0 (NOT RELEASED)
5
5
 
6
- RSpec tests: **1,434 → 1,905** (+471 tests)
6
+ RSpec tests: **1,434 → 2,210** (+776 tests)
7
7
 
8
8
  ### New Features
9
9
 
@@ -28,12 +28,33 @@ RSpec tests: **1,434 → 1,905** (+471 tests)
28
28
 
29
29
  * Improved auto-detection of `row_sep` and `col_sep` — giving more accurate results on files with comment headers.
30
30
 
31
- * Default value for `auto_row_sep_chars` changed from `500` to `8192`, providing a larger scan window for accurate row separator detection on files with wide headers or long first lines.
32
- Values below `8192` (and `nil` / `0`) are now rejected and fall back to the default `8192` with a warning message.
33
- This is a change from the previous `nil` / `0` were documented as "scan whole file".
31
+ * Larger scan window for accurate row separator detection on files with wide headers or long first lines.
34
32
 
35
33
  * `guess_line_ending` now scans the input in chunks up to a 64KB hard cap, returning as soon as one separator has a clear majority. Near-tie chunk-boundary artifacts no longer cause spurious warnings; only true ties at the hard cap fall back to `"\n"` and emit a `:no_clear_row_sep` warning at `:error` severity (silent miss-parse risk).
36
34
 
35
+ ### New / Changed Options
36
+
37
+ * **`buffer_size` is now a public option** — peek buffer chunk size for non-seekable inputs (pipes, gzip readers, HTTP/S3 bodies). Default `16_384`. Out-of-range values warn and clamp to the supported range rather than raising.
38
+
39
+ * **`auto_row_sep_chars` default changed to `4096`** (was `500` in 1.16.x). Sized to cover wide-header CSVs in a single read. Bump it higher if your files have very wide headers or long comment preambles.
40
+
41
+ ### Bug Fixes
42
+
43
+ * **Files ending in a lone `\r`** are now correctly detected as `\r`-terminated instead of falling through to a "no clear row separator" warning.
44
+
45
+ * **`remove_empty_values` now treats Unicode whitespace as empty** — a field containing only whitespace, including characters like non-breaking space (U+00A0) or ideographic space (U+3000), is now dropped, the same way Ruby's `String#blank?` behaves. Previously only ASCII whitespace counted (and only Rails apps got the Unicode behavior, via `blank?` — an inconsistency that's now gone). Behavior is identical with or without the C extension.
46
+
47
+ * **`remove_zero_values` now also removes signed zeros** — `+0`, `-0`, `-0.0`, `+0.00`, etc. are recognized as zero and dropped, just like `0` and `0.0`. (Only applies when `remove_zero_values: true`, which is off by default.)
48
+
49
+ ### Performance
50
+
51
+ Measured against 1.16.4 (Apple M4, Ruby 3.4.7):
52
+
53
+ * **C-accelerated path (the default):** quote-heavy, large-field, and wide CSVs parse meaningfully faster — roughly **7–22% faster** (city/address-style files ~10–12%; long-field and wide files the most). CSVs with very short lines and many tiny fields are up to ~3% slower — a side effect of the larger default auto-detection scan window (see `auto_row_sep_chars`); set it back to a smaller value if that matters for your workload. Net: solid wins where there's real per-row work, a small cost on the trivially-cheap cases.
54
+ * **Ruby fallback path (`acceleration: false`):** faster on nearly every file — typically **3–20% faster** than 1.16.4, with the biggest gains on wide and many-small-field CSVs.
55
+
56
+ Per-file breakdown: [`docs/releases/1.17.0/performance_notes.md`](docs/releases/1.17.0/performance_notes.md).
57
+
37
58
  ## 1.16.4 (2026-04-21) — Bug Fixes
38
59
 
39
60
  RSpec tests: **1,434 → 1,467** (+33 tests)
data/Gemfile CHANGED
@@ -5,12 +5,17 @@ source 'https://rubygems.org'
5
5
  # Specify your gem's dependencies in smarter_csv.gemspec
6
6
  gemspec
7
7
 
8
- gem "rake"
9
- gem "rake-compiler"
8
+ group :development do
9
+ gem "rake"
10
+ gem "rake-compiler"
11
+ gem "ostruct" # silences rake's stdlib-deprecation warning during dev
12
+ gem "rubocop"
13
+ end
10
14
 
11
- gem "awesome_print"
12
- gem 'pry'
13
- gem "rubocop"
15
+ group :development, :test do
16
+ gem "awesome_print"
17
+ gem "pry" # required in spec_helper.rb; also useful in dev console
18
+ end
14
19
 
15
20
  group :test do
16
21
  gem "rspec"
data/README.md CHANGED
@@ -14,6 +14,8 @@
14
14
 
15
15
  Beyond raw speed, SmarterCSV is designed to provide a significantly more convenient and developer-friendly interface than traditional CSV libraries. Instead of returning raw arrays that require substantial post-processing, SmarterCSV produces Rails-ready hashes for each row, making the data immediately usable with ActiveRecord, Sidekiq pipelines, parallel processing, and JSON-based workflows such as S3.
16
16
 
17
+ In a Rails app, warnings auto-route through `Rails.logger` and instrumentation hooks compose with `ActiveSupport::Notifications` — no setup required. Outside Rails, warnings fall back to `$stderr` and the same APIs work without any framework dependency.
18
+
17
19
  The library includes intelligent defaults, automatic detection of column and row separators, and flexible header/value transformations. These features eliminate much of the boilerplate typically required when working with CSV data and help keep ingestion code concise and maintainable.
18
20
 
19
21
  For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines.
@@ -35,22 +37,33 @@ SmarterCSV is designed for **real-world CSV processing**, returning fully usable
35
37
 
36
38
  For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to SmarterCSV.
37
39
 
38
- | Comparison (SmarterCSV 1.16.0, C-accelerated) | Range |
40
+ | Comparison (SmarterCSV 1.17.0, C-accelerated) | Range |
39
41
  |-------------------------------------------------|-------------------------|
40
- | vs SmarterCSV 1.15.2 (with C acceleration) | up to 2.4× faster |
41
- | vs SmarterCSV 1.14.4 (with C acceleration) | 9×–65× faster |
42
- | vs SmarterCSV 1.14.4 (Ruby path) | 1.7×–10.6× faster |
43
- | vs CSV.read (arrays of arrays) | 1.7×–8.6× faster |
44
- | vs CSV.table (arrays of hashes) | 7×–129× faster |
45
- | vs ZSV (arrays of hashes, equiv. output) | 1.1×–6.6× faster † |
42
+ | vs SmarterCSV 1.15.2 (with C acceleration) | up to 2.8× faster |
43
+ | vs SmarterCSV 1.14.4 (with C acceleration) | 9×–82× faster |
44
+ | vs SmarterCSV 1.14.4 (Ruby path) | 2.4×–19.8× faster |
45
+ | vs CSV.read (arrays of arrays) | 1.3×–7.9× faster |
46
+ | vs CSV.table (arrays of hashes) | 4.9×–132× faster |
47
+ | vs ZSV 1.3.0 (arrays of hashes, equiv. output) | 1.1×–6.6× faster † |
48
+
49
+ † SmarterCSV faster on 15 of 16 files. ZSV raw arrays (no hashes, no conversions) are 2×–14× faster — but that omits the post-processing work needed to produce usable output. ZSV row carried over from the 1.16.0 benchmark; not re-measured for 1.17.0.
50
+
51
+ _Benchmarks: 19 CSV files (20k–240k rows), Ruby 3.4.7, Apple M4._
46
52
 
47
- SmarterCSV faster on 15 of 16 files. ZSV raw arrays (no hashes, no conversions) are 2×–14× faster — but that omits the post-processing work needed to produce usable output.
53
+ > ⁉️ **Why these numbers look a touch lower than 1.16.0 charts?**
54
+ > TL;DR: because we use different statistic methods.
55
+ >
56
+ > Earlier versions of these benchmarks reported the best-of-N sample (the absolute `min` / fastest run) for each measurement. A single lucky run — empty caches lining up, no scheduler interrupts — could shave up to ~10% off and become the headline number. I think that would be misleading.
57
+ > Because of that, we've switched to the 10th-percentile (`p10`) of multiple runs of 40 samples, which discards roughly the four luckiest runs and reports a time much closer to what you'll actually observe in production. On noisier fixtures `p10` is ~5–10% above `min`; on quiet ones it's within 1%. The relative ordering between versions and adapters is unchanged; the absolute speedup figures are simply more honest.
48
58
 
49
- _Benchmarks: 19 CSV files (20k–80k rows), Ruby 3.4.7, Apple M1._
59
+ ### SmarterCSV vs Ruby CSV
60
+ ![SmarterCSV 1.17.0 vs Ruby CSV 3.3.5 speedup](images/SmarterCSV_1.17.0_vs_RubyCSV_3.3.5_speedup.svg)
50
61
 
51
- ![SmarterCSV 1.16.0 vs Ruby CSV 3.3.5 speedup](images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png)
62
+ ### SmarterCSV C Path
63
+ ![SmarterCSV 1.17.0 vs previous versions — C-accelerated path](images/SmarterCSV_1.17.0_vs_previous_C-speedup.svg)
52
64
 
53
- ![SmarterCSV 1.16.0 vs previous versions — C-accelerated path](images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg)
65
+ ### SmarterCSV Ruby Path
66
+ ![SmarterCSV 1.17.0 vs previous versions — Ruby path](images/SmarterCSV_1.17.0_vs_previous_Rb-speedup.svg)
54
67
 
55
68
  See [SmarterCSV 1.15.2: Faster Than Raw CSV Arrays](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for more details.
56
69
 
@@ -63,7 +76,7 @@ It's a one-line change:
63
76
  # Before
64
77
  rows = CSV.table('data.csv').map(&:to_h)
65
78
 
66
- # After — up to 129× faster, same symbol keys
79
+ # After — up to 132× faster, same symbol keys
67
80
  rows = SmarterCSV.process('data.csv')
68
81
  ```
69
82
 
@@ -126,6 +139,23 @@ strip_whitespace → nil_values_matching → remove_empty_values → remove_zero
126
139
 
127
140
  Each step is individually configurable. See [Data Transformations](docs/data_transformations.md) and [Value Converters](docs/value_converters.md) for details.
128
141
 
142
+ ### Value Converters
143
+
144
+ Per-column lambdas convert raw strings into typed values — dates, currency, booleans:
145
+
146
+ ```ruby
147
+ require 'date'
148
+
149
+ data = SmarterCSV.process('orders.csv',
150
+ value_converters: {
151
+ dob: ->(v) { v && Date.strptime(v, '%m/%d/%Y') },
152
+ price: ->(v) { v&.delete('$,')&.to_f },
153
+ active: ->(v) { v&.match?(/\Atrue\z/i) },
154
+ })
155
+ ```
156
+
157
+ See [Value Converters](docs/value_converters.md).
158
+
129
159
  ### Batch Processing:
130
160
 
131
161
  Processing large CSV files in chunks minimizes memory usage and enables powerful workflows:
@@ -149,6 +179,8 @@ SmarterCSV.process(filename, chunk_size: 100) do |chunk|
149
179
  end
150
180
  ```
151
181
 
182
+ See [Batch Processing](docs/batch_processing.md) for chunk sizing, `each_chunk`, and parallel-worker patterns.
183
+
152
184
  ### Modern Enumerator API:
153
185
 
154
186
  `Reader#each` is the modern, idiomatic way to process rows — `Reader` includes `Enumerable`, so all standard Ruby methods work:
@@ -168,6 +200,29 @@ first_ten = reader.lazy.select { |h| h[:active] }.first(10)
168
200
  reader.each_slice(500) { |batch| MyModel.insert_all(batch) }
169
201
  ```
170
202
 
203
+ See [The Basic Read API](docs/basic_read_api.md) for the full `Reader` interface.
204
+
205
+ ### Streaming / Non-Seekable Inputs (1.17.0+):
206
+
207
+ SmarterCSV reads directly from any IO — no need to materialize the file on disk first. Auto-detection works on streaming inputs without rewinding; the first chunk is buffered transparently.
208
+
209
+ ```ruby
210
+ # Gzipped CSV — stream-decompressed, never written to disk
211
+ require 'zlib'
212
+ Zlib::GzipReader.open('huge.csv.gz') do |io|
213
+ SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
214
+ end
215
+
216
+ # STDIN / pipes
217
+ SmarterCSV.process($stdin) { |row, _| ... }
218
+
219
+ # HTTP response body
220
+ require 'open-uri'
221
+ URI.open('https://example.com/data.csv') { |io| SmarterCSV.process(io) }
222
+ ```
223
+
224
+ See [Row and Column Separators](docs/row_col_sep.md) for how `:auto` detection works on non-seekable streams, and [Configuration Options](docs/options.md) for `buffer_size` (the peek-buffer chunk size).
225
+
171
226
  ### Bad Row Handling:
172
227
 
173
228
  SmarterCSV can quarantine malformed rows instead of crashing the entire import:
@@ -184,7 +239,33 @@ end
184
239
 
185
240
  See [Bad Row Quarantine](docs/bad_row_quarantine.md) for full details including `bad_row_limit` and `field_size_limit`.
186
241
 
187
- See [13 Examples](docs/examples.md) for more, including value converters, header validation, writing CSV, encoding handling, and resumable Rails ActiveJob imports.
242
+ ### Header Validation:
243
+
244
+ Raise early if the file is missing required columns, before any data row is processed:
245
+
246
+ ```ruby
247
+ begin
248
+ SmarterCSV.process('transactions.csv',
249
+ required_keys: [:account_id, :amount, :currency])
250
+ rescue SmarterCSV::MissingKeys => e
251
+ abort "CSV missing columns: #{e.keys.join(', ')}"
252
+ end
253
+ ```
254
+
255
+ See [Header Validations](docs/header_validations.md).
256
+
257
+ ### Writing CSV:
258
+
259
+ ```ruby
260
+ SmarterCSV.generate('output.csv') do |csv|
261
+ csv << { name: 'Alice', age: 30, city: 'New York' }
262
+ csv << { name: 'Bob', age: 25, city: 'Chicago' }
263
+ end
264
+ ```
265
+
266
+ Hashes (not arrays) make column-shift bugs impossible — adding a column never silently misaligns existing rows. See [The Basic Write API](docs/basic_write_api.md) for header renaming, value converters, and ordered output.
267
+
268
+ See [18 Examples](docs/examples.md) for more, including encoding and preamble handling, key mapping, instrumentation hooks, and resumable Rails ActiveJob imports.
188
269
 
189
270
  ## Requirements
190
271
 
data/TO_DO.md ADDED
@@ -0,0 +1,109 @@
1
+ # SmarterCSV v2.0 TO DO List
2
+
3
+ DONE:
4
+ [X] Don't call rewind on filehandle
5
+ [X] use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
6
+ [X] skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120). Or stream large file from S3 (linked in the issue)
7
+ [X] [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
8
+ [X] add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
9
+ [X] Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
10
+ [X] Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
11
+
12
+
13
+ Partially Done:
14
+ [ ] make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
15
+
16
+ StilL TO DO:
17
+ [ ] Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
18
+
19
+ Arguably by design (e.g. exclude these columns from conversion and have them returned as a string)
20
+ [ ] [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
21
+
22
+
23
+ ## Numeric conversion: align the Ruby fallback path with the C path (permissive)
24
+
25
+ Context: `convert_values_to_numeric` runs in two places that currently DISAGREE on edge cases:
26
+ - C path (`acceleration: true`, the default): `ext/smarter_csv/smarter_csv.c#try_numeric_conversion`
27
+ uses `strtol`/`strtod` (base 10; float branch only entered when the field contains a `.`).
28
+ - Ruby fallback (`acceleration: false`): `lib/smarter_csv/hash_transformations.rb` uses the
29
+ strict regex `NUMERIC_REGEX = /\A[+-]?\d+(?:\.\d+)?\z/` plus `to_i` / `to_f`.
30
+
31
+ Divergence (verified empirically):
32
+ | value | C path | Ruby fallback |
33
+ |-----------|------------------|-------------------|
34
+ | ".5" | 0.5 (Float) | ".5" (String) |
35
+ | "3." | 3.0 (Float) | "3." (String) |
36
+ | "1.5e3" | 1500.0 (Float) | "1.5e3" (String) |
37
+ | "1.0e10" | 10000000000.0 | "1.0e10" (String) |
38
+
39
+ Decision: the C path's permissive behavior (corner cases + scientific notation) is the intended
40
+ contract. Fix = make the Ruby fallback match the C path. Do NOT tighten the C path.
41
+
42
+ Ruby-side changes (in `hash_transformations.rb`):
43
+ 1. Swap NUMERIC_REGEX for a permissive one:
44
+ /\A[+-]?(?:\d+\.?\d*|\.\d+)(?:[eE][+-]?\d+)?\z/
45
+ matches 1, 1., 1.5, .5, 1e3, 1.5e3, -3.14e-2, etc.; still rejects ".", "e3", "1.2.3",
46
+ "1_000", "0x1F".
47
+ 2. Add `DOT_BYTE = '.'.ord` (46) and include it in the first-byte fast-reject's allowed set
48
+ (the C pre-check already allows a leading `.`; without this, ".5" gets rejected on byte 0).
49
+ 3. Int-vs-float decision: `(v.include?('.') || v.include?('e') || v.include?('E')) ? v.to_f : v.to_i`
50
+ (currently only checks for `.`).
51
+
52
+ Stays a string on BOTH paths (no change needed, but worth characterization tests — there are
53
+ currently NONE):
54
+ - "010" => 10 (NOT octal 8 — both paths use base-10 conversion: String#to_i / strtol(.,10).
55
+ A switch to Kernel#Integer() would break this. Lock it down with a test.)
56
+ - "0x1F", "0b101", "0o17" => string (radix prefixes not honored by base-10 conversion)
57
+ - "1_000" => string (underscores)
58
+ - "1,200.00", "1.300,00" => string (thousands sep / decimal comma — strtod stops at the
59
+ separator → not fully consumed; regex rejects. This is the only safe behavior; "1,200" is
60
+ genuinely ambiguous. Locale-specific number formats are the caller's job via value_converters.)
61
+
62
+ NOT doing: locale sniffing (read LC_NUMERIC at init and adjust the regexes). Rejected because
63
+ the machine locale tells you nothing about the file's number format, it breaks reproducibility
64
+ (same code + same file → different results on a US vs EU box), and `,` can't be both col_sep and
65
+ decimal separator anyway. Note `strtod` IS locale-sensitive (LC_NUMERIC) but it's dormant — Ruby
66
+ runs in the C/POSIX locale; don't deliberately activate it.
67
+
68
+ When done: parity tests (`[true, false].each`) for the now-consistent set (.5, 3., 1.5e3, 1e3)
69
+ plus characterization tests for the stays-a-string set above; CHANGELOG line noting the Ruby
70
+ fallback's numeric conversion now accepts scientific notation and bare-dot forms, matching the
71
+ accelerated path. Behavior change affects `acceleration: false` users only — and aligns them with
72
+ the default.
73
+
74
+
75
+ ## Warn once when the C extension didn't load on a platform that supports it
76
+
77
+ Context: `acceleration: true` is the default. When the C extension fails to build / isn't loaded,
78
+ SmarterCSV silently falls back to the Ruby parser — graceful degradation by design (so the gem
79
+ keeps working for users with broken toolchains, JRuby, TruffleRuby, etc.). Today there is no
80
+ signal to the user that they're not getting the C path; their CSV parsing is just slower than
81
+ they might have expected.
82
+
83
+ Idea: emit a one-time warning when:
84
+ * the C extension is NOT loaded — `!SmarterCSV::Parser.respond_to?(:parse_csv_line_c)`, AND
85
+ * the platform is one where it *should* be available — `RUBY_ENGINE == 'ruby'` (MRI / CRuby).
86
+ JRuby and TruffleRuby don't load CRuby C extensions natively; nothing for the user to do.
87
+
88
+ Where to fire:
89
+ * NOT at `require 'smarter_csv'` time — Rails.logger typically isn't set up yet, so any
90
+ "route through the warnings system" code would just fall through to `Kernel#warn` anyway,
91
+ and the warning would land in stderr instead of the Rails log where ops would see it.
92
+ * At first `Reader.new` / `SmarterCSV.process` call — Rails has booted, the existing
93
+ routing-through-Rails.logger-or-Kernel#warn infra works, and the existing deduped warnings
94
+ histogram means it fires once per process regardless of how many parse calls.
95
+
96
+ Implementation sketch:
97
+ * Add a new warning code (e.g. `:c_extension_unavailable`) alongside the existing ones
98
+ (`:chunk_size_default`, `:header_a_method`, `:utf8_missing_binary_mode`, ...).
99
+ * Severity `:warn`. Suppressible via the existing `verbose: :quiet`.
100
+ * Message points at the fix — e.g. "C acceleration extension not loaded on this Ruby; using
101
+ Ruby parser. To enable acceleration, reinstall with `gem pristine smarter_csv` and check
102
+ the build log." Plus a link/pointer to a troubleshooting section in the docs.
103
+
104
+ Bonus: add a public predicate `SmarterCSV.acceleration_available?` returning
105
+ `Parser.respond_to?(:parse_csv_line_c)`. Zero noise, useful for scripts / CI / future spec
106
+ files that want to branch on the environment fact rather than guess.
107
+
108
+ NOT doing: a banner at `require` time (every Rails app would print it at boot, too noisy);
109
+ warning when `acceleration: false` was explicitly chosen (the user knows what they're doing).
@@ -186,6 +186,10 @@ reader.each { |hash| MyModel.upsert(hash) }
186
186
  reader.errors[:bad_rows].each { |rec| puts "Bad row: #{rec[:error_message]}" }
187
187
  ```
188
188
 
189
+ ### Read-Transform-Write Pipelines
190
+
191
+ Composing `SmarterCSV.each` with `SmarterCSV.generate` is the idiomatic replacement for Ruby's `CSV.filter` — read CSV, mutate each row, write the result. See [Examples → Filtering and Transforming a CSV File](./examples.md#example-19-filtering-and-transforming-a-csv-file) for the full set of patterns (file → file, STDIN → STDOUT, gzip → gzip, header renaming).
192
+
189
193
  ---
190
194
 
191
195
  ## Value Transformation Pipeline
@@ -189,6 +189,31 @@ File.open('output.csv', 'w') do |f|
189
189
  end
190
190
  ```
191
191
 
192
+ **Write to STDOUT (e.g. piping to another process):**
193
+
194
+ ```ruby
195
+ SmarterCSV.generate($stdout) do |csv|
196
+ records.each { |r| csv << r }
197
+ end
198
+ ```
199
+
200
+ Useful in CLI scripts: `ruby export.rb | gzip > out.csv.gz`.
201
+
202
+ **Stream a CSV upload to S3 — never written to disk:**
203
+
204
+ ```ruby
205
+ require 'aws-sdk-s3'
206
+
207
+ obj = Aws::S3::Object.new(bucket_name: 'exports', key: 'reports/daily.csv')
208
+ obj.upload_stream do |stream|
209
+ SmarterCSV.generate(stream) do |csv|
210
+ Order.find_each { |o| csv << o.attributes }
211
+ end
212
+ end
213
+ ```
214
+
215
+ `upload_stream` performs a multipart upload, so the CSV is sent to S3 incrementally as it's generated — memory usage stays flat regardless of result size.
216
+
192
217
  ### Full Interface
193
218
 
194
219
  The full interface gives you direct access to the `Writer` instance, which is useful when you
@@ -617,6 +642,10 @@ end
617
642
  > **Note:** `write_headers: false` only suppresses the header line. All other
618
643
  > options (`col_sep:`, `row_sep:`, `value_converters:`, etc.) apply as normal.
619
644
 
645
+ ## Read-Transform-Write Pipelines
646
+
647
+ Pairing `SmarterCSV.generate` with `SmarterCSV.each` on the read side is the idiomatic replacement for Ruby's `CSV.filter`. See [Examples → Filtering and Transforming a CSV File](./examples.md#example-19-filtering-and-transforming-a-csv-file) for the full set of patterns, including streaming gzip → gzip pipelines.
648
+
620
649
  ## More Examples
621
650
 
622
651
  Check out the [RSpec tests](../spec/smarter_csv/writer_spec.rb) for more examples.
@@ -211,6 +211,30 @@ SmarterCSV::Reader.new('products.csv', chunk_size: 25).each_chunk do |chunk, _in
211
211
  end
212
212
  ```
213
213
 
214
+ ## Example: Resumable Import (Plain Ruby)
215
+
216
+ Track the chunk cursor in a JSON state file so an interrupted import can resume where it left off — no Rails / ActiveJob required:
217
+
218
+ ```ruby
219
+ require 'json'
220
+
221
+ STATE_FILE = '/var/run/import.state.json'
222
+
223
+ state = File.exist?(STATE_FILE) ? JSON.parse(File.read(STATE_FILE)) : { 'cursor' => 0 }
224
+
225
+ SmarterCSV.process('import.csv', chunk_size: 500) do |chunk, chunk_index|
226
+ next if chunk_index < state['cursor'] # skip already-processed chunks on resume
227
+
228
+ MyModel.import!(chunk)
229
+ state['cursor'] = chunk_index + 1
230
+ File.write(STATE_FILE, JSON.dump(state))
231
+ end
232
+
233
+ File.delete(STATE_FILE) # done — clear the cursor
234
+ ```
235
+
236
+ If the process is killed at chunk 7, the next run skips chunks 0–6 quickly via `next` and resumes at chunk 7. For Rails 8.1+ projects, see [Examples → Resumable CSV Import with Rails ActiveJob](./examples.md#example-12-resumable-csv-import-with-rails-activejob-rails-81) for the framework-native version.
237
+
214
238
  ## Example: Reading a CSV from S3
215
239
 
216
240
  SmarterCSV accepts any IO-like object, so you can stream directly from S3 without
data/docs/examples.md CHANGED
@@ -44,6 +44,12 @@
44
44
  11. [Batch Processing with Sidekiq](#example-11-batch-processing-with-sidekiq)
45
45
  12. [Resumable CSV Import with Rails ActiveJob](#example-12-resumable-csv-import-with-rails-activejob-rails-81)
46
46
  13. [Instrumentation](#example-13-instrumentation)
47
+ 14. [Streaming Inputs (Non-Seekable IO)](#example-14-streaming-inputs-non-seekable-io)
48
+ 15. [Resumable Import (Plain Ruby)](#example-15-resumable-import-plain-ruby)
49
+ 16. [CSV Files with Comment Lines](#example-16-csv-files-with-comment-lines)
50
+ 17. [Tab-Separated Values (TSV)](#example-17-tab-separated-values-tsv)
51
+ 18. [Multi-Line Fields](#example-18-multi-line-fields)
52
+ 19. [Filtering and Transforming a CSV File](#example-19-filtering-and-transforming-a-csv-file)
47
53
 
48
54
  ---
49
55
 
@@ -370,5 +376,124 @@ SmarterCSV.process('large_import.csv',
370
376
 
371
377
  See [Instrumentation Hooks](./instrumentation.md).
372
378
 
379
+ ---
380
+
381
+ ## Example 14: Streaming Inputs (Non-Seekable IO)
382
+
383
+ *(1.17.0+)* SmarterCSV reads from gzipped files, HTTP responses, S3 objects, or piped STDIN — no need to materialize the file on disk first.
384
+
385
+ ```ruby
386
+ require 'zlib'
387
+ Zlib::GzipReader.open('huge.csv.gz') do |io|
388
+ SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
389
+ end
390
+ ```
391
+
392
+ See [Real-World CSV Files → I/O Patterns](./real_world_csv.md#io-patterns) for gzip, S3, HTTP, STDIN, and `IO.popen` worked examples.
393
+
394
+ ---
395
+
396
+ ## Example 15: Resumable Import (Plain Ruby)
397
+
398
+ A non-Rails counterpart to Example 12 — track the chunk cursor in a JSON file so an interrupted import resumes where it left off.
399
+
400
+ See [Batch Processing → Resumable Import (Plain Ruby)](./batch_processing.md#example-resumable-import-plain-ruby) for the worked example.
401
+
402
+ ---
403
+
404
+ ## Example 16: CSV Files with Comment Lines
405
+
406
+ Strip lines matching a pattern (e.g. `#`-prefixed comments in DB dumps and log exports) using `comment_regexp`:
407
+
408
+ ```ruby
409
+ SmarterCSV.process('data.csv', comment_regexp: /\A#/)
410
+ ```
411
+
412
+ See [Header Transformations → CSV Files with Comment Lines](./header_transformations.md#csv-files-with-comment-lines) for the worked example.
413
+
414
+ ---
415
+
416
+ ## Example 17: Tab-Separated Values (TSV)
417
+
418
+ ```ruby
419
+ SmarterCSV.process('data.tsv') # auto-detected
420
+ SmarterCSV.process('data.tsv', col_sep: "\t") # explicit
421
+ ```
422
+
423
+ See [Row and Column Separators → Tab-Separated Values (TSV)](./row_col_sep.md#tab-separated-values-tsv) for details.
424
+
425
+ ---
426
+
427
+ ## Example 18: Multi-Line Fields
428
+
429
+ Newlines inside `"..."` are preserved as part of the field — common in addresses, CRM notes, and free-text comments. No configuration needed.
430
+
431
+ See [Real-World CSV Files → Multi-Line Quoted Fields](./real_world_csv.md#multi-line-quoted-fields) for the worked example.
432
+
433
+ ---
434
+
435
+ ## Example 19: Filtering and Transforming a CSV File
436
+
437
+ The Ruby CSV library has `CSV.filter` for "read CSV, mutate each row, write CSV." In SmarterCSV this is a two-line composition of `SmarterCSV.each` and `SmarterCSV.generate`:
438
+
439
+ ```ruby
440
+ SmarterCSV.generate('out.csv') do |csv|
441
+ SmarterCSV.each('in.csv') do |row|
442
+ row[:price] = (row[:price] * 1.1).round(2)
443
+ row.delete(:internal_notes)
444
+ csv << row
445
+ end
446
+ end
447
+ ```
448
+
449
+ The explicit `csv << row` is the win over `CSV.filter` — emission is intentional, not a side effect of mutating the block argument.
450
+
451
+ ### Pipeline (STDIN → STDOUT)
452
+
453
+ ```ruby
454
+ # cat in.csv | ruby filter.rb > out.csv
455
+ SmarterCSV.generate($stdout) do |csv|
456
+ SmarterCSV.each($stdin) { |row| csv << row }
457
+ end
458
+ ```
459
+
460
+ ### Skipping rows
461
+
462
+ ```ruby
463
+ SmarterCSV.generate('out.csv') do |csv|
464
+ SmarterCSV.each('in.csv') do |row|
465
+ next if row[:status] == 'archived' # just skip — no emit
466
+ csv << row
467
+ end
468
+ end
469
+ ```
470
+
471
+ ### Compressed in, compressed out
472
+
473
+ ```ruby
474
+ require 'zlib'
475
+ Zlib::GzipWriter.open('out.csv.gz') do |gz_out|
476
+ SmarterCSV.generate(gz_out) do |csv|
477
+ Zlib::GzipReader.open('in.csv.gz') do |gz_in|
478
+ SmarterCSV.each(gz_in) { |row| csv << row }
479
+ end
480
+ end
481
+ end
482
+ ```
483
+
484
+ Both endpoints are non-seekable streams — a pattern `CSV.filter` cannot handle, since it requires seekable input/output.
485
+
486
+ ### Header renaming on the way through
487
+
488
+ ```ruby
489
+ SmarterCSV.generate('out.csv', headers: [:given_name, :family_name, :email]) do |csv|
490
+ SmarterCSV.each('in.csv',
491
+ key_mapping: { first_name: :given_name, last_name: :family_name }
492
+ ) { |row| csv << row }
493
+ end
494
+ ```
495
+
496
+ Use `key_mapping:` on the read side to rename columns and `headers:` on the write side to enforce output column order.
497
+
373
498
  --------------------
374
499
  PREVIOUS: [Instrumentation Hooks](./instrumentation.md) | NEXT: [Real-World CSV Files](./real_world_csv.md) | UP: [README](../README.md)
@@ -62,6 +62,28 @@ See [Configuration Options](./options.md) for full option reference.
62
62
 
63
63
  ---
64
64
 
65
+ ## CSV Files with Comment Lines
66
+
67
+ Strip comment lines anywhere in the file — including before the header — using `comment_regexp`:
68
+
69
+ ```ruby
70
+ $ cat data.csv
71
+ # Generated 2026-01-15 by exporter v3.2
72
+ # Confidential — internal use only
73
+ id,name,amount
74
+ 1,Alice,100
75
+ 2,Bob,200
76
+ # end of file
77
+
78
+ data = SmarterCSV.process('data.csv', comment_regexp: /\A#/)
79
+ # => [{id: 1, name: "Alice", amount: 100},
80
+ # {id: 2, name: "Bob", amount: 200}]
81
+ ```
82
+
83
+ Common in database dumps, log exports, and pipelines that prepend provenance metadata. The regexp is applied per line — any line matching is dropped before parsing.
84
+
85
+ ---
86
+
65
87
  ## Header Normalization
66
88
 
67
89
  When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below)