smarter_csv 1.17.0.pre5 → 1.17.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/docs/options.md CHANGED
@@ -53,19 +53,19 @@
53
53
 
54
54
  ### File Input & Encoding
55
55
 
56
- | Option | Default | Explanation |
57
- |--------|---------|-------------|
58
- | `:file_encoding` | `utf-8` | Set the file encoding, e.g. `'windows-1252'` or `'iso-8859-1'`. |
59
- | `:invalid_byte_sequence` | `''` | What to replace invalid byte sequences with. |
60
- | `:force_utf8` | `false` | Force UTF-8 encoding of all lines (including headers) in the CSV file. |
56
+ | Option | Default | Explanation |
57
+ |--------------------------|---------|------------------------------------------------------------------------|
58
+ | `:file_encoding` | `utf-8` | Set the file encoding, e.g. `'windows-1252'` or `'iso-8859-1'`. |
59
+ | `:invalid_byte_sequence` | `''` | What to replace invalid byte sequences with. |
60
+ | `:force_utf8` | `false` | Force UTF-8 encoding of all lines (including headers) in the CSV file. |
61
61
 
62
62
  ### File Layout
63
63
 
64
- | Option | Default | Explanation |
65
- |--------|---------|-------------|
66
- | `:skip_lines` | `nil` | How many lines to skip before the first line or header line is processed. |
67
- | `:comment_regexp` | `nil` | Regular expression to ignore comment lines (e.g. `/\A#/`). See NOTE on CSV header. |
68
- | `:chunk_size` | `nil` | If set, data is yielded in chunks of this many rows instead of all at once. Use with `SmarterCSV.each_chunk` for memory-efficient batch processing. |
64
+ | Option | Default | Explanation |
65
+ |-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
66
+ | `:skip_lines` | `nil` | How many lines to skip before the first line or header line is processed. |
67
+ | `:comment_regexp` | `nil` | Regular expression to ignore comment lines (e.g. `/\A#/`). See NOTE on CSV header. |
68
+ | `:chunk_size` | `nil` | If set, data is yielded in chunks of this many rows instead of all at once. Use with `SmarterCSV.each_chunk` for memory-efficient batch processing. |
69
69
 
70
70
  ### Separators
71
71
 
@@ -73,7 +73,8 @@
73
73
  |--------|---------|-------------|
74
74
  | `:col_sep` | `:auto` | Column separator. `:auto` detects from file content (previous default was `','`). |
75
75
  | `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content by scanning in chunks of `auto_row_sep_chars` bytes, up to a 64KB hard cap. |
76
- | `:auto_row_sep_chars` | `8192` | Chunk size used while scanning for `:row_sep => :auto`. Detection stops as soon as one separator has a clear majority, with a 64KB hard cap. Must be an Integer 8192; smaller values, `nil`, or `0` are rejected and fall back to the default with a warning. |
76
+ | `:auto_row_sep_chars` | `4096` | Initial scan size for `:row_sep => :auto` detection. Scan stops as soon as one separator has a clear majority, up to a 64KB cap. Bump this if your files have very wide headers or long comment preambles. Out-of-range values, `nil`, or `0` fall back to the default with a warning. |
77
+ | `:buffer_size` | `16_384` | Peek buffer chunk size for non-seekable inputs (pipes, gzip readers, HTTP/S3 bodies). Out-of-range values warn and clamp to the supported range. Has no effect on seekable inputs (file paths, `File`, `StringIO`, `Tempfile`). |
77
78
 
78
79
  ### Quoting
79
80
 
@@ -122,8 +123,8 @@ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling
122
123
  | `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
123
124
  | `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
124
125
  | `:value_converters` | `nil` | Hash of `:header => converter`; converter can be a lambda/Proc or a class implementing `self.convert(value)`. See [Value Converters](./value_converters.md). |
125
- | `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil` or an empty string. |
126
- | `:remove_zero_values` | `false` | Remove key/value pairs where the numeric value equals zero. |
126
+ | `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil`, empty, or whitespace-only any Unicode whitespace, same as Ruby's `String#blank?`. |
127
+ | `:remove_zero_values` | `false` | Remove key/value pairs whose value is zero — numeric `0` / `0.0`, or any textual form of zero (`"0"`, `"0.0"`, `"00.00"`, `"+0"`, `"-0.0"`, …). |
127
128
  | `:nil_values_matching` | `nil` | Set matching values to `nil`. Accepts a regular expression matched against the string representation of each value (e.g. `/\ANAN\z/` for NaN, `/\A#VALUE!\z/` for Excel errors). With `remove_empty_values: true` (default), nil-ified values are then removed. With `remove_empty_values: false`, the key is retained with a `nil` value. |
128
129
  | `:remove_empty_hashes` | `true` | Remove result hashes that have no key/value pairs or all-empty values. |
129
130
 
@@ -157,9 +158,9 @@ See [Instrumentation Hooks](./instrumentation.md) for full details and payload r
157
158
 
158
159
  ### Performance
159
160
 
160
- | Option | Default | Explanation |
161
- |--------|---------|-------------|
162
- | `:acceleration` | `true` | Use the C extension for parsing (MRI Ruby only). Set to `false` to force the pure-Ruby fallback (always used on JRuby/TruffleRuby). |
161
+ | Option | Default | Explanation |
162
+ |-------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------|
163
+ | `:acceleration` | `true` | Use the C extension for parsing (MRI Ruby only). Set to `false` to force the pure-Ruby fallback (always used on JRuby/TruffleRuby). |
163
164
 
164
165
  ---
165
166
 
@@ -187,7 +187,7 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
187
187
 
188
188
  ### I/O Patterns
189
189
 
190
- SmarterCSV accepts any IO-compatible source — file paths, open `File` handles, `StringIO`, and **non-seekable streams** like pipes, `STDIN`, and `Zlib::GzipReader`. Auto-detection of `row_sep` / `col_sep` works on streaming sources too: SmarterCSV captures the first bytes in an internal peek buffer and replays them, so the underlying source never needs to support `rewind` or `seek`. (Streaming IO support landed in 1.17.0.)
190
+ SmarterCSV accepts any IO-compatible source — file paths, open `File` handles, `StringIO`, and **non-seekable streams** like pipes, `STDIN`, and `Zlib::GzipReader`. Auto-detection of `row_sep` / `col_sep` works on streaming sources too thanks to internal buffering the underlying source never needs to support `rewind` or `seek`. (Streaming IO support landed in 1.17.0.)
191
191
 
192
192
  | Source | Issue | Status | Notes |
193
193
  |--------|-------|--------|-------|
@@ -195,6 +195,51 @@ SmarterCSV accepts any IO-compatible source — file paths, open `File` handles,
195
195
  | HTTP streaming | Parsing from a live HTTP response | 🔘 | Pass any IO-compatible object that responds to `#gets`. |
196
196
  | `STDIN` / shell pipes | Non-seekable input | 🔘 | `cat data.csv \| ruby -rsmarter_csv -e 'SmarterCSV.process(STDIN) { \|h\| ... }'` |
197
197
  | `IO.popen` output | Non-seekable subprocess stream | 🔘 | `IO.popen('zcat data.csv.gz') { \|io\| SmarterCSV.process(io) }` |
198
+ | S3 object body | Non-seekable HTTP stream | 🔘 | `SmarterCSV.process(s3.get_object(...).body)` — see worked example below. |
199
+
200
+ #### Streaming Inputs
201
+
202
+ ```ruby
203
+ # Gzipped CSV — stream-decompressed, never written to disk
204
+ require 'zlib'
205
+ Zlib::GzipReader.open('huge.csv.gz') do |io|
206
+ SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
207
+ end
208
+
209
+ # STDIN / pipes
210
+ SmarterCSV.process($stdin) { |row, _| MyModel.upsert(row.first) }
211
+
212
+ # HTTP response body
213
+ require 'open-uri'
214
+ URI.open('https://example.com/data.csv') { |io| SmarterCSV.process(io) }
215
+
216
+ # S3 — stream the response body directly
217
+ require 'aws-sdk-s3'
218
+ obj = Aws::S3::Client.new.get_object(bucket: 'data', key: 'imports/users.csv')
219
+ SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _index|
220
+ MyModel.insert_all(chunk)
221
+ end
222
+
223
+ # Subprocess output
224
+ IO.popen('zcat data.csv.gz') { |io| SmarterCSV.process(io) }
225
+ ```
226
+
227
+ #### Multi-Line Quoted Fields
228
+
229
+ Newlines inside `"..."` are preserved as part of the field — useful for address blocks, CRM notes, and free-text comments. No configuration needed:
230
+
231
+ ```ruby
232
+ $ cat addresses.csv
233
+ id,name,address
234
+ 1,Alice,"123 Main St
235
+ Apt 4B
236
+ Brooklyn, NY 11201"
237
+ 2,Bob,"42 Elm Ave"
238
+
239
+ data = SmarterCSV.process('addresses.csv')
240
+ # => [{id: 1, name: "Alice", address: "123 Main St\nApt 4B\nBrooklyn, NY 11201"},
241
+ # {id: 2, name: "Bob", address: "42 Elm Ave"}]
242
+ ```
198
243
 
199
244
  †: Legacy Apple DB Dump and older UNIX data dumps use ASCII control characters as delimiters:
200
245
 
@@ -45,14 +45,14 @@ rows with type conversion applied. SmarterCSV/C is dramatically faster:
45
45
 
46
46
  ### C path
47
47
 
48
- | Gain | Files |
49
- |--------------|---------------------------------------------------------------------|
50
- | **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields |
51
- | **1.5×** | heavy_quoting — same skip-ahead benefit |
52
- | **1.4×** | tab_separated |
48
+ | Gain | Files |
49
+ |--------------|-----------------------------------------------------------------------------|
50
+ | **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields |
51
+ | **1.5×** | heavy_quoting — same skip-ahead benefit |
52
+ | **1.4×** | tab_separated |
53
53
  | **1.2–1.3×** | embedded_sep, utf8, PEOPLE_IMPORT_C/NC, worldcities, whitespace, multi_char |
54
- | **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols |
55
- | **~1.0×** | sensor_data, embedded_newlines (within noise) |
54
+ | **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols |
55
+ | **~1.0×** | sensor_data, embedded_newlines (within noise) |
56
56
 
57
57
  15 of 19 files are measurably faster; 2 within noise; 2 files show a small regression
58
58
  (PEOPLE_IMPORT_NB −7%, wide_500_cols −5%) attributable to the new `quote_boundary: :standard`
@@ -60,11 +60,11 @@ default adding one extra state check on the unquoted fast path.
60
60
 
61
61
  ### Ruby path
62
62
 
63
- | Gain | Files |
64
- |--------------|---------------------------------------------------------------------|
63
+ | Gain | Files |
64
+ |--------------|-----------------------------------------------------------------------------------|
65
65
  | **1.9×** | PEOPLE_IMPORT_C (117 cols) — direct hash construction bypasses intermediate Array |
66
- | **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep |
67
- | **1.0–1.1×** | most other files |
66
+ | **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep |
67
+ | **1.0–1.1×** | most other files |
68
68
 
69
69
  The Ruby path gains are concentrated on wide/complex files where the direct-hash
70
70
  construction optimization (Opt #11) has the most impact.
@@ -106,9 +106,9 @@ are skipped entirely in the C hot path — no string allocation, no conversion,
106
106
  insertion. Benchmark on `wide_500_cols_20k.csv` (500 columns):
107
107
 
108
108
  | Columns kept | Speedup vs no selection |
109
- |---|---|
110
- | 2 of 500 | ~16× faster |
111
- | 10 of 500 | ~8× faster |
112
- | 50 of 500 | ~3× faster |
109
+ |--------------|-------------------------|
110
+ | 2 of 500 | ~16× faster |
111
+ | 10 of 500 | ~8× faster |
112
+ | 50 of 500 | ~3× faster |
113
113
 
114
114
  This is additive on top of the baseline gains above.
@@ -0,0 +1,121 @@
1
+ # SmarterCSV 1.17.0 — Benchmark Results
2
+
3
+ - **Date:** 2026-05-06
4
+ - **Ruby:** 3.4.7 [arm64-darwin25] on Apple M1 Pro
5
+ - **SmarterCSV:** 1.17.0
6
+ - **Versions compared:** 1.14.4, 1.15.2, 1.16.4, 1.17.0
7
+ - **Ruby CSV:** 3.3.5
8
+ - **Methodology:** best of 40 measured runs (2 warm-up)
9
+ - **Raw data files:**
10
+ - [`2026-05-06_1250_ruby3.4.7.md`](2026-05-06_1250_ruby3.4.7.md) / [`.json`](2026-05-06_1250_ruby3.4.7.json) — version comparison (1.14.4 / 1.15.2 / 1.16.4 / 1.17.0)
11
+ - [`2026-05-06_1511_ruby3.4.7.md`](2026-05-06_1511_ruby3.4.7.md) / [`.json`](2026-05-06_1511_ruby3.4.7.json) — vs Ruby CSV 3.3.5
12
+
13
+ See [performance_notes.md](performance_notes.md) for analysis of these numbers.
14
+
15
+ ---
16
+
17
+ ## SmarterCSV C accelerated — version comparison
18
+
19
+ | File | Rows | v1.14.4 | v1.15.2 | v1.16.4 | v1.17.0 | newest vs oldest |
20
+ |----------------------------------|--------|------------|-----------|-----------|-----------|------------------|
21
+ | PEOPLE_IMPORT_B.csv | 50000 | 1.6175s | 0.1049s | 0.0867s | 0.0872s | 18.54× faster |
22
+ | PEOPLE_IMPORT_C.csv | 50000 | 8.0347s | 0.2055s | 0.1763s | 0.1746s | 46.02× faster |
23
+ | PEOPLE_IMPORT_NB.csv | 50000 | 1.5629s | 0.0994s | 0.0694s | 0.0708s | 22.08× faster |
24
+ | PEOPLE_IMPORT_NC.csv | 50000 | 1.4679s | 0.0855s | 0.0711s | 0.0705s | 20.83× faster |
25
+ | uscities.csv | 31257 | 1.0357s | 0.1129s | 0.0878s | 0.0819s | 12.64× faster |
26
+ | uszips.csv | 33782 | 1.2419s | 0.1121s | 0.0880s | 0.0879s | 14.13× faster |
27
+ | worldcities.csv | 48059 | 1.0420s | 0.1174s | 0.0861s | 0.0773s | 13.49× faster |
28
+ | embedded_newlines_20k.csv | 80000 | 0.5337s | 0.0633s | 0.0591s | 0.0545s | 9.80× faster |
29
+ | embedded_separators_20k.csv | 20000 | 0.2761s | 0.0328s | 0.0215s | 0.0214s | 12.90× faster |
30
+ | heavy_quoting_20k.csv | 20000 | 0.5129s | 0.0561s | 0.0364s | 0.0358s | 14.34× faster |
31
+ | long_fields_20k.csv | 20000 | 2.9215s | 0.1082s | 0.0464s | 0.0392s | 74.54× faster |
32
+ | many_empty_fields_20k.csv | 20000 | 0.3885s | 0.0314s | 0.0240s | 0.0262s | 14.81× faster |
33
+ | multi_char_separator_20k.csv | 20000 | 0.5305s | 0.0340s | 0.0272s | 0.0296s | 17.90× faster |
34
+ | sample_10M.csv | 50000 | 0.4513s | 0.0619s | 0.0480s | 0.0446s | 10.11× faster |
35
+ | sensor_data_50krows_50cols.csv | 50000 | 3.8704s | 0.2714s | 0.2559s | 0.2549s | 15.19× faster |
36
+ | tab_separated_20k.tsv | 20000 | 0.4496s | 0.0337s | 0.0255s | 0.0256s | 17.54× faster |
37
+ | utf8_multibyte_20k.csv | 20000 | 0.2233s | 0.0210s | 0.0152s | 0.0149s | 14.96× faster |
38
+ | whitespace_heavy_20k.csv | 20000 | 0.5244s | 0.0349s | 0.0250s | 0.0286s | 18.34× faster |
39
+ | wide_500_cols_20k.csv | 20000 | 17.3477s | 1.2805s | 1.2798s | 1.2701s | 13.66× faster |
40
+
41
+ ## SmarterCSV Ruby path — version comparison
42
+
43
+ | File | Rows | v1.14.4 | v1.15.2 | v1.16.4 | v1.17.0 | newest vs oldest |
44
+ |----------------------------------|--------|------------|-----------|-----------|-----------|------------------|
45
+ | PEOPLE_IMPORT_B.csv | 50000 | 4.5718s | 0.5635s | 0.5272s | 0.4971s | 9.20× faster |
46
+ | PEOPLE_IMPORT_C.csv | 50000 | 26.0194s | 2.5511s | 1.3401s | 1.3328s | 19.52× faster |
47
+ | PEOPLE_IMPORT_NB.csv | 50000 | 4.4999s | 0.5268s | 0.4757s | 0.4791s | 9.39× faster |
48
+ | PEOPLE_IMPORT_NC.csv | 50000 | 4.3233s | 0.5752s | 0.3989s | 0.4017s | 10.76× faster |
49
+ | uscities.csv | 31257 | 2.6702s | 1.8124s | 1.0662s | 1.0944s | 2.44× faster |
50
+ | uszips.csv | 33782 | 3.1853s | 2.1641s | 1.3332s | 1.3434s | 2.37× faster |
51
+ | worldcities.csv | 48059 | 2.8397s | 1.8978s | 1.0910s | 1.0909s | 2.60× faster |
52
+ | embedded_newlines_20k.csv | 80000 | 0.9578s | 0.4629s | 0.4291s | 0.4314s | 2.22× faster |
53
+ | embedded_separators_20k.csv | 20000 | 0.7074s | 0.4535s | 0.2748s | 0.2748s | 2.57× faster |
54
+ | heavy_quoting_20k.csv | 20000 | 1.4361s | 0.8598s | 0.5241s | 0.5273s | 2.72× faster |
55
+ | long_fields_20k.csv | 20000 | 8.8715s | 4.7839s | 2.5696s | 2.5624s | 3.46× faster |
56
+ | many_empty_fields_20k.csv | 20000 | 0.8635s | 0.2521s | 0.1680s | 0.1664s | 5.19× faster |
57
+ | multi_char_separator_20k.csv | 20000 | 1.4172s | 0.2463s | 0.1853s | 0.1879s | 7.54× faster |
58
+ | sample_10M.csv | 50000 | 1.0547s | 0.2388s | 0.2238s | 0.2211s | 4.77× faster |
59
+ | sensor_data_50krows_50cols.csv | 50000 | 8.9445s | 1.8246s | 1.8348s | 1.8181s | 4.92× faster |
60
+ | tab_separated_20k.tsv | 20000 | 1.2664s | 0.1596s | 0.1553s | 0.1536s | 8.24× faster |
61
+ | utf8_multibyte_20k.csv | 20000 | 0.6484s | 0.1124s | 0.1068s | 0.1066s | 6.08× faster |
62
+ | whitespace_heavy_20k.csv | 20000 | 1.5513s | 0.1613s | 0.1654s | 0.1610s | 9.63× faster |
63
+ | wide_500_cols_20k.csv | 20000 | 44.5782s | 7.2023s | 6.9748s | 6.9261s | 6.44× faster |
64
+
65
+ ---
66
+
67
+ ## SmarterCSV 1.17.0 vs Ruby CSV 3.3.5 — full results
68
+
69
+ | File | Rows | CSV.read¹ | CSV.hashes¹ | SmarterCSV/C | SmarterCSV/Rb |
70
+ |----------------------------------|--------|------------|-------------|---------------|---------------|
71
+ | PEOPLE_IMPORT_B.csv | 50000 | 0.2718s | 0.7750s | 0.0673s | 0.5034s |
72
+ | PEOPLE_IMPORT_C.csv | 50000 | 1.4111s | 8.0199s | 0.1907s | 1.4032s |
73
+ | PEOPLE_IMPORT_NB.csv | 50000 | 0.2659s | 0.7603s | 0.0638s | 0.4800s |
74
+ | PEOPLE_IMPORT_NC.csv | 50000 | 0.2860s | 0.9173s | 0.0630s | 0.4132s |
75
+ | uscities.csv | 31257 | 0.5640s | 0.8803s | 0.0789s | 1.1120s |
76
+ | uszips.csv | 33782 | 0.7414s | 1.1604s | 0.0929s | 1.3645s |
77
+ | worldcities.csv | 48059 | 0.6313s | 0.9906s | 0.0794s | 1.0945s |
78
+ | embedded_newlines_20k.csv | 80000 | 0.1693s | 0.2245s | 0.0554s | 0.4451s |
79
+ | embedded_separators_20k.csv | 20000 | 0.1312s | 0.1838s | 0.0206s | 0.2830s |
80
+ | heavy_quoting_20k.csv | 20000 | 0.1167s | 0.2410s | 0.0338s | 0.5400s |
81
+ | long_fields_20k.csv | 20000 | 0.2373s | 0.2762s | 0.0392s | 2.6172s |
82
+ | many_empty_fields_20k.csv | 20000 | 0.1145s | 0.3622s | 0.0216s | 0.1727s |
83
+ | multi_char_separator_20k.csv | 20000 | 0.0890s | 0.2122s | 0.0293s | 0.1662s |
84
+ | sample_10M.csv | 50000 | 0.1685s | 0.3012s | 0.0357s | 0.2361s |
85
+ | sensor_data_50krows_50cols.csv | 50000 | 0.5655s | 2.6744s | 0.2442s | 1.8878s |
86
+ | tab_separated_20k.tsv | 20000 | 0.0832s | 0.2029s | 0.0219s | 0.1651s |
87
+ | utf8_multibyte_20k.csv | 20000 | 0.0662s | 0.1427s | 0.0156s | 0.1138s |
88
+ | whitespace_heavy_20k.csv | 20000 | 0.0890s | 0.2169s | 0.0278s | 0.1670s |
89
+ | wide_500_cols_20k.csv | 20000 | 2.3351s | 32.4002s | 1.2823s | 7.3504s |
90
+
91
+ ## Ruby CSV 3.3.5 vs SmarterCSV 1.17.0 (C accelerated)
92
+
93
+ | File | Rows | CSV.read¹ | CSV.hashes¹ |
94
+ |----------------------------------|--------|---------------|---------------|
95
+ | PEOPLE_IMPORT_B.csv | 50000 | 4.04× slower | 11.51× slower |
96
+ | PEOPLE_IMPORT_C.csv | 50000 | 7.40× slower | 42.04× slower |
97
+ | PEOPLE_IMPORT_NB.csv | 50000 | 4.17× slower | 11.92× slower |
98
+ | PEOPLE_IMPORT_NC.csv | 50000 | 4.54× slower | 14.55× slower |
99
+ | uscities.csv | 31257 | 7.15× slower | 11.16× slower |
100
+ | uszips.csv | 33782 | 7.98× slower | 12.50× slower |
101
+ | worldcities.csv | 48059 | 7.95× slower | 12.48× slower |
102
+ | embedded_newlines_20k.csv | 80000 | 3.05× slower | 4.05× slower |
103
+ | embedded_separators_20k.csv | 20000 | 6.36× slower | 8.91× slower |
104
+ | heavy_quoting_20k.csv | 20000 | 3.46× slower | 7.14× slower |
105
+ | long_fields_20k.csv | 20000 | 6.05× slower | 7.04× slower |
106
+ | many_empty_fields_20k.csv | 20000 | 5.29× slower | 16.73× slower |
107
+ | multi_char_separator_20k.csv | 20000 | 3.04× slower | 7.25× slower |
108
+ | sample_10M.csv | 50000 | 4.72× slower | 8.43× slower |
109
+ | sensor_data_50krows_50cols.csv | 50000 | 2.32× slower | 10.95× slower |
110
+ | tab_separated_20k.tsv | 20000 | 3.80× slower | 9.28× slower |
111
+ | utf8_multibyte_20k.csv | 20000 | 4.24× slower | 9.14× slower |
112
+ | whitespace_heavy_20k.csv | 20000 | 3.20× slower | 7.81× slower |
113
+ | wide_500_cols_20k.csv | 20000 | 1.82× slower | 25.27× slower |
114
+
115
+ ---
116
+
117
+ ¹ **Raw output** — no post-processing applied. Returns plain arrays or string-keyed hashes. No header normalization, type conversion, whitespace stripping, or empty-value removal. Your own post-processing must be added to produce usable data.
118
+
119
+ ---
120
+
121
+ PREVIOUS: [Performance Notes](./performance_notes.md) | UP: [README](../../../README.md)
@@ -0,0 +1,161 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](../../_introduction.md)
5
+ * [Migrating from Ruby CSV](../../migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](../../ruby_csv_pitfalls.md)
7
+ * [Parsing Strategy](../../parsing_strategy.md)
8
+ * [The Basic Read API](../../basic_read_api.md)
9
+ * [The Basic Write API](../../basic_write_api.md)
10
+ * [Batch Processing](../../batch_processing.md)
11
+ * [Configuration Options](../../options.md)
12
+ * [Row and Column Separators](../../row_col_sep.md)
13
+ * [Header Transformations](../../header_transformations.md)
14
+ * [Header Validations](../../header_validations.md)
15
+ * [Column Selection](../../column_selection.md)
16
+ * [Data Transformations](../../data_transformations.md)
17
+ * [Value Converters](../../value_converters.md)
18
+ * [Bad Row Quarantine](../../bad_row_quarantine.md)
19
+ * [Warnings](../../warnings.md)
20
+ * [Instrumentation Hooks](../../instrumentation.md)
21
+ * [Examples](../../examples.md)
22
+ * [Real-World CSV Files](../../real_world_csv.md)
23
+ * [SmarterCSV over the Years](../../history.md)
24
+ * [**Release Notes**](./changes.md)
25
+
26
+ --------------
27
+
28
+ # SmarterCSV 1.17.0 — Changes
29
+
30
+ RSpec tests: **1,434 → 2,210** (+776 tests since 1.16.4)
31
+
32
+ 1.17.0 is a **features-and-quality** release, focused on three things: streaming IO inputs, a structured warnings system, and Rails-friendly defaults. The C parser's core line-parsing — separator splitting, quote/escape handling, multiline stitching — is unchanged from 1.16.0 (see [`docs/releases/1.16.0/`](../1.16.0/changes.md) for the parser performance story); what changed in the C path this cycle is a faster code path for quoted-field-heavy files and Unicode-aware blank detection. On the C-accelerated path, 1.17.0 vs 1.16.4 is a **mixed picture**: quoted-field-heavy and wide files run meaningfully faster, a handful of short-line / many-small-field files run a little slower, and the rest are within noise. The Ruby path is parity throughout. The wins come from the faster quoted-field handling; the small regressions trace to the new auto-detection default (`auto_row_sep_chars` 500→4096) plus a tiny per-line overhead — see [performance_notes.md](performance_notes.md) and [benchmarks.md](benchmarks.md) for the per-file breakdown.
33
+
34
+ ---
35
+
36
+ ## Compatibility
37
+
38
+ * **No breaking changes.** All 1.16.x code continues to work without modification.
39
+ * **Behavior change worth noting:** `auto_row_sep_chars: nil` / `0` no longer means "scan whole file" — these values fall back to the default with a warning. The total scan is hard-capped at 64KB. If you relied on the previous undocumented "scan whole file" semantics, this is a visible change.
40
+
41
+ ---
42
+
43
+ ## Headline Features
44
+
45
+ ### 1. Non-Seekable Streaming Inputs
46
+
47
+ SmarterCSV now reads directly from any IO source — including streams that don't support `rewind` or `seek`. No need to materialize the file on disk first.
48
+
49
+ ```ruby
50
+ # Gzipped CSV — stream-decompressed, never written to disk
51
+ require 'zlib'
52
+ Zlib::GzipReader.open('huge.csv.gz') do |io|
53
+ SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
54
+ end
55
+
56
+ # STDIN / pipes
57
+ SmarterCSV.process($stdin) { |row, _| MyModel.upsert(row.first) }
58
+
59
+ # HTTP response body
60
+ require 'open-uri'
61
+ URI.open('https://example.com/data.csv') { |io| SmarterCSV.process(io) }
62
+
63
+ # S3 — stream the response body directly
64
+ require 'aws-sdk-s3'
65
+ obj = Aws::S3::Client.new.get_object(bucket: 'data', key: 'imports/users.csv')
66
+ SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _|
67
+ MyModel.insert_all(chunk)
68
+ end
69
+ ```
70
+
71
+ Auto-detection of `row_sep` and `col_sep` works on these streaming sources thanks to internal buffering — the underlying source never needs to support `rewind` or `seek`. See [Real-World CSV Files → I/O Patterns](../../real_world_csv.md#io-patterns) and [Examples → Streaming Inputs](../../examples.md#example-14-streaming-inputs-non-seekable-io).
72
+
73
+ ### 2. Structured Warnings Collection
74
+
75
+ Auto-detection and configuration warnings are now collected on the Reader as a deduped histogram, in addition to being emitted to a log sink:
76
+
77
+ ```ruby
78
+ reader = SmarterCSV::Reader.new('data.csv')
79
+ reader.process
80
+ reader.warnings
81
+ # => [
82
+ # { type: :config, code: :chunk_size_default, severity: :warn,
83
+ # message: "chunk_size not set, defaulting to 100. ...", count: 1 },
84
+ # ...
85
+ # ]
86
+ ```
87
+
88
+ Repeated warnings of the same `(type, code)` are deduped — `count` tracks occurrences across the run. This lets you surface warnings programmatically (dashboards, fail-deploys-on-codes, etc.) without parsing stderr text.
89
+
90
+ **Warning codes available in 1.17.0:**
91
+
92
+ | Code | Type | Severity | Triggered when |
93
+ |-------------------------------|----------------|----------|-----------------------------------------------------------------------------------------------|
94
+ | `:chunk_size_default` | `:config` | `:warn` | `each_chunk` is called without `chunk_size:` and the default of `100` is used. |
95
+ | `:header_a_method` | `:deprecation` | `:warn` | The deprecated `Reader#headerA` accessor is called. |
96
+ | `:utf8_missing_binary_mode` | `:encoding` | `:warn` | UTF-8 input is being processed but the IO was not opened with `"b:utf-8"`. |
97
+ | `:no_clear_row_sep` | `:row_sep` | `:error` | Auto-detection found a true tie between separators after scanning 64KB. Silent miss-parse risk. |
98
+ | `:no_row_sep_found` | `:row_sep` | `:error` | No known row separator was found in the first 64KB. Likely an exotic separator like `
`. |
99
+
100
+ See [Warnings](../../warnings.md) for the full record shape, suppression options, and Rails integration details.
101
+
102
+ ### 3. Class-Level `SmarterCSV.warnings` Accessor
103
+
104
+ Mirrors `SmarterCSV.errors`. Returns warnings from the most recent call to `process`, `parse`, `each`, or `each_chunk` on the current thread. Cleared at the start of each new call.
105
+
106
+ ```ruby
107
+ SmarterCSV.process('data.csv')
108
+ SmarterCSV.warnings.each do |w|
109
+ logger.warn("[#{w[:type]}/#{w[:code]}] #{w[:message]} (×#{w[:count]})")
110
+ end
111
+ ```
112
+
113
+ Per-thread (uses `Thread.current`) — safe under Puma and Sidekiq. Not fiber-safe; use `SmarterCSV::Reader` directly if processing CSV concurrently with `Async`/`Falcon`/manual `Fiber` scheduling.
114
+
115
+ ### 4. Rails.logger Auto-Routing
116
+
117
+ When `Rails.logger` is present, warnings are routed through it at the severity declared at the call site (`:debug` / `:info` / `:warn` / `:error` / `:fatal`):
118
+
119
+ ```
120
+ # In log/development.log
121
+ [WARN] SmarterCSV: chunk_size not set, defaulting to 100. ...
122
+ ```
123
+
124
+ Without Rails, falls back to `Kernel#warn` (writes to `$stderr`). Detection is one-shot at Reader construction — no per-call overhead. The programmatic `reader.warnings` collection is identical in both modes.
125
+
126
+ See [Warnings → Log sink routing](../../warnings.md#log-sink-routing).
127
+
128
+ ---
129
+
130
+ ## Improvements
131
+
132
+ * **Better auto-detection of `row_sep` and `col_sep`** — more accurate results on files with comment headers and other irregularities at the start of the stream.
133
+
134
+ * **`auto_row_sep_chars` default changed to `4096`** (was `500` in 1.16.x). Sized to cover wide-header CSVs in a single read. Out-of-range values, `nil`, or `0` fall back to the default with a warning. **Behavior change vs 1.16.x:** the previous undocumented "scan whole file" semantics on `nil`/`0` is removed; the total scan is hard-capped at 64KB.
135
+
136
+ * **`buffer_size` is now a public option** — peek buffer chunk size for non-seekable inputs (pipes, gzip readers, HTTP/S3 bodies). Default `16_384`. Out-of-range values warn and clamp to the supported range rather than raising. Has no effect on seekable inputs (file paths, `File`, `StringIO`).
137
+
138
+ * **Files ending in a lone `\r`** are now correctly detected as `\r`-terminated instead of falling through to a "no clear row separator" warning.
139
+
140
+ * **`SmarterCSV.errors` mid-stream preservation** *(merged from 1.16.4)* — fixed a bug where collected error records could be lost when processing raised mid-stream (e.g. `bad_row_limit:` exceeded → `TooManyBadRows`, or a user block raising through `.process` / `.each` / `.each_chunk`).
141
+
142
+ * **`enforce_utf8_encoding` for `ASCII-8BIT` inputs** *(merged from 1.16.4)* — fixed incorrect replacement of all non-ASCII bytes when the input was tagged binary. Encoding is now relabeled to UTF-8 before transcoding so only genuinely invalid byte sequences are replaced.
143
+
144
+ ---
145
+
146
+ ## Documentation
147
+
148
+ Substantive expansion of the user-facing docs to match the new capabilities:
149
+
150
+ * **`docs/examples.md`** — six new cookbook entries (Examples 14–19): Streaming Inputs, Resumable Plain-Ruby Import, CSV Files with Comment Lines, Tab-Separated Values (TSV), Multi-Line Fields, and Filtering and Transforming a CSV File (the `CSV.filter` replacement pattern).
151
+ * **`docs/real_world_csv.md`** — expanded I/O Patterns section with worked examples for gzip, S3, HTTP, STDIN, and `IO.popen`. Added a Multi-Line Quoted Fields worked example.
152
+ * **`docs/warnings.md`** *(new)* — full coverage of the structured warnings system: record shape, available codes, log-sink routing for Rails vs non-Rails, suppression via `verbose: :quiet`.
153
+ * **`docs/header_transformations.md`** — added a worked example for `comment_regexp:` (CSV files with comment lines).
154
+ * **`docs/row_col_sep.md`** — added a worked TSV example.
155
+ * **`docs/batch_processing.md`** — added a Resumable Import (Plain Ruby) example using `chunk_index` + a JSON state file (companion to the Rails 8.1 ActiveJob version in `examples.md`).
156
+ * **`docs/basic_read_api.md`** / **`docs/basic_write_api.md`** — cross-references to the read-transform-write composition pattern; added `$stdout` and S3 streaming write examples.
157
+ * **`README.md`** — added inline examples for streaming inputs, value converters, header validation, and writing CSV; one-sentence note on Rails.logger auto-routing.
158
+
159
+ ---
160
+
161
+ PREVIOUS: [SmarterCSV over the Years](../../history.md) | UP: [README](../../../README.md)
@@ -0,0 +1,126 @@
1
+ # SmarterCSV 1.17.0 — Performance Notes
2
+
3
+ The per-file tables below: Apple M4, Ruby 3.4.7 [arm64], 40 iterations per run × 8 runs, median across runs (p10-trimmed), measured 2026-05-11–12. 19-file corpus; `1.16.4 → 1.17.0`. Times in seconds — lower is better. (The "vs Ruby CSV" tables further down are from the earlier 2026-05-06 run — see Methodology.)
4
+
5
+ ---
6
+
7
+ ## 1.16.4 → 1.17.0 — C-accelerated path (the default)
8
+
9
+ The C parser's core line-parsing (separator splitting, quote/escape handling, multiline stitching) is unchanged from 1.16.0. The C-path changes this cycle are a faster code path for quoted-field-heavy files — the big wins — and Unicode-aware blank detection.
10
+
11
+ | file | 1.16.4 (s) | 1.17.0 (s) | 1.17.0 vs 1.16.4 |
12
+ | ------------------------------ | ---------- | ---------- | ---------------- |
13
+ | PEOPLE_IMPORT_B.csv | 0.06255 | 0.06305 | ~1% noise |
14
+ | PEOPLE_IMPORT_C.csv | 0.13072 | 0.13274 | ~2% noise |
15
+ | PEOPLE_IMPORT_NB.csv | 0.05985 | 0.06079 | ~2% noise |
16
+ | PEOPLE_IMPORT_NC.csv | 0.05273 | 0.05420 | ~3% noise |
17
+ | uscities.csv | 0.06325 | 0.05545 | 12.3% faster |
18
+ | uszips.csv | 0.06957 | 0.06255 | 10.1% faster |
19
+ | worldcities.csv | 0.06824 | 0.06134 | 10.1% faster |
20
+ | embedded_newlines_60k.csv | 0.12795 | 0.11951 | 6.6% faster |
21
+ | embedded_separators_60k.csv | 0.05093 | 0.04591 | 9.9% faster |
22
+ | heavy_quoting_60k.csv | 0.08926 | 0.07490 | 16.1% faster |
23
+ | long_fields_40k.csv | 0.06375 | 0.04970 | 22.0% faster |
24
+ | many_empty_fields_60k.csv | 0.06813 | 0.06888 | ~1% noise |
25
+ | multi_char_separator_60k.csv | 0.07720 | 0.07830 | ~1% noise |
26
+ | sample_100k.csv | 0.07051 | 0.07139 | ~1% noise |
27
+ | sensor_data_50krows_50cols.csv | 0.17839 | 0.17897 | ~1% noise |
28
+ | tab_separated_60k.tsv | 0.06704 | 0.06798 | ~1% noise |
29
+ | utf8_multibyte_60k.csv | 0.04391 | 0.04376 | ~ same |
30
+ | whitespace_heavy_60k.csv | 0.06803 | 0.06897 | ~1% noise |
31
+ | wide_500_cols_20k.csv | 1.07019 | 1.07348 | ~1% noise |
32
+
33
+ *`~N% noise` means the measured difference (≈N%, always a small slowdown here) is within the run-to-run variance of this setup (8 runs × 40 iterations, median across runs, p10-trimmed) — i.e. effectively unchanged, not a real regression. The raw per-version times are in the table for the exact figure.*
34
+
35
+ Quote-heavy / large-field / wide files run **7–22% faster** than 1.16.4 (`long_fields_40k` 22%, `heavy_quoting_60k` 16%, the city files 10–12%, `embedded_separators` 10%, `embedded_newlines` 7%). Everything else is within ±3% of 1.16.4 — effectively unchanged. (The short-line / many-small-field files do show a small, *consistent* uptick at the bottom of that band, traceable to the larger default auto-detection scan window plus a tiny per-line overhead; if that matters for your workload, set `auto_row_sep_chars` lower. See [What's driving the mixed C-path picture](#whats-driving-the-mixed-c-path-picture) below.)
36
+
37
+ ---
38
+
39
+ ## 1.16.4 → 1.17.0 — Ruby fallback path (`acceleration: false`)
40
+
41
+ Faster on nearly every file this cycle, from three changes: in-place stripping in the no-quote split path, a first-byte fast-reject before numeric conversion, and per-row / per-value overhead removed from the hash transformations.
42
+
43
+ | file | 1.16.4 (s) | 1.17.0 (s) | 1.17.0 vs 1.16.4 |
44
+ | ------------------------------ | ---------- | ---------- | ---------------- |
45
+ | PEOPLE_IMPORT_B.csv | 0.38220 | 0.35281 | 7.7% faster |
46
+ | PEOPLE_IMPORT_C.csv | 0.99047 | 0.95728 | 3.4% faster |
47
+ | PEOPLE_IMPORT_NB.csv | 0.36110 | 0.31716 | 12.2% faster |
48
+ | PEOPLE_IMPORT_NC.csv | 0.28762 | 0.25849 | 10.1% faster |
49
+ | uscities.csv | 0.74246 | 0.71183 | 4.1% faster |
50
+ | uszips.csv | 0.90817 | 0.87628 | 3.5% faster |
51
+ | worldcities.csv | 0.75714 | 0.72641 | 4.1% faster |
52
+ | embedded_newlines_60k.csv | 0.88887 | 0.86252 | 3.0% faster |
53
+ | embedded_separators_60k.csv | 0.57053 | 0.53401 | 6.4% faster |
54
+ | heavy_quoting_60k.csv | 1.09395 | 1.02829 | 6.0% faster |
55
+ | long_fields_40k.csv | 3.27964 | 3.29366 | ~ same |
56
+ | many_empty_fields_60k.csv | 0.37815 | 0.33153 | 12.3% faster |
57
+ | multi_char_separator_60k.csv | 0.45717 | 0.38380 | 16.0% faster |
58
+ | sample_100k.csv | 0.34527 | 0.30690 | 11.1% faster |
59
+ | sensor_data_50krows_50cols.csv | 1.32705 | 1.33218 | ~ same |
60
+ | tab_separated_60k.tsv | 0.38261 | 0.31359 | 18.0% faster |
61
+ | utf8_multibyte_60k.csv | 0.24212 | 0.21281 | 12.1% faster |
62
+ | whitespace_heavy_60k.csv | 0.37635 | 0.30848 | 18.0% faster |
63
+ | wide_500_cols_20k.csv | 5.28395 | 4.23045 | 19.9% faster |
64
+
65
+ Gains run **3–20%** vs 1.16.4, biggest on wide / many-small-field files (`wide_500_cols` 20%, `whitespace_heavy` / `tab_separated` 18%, `multi_char_separator` 16%). Only `long_fields_40k` (dominated by large-field allocation, not per-field work) and `sensor_data` (numeric-heavy — the fast-reject's per-value cost and a saved per-value method call cancel out) sit at parity.
66
+
67
+ ---
68
+
69
+ ## What's driving the mixed C-path picture
70
+
71
+ The C parser's core line-parsing — separator splitting, quote/escape handling, multiline stitching — is unchanged from 1.16.0; all of that hot-path work carries forward (see [the 1.16.0 changes](../1.16.0/changes.md) for the parser performance story). So why the split — some files faster, a band of small files a hair slower?
72
+
73
+ **The wins are the quoted-field handling.** 1.17.0 added a faster path for fields wrapped in quotes: the common case — a quoted field with no doubled `""` inside — now skips a copy step. Files where most or all fields are quoted (city/address-style data, long quoted text, wide rows) pick up 7–22%.
74
+
75
+ **The bigger default auto-detection window.** The benchmark leaves `row_sep` at `:auto` for every file, so each run reads `auto_row_sep_chars` bytes up front — now `4096`, was `500` — and scans them for the row separator.
76
+ * On tiny files where total parse time is only ~50–80 ms, that one-time scan shows up as a ≤3% uptick.
77
+ * On larger files it's noise (and often net-positive — the wider window usually settles the separator on the first read, avoiding the doubling-escalation loop).
78
+ If you parse lots of very small files and care about that 1–3%, set `auto_row_sep_chars` lower, or pin `row_sep` explicitly to skip detection entirely. (The related `guess_line_ending` change — a chunked scan that doubles up to a 64 KB hard cap, replacing the old undocumented "scan whole file" on `nil`/`0` — is the same trade-off.)
79
+
80
+ **Not a factor here:** the buffering layer for non-seekable streams. The benchmark passes file paths to `SmarterCSV.process`, which opens them as seekable `File` objects, so the seekable fast path is taken and no buffering wrapper is instantiated. That layer only runs for pipes / gzip readers / HTTP/S3 bodies, which have much higher latency anyway — any extra work the buffer does there is negligible.
81
+
82
+ ---
83
+
84
+ ## vs Ruby CSV 3.3.5 (1.17.0 reference)
85
+
86
+ ### vs `CSV.read` (raw arrays — minimum equivalent work)
87
+
88
+ `CSV.read` is the *fastest* Ruby CSV mode: plain string arrays, no symbol keys, no numeric conversion. SmarterCSV/C delivers fully processed hashes — and still beats it on every file:
89
+
90
+ | Range | Files |
91
+ |-----------|-------------------------------------------------------------------------|
92
+ | **7–8×** | PEOPLE_IMPORT_C (7.8×), uszips (7.8×) |
93
+ | **6–7×** | long_fields (6.9×), uscities (6.8×), worldcities (6.8×) |
94
+ | **5–6×** | embedded_separators (5.4×) |
95
+ | **3–4×** | utf8_multibyte (3.9×), PEOPLE_IMPORT_NC (3.7×), many_empty (3.5×), heavy_quoting (3.4×), sample_100k (3.4×), PEOPLE_IMPORT_NB (3.2×) |
96
+ | **2–3×** | PEOPLE_IMPORT_B (2.9×), embedded_newlines (2.9×), whitespace_heavy (2.9×), sensor_data (2.5×) |
97
+ | **1–2×** | wide_500_cols (1.7×), tab_separated (1.6×), multi_char_separator (1.4×) |
98
+
99
+ **Summary: 1.4×–7.8× faster than `CSV.read`, while returning fully processed hashes.**
100
+
101
+ ### vs `CSV.hashes` (string-keyed hashes — closer to SmarterCSV output)
102
+
103
+ | Range | Files |
104
+ |------------|------------------------------------------------------------------------|
105
+ | **40–50×** | PEOPLE_IMPORT_C (47.3×) |
106
+ | **20–25×** | wide_500_cols (22.1×) |
107
+ | **10–15×** | uszips (12.5×), PEOPLE_IMPORT_NC (12.1×), many_empty (11.8×), worldcities (11.4×), uscities (11.2×), sensor_data (11.1×) |
108
+ | **7–10×** | embedded_separators (8.3×), long_fields (8.1×), PEOPLE_IMPORT_NB (8.1×), PEOPLE_IMPORT_B (7.9×), heavy_quoting (7.0×) |
109
+ | **5–7×** | whitespace_heavy (6.9×), utf8_multibyte (6.7×), sample_100k (6.2×) |
110
+ | **4–5×** | embedded_newlines (4.2×) |
111
+ | **2–3×** | tab_separated (2.3×), multi_char_separator (2.2×) |
112
+
113
+ **Summary: 2.2×–47.3× faster than `CSV.hashes`.**
114
+
115
+ ---
116
+
117
+ ## Methodology
118
+
119
+ Same as 1.16.0:
120
+ - Apple M4, Ruby 3.4.7
121
+ - 40 iterations per run × 8 runs (2 warm-up), median across runs (p10-trimmed)
122
+ - Raw .json captures preserved alongside the .md tables for reproducibility
123
+
124
+ ---
125
+
126
+ PREVIOUS: [Changes](./changes.md) | UP: [README](../../../README.md)
data/docs/row_col_sep.md CHANGED
@@ -31,7 +31,7 @@
31
31
 
32
32
  Convenient defaults allow automatic detection of the column and row separators: `row_sep: :auto`, `col_sep: :auto`. This makes it easier to process any CSV files without having to examine the line endings or column separators, e.g. when users upload CSV files to your service and you have no control over the incoming files.
33
33
 
34
- The setting `:auto_row_sep_chars` controls the chunk size used while scanning for the row separator (default is 8192). Detection reads in chunks of this size and stops as soon as one separator has a clear majority, with a 64KB hard cap. Values below 8192 (and `nil` / `0`) are rejected and fall back to the default with a warning. Of course you can also set the `:row_sep` manually.
34
+ The setting `:auto_row_sep_chars` controls the initial scan size used while detecting the row separator (default is `4096`). Detection stops as soon as one separator has a clear majority, up to a 64KB cap. Bump it higher if your files have very wide headers or long comment preambles; out-of-range values, `nil`, or `0` fall back to the default with a warning. Of course you can also set the `:row_sep` manually to skip auto-detection entirely.
35
35
 
36
36
 
37
37
  ## Column Separator `col_sep`
@@ -40,6 +40,25 @@ The automatic detection of column separators considers: `,`, `\t`, `;`, `:`, `|`
40
40
 
41
41
  Some CSV files may contain an unusual column separqator, which could even be a control character.
42
42
 
43
+ ### Tab-Separated Values (TSV)
44
+
45
+ Tab-separated files are auto-detected by default — no options needed:
46
+
47
+ ```ruby
48
+ $ cat data.tsv
49
+ id<TAB>name<TAB>amount
50
+ 1<TAB>Alice<TAB>100
51
+ 2<TAB>Bob<TAB>200
52
+
53
+ # Auto-detected — col_sep: :auto is the default
54
+ SmarterCSV.process('data.tsv')
55
+
56
+ # Or set the separator explicitly
57
+ SmarterCSV.process('data.tsv', col_sep: "\t")
58
+ ```
59
+
60
+ The default `col_sep: :auto` picks tab when it's the dominant delimiter in the first chunk of the file. The explicit form is useful in test fixtures or when you want to fail fast on unexpected formats.
61
+
43
62
  ## Row Separator `row_sep`
44
63
 
45
64
  The automatic detection of row separators considers: `\n`, `\r\n`, `\r`.