smarter_csv 1.16.4 → 1.17.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +10 -1
- data/CHANGELOG.md +54 -0
- data/Gemfile +10 -5
- data/README.md +98 -14
- data/TO_DO.md +109 -0
- data/docs/_introduction.md +1 -0
- data/docs/bad_row_quarantine.md +2 -1
- data/docs/basic_read_api.md +6 -1
- data/docs/basic_write_api.md +30 -0
- data/docs/batch_processing.md +25 -0
- data/docs/column_selection.md +1 -0
- data/docs/data_transformations.md +1 -0
- data/docs/examples.md +126 -0
- data/docs/header_transformations.md +23 -0
- data/docs/header_validations.md +1 -0
- data/docs/history.md +1 -0
- data/docs/instrumentation.md +2 -1
- data/docs/migrating_from_csv.md +1 -0
- data/docs/options.md +20 -18
- data/docs/parsing_strategy.md +1 -0
- data/docs/real_world_csv.md +51 -1
- data/docs/releases/1.16.0/performance_notes.md +15 -15
- data/docs/releases/1.17.0/benchmarks.md +121 -0
- data/docs/releases/1.17.0/changes.md +161 -0
- data/docs/releases/1.17.0/performance_notes.md +126 -0
- data/docs/row_col_sep.md +21 -1
- data/docs/ruby_csv_pitfalls.md +1 -0
- data/docs/value_converters.md +24 -0
- data/docs/warnings.md +141 -0
- data/ext/smarter_csv/smarter_csv.c +98 -32
- data/images/SmarterCSV_1.17.0_vs_RubyCSV_3.3.5_speedup.svg +106 -0
- data/images/SmarterCSV_1.17.0_vs_previous_C-speedup.svg +181 -0
- data/images/SmarterCSV_1.17.0_vs_previous_Rb-speedup.svg +179 -0
- data/lib/smarter_csv/auto_detection.rb +215 -30
- data/lib/smarter_csv/file_io.rb +2 -2
- data/lib/smarter_csv/hash_transformations.rb +29 -13
- data/lib/smarter_csv/parser.rb +42 -33
- data/lib/smarter_csv/peekable_io.rb +453 -0
- data/lib/smarter_csv/reader.rb +119 -23
- data/lib/smarter_csv/reader_options.rb +61 -1
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +40 -12
- metadata +12 -5
- data/TO_DO_v2.md +0 -14
- data/ext/smarter_csv/Makefile +0 -270
|
@@ -0,0 +1,161 @@
|
|
|
1
|
+
|
|
2
|
+
### Contents
|
|
3
|
+
|
|
4
|
+
* [Introduction](../../_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](../../migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](../../ruby_csv_pitfalls.md)
|
|
7
|
+
* [Parsing Strategy](../../parsing_strategy.md)
|
|
8
|
+
* [The Basic Read API](../../basic_read_api.md)
|
|
9
|
+
* [The Basic Write API](../../basic_write_api.md)
|
|
10
|
+
* [Batch Processing](../../batch_processing.md)
|
|
11
|
+
* [Configuration Options](../../options.md)
|
|
12
|
+
* [Row and Column Separators](../../row_col_sep.md)
|
|
13
|
+
* [Header Transformations](../../header_transformations.md)
|
|
14
|
+
* [Header Validations](../../header_validations.md)
|
|
15
|
+
* [Column Selection](../../column_selection.md)
|
|
16
|
+
* [Data Transformations](../../data_transformations.md)
|
|
17
|
+
* [Value Converters](../../value_converters.md)
|
|
18
|
+
* [Bad Row Quarantine](../../bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](../../warnings.md)
|
|
20
|
+
* [Instrumentation Hooks](../../instrumentation.md)
|
|
21
|
+
* [Examples](../../examples.md)
|
|
22
|
+
* [Real-World CSV Files](../../real_world_csv.md)
|
|
23
|
+
* [SmarterCSV over the Years](../../history.md)
|
|
24
|
+
* [**Release Notes**](./changes.md)
|
|
25
|
+
|
|
26
|
+
--------------
|
|
27
|
+
|
|
28
|
+
# SmarterCSV 1.17.0 — Changes
|
|
29
|
+
|
|
30
|
+
RSpec tests: **1,434 → 2,210** (+776 tests since 1.16.4)
|
|
31
|
+
|
|
32
|
+
1.17.0 is a **features-and-quality** release, focused on three things: streaming IO inputs, a structured warnings system, and Rails-friendly defaults. The C parser's core line-parsing — separator splitting, quote/escape handling, multiline stitching — is unchanged from 1.16.0 (see [`docs/releases/1.16.0/`](../1.16.0/changes.md) for the parser performance story); what changed in the C path this cycle is a faster code path for quoted-field-heavy files and Unicode-aware blank detection. On the C-accelerated path, 1.17.0 vs 1.16.4 is a **mixed picture**: quoted-field-heavy and wide files run meaningfully faster, a handful of short-line / many-small-field files run a little slower, and the rest are within noise. The Ruby path is parity throughout. The wins come from the faster quoted-field handling; the small regressions trace to the new auto-detection default (`auto_row_sep_chars` 500→4096) plus a tiny per-line overhead — see [performance_notes.md](performance_notes.md) and [benchmarks.md](benchmarks.md) for the per-file breakdown.
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Compatibility
|
|
37
|
+
|
|
38
|
+
* **No breaking changes.** All 1.16.x code continues to work without modification.
|
|
39
|
+
* **Behavior change worth noting:** `auto_row_sep_chars: nil` / `0` no longer means "scan whole file" — these values fall back to the default with a warning. The total scan is hard-capped at 64KB. If you relied on the previous undocumented "scan whole file" semantics, this is a visible change.
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## Headline Features
|
|
44
|
+
|
|
45
|
+
### 1. Non-Seekable Streaming Inputs
|
|
46
|
+
|
|
47
|
+
SmarterCSV now reads directly from any IO source — including streams that don't support `rewind` or `seek`. No need to materialize the file on disk first.
|
|
48
|
+
|
|
49
|
+
```ruby
|
|
50
|
+
# Gzipped CSV — stream-decompressed, never written to disk
|
|
51
|
+
require 'zlib'
|
|
52
|
+
Zlib::GzipReader.open('huge.csv.gz') do |io|
|
|
53
|
+
SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
|
|
54
|
+
end
|
|
55
|
+
|
|
56
|
+
# STDIN / pipes
|
|
57
|
+
SmarterCSV.process($stdin) { |row, _| MyModel.upsert(row.first) }
|
|
58
|
+
|
|
59
|
+
# HTTP response body
|
|
60
|
+
require 'open-uri'
|
|
61
|
+
URI.open('https://example.com/data.csv') { |io| SmarterCSV.process(io) }
|
|
62
|
+
|
|
63
|
+
# S3 — stream the response body directly
|
|
64
|
+
require 'aws-sdk-s3'
|
|
65
|
+
obj = Aws::S3::Client.new.get_object(bucket: 'data', key: 'imports/users.csv')
|
|
66
|
+
SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _|
|
|
67
|
+
MyModel.insert_all(chunk)
|
|
68
|
+
end
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
Auto-detection of `row_sep` and `col_sep` works on these streaming sources thanks to internal buffering — the underlying source never needs to support `rewind` or `seek`. See [Real-World CSV Files → I/O Patterns](../../real_world_csv.md#io-patterns) and [Examples → Streaming Inputs](../../examples.md#example-14-streaming-inputs-non-seekable-io).
|
|
72
|
+
|
|
73
|
+
### 2. Structured Warnings Collection
|
|
74
|
+
|
|
75
|
+
Auto-detection and configuration warnings are now collected on the Reader as a deduped histogram, in addition to being emitted to a log sink:
|
|
76
|
+
|
|
77
|
+
```ruby
|
|
78
|
+
reader = SmarterCSV::Reader.new('data.csv')
|
|
79
|
+
reader.process
|
|
80
|
+
reader.warnings
|
|
81
|
+
# => [
|
|
82
|
+
# { type: :config, code: :chunk_size_default, severity: :warn,
|
|
83
|
+
# message: "chunk_size not set, defaulting to 100. ...", count: 1 },
|
|
84
|
+
# ...
|
|
85
|
+
# ]
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Repeated warnings of the same `(type, code)` are deduped — `count` tracks occurrences across the run. This lets you surface warnings programmatically (dashboards, fail-deploys-on-codes, etc.) without parsing stderr text.
|
|
89
|
+
|
|
90
|
+
**Warning codes available in 1.17.0:**
|
|
91
|
+
|
|
92
|
+
| Code | Type | Severity | Triggered when |
|
|
93
|
+
|-------------------------------|----------------|----------|-----------------------------------------------------------------------------------------------|
|
|
94
|
+
| `:chunk_size_default` | `:config` | `:warn` | `each_chunk` is called without `chunk_size:` and the default of `100` is used. |
|
|
95
|
+
| `:header_a_method` | `:deprecation` | `:warn` | The deprecated `Reader#headerA` accessor is called. |
|
|
96
|
+
| `:utf8_missing_binary_mode` | `:encoding` | `:warn` | UTF-8 input is being processed but the IO was not opened with `"b:utf-8"`. |
|
|
97
|
+
| `:no_clear_row_sep` | `:row_sep` | `:error` | Auto-detection found a true tie between separators after scanning 64KB. Silent miss-parse risk. |
|
|
98
|
+
| `:no_row_sep_found` | `:row_sep` | `:error` | No known row separator was found in the first 64KB. Likely an exotic separator like `
`. |
|
|
99
|
+
|
|
100
|
+
See [Warnings](../../warnings.md) for the full record shape, suppression options, and Rails integration details.
|
|
101
|
+
|
|
102
|
+
### 3. Class-Level `SmarterCSV.warnings` Accessor
|
|
103
|
+
|
|
104
|
+
Mirrors `SmarterCSV.errors`. Returns warnings from the most recent call to `process`, `parse`, `each`, or `each_chunk` on the current thread. Cleared at the start of each new call.
|
|
105
|
+
|
|
106
|
+
```ruby
|
|
107
|
+
SmarterCSV.process('data.csv')
|
|
108
|
+
SmarterCSV.warnings.each do |w|
|
|
109
|
+
logger.warn("[#{w[:type]}/#{w[:code]}] #{w[:message]} (×#{w[:count]})")
|
|
110
|
+
end
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Per-thread (uses `Thread.current`) — safe under Puma and Sidekiq. Not fiber-safe; use `SmarterCSV::Reader` directly if processing CSV concurrently with `Async`/`Falcon`/manual `Fiber` scheduling.
|
|
114
|
+
|
|
115
|
+
### 4. Rails.logger Auto-Routing
|
|
116
|
+
|
|
117
|
+
When `Rails.logger` is present, warnings are routed through it at the severity declared at the call site (`:debug` / `:info` / `:warn` / `:error` / `:fatal`):
|
|
118
|
+
|
|
119
|
+
```
|
|
120
|
+
# In log/development.log
|
|
121
|
+
[WARN] SmarterCSV: chunk_size not set, defaulting to 100. ...
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Without Rails, falls back to `Kernel#warn` (writes to `$stderr`). Detection is one-shot at Reader construction — no per-call overhead. The programmatic `reader.warnings` collection is identical in both modes.
|
|
125
|
+
|
|
126
|
+
See [Warnings → Log sink routing](../../warnings.md#log-sink-routing).
|
|
127
|
+
|
|
128
|
+
---
|
|
129
|
+
|
|
130
|
+
## Improvements
|
|
131
|
+
|
|
132
|
+
* **Better auto-detection of `row_sep` and `col_sep`** — more accurate results on files with comment headers and other irregularities at the start of the stream.
|
|
133
|
+
|
|
134
|
+
* **`auto_row_sep_chars` default changed to `4096`** (was `500` in 1.16.x). Sized to cover wide-header CSVs in a single read. Out-of-range values, `nil`, or `0` fall back to the default with a warning. **Behavior change vs 1.16.x:** the previous undocumented "scan whole file" semantics on `nil`/`0` is removed; the total scan is hard-capped at 64KB.
|
|
135
|
+
|
|
136
|
+
* **`buffer_size` is now a public option** — peek buffer chunk size for non-seekable inputs (pipes, gzip readers, HTTP/S3 bodies). Default `16_384`. Out-of-range values warn and clamp to the supported range rather than raising. Has no effect on seekable inputs (file paths, `File`, `StringIO`).
|
|
137
|
+
|
|
138
|
+
* **Files ending in a lone `\r`** are now correctly detected as `\r`-terminated instead of falling through to a "no clear row separator" warning.
|
|
139
|
+
|
|
140
|
+
* **`SmarterCSV.errors` mid-stream preservation** *(merged from 1.16.4)* — fixed a bug where collected error records could be lost when processing raised mid-stream (e.g. `bad_row_limit:` exceeded → `TooManyBadRows`, or a user block raising through `.process` / `.each` / `.each_chunk`).
|
|
141
|
+
|
|
142
|
+
* **`enforce_utf8_encoding` for `ASCII-8BIT` inputs** *(merged from 1.16.4)* — fixed incorrect replacement of all non-ASCII bytes when the input was tagged binary. Encoding is now relabeled to UTF-8 before transcoding so only genuinely invalid byte sequences are replaced.
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
## Documentation
|
|
147
|
+
|
|
148
|
+
Substantive expansion of the user-facing docs to match the new capabilities:
|
|
149
|
+
|
|
150
|
+
* **`docs/examples.md`** — six new cookbook entries (Examples 14–19): Streaming Inputs, Resumable Plain-Ruby Import, CSV Files with Comment Lines, Tab-Separated Values (TSV), Multi-Line Fields, and Filtering and Transforming a CSV File (the `CSV.filter` replacement pattern).
|
|
151
|
+
* **`docs/real_world_csv.md`** — expanded I/O Patterns section with worked examples for gzip, S3, HTTP, STDIN, and `IO.popen`. Added a Multi-Line Quoted Fields worked example.
|
|
152
|
+
* **`docs/warnings.md`** *(new)* — full coverage of the structured warnings system: record shape, available codes, log-sink routing for Rails vs non-Rails, suppression via `verbose: :quiet`.
|
|
153
|
+
* **`docs/header_transformations.md`** — added a worked example for `comment_regexp:` (CSV files with comment lines).
|
|
154
|
+
* **`docs/row_col_sep.md`** — added a worked TSV example.
|
|
155
|
+
* **`docs/batch_processing.md`** — added a Resumable Import (Plain Ruby) example using `chunk_index` + a JSON state file (companion to the Rails 8.1 ActiveJob version in `examples.md`).
|
|
156
|
+
* **`docs/basic_read_api.md`** / **`docs/basic_write_api.md`** — cross-references to the read-transform-write composition pattern; added `$stdout` and S3 streaming write examples.
|
|
157
|
+
* **`README.md`** — added inline examples for streaming inputs, value converters, header validation, and writing CSV; one-sentence note on Rails.logger auto-routing.
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
PREVIOUS: [SmarterCSV over the Years](../../history.md) | UP: [README](../../../README.md)
|
|
@@ -0,0 +1,126 @@
|
|
|
1
|
+
# SmarterCSV 1.17.0 — Performance Notes
|
|
2
|
+
|
|
3
|
+
The per-file tables below: Apple M4, Ruby 3.4.7 [arm64], 40 iterations per run × 8 runs, median across runs (p10-trimmed), measured 2026-05-11–12. 19-file corpus; `1.16.4 → 1.17.0`. Times in seconds — lower is better. (The "vs Ruby CSV" tables further down are from the earlier 2026-05-06 run — see Methodology.)
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 1.16.4 → 1.17.0 — C-accelerated path (the default)
|
|
8
|
+
|
|
9
|
+
The C parser's core line-parsing (separator splitting, quote/escape handling, multiline stitching) is unchanged from 1.16.0. The C-path changes this cycle are a faster code path for quoted-field-heavy files — the big wins — and Unicode-aware blank detection.
|
|
10
|
+
|
|
11
|
+
| file | 1.16.4 (s) | 1.17.0 (s) | 1.17.0 vs 1.16.4 |
|
|
12
|
+
| ------------------------------ | ---------- | ---------- | ---------------- |
|
|
13
|
+
| PEOPLE_IMPORT_B.csv | 0.06255 | 0.06305 | ~1% noise |
|
|
14
|
+
| PEOPLE_IMPORT_C.csv | 0.13072 | 0.13274 | ~2% noise |
|
|
15
|
+
| PEOPLE_IMPORT_NB.csv | 0.05985 | 0.06079 | ~2% noise |
|
|
16
|
+
| PEOPLE_IMPORT_NC.csv | 0.05273 | 0.05420 | ~3% noise |
|
|
17
|
+
| uscities.csv | 0.06325 | 0.05545 | 12.3% faster |
|
|
18
|
+
| uszips.csv | 0.06957 | 0.06255 | 10.1% faster |
|
|
19
|
+
| worldcities.csv | 0.06824 | 0.06134 | 10.1% faster |
|
|
20
|
+
| embedded_newlines_60k.csv | 0.12795 | 0.11951 | 6.6% faster |
|
|
21
|
+
| embedded_separators_60k.csv | 0.05093 | 0.04591 | 9.9% faster |
|
|
22
|
+
| heavy_quoting_60k.csv | 0.08926 | 0.07490 | 16.1% faster |
|
|
23
|
+
| long_fields_40k.csv | 0.06375 | 0.04970 | 22.0% faster |
|
|
24
|
+
| many_empty_fields_60k.csv | 0.06813 | 0.06888 | ~1% noise |
|
|
25
|
+
| multi_char_separator_60k.csv | 0.07720 | 0.07830 | ~1% noise |
|
|
26
|
+
| sample_100k.csv | 0.07051 | 0.07139 | ~1% noise |
|
|
27
|
+
| sensor_data_50krows_50cols.csv | 0.17839 | 0.17897 | ~1% noise |
|
|
28
|
+
| tab_separated_60k.tsv | 0.06704 | 0.06798 | ~1% noise |
|
|
29
|
+
| utf8_multibyte_60k.csv | 0.04391 | 0.04376 | ~ same |
|
|
30
|
+
| whitespace_heavy_60k.csv | 0.06803 | 0.06897 | ~1% noise |
|
|
31
|
+
| wide_500_cols_20k.csv | 1.07019 | 1.07348 | ~1% noise |
|
|
32
|
+
|
|
33
|
+
*`~N% noise` means the measured difference (≈N%, always a small slowdown here) is within the run-to-run variance of this setup (8 runs × 40 iterations, median across runs, p10-trimmed) — i.e. effectively unchanged, not a real regression. The raw per-version times are in the table for the exact figure.*
|
|
34
|
+
|
|
35
|
+
Quote-heavy / large-field / wide files run **7–22% faster** than 1.16.4 (`long_fields_40k` 22%, `heavy_quoting_60k` 16%, the city files 10–12%, `embedded_separators` 10%, `embedded_newlines` 7%). Everything else is within ±3% of 1.16.4 — effectively unchanged. (The short-line / many-small-field files do show a small, *consistent* uptick at the bottom of that band, traceable to the larger default auto-detection scan window plus a tiny per-line overhead; if that matters for your workload, set `auto_row_sep_chars` lower. See [What's driving the mixed C-path picture](#whats-driving-the-mixed-c-path-picture) below.)
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## 1.16.4 → 1.17.0 — Ruby fallback path (`acceleration: false`)
|
|
40
|
+
|
|
41
|
+
Faster on nearly every file this cycle, from three changes: in-place stripping in the no-quote split path, a first-byte fast-reject before numeric conversion, and per-row / per-value overhead removed from the hash transformations.
|
|
42
|
+
|
|
43
|
+
| file | 1.16.4 (s) | 1.17.0 (s) | 1.17.0 vs 1.16.4 |
|
|
44
|
+
| ------------------------------ | ---------- | ---------- | ---------------- |
|
|
45
|
+
| PEOPLE_IMPORT_B.csv | 0.38220 | 0.35281 | 7.7% faster |
|
|
46
|
+
| PEOPLE_IMPORT_C.csv | 0.99047 | 0.95728 | 3.4% faster |
|
|
47
|
+
| PEOPLE_IMPORT_NB.csv | 0.36110 | 0.31716 | 12.2% faster |
|
|
48
|
+
| PEOPLE_IMPORT_NC.csv | 0.28762 | 0.25849 | 10.1% faster |
|
|
49
|
+
| uscities.csv | 0.74246 | 0.71183 | 4.1% faster |
|
|
50
|
+
| uszips.csv | 0.90817 | 0.87628 | 3.5% faster |
|
|
51
|
+
| worldcities.csv | 0.75714 | 0.72641 | 4.1% faster |
|
|
52
|
+
| embedded_newlines_60k.csv | 0.88887 | 0.86252 | 3.0% faster |
|
|
53
|
+
| embedded_separators_60k.csv | 0.57053 | 0.53401 | 6.4% faster |
|
|
54
|
+
| heavy_quoting_60k.csv | 1.09395 | 1.02829 | 6.0% faster |
|
|
55
|
+
| long_fields_40k.csv | 3.27964 | 3.29366 | ~ same |
|
|
56
|
+
| many_empty_fields_60k.csv | 0.37815 | 0.33153 | 12.3% faster |
|
|
57
|
+
| multi_char_separator_60k.csv | 0.45717 | 0.38380 | 16.0% faster |
|
|
58
|
+
| sample_100k.csv | 0.34527 | 0.30690 | 11.1% faster |
|
|
59
|
+
| sensor_data_50krows_50cols.csv | 1.32705 | 1.33218 | ~ same |
|
|
60
|
+
| tab_separated_60k.tsv | 0.38261 | 0.31359 | 18.0% faster |
|
|
61
|
+
| utf8_multibyte_60k.csv | 0.24212 | 0.21281 | 12.1% faster |
|
|
62
|
+
| whitespace_heavy_60k.csv | 0.37635 | 0.30848 | 18.0% faster |
|
|
63
|
+
| wide_500_cols_20k.csv | 5.28395 | 4.23045 | 19.9% faster |
|
|
64
|
+
|
|
65
|
+
Gains run **3–20%** vs 1.16.4, biggest on wide / many-small-field files (`wide_500_cols` 20%, `whitespace_heavy` / `tab_separated` 18%, `multi_char_separator` 16%). Only `long_fields_40k` (dominated by large-field allocation, not per-field work) and `sensor_data` (numeric-heavy — the fast-reject's per-value cost and a saved per-value method call cancel out) sit at parity.
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## What's driving the mixed C-path picture
|
|
70
|
+
|
|
71
|
+
The C parser's core line-parsing — separator splitting, quote/escape handling, multiline stitching — is unchanged from 1.16.0; all of that hot-path work carries forward (see [the 1.16.0 changes](../1.16.0/changes.md) for the parser performance story). So why the split — some files faster, a band of small files a hair slower?
|
|
72
|
+
|
|
73
|
+
**The wins are the quoted-field handling.** 1.17.0 added a faster path for fields wrapped in quotes: the common case — a quoted field with no doubled `""` inside — now skips a copy step. Files where most or all fields are quoted (city/address-style data, long quoted text, wide rows) pick up 7–22%.
|
|
74
|
+
|
|
75
|
+
**The bigger default auto-detection window.** The benchmark leaves `row_sep` at `:auto` for every file, so each run reads `auto_row_sep_chars` bytes up front — now `4096`, was `500` — and scans them for the row separator.
|
|
76
|
+
* On tiny files where total parse time is only ~50–80 ms, that one-time scan shows up as a ≤3% uptick.
|
|
77
|
+
* On larger files it's noise (and often net-positive — the wider window usually settles the separator on the first read, avoiding the doubling-escalation loop).
|
|
78
|
+
If you parse lots of very small files and care about that 1–3%, set `auto_row_sep_chars` lower, or pin `row_sep` explicitly to skip detection entirely. (The related `guess_line_ending` change — a chunked scan that doubles up to a 64 KB hard cap, replacing the old undocumented "scan whole file" on `nil`/`0` — is the same trade-off.)
|
|
79
|
+
|
|
80
|
+
**Not a factor here:** the buffering layer for non-seekable streams. The benchmark passes file paths to `SmarterCSV.process`, which opens them as seekable `File` objects, so the seekable fast path is taken and no buffering wrapper is instantiated. That layer only runs for pipes / gzip readers / HTTP/S3 bodies, which have much higher latency anyway — any extra work the buffer does there is negligible.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## vs Ruby CSV 3.3.5 (1.17.0 reference)
|
|
85
|
+
|
|
86
|
+
### vs `CSV.read` (raw arrays — minimum equivalent work)
|
|
87
|
+
|
|
88
|
+
`CSV.read` is the *fastest* Ruby CSV mode: plain string arrays, no symbol keys, no numeric conversion. SmarterCSV/C delivers fully processed hashes — and still beats it on every file:
|
|
89
|
+
|
|
90
|
+
| Range | Files |
|
|
91
|
+
|-----------|-------------------------------------------------------------------------|
|
|
92
|
+
| **7–8×** | PEOPLE_IMPORT_C (7.8×), uszips (7.8×) |
|
|
93
|
+
| **6–7×** | long_fields (6.9×), uscities (6.8×), worldcities (6.8×) |
|
|
94
|
+
| **5–6×** | embedded_separators (5.4×) |
|
|
95
|
+
| **3–4×** | utf8_multibyte (3.9×), PEOPLE_IMPORT_NC (3.7×), many_empty (3.5×), heavy_quoting (3.4×), sample_100k (3.4×), PEOPLE_IMPORT_NB (3.2×) |
|
|
96
|
+
| **2–3×** | PEOPLE_IMPORT_B (2.9×), embedded_newlines (2.9×), whitespace_heavy (2.9×), sensor_data (2.5×) |
|
|
97
|
+
| **1–2×** | wide_500_cols (1.7×), tab_separated (1.6×), multi_char_separator (1.4×) |
|
|
98
|
+
|
|
99
|
+
**Summary: 1.4×–7.8× faster than `CSV.read`, while returning fully processed hashes.**
|
|
100
|
+
|
|
101
|
+
### vs `CSV.hashes` (string-keyed hashes — closer to SmarterCSV output)
|
|
102
|
+
|
|
103
|
+
| Range | Files |
|
|
104
|
+
|------------|------------------------------------------------------------------------|
|
|
105
|
+
| **40–50×** | PEOPLE_IMPORT_C (47.3×) |
|
|
106
|
+
| **20–25×** | wide_500_cols (22.1×) |
|
|
107
|
+
| **10–15×** | uszips (12.5×), PEOPLE_IMPORT_NC (12.1×), many_empty (11.8×), worldcities (11.4×), uscities (11.2×), sensor_data (11.1×) |
|
|
108
|
+
| **7–10×** | embedded_separators (8.3×), long_fields (8.1×), PEOPLE_IMPORT_NB (8.1×), PEOPLE_IMPORT_B (7.9×), heavy_quoting (7.0×) |
|
|
109
|
+
| **5–7×** | whitespace_heavy (6.9×), utf8_multibyte (6.7×), sample_100k (6.2×) |
|
|
110
|
+
| **4–5×** | embedded_newlines (4.2×) |
|
|
111
|
+
| **2–3×** | tab_separated (2.3×), multi_char_separator (2.2×) |
|
|
112
|
+
|
|
113
|
+
**Summary: 2.2×–47.3× faster than `CSV.hashes`.**
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## Methodology
|
|
118
|
+
|
|
119
|
+
Same as 1.16.0:
|
|
120
|
+
- Apple M4, Ruby 3.4.7
|
|
121
|
+
- 40 iterations per run × 8 runs (2 warm-up), median across runs (p10-trimmed)
|
|
122
|
+
- Raw .json captures preserved alongside the .md tables for reproducibility
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
PREVIOUS: [Changes](./changes.md) | UP: [README](../../../README.md)
|
data/docs/row_col_sep.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -30,7 +31,7 @@
|
|
|
30
31
|
|
|
31
32
|
Convenient defaults allow automatic detection of the column and row separators: `row_sep: :auto`, `col_sep: :auto`. This makes it easier to process any CSV files without having to examine the line endings or column separators, e.g. when users upload CSV files to your service and you have no control over the incoming files.
|
|
32
33
|
|
|
33
|
-
|
|
34
|
+
The setting `:auto_row_sep_chars` controls the initial scan size used while detecting the row separator (default is `4096`). Detection stops as soon as one separator has a clear majority, up to a 64KB cap. Bump it higher if your files have very wide headers or long comment preambles; out-of-range values, `nil`, or `0` fall back to the default with a warning. Of course you can also set the `:row_sep` manually to skip auto-detection entirely.
|
|
34
35
|
|
|
35
36
|
|
|
36
37
|
## Column Separator `col_sep`
|
|
@@ -39,6 +40,25 @@ The automatic detection of column separators considers: `,`, `\t`, `;`, `:`, `|`
|
|
|
39
40
|
|
|
40
41
|
Some CSV files may contain an unusual column separqator, which could even be a control character.
|
|
41
42
|
|
|
43
|
+
### Tab-Separated Values (TSV)
|
|
44
|
+
|
|
45
|
+
Tab-separated files are auto-detected by default — no options needed:
|
|
46
|
+
|
|
47
|
+
```ruby
|
|
48
|
+
$ cat data.tsv
|
|
49
|
+
id<TAB>name<TAB>amount
|
|
50
|
+
1<TAB>Alice<TAB>100
|
|
51
|
+
2<TAB>Bob<TAB>200
|
|
52
|
+
|
|
53
|
+
# Auto-detected — col_sep: :auto is the default
|
|
54
|
+
SmarterCSV.process('data.tsv')
|
|
55
|
+
|
|
56
|
+
# Or set the separator explicitly
|
|
57
|
+
SmarterCSV.process('data.tsv', col_sep: "\t")
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
The default `col_sep: :auto` picks tab when it's the dominant delimiter in the first chunk of the file. The explicit form is useful in test fixtures or when you want to fail fast on unexpected formats.
|
|
61
|
+
|
|
42
62
|
## Row Separator `row_sep`
|
|
43
63
|
|
|
44
64
|
The automatic detection of row separators considers: `\n`, `\r\n`, `\r`.
|
data/docs/ruby_csv_pitfalls.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/value_converters.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [**Value Converters**](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -113,6 +114,29 @@ def self.convert(value)
|
|
|
113
114
|
end
|
|
114
115
|
```
|
|
115
116
|
|
|
117
|
+
## Handling Numeric Inputs
|
|
118
|
+
|
|
119
|
+
Converters run **after** `convert_values_to_numeric`, so a field that looks like a
|
|
120
|
+
number (e.g. `"42"`, `"3.14"`) will already be an `Integer` or `Float` by the time
|
|
121
|
+
your converter sees it. If your converter expects a string, guard against this:
|
|
122
|
+
|
|
123
|
+
```ruby
|
|
124
|
+
# Safe: passes already-numeric values through unchanged
|
|
125
|
+
dollar = ->(v) { v.is_a?(String) ? v.sub('$', '').to_f : v }
|
|
126
|
+
|
|
127
|
+
# Unsafe: raises NoMethodError on Integer/Float (no #sub)
|
|
128
|
+
dollar = ->(v) { v.sub('$', '').to_f }
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Alternatively, exclude the column from numeric conversion so the converter always
|
|
132
|
+
receives a string:
|
|
133
|
+
|
|
134
|
+
```ruby
|
|
135
|
+
SmarterCSV.process(file,
|
|
136
|
+
convert_values_to_numeric: { except: [:price] },
|
|
137
|
+
value_converters: { price: ->(v) { v&.sub('$', '')&.to_f } })
|
|
138
|
+
```
|
|
139
|
+
|
|
116
140
|
## Class-Based Converters
|
|
117
141
|
|
|
118
142
|
For converters you want to reuse across the codebase or test independently, define a class
|
data/docs/warnings.md
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
1
|
+
|
|
2
|
+
### Contents
|
|
3
|
+
|
|
4
|
+
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
|
|
7
|
+
* [Parsing Strategy](./parsing_strategy.md)
|
|
8
|
+
* [The Basic Read API](./basic_read_api.md)
|
|
9
|
+
* [The Basic Write API](./basic_write_api.md)
|
|
10
|
+
* [Batch Processing](././batch_processing.md)
|
|
11
|
+
* [Configuration Options](./options.md)
|
|
12
|
+
* [Row and Column Separators](./row_col_sep.md)
|
|
13
|
+
* [Header Transformations](./header_transformations.md)
|
|
14
|
+
* [Header Validations](./header_validations.md)
|
|
15
|
+
* [Column Selection](./column_selection.md)
|
|
16
|
+
* [Data Transformations](./data_transformations.md)
|
|
17
|
+
* [Value Converters](./value_converters.md)
|
|
18
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [**Warnings**](./warnings.md)
|
|
20
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
21
|
+
* [Examples](./examples.md)
|
|
22
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
23
|
+
* [SmarterCSV over the Years](./history.md)
|
|
24
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
25
|
+
|
|
26
|
+
--------------
|
|
27
|
+
|
|
28
|
+
# Warnings
|
|
29
|
+
|
|
30
|
+
SmarterCSV records auto-detection and configuration warnings into a structured
|
|
31
|
+
collection on the Reader, in addition to emitting them to a log sink. This lets
|
|
32
|
+
you inspect warnings programmatically (e.g. surface them in dashboards, fail
|
|
33
|
+
deploys on unexpected codes) without parsing stderr text.
|
|
34
|
+
|
|
35
|
+
## Accessing warnings
|
|
36
|
+
|
|
37
|
+
### Via the Reader API
|
|
38
|
+
|
|
39
|
+
```ruby
|
|
40
|
+
reader = SmarterCSV::Reader.new('data.csv')
|
|
41
|
+
reader.process
|
|
42
|
+
|
|
43
|
+
reader.warnings
|
|
44
|
+
# => [
|
|
45
|
+
# { type: :config, code: :chunk_size_default, severity: :warn,
|
|
46
|
+
# message: "chunk_size not set, defaulting to 100. ...", count: 1 },
|
|
47
|
+
# ...
|
|
48
|
+
# ]
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
### Via the class-level API (`SmarterCSV.warnings`)
|
|
52
|
+
|
|
53
|
+
Mirrors `SmarterCSV.errors`. Returns the warnings from the most recent call to
|
|
54
|
+
`process`, `parse`, `each`, or `each_chunk` on the current thread. Cleared at
|
|
55
|
+
the start of each new call.
|
|
56
|
+
|
|
57
|
+
```ruby
|
|
58
|
+
SmarterCSV.process('data.csv')
|
|
59
|
+
SmarterCSV.warnings.each do |w|
|
|
60
|
+
logger.warn("[#{w[:type]}/#{w[:code]}] #{w[:message]} (×#{w[:count]})")
|
|
61
|
+
end
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
> **Note:** `SmarterCSV.warnings` is per-thread (uses `Thread.current`). It is
|
|
65
|
+
> safe in multi-threaded environments (Puma, Sidekiq), but **not fiber-safe**.
|
|
66
|
+
> If you process CSV files concurrently in fibers (e.g. with `Async`, `Falcon`,
|
|
67
|
+
> or manual `Fiber` scheduling), use `SmarterCSV::Reader` directly so warnings
|
|
68
|
+
> are scoped to the reader instance.
|
|
69
|
+
|
|
70
|
+
## Warning record shape
|
|
71
|
+
|
|
72
|
+
| Field | Description |
|
|
73
|
+
|---|---|
|
|
74
|
+
| `type` | Coarse semantic grouping. Currently: `:config`, `:deprecation`, `:encoding`, `:row_sep`. |
|
|
75
|
+
| `code` | Unique identifier for the specific warning. |
|
|
76
|
+
| `severity` | Log level: `:debug` / `:info` / `:warn` / `:error` / `:fatal`. |
|
|
77
|
+
| `message` | Human-readable description. |
|
|
78
|
+
| `count` | Number of times this `(type, code)` was triggered during the run. |
|
|
79
|
+
|
|
80
|
+
Repeated warnings of the same `(type, code)` are deduped — `count` tracks
|
|
81
|
+
occurrences. The `message` is the first one emitted.
|
|
82
|
+
|
|
83
|
+
## Available codes
|
|
84
|
+
|
|
85
|
+
| Code | Type | Severity | Triggered when |
|
|
86
|
+
|---|---|---|---|
|
|
87
|
+
| `:chunk_size_default` | `:config` | `:warn` | `each_chunk` is called without `chunk_size:` and the default of `100` is used. |
|
|
88
|
+
| `:header_a_method` | `:deprecation` | `:warn` | The deprecated `Reader#headerA` accessor is called. |
|
|
89
|
+
| `:utf8_missing_binary_mode` | `:encoding` | `:warn` | UTF-8 input is being processed but the IO was not opened with `"b:utf-8"`. |
|
|
90
|
+
| `:no_clear_row_sep` | `:row_sep` | `:error` | Auto-detection found a true tie between separators after scanning 64KB. Falls back to `"\n"` — silent miss-parse risk. |
|
|
91
|
+
| `:no_row_sep_found` | `:row_sep` | `:error` | No known row separator was found in the first 64KB. Falls back to `"\n"`. Likely an exotic separator like `\u2028`. |
|
|
92
|
+
|
|
93
|
+
## Log sink routing
|
|
94
|
+
|
|
95
|
+
When the warning is emitted, the sink is selected at Reader construction time:
|
|
96
|
+
|
|
97
|
+
* **Rails.logger present** — the warning is routed through `Rails.logger` at
|
|
98
|
+
the declared `severity`. `Rails.logger.warn(...)`, `Rails.logger.error(...)`,
|
|
99
|
+
etc.
|
|
100
|
+
* **No Rails.logger** — falls back to `Kernel#warn` (writes to `$stderr`).
|
|
101
|
+
|
|
102
|
+
Detection is one-shot at construct time, so there is no per-call overhead.
|
|
103
|
+
|
|
104
|
+
### In a Rails app
|
|
105
|
+
|
|
106
|
+
No setup needed. SmarterCSV detects `Rails.logger` automatically, and warnings appear in your Rails log at their declared severity:
|
|
107
|
+
|
|
108
|
+
```ruby
|
|
109
|
+
SmarterCSV.process('data.csv')
|
|
110
|
+
# In log/development.log:
|
|
111
|
+
# [WARN] SmarterCSV: chunk_size not set, defaulting to 100. ...
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
### Without Rails (CLI scripts, plain Ruby, Sinatra, etc.)
|
|
115
|
+
|
|
116
|
+
Falls back to `Kernel#warn`, which writes to `$stderr`:
|
|
117
|
+
|
|
118
|
+
```ruby
|
|
119
|
+
SmarterCSV.process('data.csv')
|
|
120
|
+
# stderr:
|
|
121
|
+
# SmarterCSV: chunk_size not set, defaulting to 100. ...
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
The programmatic `reader.warnings` / `SmarterCSV.warnings` collection is identical in both modes — you can always inspect warnings without parsing log output.
|
|
125
|
+
|
|
126
|
+
## Suppressing warnings
|
|
127
|
+
|
|
128
|
+
Pass `verbose: :quiet` to suppress both the recording and the log emission of
|
|
129
|
+
all warnings. Currently this affects every code listed above.
|
|
130
|
+
|
|
131
|
+
```ruby
|
|
132
|
+
SmarterCSV.process('data.csv', verbose: :quiet)
|
|
133
|
+
SmarterCSV.warnings # => []
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
> ⚠️ Suppressing `:row_sep` warnings hides genuine silent miss-parse risk on
|
|
137
|
+
> ambiguous files. Prefer passing `row_sep:` explicitly over silencing.
|
|
138
|
+
|
|
139
|
+
----------------
|
|
140
|
+
|
|
141
|
+
PREVIOUS: [Bad Row Quarantine](./bad_row_quarantine.md) | NEXT: [Instrumentation Hooks](./instrumentation.md) | UP: [README](../README.md)
|