smarter_csv 1.16.1 → 1.16.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -1
- data/CONTRIBUTORS.md +2 -1
- data/README.md +1 -1
- data/docs/options.md +1 -1
- data/docs/releases/1.16.0/changes.md +0 -2
- data/docs/ruby_csv_pitfalls.md +228 -197
- data/lib/smarter_csv/hash_transformations.rb +1 -1
- data/lib/smarter_csv/header_transformations.rb +11 -9
- data/lib/smarter_csv/reader.rb +2 -2
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6d4cf6c7123e6048cb2c8de80ad92625a9985954e4308084f0a5b86cae4df03c
|
|
4
|
+
data.tar.gz: 5bd6237f017a8d4c54e4ee9ce6f9c3863d65d744c49ed1d6409c78c07f84ec88
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 25259d4b0b4edfe05c8e5d83e5ba691f6c82effaacc3a2cb9b78490f89b1e7a132ae70c8d1ea32731d1c3cce1f62e1dd98d05ae23576cc9872bd1ff4ac635ea3
|
|
7
|
+
data.tar.gz: 045fae96155913ff53c7661c7b4fe59946a792b72ff833eb723b49138406f30b3d6763e196c2cb6d5286ce014dc3d07f7bda94434a58857701b97d24a407fb4f
|
data/CHANGELOG.md
CHANGED
|
@@ -1,6 +1,22 @@
|
|
|
1
1
|
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
|
3
3
|
|
|
4
|
+
## 1.16.2 (2026-03-30) — Bug Fixes
|
|
5
|
+
|
|
6
|
+
RSpec tests: **1,410 → 1,425** (+15 tests)
|
|
7
|
+
|
|
8
|
+
### Bug Fixes
|
|
9
|
+
|
|
10
|
+
* Fixed `value_converters` to accept lambdas and Procs in addition to class-based converters.
|
|
11
|
+
Thanks to [Jonas Staškevičius](https://github.com/pirminis) for issue [#329](https://github.com/tilo/smarter_csv/issues/329).
|
|
12
|
+
|
|
13
|
+
* Fixed blank header auto-naming to use **absolute column position**, consistent with extra data column naming.
|
|
14
|
+
`name,,` now produces `column_2`/`column_3` instead of `column_1`/`column_2`.
|
|
15
|
+
⚠️ If your code references auto-generated keys for blank headers, update those to use the absolute column position.
|
|
16
|
+
|
|
17
|
+
* Fixed `Writer`: when both `map_headers:` and `header_converter:` were used together, `map_headers` was silently ignored.
|
|
18
|
+
`map_headers` is now applied first, then `header_converter` on top.
|
|
19
|
+
|
|
4
20
|
## 1.16.1 (2026-03-16) — Bug Fixes & New Features
|
|
5
21
|
|
|
6
22
|
RSpec tests: **1,247 → 1,410** (+163 tests)
|
|
@@ -101,7 +117,6 @@ Measured on 19 benchmark files, Apple M1, Ruby 3.4.7. See [benchmarks](docs/rele
|
|
|
101
117
|
* `remove_values_matching:` → use `nil_values_matching:`
|
|
102
118
|
* `strict:` → use `missing_headers: :raise/:auto`
|
|
103
119
|
* `verbose: true/false` → use `verbose: :debug/:normal`
|
|
104
|
-
* `only_headers:` / `except_headers:` → use `headers: { only: }` / `headers: { except: }`
|
|
105
120
|
|
|
106
121
|
### Bug Fixes
|
|
107
122
|
|
data/CONTRIBUTORS.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
# A Big Thank You to all
|
|
1
|
+
# A Big Thank You to all 63 Contributors!!
|
|
2
2
|
|
|
3
3
|
|
|
4
4
|
A Big Thank you to everyone who filed issues, sent comments, and who contributed with pull requests:
|
|
@@ -65,3 +65,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
|
|
|
65
65
|
* [Tophe](https://github.com/tophe)
|
|
66
66
|
* [Dom Lebron](https://github.com/biglebronski)
|
|
67
67
|
* [Paho Lurie-Gregg](https://github.com/paholg)
|
|
68
|
+
* [Jonas Staškevičius](https://github.com/pirminis)
|
data/README.md
CHANGED
|
@@ -249,7 +249,7 @@ For reporting issues, please:
|
|
|
249
249
|
* open a pull-request adding a test that demonstrates the issue
|
|
250
250
|
* mention your version of SmarterCSV, Ruby, Rails
|
|
251
251
|
|
|
252
|
-
# [A Special Thanks to all
|
|
252
|
+
# [A Special Thanks to all 63 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
|
|
253
253
|
|
|
254
254
|
|
|
255
255
|
## Contributing
|
data/docs/options.md
CHANGED
|
@@ -119,7 +119,7 @@ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling
|
|
|
119
119
|
|--------|---------|-------------|
|
|
120
120
|
| `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
|
|
121
121
|
| `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
|
|
122
|
-
| `:value_converters` | `nil` | Hash of `:header =>
|
|
122
|
+
| `:value_converters` | `nil` | Hash of `:header => converter`; converter can be a lambda/Proc or a class implementing `self.convert(value)`. See [Value Converters](./value_converters.md). |
|
|
123
123
|
| `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil` or an empty string. |
|
|
124
124
|
| `:remove_zero_values` | `false` | Remove key/value pairs where the numeric value equals zero. |
|
|
125
125
|
| `:nil_values_matching` | `nil` | Set matching values to `nil`. Accepts a regular expression matched against the string representation of each value (e.g. `/\ANAN\z/` for NaN, `/\A#VALUE!\z/` for Excel errors). With `remove_empty_values: true` (default), nil-ified values are then removed. With `remove_empty_values: false`, the key is retained with a `nil` value. |
|
|
@@ -195,8 +195,6 @@ See [performance_notes.md](performance_notes.md) and [benchmarks.md](benchmarks.
|
|
|
195
195
|
|
|
196
196
|
**Deprecations:**
|
|
197
197
|
|
|
198
|
-
- `only_headers:` → use `headers: { only: }`
|
|
199
|
-
- `except_headers:` → use `headers: { except: }`
|
|
200
198
|
- `remove_values_matching:` → use `nil_values_matching:`
|
|
201
199
|
- `strict: true` → use `missing_headers: :raise`
|
|
202
200
|
- `strict: false` → use `missing_headers: :auto`
|
data/docs/ruby_csv_pitfalls.md
CHANGED
|
@@ -26,50 +26,84 @@
|
|
|
26
26
|
|
|
27
27
|
# Ruby CSV Pitfalls: Silent Data Corruption and Loss
|
|
28
28
|
|
|
29
|
-
|
|
29
|
+
When having to parse CSV files, many developers go straight to the Ruby `CSV` library — it ships with Ruby and requires no dependencies.
|
|
30
30
|
|
|
31
|
-
|
|
31
|
+
But it comes at the cost of boilerplate post-processing you have to write, test, and maintain yourself. Worse, there are some failure modes that produce **no exception, no warning, and no indication that anything went wrong**. Your import runs, your tests pass, and your data is quietly wrong.
|
|
32
32
|
|
|
33
|
-
|
|
33
|
+
`CSV.read` is fine for small, trusted, well-formed files — particularly when you control the source. This page is about what can happen with **messy real-world files your partners produce, or users upload** — ten reproducible ways `CSV.read` and `CSV.table` can silently corrupt or lose data, with examples you can run yourself, and how SmarterCSV handles each case.
|
|
34
|
+
|
|
35
|
+
> Not all ten may be equally surprising — some are odd behavior that bites you anyway, others are genuine traps. All ten are silent.
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
> 💡 **Want to follow along?** Download the [example CSV files](https://raw.githubusercontent.com/tilo/articles/main/ruby/smarter_csv/10-ways-ruby_csv-can-silently-corrupt-or-lose-your-data/images/10-ways-ruby_csv-can-silently-corrupt-or-lose-your-data-examples.tgz) and run the examples locally.
|
|
34
40
|
|
|
35
41
|
---
|
|
36
42
|
|
|
37
43
|
## At a Glance
|
|
38
44
|
|
|
39
|
-
| # | Ruby CSV Issue | Failure Mode | SmarterCSV fix | SmarterCSV Details |
|
|
40
|
-
|
|
41
|
-
| 1 | Extra columns silently dropped | Values beyond header count compete for the `nil` key —
|
|
42
|
-
| 2 | Duplicate headers —
|
|
43
|
-
| 3 | Empty headers — `
|
|
44
|
-
| 4 |
|
|
45
|
-
| 5 | Whitespace in headers
|
|
46
|
-
| 6 | `
|
|
47
|
-
| 7 | `nil` vs `""` for empty fields | Unquoted empty → `nil`, quoted empty → `""` — inconsistent empty checks | by default ✅ | Default `remove_empty_values: true` removes both; `false` normalizes both to `
|
|
48
|
-
| 8 | Backslash-escaped quotes (MySQL/Unix) | `\"` treated as field-closing quote — crash or garbled data | by default ✅ | Default `quote_escaping: :auto` handles both RFC 4180 and backslash escaping |
|
|
49
|
-
| 9 |
|
|
50
|
-
| 10 | No encoding auto-detection | Non-UTF-8 files either crash or silently produce mojibake | via option | `file_encoding:`, `force_utf8: true`, `invalid_byte_sequence
|
|
51
|
-
|
|
52
|
-
¹
|
|
45
|
+
| # | Severity | Ruby CSV Issue | Failure Mode | SmarterCSV fix | SmarterCSV Details |
|
|
46
|
+
|---|:--------:|-------|-------------|:--------------:|---------|
|
|
47
|
+
| 1 | 🔴 | Extra columns silently dropped | Values beyond header count compete for the `nil` key — only the first survives, the rest are discarded | by default ✅ | Default `missing_headers: :auto` auto-generates `:column_N` keys |
|
|
48
|
+
| 2 | 🔴 | Duplicate headers — first wins | `.to_h` keeps only the first value for a repeated header; later values silently lost | by default ✅ | Default `duplicate_header_suffix:` → `:score`, `:score2`, `:score3` |
|
|
49
|
+
| 3 | 🔴 | Empty headers — `nil` key collision | Blank header cells become `nil` keys; multiple blanks collide and only the first value survives | by default ✅ | Default `missing_header_prefix:` → `:column_1`, `:column_2` |
|
|
50
|
+
| 4 | 🔴 | `converters: :numeric` silently corrupts leading-zero values as octal ¹ | `Integer()` interprets leading zeros as octal — `"00123"` → `83` ❌ | by default ✅ | Default `convert_values_to_numeric: true` uses decimal — no octal trap; `convert_values_to_numeric: false` preserves strings exactly |
|
|
51
|
+
| 5 | 🟡 | Whitespace in headers ² | `" Age"` ≠ `"Age"` — lookup silently returns `nil` | by default ✅ | Default `strip_whitespace: true` strips headers and values |
|
|
52
|
+
| 6 | 🟡 | Whitespace around values | `"active " == "active"` → `false` — leading/trailing spaces or tabs cause status/type checks to silently return wrong results | by default ✅ | Default `strip_whitespace: true` strips all values; set `false` to preserve spaces |
|
|
53
|
+
| 7 | 🟠 | `nil` vs `""` for empty fields | Unquoted empty → `nil`, quoted empty → `""` — inconsistent empty checks | by default ✅ | Default `remove_empty_values: true` removes both; `false` normalizes both to `""` |
|
|
54
|
+
| 8 | 🟠 | Backslash-escaped quotes (MySQL/Unix) | `\"` treated as field-closing quote — crash or garbled data | by default ✅ | Default `quote_escaping: :auto` handles both RFC 4180 and backslash escaping |
|
|
55
|
+
| 9 | 🔴 | TSV file read as CSV — completely breaks ❌ | Default `col_sep: ","` on a tab-delimited file returns each row as a single string; all column structure lost | by default ✅ | Default `col_sep: :auto` detects the actual delimiter — no option needed |
|
|
56
|
+
| 10 | 🔴 | No encoding auto-detection | Non-UTF-8 files either crash or silently produce mojibake | via option | `file_encoding:`, `force_utf8: true`, `invalid_byte_sequence: ''` |
|
|
57
|
+
|
|
58
|
+
¹ Issue #4 can be triggered two ways: `CSV.table` enables `converters: :numeric` by default (no opt-in required), and `CSV.read` triggers the same corruption when passed `converters: :numeric` explicitly. Either way, any leading-zero string field — ZIP codes, customer IDs, product codes — is silently converted to a wrong integer.
|
|
59
|
+
|
|
60
|
+
² The one case where `CSV.table` does better than `CSV.read`: its `header_converters: :symbol` option includes `.strip`, so whitespace is removed from headers (#5). Values (#6) are not stripped — `CSV.table` has the same whitespace-around-values problem. For all other issues `CSV.table` is identical to or worse than `CSV.read`.
|
|
61
|
+
|
|
62
|
+
> `CSV.table` is a convenience wrapper for `CSV.read` with `headers: true`, `header_converters: :symbol`, and `converters: :numeric`.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## The Real Cost of Handling This Yourself
|
|
67
|
+
|
|
68
|
+
Experienced users of `CSV.read` know some of these gotchas and handle them in post-processing — but not all of them can be: some are serious bugs that will silently corrupt your data regardless. And even for the ones you can handle, manual post-processing has five hidden costs:
|
|
69
|
+
|
|
70
|
+
* **You hand-craft boilerplate for every use case.** The right fix for whitespace differs when headers have spaces vs. values have spaces vs. both. Encoding handling depends on the source system. There is no generic post-processing snippet — you write a slightly different version every time.
|
|
71
|
+
|
|
72
|
+
* **You have to remember all of it, every time.** Every new import, service, or data source needs the same gotchas handled — consistently. But boilerplate doesn't enforce itself. A fix you wrote for one importer doesn't automatically apply to the next. The gotchas don't announce themselves — you only catch them if you remember to look.
|
|
73
|
+
|
|
74
|
+
* **Your boilerplate is probably undertested.** Post-processing code that wraps `CSV.read` rarely gets the same test coverage as business logic. Developers don't think of it as the risky part. Data edge cases — files with blank headers, leading-zero IDs, quoted empty fields, mixed encoding — don't make it into the test suite until they cause a production incident. You don't know what your boilerplate misses until a file breaks it.
|
|
75
|
+
|
|
76
|
+
> ❓ Do your tests for your CSV wrapper just test the mechanics, or include data corner cases?
|
|
77
|
+
|
|
78
|
+
* **Your benchmarks probably don't include the boilerplate code.** When you chose `CSV.read`, you probably looked at raw parsing performance — but did you measure the end-to-end cost of your post-processing? Whitespace stripping, header cleanup, empty normalization: none of that is free. Your end-to-end data pipeline is much slower than what you initially measured.
|
|
79
|
+
|
|
80
|
+
* **One library that handles it predictably and performant is worth more than the sum of its parts.** The value isn't "these ten cases are covered." It is that you stop maintaining a bespoke cleaning pipeline, stop writing one-off fixes after production surprises, and don't have to worry about test coverage or performance - you can trust that the default behavior handles edge cases sensibly — without silently damaging your data.
|
|
81
|
+
|
|
82
|
+
Predictable behavior in a well-tested library beats hand-crafted boilerplate that anticipates fewer edge cases.
|
|
53
83
|
|
|
54
84
|
---
|
|
55
85
|
|
|
56
86
|
## Why These Failures Are Dangerous
|
|
57
87
|
|
|
58
|
-
Every failure in this list is
|
|
88
|
+
**Every single failure in this list is silent.** No exception, no warning, no log line — your import completes successfully and your data is quietly wrong. That's what makes these issues so dangerous: they don't surface in tests, they don't cause immediate errors, and they're easy to miss during code review.
|
|
89
|
+
|
|
90
|
+
The root cause is that `CSV.read` is a **tokenizer**, not a data pipeline. It splits bytes into fields and hands them back with no normalization, no validation, and no defensive handling of real-world messiness. Every assumption about what "clean" input looks like is left to the caller.
|
|
59
91
|
|
|
60
|
-
|
|
92
|
+
Issue #4 deserves special mention: `CSV.table`'s default `converters: :numeric` silently turns `"00123"` into `83`³ and `"01234"` into `668`³ — values that look like perfectly valid integers. ZIP codes, customer IDs, and product codes are quietly replaced with wrong numbers that pass every validation, get stored in your database, and are indistinguishable from real data until someone notices the numbers don't match.
|
|
61
93
|
|
|
62
|
-
|
|
94
|
+
These aren't obscure edge cases. Extra columns, trailing commas, Windows-1252 encoding, duplicate headers, blank header cells, TSV-vs-CSV confusion, leading-zero identifiers, and whitespace-padded values are all common in CSV files exported from Excel, reporting tools, ERP systems, and legacy data pipelines. If your application accepts user-uploaded CSV files, you will encounter these.
|
|
63
95
|
|
|
64
|
-
|
|
96
|
+
The defensive post-processing code required to handle all ten cases correctly — octal-safe numeric conversion, whitespace normalization, duplicate header disambiguation, extra column naming, consistent empty value handling, backslash quote escaping, delimiter auto-detection, encoding detection — is non-trivial to write, test, and maintain. Most applications never bother, because the failures are silent.
|
|
65
97
|
|
|
66
|
-
|
|
98
|
+
³ These aren't rounding errors or truncations — they are completely different numbers. [Octal](https://en.wikipedia.org/wiki/Octal) is a base-8 number system from the early days of computing, still used in low-level Unix file permissions and C integer literals. It has no place in CSV data. No spreadsheet, ERP system, or database exports ZIP codes or customer IDs in octal — but Ruby CSV silently assumes that's exactly what a leading zero means.
|
|
99
|
+
|
|
100
|
+
Read on for a detailed explanation and reproducible example for each issue.
|
|
67
101
|
|
|
68
102
|
---
|
|
69
103
|
|
|
70
104
|
## 1. Extra Columns Without Headers — Values Silently Discarded
|
|
71
105
|
|
|
72
|
-
When a row has more fields than there are headers, `CSV.read` maps every extra field to the `nil` key. If there are multiple extra fields, they all compete for the same `nil` key — **only the
|
|
106
|
+
When a row has more fields than there are headers, `CSV.read` maps every extra field to the `nil` key. If there are multiple extra fields, they all compete for the same `nil` key — **only the first one survives**, the rest are silently discarded.
|
|
73
107
|
|
|
74
108
|
```
|
|
75
109
|
$ cat example1.csv
|
|
@@ -78,36 +112,46 @@ Alice , Smith, 30, VIP, Gold ,
|
|
|
78
112
|
Bob, Jones, 25
|
|
79
113
|
```
|
|
80
114
|
|
|
81
|
-
**With Ruby CSV:**
|
|
82
|
-
|
|
83
115
|
```ruby
|
|
84
116
|
rows = CSV.read('example1.csv', headers: true).map(&:to_h)
|
|
85
117
|
rows.first
|
|
86
|
-
# => {
|
|
87
|
-
#
|
|
118
|
+
# => {
|
|
119
|
+
# " First Name " => "Alice ",
|
|
120
|
+
# " Last Name " => " Smith",
|
|
121
|
+
# " Age" => " 30",
|
|
122
|
+
# nil => " VIP"
|
|
123
|
+
# ^^^^^^^^^^^^^
|
|
124
|
+
# data from unnamed column with "Gold" is silently lost
|
|
125
|
+
# }
|
|
88
126
|
```
|
|
89
127
|
|
|
90
|
-
Alice's row has 6 fields but only 3 headers. The extra fields `"VIP"`, `"Gold"`, and `""` (trailing comma) all land on `nil` —
|
|
128
|
+
Alice's row has 6 fields but only 3 headers. The extra fields `" VIP"`, `" Gold"`, and `""` (trailing comma) all land on `nil` — only the first one wins. No error, no warning.
|
|
91
129
|
|
|
92
130
|
This is common in real-world exports: tools frequently append audit columns, status flags, or trailing commas that don't correspond to headers.
|
|
93
131
|
|
|
94
132
|
**`CSV.table` has the same problem.**
|
|
95
133
|
|
|
96
|
-
**
|
|
134
|
+
**SmarterCSV:** The default `missing_headers: :auto` auto-generates distinct names for extra columns using `missing_header_prefix` (default: `"column_"`). The trailing empty field is dropped by the default `remove_empty_values: true` setting. No data loss.
|
|
97
135
|
|
|
98
136
|
```ruby
|
|
99
137
|
rows = SmarterCSV.process('example1.csv')
|
|
100
138
|
rows.first
|
|
101
|
-
# => {
|
|
139
|
+
# => {
|
|
140
|
+
# first_name: "Alice",
|
|
141
|
+
# last_name: "Smith",
|
|
142
|
+
# age: 30,
|
|
143
|
+
# column_4: "VIP",
|
|
144
|
+
# column_5: "Gold"
|
|
145
|
+
# ^^^^^^^^^^^^^^^^
|
|
146
|
+
# extra data columns are handled, no data is lost
|
|
147
|
+
# }
|
|
102
148
|
```
|
|
103
149
|
|
|
104
|
-
The default `missing_headers: :auto` auto-generates distinct names for extra columns using `missing_header_prefix` (default: `"column_"`). The trailing empty field is dropped by the default `remove_empty_values: true` setting. No data loss.
|
|
105
|
-
|
|
106
150
|
---
|
|
107
151
|
|
|
108
|
-
## 2. Duplicate Header Names —
|
|
152
|
+
## 2. Duplicate Header Names — Second Value Silently Dropped
|
|
109
153
|
|
|
110
|
-
When two columns share the same header name, `CSV::Row#to_h` keeps only the **
|
|
154
|
+
When two columns share the same header name, `CSV::Row#to_h` keeps only the **first** value. Later values are silently dropped.
|
|
111
155
|
|
|
112
156
|
```
|
|
113
157
|
$ cat example2.csv
|
|
@@ -115,18 +159,18 @@ score,name,score
|
|
|
115
159
|
95,Alice,87
|
|
116
160
|
```
|
|
117
161
|
|
|
118
|
-
**With Ruby CSV:**
|
|
119
|
-
|
|
120
162
|
```ruby
|
|
121
163
|
rows = CSV.read('example2.csv', headers: true).map(&:to_h)
|
|
122
164
|
rows.first
|
|
123
|
-
# => {"score" => "
|
|
124
|
-
# ^^^
|
|
165
|
+
# => {"score" => "95", "name" => "Alice"}
|
|
166
|
+
# ^^^ second score (87) silently lost
|
|
125
167
|
```
|
|
126
168
|
|
|
127
169
|
Common with reporting tool exports that repeat a column (e.g., two date columns both labeled `"Date"`).
|
|
128
170
|
|
|
129
|
-
|
|
171
|
+
**`CSV.table` has the same problem.**
|
|
172
|
+
|
|
173
|
+
**SmarterCSV:** disambiguates duplicate headers by appending a number directly: `:score`, `:score2`, `:score3`.
|
|
130
174
|
|
|
131
175
|
```ruby
|
|
132
176
|
rows = SmarterCSV.process('example2.csv')
|
|
@@ -136,15 +180,15 @@ rows.first
|
|
|
136
180
|
|
|
137
181
|
* The default `duplicate_header_suffix: ""` disambiguates by appending a counter: `:score`, `:score2`, `:score3`.
|
|
138
182
|
* Use `duplicate_header_suffix: '_'` to get `:score_2`, `:score_3`.
|
|
139
|
-
* Set `
|
|
183
|
+
* Set `duplicate_header_suffix: nil` to raise `DuplicateHeaders` instead.
|
|
140
184
|
|
|
141
185
|
---
|
|
142
186
|
|
|
143
|
-
## 3. Empty Header Fields — `
|
|
187
|
+
## 3. Empty Header Fields — `nil` Key Collision
|
|
144
188
|
|
|
145
|
-
A CSV file with blank header
|
|
189
|
+
A CSV file with blank header fields (e.g., `name,,age`) gives those columns a `nil` key. Multiple blank headers all collide on `nil` — same overwrite problem as issue #1, and only the first value survives.
|
|
146
190
|
|
|
147
|
-
>
|
|
191
|
+
> Note: this is distinct from issue #1. Issue #1 is about extra *data* fields beyond the header count, which get keyed under `nil`. Issue #3 is about blank cells *in the header row itself*, which also get keyed under `nil`.
|
|
148
192
|
|
|
149
193
|
```
|
|
150
194
|
$ cat example3.csv
|
|
@@ -152,25 +196,23 @@ name,,,age
|
|
|
152
196
|
Alice,foo,bar,30
|
|
153
197
|
```
|
|
154
198
|
|
|
155
|
-
**With Ruby CSV:**
|
|
156
|
-
|
|
157
199
|
```ruby
|
|
158
200
|
rows = CSV.read('example3.csv', headers: true).map(&:to_h)
|
|
159
201
|
rows.first
|
|
160
|
-
# => {"name" => "Alice",
|
|
161
|
-
# ^^^ "
|
|
202
|
+
# => {"name" => "Alice", nil => "foo", "age" => "30"}
|
|
203
|
+
# ^^^ "bar" silently lost — both blank headers map to nil, first value wins
|
|
162
204
|
```
|
|
163
205
|
|
|
164
|
-
`CSV.table`
|
|
206
|
+
`CSV.table` has the same `nil` key collision:
|
|
165
207
|
|
|
166
208
|
```ruby
|
|
167
209
|
rows = CSV.table('example3.csv').map(&:to_h)
|
|
168
210
|
rows.first
|
|
169
|
-
# => {name: "Alice",
|
|
170
|
-
# ^^^ "
|
|
211
|
+
# => {name: "Alice", nil => "foo", age: 30}
|
|
212
|
+
# ^^^ "bar" still silently lost
|
|
171
213
|
```
|
|
172
214
|
|
|
173
|
-
**
|
|
215
|
+
**SmarterCSV:** `missing_header_prefix:` (default `"column_"`) auto-generates names for blank headers: `:column_1`, `:column_2`, etc. No collision, no data loss.
|
|
174
216
|
|
|
175
217
|
```ruby
|
|
176
218
|
rows = SmarterCSV.process('example3.csv')
|
|
@@ -178,47 +220,62 @@ rows.first
|
|
|
178
220
|
# => {name: "Alice", column_1: "foo", column_2: "bar", age: 30}
|
|
179
221
|
```
|
|
180
222
|
|
|
181
|
-
`missing_header_prefix:` (default `"column_"`) auto-generates names for blank headers: `:column_1`, `:column_2`, etc. No collision, no data loss.
|
|
182
|
-
|
|
183
223
|
---
|
|
184
224
|
|
|
185
|
-
## 4.
|
|
225
|
+
## 4. `converters: :numeric` Silently Corrupts Leading-Zero Values as Octal
|
|
186
226
|
|
|
187
|
-
|
|
227
|
+
`converters: :numeric` When numbers have leading zeroes, the result does not just strip them - the entire number is silently converted to a completely different value³ that looks plausible but is incorrect ❌ .
|
|
228
|
+
|
|
229
|
+
`CSV.table` enables `converters: :numeric` by default without any opt-in, **triggering the bug by default**. `CSV.read` is safe by default, but triggers the same corruption when `converters: :numeric` (or `converters: :integer`) is passed explicitly.
|
|
188
230
|
|
|
189
231
|
```
|
|
190
232
|
$ cat example4.csv
|
|
191
|
-
|
|
192
|
-
|
|
233
|
+
customer_id,zip_code,amount
|
|
234
|
+
00123,01234,99.50
|
|
235
|
+
00456,90210,9.99
|
|
193
236
|
```
|
|
194
237
|
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
|
|
198
|
-
|
|
238
|
+
**With Ruby CSV:**
|
|
239
|
+
|
|
240
|
+
```ruby
|
|
241
|
+
# CSV.table — converters: :numeric on by default, no opt-in needed
|
|
242
|
+
rows = CSV.table('example4.csv').map(&:to_h)
|
|
243
|
+
rows.first
|
|
244
|
+
# => {customer_id: 83, zip_code: 668, amount: 99.5}
|
|
245
|
+
# ^^^ "00123" → 83 (octal 0123 = decimal 83)
|
|
246
|
+
# ^^^ "01234" → 668 (octal 1234 = decimal 668)
|
|
247
|
+
|
|
248
|
+
# CSV.read with explicit converters: :numeric — same result
|
|
249
|
+
rows = CSV.read('example4.csv', headers: true, converters: :numeric).map(&:to_h)
|
|
250
|
+
rows.first
|
|
251
|
+
# => {"customer_id" => 83, "zip_code" => 668, "amount" => 99.5}
|
|
199
252
|
```
|
|
200
253
|
|
|
201
|
-
|
|
254
|
+
`"00123"` becomes `83`. `"01234"` becomes `668`. ZIP codes, customer IDs, order numbers, product codes — any field with a leading zero becomes a completely wrong integer. No exception, no warning. The resulting values look plausible and pass all type validations.
|
|
202
255
|
|
|
203
|
-
|
|
256
|
+
`CSV.read` without converters is safe — strings are returned as-is:
|
|
204
257
|
|
|
205
258
|
```ruby
|
|
206
259
|
rows = CSV.read('example4.csv', headers: true).map(&:to_h)
|
|
207
|
-
rows.first
|
|
208
|
-
|
|
209
|
-
rows.first['name'] # => nil ← first column unreachable
|
|
260
|
+
rows.first
|
|
261
|
+
# => {"customer_id" => "00123", "zip_code" => "01234", "amount" => "99.50"}
|
|
210
262
|
```
|
|
211
263
|
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
**With SmarterCSV:**
|
|
264
|
+
**SmarterCSV:**
|
|
215
265
|
|
|
216
266
|
```ruby
|
|
267
|
+
# Default (convert_values_to_numeric: true) — decimal conversion, no octal trap
|
|
217
268
|
rows = SmarterCSV.process('example4.csv')
|
|
218
|
-
rows.first
|
|
269
|
+
rows.first
|
|
270
|
+
# => {customer_id: 123, zip_code: 1234, amount: 99.5}
|
|
271
|
+
|
|
272
|
+
# convert_values_to_numeric: false — preserves strings exactly, including leading zeros
|
|
273
|
+
rows = SmarterCSV.process('example4.csv', convert_values_to_numeric: false)
|
|
274
|
+
rows.first
|
|
275
|
+
# => {customer_id: "00123", zip_code: "01234", amount: "99.50"}
|
|
219
276
|
```
|
|
220
277
|
|
|
221
|
-
|
|
278
|
+
SmarterCSV's default `convert_values_to_numeric: true` uses `to_i` / `to_f`, which always treats strings as decimal — no octal interpretation. Use `convert_values_to_numeric: false` when leading zeros must be preserved (ZIP codes, IDs, product codes).
|
|
222
279
|
|
|
223
280
|
---
|
|
224
281
|
|
|
@@ -232,94 +289,71 @@ $ cat example5.csv
|
|
|
232
289
|
Alice,30
|
|
233
290
|
```
|
|
234
291
|
|
|
235
|
-
**With Ruby CSV:**
|
|
236
|
-
|
|
237
292
|
```ruby
|
|
238
293
|
rows = CSV.read('example5.csv', headers: true).map(&:to_h)
|
|
239
294
|
rows.first
|
|
240
|
-
# => {" name " => "Alice", " age
|
|
295
|
+
# => {" name " => "Alice", " age" => "30"}
|
|
241
296
|
|
|
242
|
-
rows.first['name'] # => nil ← key is " name ", not "name"
|
|
297
|
+
rows.first['name'] # => nil ← silent miss; key is " name ", not "name"
|
|
243
298
|
rows.first['age'] # => nil
|
|
244
299
|
```
|
|
245
300
|
|
|
246
|
-
|
|
301
|
+
**`CSV.table` mitigates this:** ² the `:symbol` header converter includes `.strip`, so whitespace is removed from headers. This is the one issue where `CSV.table` behaves better than `CSV.read`.
|
|
247
302
|
|
|
248
|
-
**
|
|
303
|
+
**SmarterCSV:**
|
|
249
304
|
|
|
250
305
|
```ruby
|
|
251
306
|
rows = SmarterCSV.process('example5.csv')
|
|
252
307
|
rows.first
|
|
253
308
|
# => {name: "Alice", age: 30}
|
|
254
309
|
```
|
|
255
|
-
|
|
256
310
|
The default setting `strip_whitespace: true` strips leading/trailing whitespace from both headers and values.
|
|
257
311
|
|
|
312
|
+
|
|
258
313
|
---
|
|
259
314
|
|
|
260
|
-
## 6.
|
|
315
|
+
## 6. Whitespace Around Values — Silent Comparison Failure
|
|
261
316
|
|
|
262
|
-
`CSV.read`
|
|
317
|
+
`CSV.read` returns field values exactly as they appear in the file — leading spaces, trailing spaces, and tab characters all preserved. Exporters from fixed-width database systems (Oracle `CHAR` columns, COBOL-era systems) routinely pad string fields to a fixed width; other tools leave accidental leading spaces. The values look correct when printed, but equality checks silently return `false`.
|
|
263
318
|
|
|
264
|
-
|
|
319
|
+
This pairs with Example 5 (whitespace in headers): Ruby CSV strips neither headers nor values by default.
|
|
265
320
|
|
|
266
321
|
```
|
|
267
322
|
$ cat example6.csv
|
|
268
|
-
name,
|
|
269
|
-
Alice,
|
|
270
|
-
Bob,
|
|
323
|
+
name,status,city
|
|
324
|
+
Alice,active ,New York ← trailing spaces after 'active'
|
|
325
|
+
Bob,inactive,Chicago
|
|
326
|
+
Carol, active,Boston ← leading space before 'active'
|
|
271
327
|
```
|
|
272
328
|
|
|
273
|
-
**With Ruby CSV:**
|
|
274
|
-
|
|
275
329
|
```ruby
|
|
276
|
-
|
|
277
|
-
CSV.read('example6.csv', headers: true)
|
|
278
|
-
# => CSV::MalformedCSVError: Unclosed quoted field on line 2
|
|
330
|
+
rows = CSV.read('example6.csv', headers: true).map(&:to_h)
|
|
279
331
|
|
|
280
|
-
#
|
|
281
|
-
rows
|
|
282
|
-
|
|
283
|
-
rows[
|
|
284
|
-
# =>
|
|
285
|
-
# ^^^ Alice's note field swallowed the rest of the file; Bob vanished
|
|
332
|
+
rows[0]['status'] # => "active "
|
|
333
|
+
rows[2]['status'] # => " active"
|
|
334
|
+
|
|
335
|
+
rows.select { |r| r['status'] == 'active' }
|
|
336
|
+
# => [] ← Alice and Carol are not found. No error raised.
|
|
286
337
|
```
|
|
287
338
|
|
|
288
|
-
The
|
|
339
|
+
The values look fine in logs and `puts` output. The bug only surfaces when the comparison silently returns the wrong result.
|
|
289
340
|
|
|
290
|
-
**
|
|
341
|
+
**Workaround:** pass `strip: true` to `CSV.read`. This correctly strips spaces and tab characters. Note it also strips intentional leading/trailing spaces from any field — including quoted fields where spaces may be meaningful.
|
|
291
342
|
|
|
292
|
-
|
|
293
|
-
reader = SmarterCSV::Reader.new('example6.csv', on_bad_row: :collect)
|
|
294
|
-
good_rows = reader.process
|
|
295
|
-
reader.errors
|
|
296
|
-
# => {
|
|
297
|
-
# :bad_row_count => 1,
|
|
298
|
-
# :bad_rows => [
|
|
299
|
-
# {
|
|
300
|
-
# :csv_line_number => 2,
|
|
301
|
-
# :file_line_number => 2,
|
|
302
|
-
# :file_lines_consumed => 2,
|
|
303
|
-
# :error_class => SmarterCSV::MalformedCSV,
|
|
304
|
-
# :error_message => "Unclosed quoted field detected in multiline data",
|
|
305
|
-
# :raw_logical_line => "Alice,\"unclosed quote,99\nBob,normal,87\n"
|
|
306
|
-
# }
|
|
307
|
-
# ]
|
|
308
|
-
# }
|
|
309
|
-
```
|
|
343
|
+
**`CSV.table` has the same problem** — its `:symbol` converter strips header names but does not touch field values.
|
|
310
344
|
|
|
311
|
-
|
|
345
|
+
**SmarterCSV:**
|
|
312
346
|
|
|
313
347
|
```ruby
|
|
314
|
-
|
|
315
|
-
|
|
316
|
-
|
|
348
|
+
rows = SmarterCSV.process('example6.csv')
|
|
349
|
+
|
|
350
|
+
rows[0][:status] # => "active"
|
|
351
|
+
rows[2][:status] # => "active"
|
|
352
|
+
|
|
353
|
+
rows.select { |r| r[:status] == 'active' }.length # => 2
|
|
317
354
|
```
|
|
318
355
|
|
|
319
|
-
|
|
320
|
-
* `on_bad_row: :collect` quarantines them — use `reader.errors` to access.
|
|
321
|
-
* `on_bad_row: ->(rec) { ... }` calls your lambda per bad row; works with `SmarterCSV.process`.
|
|
322
|
-
* `on_bad_row: :skip` discards bad rows silently.
|
|
356
|
+
`strip_whitespace: true` (default) strips all leading and trailing whitespace (spaces and tabs) from values. Set `strip_whitespace: false` to preserve spaces when needed.
|
|
323
357
|
|
|
324
358
|
---
|
|
325
359
|
|
|
@@ -337,8 +371,6 @@ Alice,
|
|
|
337
371
|
Bob,""
|
|
338
372
|
```
|
|
339
373
|
|
|
340
|
-
**With Ruby CSV:**
|
|
341
|
-
|
|
342
374
|
```ruby
|
|
343
375
|
rows = CSV.read('example7.csv', headers: true).map(&:to_h)
|
|
344
376
|
|
|
@@ -349,9 +381,11 @@ rows[0]['city'].nil? # => true
|
|
|
349
381
|
rows[1]['city'].nil? # => false ← same semantic meaning, different Ruby type
|
|
350
382
|
```
|
|
351
383
|
|
|
352
|
-
Both rows have no city
|
|
384
|
+
Both rows have no city. But your code sees two different things. Any check using `.nil?`, `.blank?`, `.present?`, or a simple `if row['city']` will behave differently depending on how the upstream exporter happened to quote the empty field. No two exporters agree on this.
|
|
353
385
|
|
|
354
|
-
|
|
386
|
+
**`CSV.table` has the same problem.**
|
|
387
|
+
|
|
388
|
+
**SmarterCSV:** `remove_empty_values: true` (default) removes both from the hash. With `remove_empty_values: false`, both are normalized to `""`. Consistent either way.
|
|
355
389
|
|
|
356
390
|
```ruby
|
|
357
391
|
# remove_empty_values: true (default) — both empty cities are dropped from the hash
|
|
@@ -359,17 +393,17 @@ rows = SmarterCSV.process('example7.csv')
|
|
|
359
393
|
rows[0] # => {name: "Alice"}
|
|
360
394
|
rows[1] # => {name: "Bob"}
|
|
361
395
|
|
|
362
|
-
# remove_empty_values: false — both normalized to
|
|
396
|
+
# remove_empty_values: false — both normalized to ""
|
|
363
397
|
rows = SmarterCSV.process('example7.csv', remove_empty_values: false)
|
|
364
|
-
rows[0] # => {name: "Alice", city:
|
|
365
|
-
rows[1] # => {name: "Bob", city:
|
|
398
|
+
rows[0] # => {name: "Alice", city: ""}
|
|
399
|
+
rows[1] # => {name: "Bob", city: ""}
|
|
366
400
|
```
|
|
367
401
|
|
|
368
402
|
---
|
|
369
403
|
|
|
370
404
|
## 8. Backslash-Escaped Quotes — MySQL / Unix Dump Format
|
|
371
405
|
|
|
372
|
-
MySQL's `SELECT INTO OUTFILE`, PostgreSQL `COPY TO`, and many Unix data-pipeline tools escape embedded double quotes as `\"` — not as `""` (the RFC 4180 standard). Ruby's `CSV` only understands RFC 4180, so a backslash before a quote is treated as two separate characters: a literal `\` followed by a `"` that immediately **closes the field**.
|
|
406
|
+
MySQL's `SELECT INTO OUTFILE`, PostgreSQL `COPY TO`, and many Unix data-pipeline tools escape embedded double quotes as `\"` — not as `""` (the RFC 4180 standard). Ruby's `CSV` only understands the RFC 4180 convention, so a backslash before a quote is treated as two separate characters: a literal `\` followed by a `"` that immediately **closes the field**.
|
|
373
407
|
|
|
374
408
|
```
|
|
375
409
|
$ cat example8.csv
|
|
@@ -378,92 +412,69 @@ Alice,"She said \"hello\" to everyone"
|
|
|
378
412
|
Bob,"Normal note"
|
|
379
413
|
```
|
|
380
414
|
|
|
381
|
-
**
|
|
415
|
+
**Scenario 1 — crash** (at least you know something went wrong):
|
|
382
416
|
|
|
383
417
|
```ruby
|
|
384
418
|
rows = CSV.read('example8.csv', headers: true)
|
|
385
|
-
# => CSV::MalformedCSVError:
|
|
419
|
+
# => CSV::MalformedCSVError: Any value after quoted field isn't allowed in line 2.
|
|
386
420
|
```
|
|
387
421
|
|
|
388
|
-
**
|
|
422
|
+
**Scenario 2 — silent garbling** with `liberal_parsing: true`:
|
|
389
423
|
|
|
390
424
|
```ruby
|
|
391
425
|
rows = CSV.read('example8.csv', headers: true, liberal_parsing: true)
|
|
392
|
-
rows[0]['
|
|
393
|
-
rows[0]['note'] # => "She said \\" ← field closed at the backslash-quote; rest lost
|
|
394
|
-
rows[1]['name'] # => "hello" ← Alice's leftovers eaten as Bob's name
|
|
395
|
-
rows[1]['note'] # => nil
|
|
426
|
+
rows[0]['note'] # => 'She said \"hello\" to everyone'
|
|
396
427
|
```
|
|
397
428
|
|
|
398
|
-
No exception. No warning.
|
|
429
|
+
No exception. No warning. The note field has extra wrapping quotes and mangled escaping — it won't compare, display, or serialize correctly.
|
|
430
|
+
|
|
431
|
+
**`CSV.table` has the same problem** — and adding `liberal_parsing: true` makes it silently worse.
|
|
399
432
|
|
|
400
|
-
**
|
|
433
|
+
**SmarterCSV:** `quote_escaping: :auto` (default since 1.0) detects and handles both `""` and `\"` escaping row-by-row. No option required.
|
|
401
434
|
|
|
402
435
|
```ruby
|
|
403
436
|
rows = SmarterCSV.process('example8.csv')
|
|
404
|
-
rows[0] # => {name: "Alice", note:
|
|
437
|
+
rows[0] # => {name: "Alice", note: 'She said \"hello\" to everyone'}
|
|
405
438
|
rows[1] # => {name: "Bob", note: "Normal note"}
|
|
406
439
|
```
|
|
407
440
|
|
|
408
|
-
`quote_escaping: :auto` (default) detects and handles both `""` and `\"` escaping row-by-row. No option required. This covers MySQL `SELECT INTO OUTFILE`, PostgreSQL `COPY TO`, and Unix `csvkit`/`awk`-generated files.
|
|
409
|
-
|
|
410
441
|
---
|
|
411
442
|
|
|
412
|
-
## 9.
|
|
443
|
+
## 9. TSV File Read as CSV — Completely Breaks ❌
|
|
413
444
|
|
|
414
|
-
|
|
445
|
+
`CSV.read` defaults to `col_sep: ","`. When given a tab-delimited file (TSV), it finds no commas and treats each entire row as a single field. The header row becomes one giant key; each data row becomes one giant value. All column structure is silently lost — no error, no warning, and `rows.length` looks correct.
|
|
415
446
|
|
|
416
447
|
```
|
|
417
|
-
$ cat
|
|
418
|
-
name
|
|
419
|
-
|
|
420
|
-
Bob
|
|
421
|
-
Carol,40
|
|
448
|
+
$ cat example9.csv
|
|
449
|
+
name city score
|
|
450
|
+
Alice New York 95
|
|
451
|
+
Bob Chicago 87
|
|
422
452
|
```
|
|
423
453
|
|
|
424
|
-
**With Ruby CSV:**
|
|
425
|
-
|
|
426
454
|
```ruby
|
|
427
|
-
rows = CSV.read('
|
|
428
|
-
rows.length # => 1 (not 3)
|
|
429
|
-
rows.first['name'] # => "Alice,30\nBob,25\nCarol,40"
|
|
430
|
-
# ^^^ entire remainder of file in one field
|
|
431
|
-
```
|
|
455
|
+
rows = CSV.read('example9.csv', headers: true).map(&:to_h)
|
|
432
456
|
|
|
433
|
-
|
|
457
|
+
rows.length # => 2 (looks right — but...)
|
|
458
|
+
rows.first.keys # => ["name\tcity\tscore"] ← entire header is one key
|
|
459
|
+
rows.first['name'] # => nil ← column unreachable
|
|
460
|
+
rows.first.values # => ["Alice\tNew York\t95"] ← entire row is one value
|
|
461
|
+
```
|
|
434
462
|
|
|
435
|
-
|
|
463
|
+
This can happen when users upload TSV instead of CSV - the file name could still be `.csv`, so indistinguishable from actual CSV data.
|
|
436
464
|
|
|
437
|
-
|
|
438
|
-
reader = SmarterCSV::Reader.new('example8.csv',
|
|
439
|
-
on_bad_row: :collect,
|
|
440
|
-
)
|
|
441
|
-
good_rows = reader.process
|
|
442
|
-
reader.errors
|
|
443
|
-
# => {
|
|
444
|
-
# :bad_row_count => 1,
|
|
445
|
-
# :bad_rows => [
|
|
446
|
-
# {
|
|
447
|
-
# :csv_line_number => 2,
|
|
448
|
-
# :file_line_number => 2,
|
|
449
|
-
# :file_lines_consumed => 3,
|
|
450
|
-
# :error_class => SmarterCSV::MalformedCSV,
|
|
451
|
-
# :error_message => "Unclosed quoted field detected in multiline data",
|
|
452
|
-
# :raw_logical_line => "\"Alice,30\nBob,25\nCarol,40\n"
|
|
453
|
-
# }
|
|
454
|
-
# ]
|
|
455
|
-
# }
|
|
456
|
-
```
|
|
465
|
+
**`CSV.table` has the same problem.**
|
|
457
466
|
|
|
458
|
-
|
|
467
|
+
**SmarterCSV:**
|
|
459
468
|
|
|
460
469
|
```ruby
|
|
461
|
-
|
|
462
|
-
|
|
463
|
-
|
|
470
|
+
rows = SmarterCSV.process('example9.csv')
|
|
471
|
+
# col_sep: :auto detects the tab separator automatically
|
|
472
|
+
|
|
473
|
+
rows.first
|
|
474
|
+
# => {name: "Alice", city: "New York", score: 95}
|
|
464
475
|
```
|
|
465
476
|
|
|
466
|
-
`
|
|
477
|
+
`col_sep: :auto` (default) samples the file and detects the actual delimiter. No option required.
|
|
467
478
|
|
|
468
479
|
---
|
|
469
480
|
|
|
@@ -472,42 +483,62 @@ good_rows = SmarterCSV.process('example8.csv',
|
|
|
472
483
|
`CSV.read` assumes UTF-8. CSV files exported from Excel on Windows are typically Windows-1252 (CP1252), which encodes accented characters (é, ü, ñ) differently from UTF-8.
|
|
473
484
|
|
|
474
485
|
```
|
|
475
|
-
$ cat
|
|
486
|
+
$ cat example10.csv
|
|
476
487
|
last_name,first_name
|
|
477
488
|
Müller,Hans
|
|
478
489
|
```
|
|
479
490
|
|
|
480
491
|
The file is saved in Windows-1252 encoding — `ü` is stored as `\xFC`, not as UTF-8.
|
|
481
492
|
|
|
482
|
-
**
|
|
493
|
+
**Scenario 1 — crash** (the better outcome — at least you know):
|
|
483
494
|
|
|
484
495
|
```ruby
|
|
485
|
-
rows = CSV.read('
|
|
486
|
-
# =>
|
|
496
|
+
rows = CSV.read('example10.csv', headers: true)
|
|
497
|
+
# => CSV::InvalidEncodingError: Invalid byte sequence in UTF-8 in line 2.
|
|
487
498
|
```
|
|
488
499
|
|
|
489
|
-
**
|
|
500
|
+
**Scenario 2 — silent mojibake** (the worse outcome):
|
|
490
501
|
|
|
491
502
|
```ruby
|
|
492
503
|
# Specifying the wrong encoding suppresses the error
|
|
493
|
-
rows = CSV.read('
|
|
504
|
+
rows = CSV.read('example10.csv', headers: true, encoding: 'binary')
|
|
494
505
|
rows.first['last_name'] # => "M\xFCller" ← garbled string
|
|
495
|
-
rows.first['last_name'].valid_encoding? # => true ← Ruby thinks it's fine
|
|
506
|
+
rows.first['last_name'].valid_encoding? # => true ← Ruby thinks it's fine!
|
|
496
507
|
```
|
|
497
508
|
|
|
498
|
-
The mojibake string passes `.valid_encoding?`, passes database validations, gets stored, and surfaces as a display bug in production.
|
|
509
|
+
The mojibake string passes `.valid_encoding?`, passes database validations, gets stored, and surfaces as a display bug weeks later in production.
|
|
510
|
+
|
|
511
|
+
**`CSV.table` has the same problem.**
|
|
499
512
|
|
|
500
|
-
**
|
|
513
|
+
**SmarterCSV:** `file_encoding:` accepts Ruby's `'external:internal'` transcoding notation; `force_utf8: true` transcodes to UTF-8 automatically; `invalid_byte_sequence: ''` controls the replacement character for bytes that can't be transcoded, e.g. `''`.
|
|
501
514
|
|
|
502
515
|
```ruby
|
|
503
|
-
rows = SmarterCSV.process('
|
|
516
|
+
rows = SmarterCSV.process('example10.csv',
|
|
504
517
|
file_encoding: 'windows-1252:utf-8')
|
|
505
518
|
rows.first[:last_name] # => "Müller"
|
|
506
519
|
```
|
|
507
520
|
|
|
508
|
-
|
|
509
|
-
|
|
510
|
-
|
|
521
|
+
---
|
|
522
|
+
|
|
523
|
+
## The Alternative
|
|
524
|
+
|
|
525
|
+
```ruby
|
|
526
|
+
gem 'smarter_csv'
|
|
527
|
+
```
|
|
528
|
+
|
|
529
|
+
```ruby
|
|
530
|
+
# Before
|
|
531
|
+
rows = CSV.read('data.csv', headers: true).map(&:to_h)
|
|
532
|
+
|
|
533
|
+
# After
|
|
534
|
+
rows = SmarterCSV.process('data.csv')
|
|
535
|
+
```
|
|
536
|
+
|
|
537
|
+
SmarterCSV handles nine of the ten cases out of the box — octal-safe numeric conversion, whitespace normalization, duplicate header disambiguation, extra column naming, consistent empty value handling, backslash quote escaping, and delimiter auto-detection.
|
|
538
|
+
|
|
539
|
+
The remaining one (encoding control) requires explicit opt-in options, but the building blocks are there. No boilerplate, no post-processing pipeline, no silent data loss.
|
|
540
|
+
|
|
541
|
+
> **Ready to switch?** → [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
511
542
|
|
|
512
543
|
---
|
|
513
544
|
|
|
@@ -62,7 +62,7 @@ module SmarterCSV
|
|
|
62
62
|
# Apply value converters
|
|
63
63
|
if value_converters
|
|
64
64
|
converter = value_converters[k]
|
|
65
|
-
hash[k] = converter.convert(hash[k]) if converter
|
|
65
|
+
hash[k] = converter.respond_to?(:convert) ? converter.convert(hash[k]) : converter.call(hash[k]) if converter
|
|
66
66
|
end
|
|
67
67
|
end
|
|
68
68
|
|
|
@@ -27,19 +27,21 @@ module SmarterCSV
|
|
|
27
27
|
|
|
28
28
|
def disambiguate_headers(headers, options)
|
|
29
29
|
counts = Hash.new(0)
|
|
30
|
-
empty_count = 0
|
|
31
30
|
prefix = options[:missing_header_prefix] || 'column_'
|
|
32
31
|
# Pre-collect non-blank header names so auto-generated names can avoid collisions.
|
|
33
32
|
used = headers.reject { |h| blank?(h) }
|
|
34
|
-
headers.map do |header|
|
|
33
|
+
headers.each_with_index.map do |header, idx|
|
|
35
34
|
if blank?(header)
|
|
36
|
-
#
|
|
37
|
-
#
|
|
38
|
-
#
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
35
|
+
# Use absolute 1-based column position, consistent with how extra data columns
|
|
36
|
+
# beyond the header count are named. If the positional name collides with an
|
|
37
|
+
# existing header, append underscores until a free name is found — this avoids
|
|
38
|
+
# stealing the positional name from any subsequent blank header.
|
|
39
|
+
candidate = "#{prefix}#{idx + 1}"
|
|
40
|
+
suffix = ''
|
|
41
|
+
while used.include?(candidate)
|
|
42
|
+
suffix += '_'
|
|
43
|
+
candidate = "#{prefix}#{idx + 1}#{suffix}"
|
|
44
|
+
end
|
|
43
45
|
used << candidate
|
|
44
46
|
candidate
|
|
45
47
|
else
|
data/lib/smarter_csv/reader.rb
CHANGED
|
@@ -357,7 +357,7 @@ module SmarterCSV
|
|
|
357
357
|
|
|
358
358
|
if options[:value_converters]
|
|
359
359
|
options[:value_converters].each do |key, converter|
|
|
360
|
-
hash[key] = converter.convert(hash[key]) if hash.key?(key)
|
|
360
|
+
hash[key] = converter.respond_to?(:convert) ? converter.convert(hash[key]) : converter.call(hash[key]) if hash.key?(key)
|
|
361
361
|
end
|
|
362
362
|
end
|
|
363
363
|
else
|
|
@@ -755,7 +755,7 @@ module SmarterCSV
|
|
|
755
755
|
|
|
756
756
|
if options[:value_converters]
|
|
757
757
|
options[:value_converters].each do |key, converter|
|
|
758
|
-
hash[key] = converter.convert(hash[key]) if hash.key?(key)
|
|
758
|
+
hash[key] = converter.respond_to?(:convert) ? converter.convert(hash[key]) : converter.call(hash[key]) if hash.key?(key)
|
|
759
759
|
end
|
|
760
760
|
end
|
|
761
761
|
else
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv/writer.rb
CHANGED
|
@@ -149,7 +149,7 @@ module SmarterCSV
|
|
|
149
149
|
|
|
150
150
|
def write_header_line
|
|
151
151
|
mapped_headers = @headers.map { |header| @map_headers[header] || header }
|
|
152
|
-
mapped_headers =
|
|
152
|
+
mapped_headers = mapped_headers.map { |header| @header_converter.call(header) } if @header_converter
|
|
153
153
|
force_quotes = @quote_headers || @force_quotes
|
|
154
154
|
mapped_headers = mapped_headers.map { |x| escape_csv_field(x, force_quotes) }
|
|
155
155
|
@output_file.write(mapped_headers.join(@col_sep) + @row_sep) unless mapped_headers.empty?
|
metadata
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: smarter_csv
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.16.
|
|
4
|
+
version: 1.16.2
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tilo Sloboda
|
|
8
8
|
bindir: bin
|
|
9
9
|
cert_chain: []
|
|
10
|
-
date: 2026-03-
|
|
10
|
+
date: 2026-03-30 00:00:00.000000000 Z
|
|
11
11
|
dependencies: []
|
|
12
12
|
description: |
|
|
13
13
|
SmarterCSV is a high-performance CSV reader and writer for Ruby focused on
|