smarter_csv 1.17.1 → 1.17.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +12 -3
- data/CONTRIBUTORS.md +2 -1
- data/README.md +4 -4
- data/ext/smarter_csv/smarter_csv.c +111 -53
- data/lib/smarter_csv/parser.rb +36 -10
- data/lib/smarter_csv/version.rb +1 -1
- metadata +2 -3
- data/TO_DO.md +0 -109
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 2e665f0dc98db44950aa9cbb2cac430068e91df8886062068413dfbcefc74fc3
|
|
4
|
+
data.tar.gz: def43fb66886b16ec13bd429b4fd6923b09aa1a01757a696390a38c18b59fa31
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 2fb7793ed4eca64cfef1f7dd82a417b44988832280b373c1748213d9f7c879cd0a2d17c4e3b72c82be6acedb01b0fec26b70e6daaefb645ee2c3bf64b7aedcd8
|
|
7
|
+
data.tar.gz: a0b8842d5a69d8526af81d4e2a64c31fd6a54d6d610c5ce0dbb16298dfd03c3d296546c049a4a76a710908390ec1ea1739bc7530b1ab652c7ead1ceaa02b431d
|
data/CHANGELOG.md
CHANGED
|
@@ -1,8 +1,17 @@
|
|
|
1
1
|
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
|
3
3
|
|
|
4
|
+
## 1.17.2 (2026-05-21)
|
|
4
5
|
|
|
5
|
-
|
|
6
|
+
RSpec tests: **2,220→ 2,274** (+54 tests)
|
|
7
|
+
|
|
8
|
+
### Bug Fix
|
|
9
|
+
|
|
10
|
+
- fixed [Issue #334](https://github.com/tilo/smarter_csv/issues/334) with escaped double quote followed by comma. Thanks to [conorg](https://github.com/conorg)
|
|
11
|
+
- fixed bug when using `headers: { except: }`
|
|
12
|
+
- added more tests
|
|
13
|
+
|
|
14
|
+
## 1.17.1 (2026-05-17)
|
|
6
15
|
|
|
7
16
|
RSpec tests: **2,210→ 2,220** (+10 tests)
|
|
8
17
|
|
|
@@ -64,7 +73,7 @@ Measured against 1.16.4 (Apple M4, Ruby 3.4.7):
|
|
|
64
73
|
|
|
65
74
|
Per-file breakdown: [`docs/releases/1.17.0/performance_notes.md`](docs/releases/1.17.0/performance_notes.md).
|
|
66
75
|
|
|
67
|
-
## 1.16.5 (2026-05-
|
|
76
|
+
## 1.16.5 (2026-05-17)
|
|
68
77
|
|
|
69
78
|
### Bug Fix
|
|
70
79
|
|
|
@@ -222,7 +231,7 @@ Measured on 19 benchmark files, Apple M1, Ruby 3.4.7. See [benchmarks](docs/rele
|
|
|
222
231
|
* **Writer temp file** no longer hardcoded to `/tmp` (fixes Windows); properly cleaned up with `Tempfile#close!`.
|
|
223
232
|
* **Writer `StringIO`**: `finalize` no longer attempts to close a caller-owned `StringIO`.
|
|
224
233
|
|
|
225
|
-
## 1.15.3 (2026-05-
|
|
234
|
+
## 1.15.3 (2026-05-17)
|
|
226
235
|
|
|
227
236
|
### Bug Fix
|
|
228
237
|
|
data/CONTRIBUTORS.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
# A Big Thank You to all
|
|
1
|
+
# A Big Thank You to all 64 Contributors!!
|
|
2
2
|
|
|
3
3
|
|
|
4
4
|
A Big Thank you to everyone who filed issues, sent comments, and who contributed with pull requests:
|
|
@@ -66,3 +66,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
|
|
|
66
66
|
* [Dom Lebron](https://github.com/biglebronski)
|
|
67
67
|
* [Paho Lurie-Gregg](https://github.com/paholg)
|
|
68
68
|
* [Jonas Staškevičius](https://github.com/pirminis)
|
|
69
|
+
* [conorg](https://github.com/conorg)
|
data/README.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
|
1
1
|
|
|
2
2
|
# SmarterCSV
|
|
3
3
|
|
|
4
|
-
|
|
5
|
-
|
|
4
|
+
 [](https://codecov.io/gh/tilo/smarter_csv) [](https://rubygems.org/gems/smarter_csv) [](https://rubygems.org/gems/smarter_csv) [](https://www.ruby-toolbox.com/projects/smarter_csv)
|
|
5
|
+
|
|
6
6
|
SmarterCSV is a high-performance CSV ingestion and generation for Ruby, focused on fast end-to-end CSV ingestion of real-world data — no silent failures, no surprises, not just tokenization.
|
|
7
7
|
|
|
8
8
|
⭐ If SmarterCSV saved you hours of import time, please star the repo, and consider sponsoring this project.
|
|
@@ -311,7 +311,7 @@ Or install it yourself as:
|
|
|
311
311
|
* [Examples](docs/examples.md)
|
|
312
312
|
* [Real-World CSV Files](docs/real_world_csv.md)
|
|
313
313
|
* [SmarterCSV over the Years](docs/history.md)
|
|
314
|
-
* [Release Notes](docs/releases/1.
|
|
314
|
+
* [Release Notes](docs/releases/1.17.0/changes.md)
|
|
315
315
|
|
|
316
316
|
## Articles
|
|
317
317
|
* [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38)
|
|
@@ -333,7 +333,7 @@ For reporting issues, please:
|
|
|
333
333
|
* open a pull-request adding a test that demonstrates the issue
|
|
334
334
|
* mention your version of SmarterCSV, Ruby, Rails
|
|
335
335
|
|
|
336
|
-
# [A Special Thanks to all
|
|
336
|
+
# [A Special Thanks to all 64 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
|
|
337
337
|
|
|
338
338
|
|
|
339
339
|
## Contributing
|
|
@@ -331,25 +331,40 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
|
331
331
|
if (!allow_escaped_quotes || backslash_count % 2 == 0) {
|
|
332
332
|
if (__builtin_expect(quote_boundary_standard, 1)) {
|
|
333
333
|
if (in_quotes) {
|
|
334
|
-
|
|
335
|
-
|
|
336
|
-
|
|
337
|
-
|
|
338
|
-
|
|
339
|
-
|
|
334
|
+
if (p + 2 < endP && *(p + 1) == quote_char_val) {
|
|
335
|
+
/* RFC doubled quote inside a quoted field ("" → ").
|
|
336
|
+
* Give this precedence over the closing-quote check, but only
|
|
337
|
+
* when another byte follows the doubled pair.
|
|
338
|
+
*
|
|
339
|
+
* Compatibility note: we intentionally do NOT force terminal
|
|
340
|
+
* "" to be consumed here. SmarterCSV has a long-standing lenient
|
|
341
|
+
* behavior for malformed tails like ...\"" in :double_quotes mode:
|
|
342
|
+
* the final quote may still close the field instead of turning the
|
|
343
|
+
* row into an unclosed-quote error. Issue #334 needs doubled-quote
|
|
344
|
+
* precedence for ..."",... (more content follows), but we keep the
|
|
345
|
+
* historical leniency for terminal ..."". */
|
|
346
|
+
p++;
|
|
347
|
+
} else {
|
|
348
|
+
// closing quote: only valid if followed by col_sep, row_sep, or end of line
|
|
349
|
+
bool valid_close = (p + 1 >= endP);
|
|
350
|
+
if (!valid_close) {
|
|
351
|
+
valid_close = true;
|
|
352
|
+
for (long j = 0; j < col_sep_len; j++) {
|
|
353
|
+
if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
|
|
354
|
+
}
|
|
340
355
|
}
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
356
|
+
if (!valid_close && row_sep_len > 0) {
|
|
357
|
+
valid_close = true;
|
|
358
|
+
for (long j = 0; j < row_sep_len; j++) {
|
|
359
|
+
if (*(p + 1 + j) != *(row_sepP + j)) { valid_close = false; break; }
|
|
360
|
+
}
|
|
346
361
|
}
|
|
362
|
+
if (valid_close) {
|
|
363
|
+
in_quotes = false;
|
|
364
|
+
field_started = true;
|
|
365
|
+
}
|
|
366
|
+
// else: quote inside quoted field → literal
|
|
347
367
|
}
|
|
348
|
-
if (valid_close) {
|
|
349
|
-
in_quotes = false;
|
|
350
|
-
field_started = true;
|
|
351
|
-
}
|
|
352
|
-
// else: quote inside quoted field → literal (handles "" doubling)
|
|
353
368
|
} else if (!field_started) {
|
|
354
369
|
in_quotes = true; // opening quote at field boundary
|
|
355
370
|
field_started = true;
|
|
@@ -829,6 +844,11 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
|
|
|
829
844
|
* the frame stays well below 4 KB and ___chkstk_darwin never fires on ARM64 macOS.
|
|
830
845
|
*/
|
|
831
846
|
bool *keep_bitmap = NULL;
|
|
847
|
+
/* In THIS (non-ctx) function the bitmap is alloca'd to headers_len on every call (see the alloca
|
|
848
|
+
* sites below), so keep_bitmap[] is exactly headers_len long and headers_len is the correct bound
|
|
849
|
+
* at all access sites. Do NOT mirror rb_parse_line_to_hash_ctx's keep_bitmap_len here: that variant
|
|
850
|
+
* caches its bitmap across rows (where @headers can grow), so it must use the captured length; this
|
|
851
|
+
* one rebuilds per call and does not. */
|
|
832
852
|
bool keep_extra_columns = true; /* extra cols (> headers_len): keep by default */
|
|
833
853
|
bool has_only = false; /* true when only_headers: filtering is active */
|
|
834
854
|
long early_exit_after = -1; /* column index after which we stop; -1 = no early exit */
|
|
@@ -1147,25 +1167,40 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
|
|
|
1147
1167
|
if (!allow_escaped_quotes || backslash_count % 2 == 0) {
|
|
1148
1168
|
if (__builtin_expect(quote_boundary_standard, 1)) {
|
|
1149
1169
|
if (in_quotes) {
|
|
1150
|
-
|
|
1151
|
-
|
|
1152
|
-
|
|
1153
|
-
|
|
1154
|
-
|
|
1155
|
-
|
|
1170
|
+
if (p + 2 < endP && *(p + 1) == quote_char_val) {
|
|
1171
|
+
/* RFC doubled quote inside a quoted field ("" → ").
|
|
1172
|
+
* Give this precedence over the closing-quote check, but only
|
|
1173
|
+
* when another byte follows the doubled pair.
|
|
1174
|
+
*
|
|
1175
|
+
* Compatibility note: we intentionally do NOT force terminal
|
|
1176
|
+
* "" to be consumed here. SmarterCSV has a long-standing lenient
|
|
1177
|
+
* behavior for malformed tails like ...\"" in :double_quotes mode:
|
|
1178
|
+
* the final quote may still close the field instead of turning the
|
|
1179
|
+
* row into an unclosed-quote error. Issue #334 needs doubled-quote
|
|
1180
|
+
* precedence for ..."",... (more content follows), but we keep the
|
|
1181
|
+
* historical leniency for terminal ..."". */
|
|
1182
|
+
p++;
|
|
1183
|
+
} else {
|
|
1184
|
+
// closing quote: only valid if followed by col_sep, row_sep, or end of line
|
|
1185
|
+
bool valid_close = (p + 1 >= endP);
|
|
1186
|
+
if (!valid_close) {
|
|
1187
|
+
valid_close = true;
|
|
1188
|
+
for (long j = 0; j < col_sep_len; j++) {
|
|
1189
|
+
if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
|
|
1190
|
+
}
|
|
1156
1191
|
}
|
|
1157
|
-
|
|
1158
|
-
|
|
1159
|
-
|
|
1160
|
-
|
|
1161
|
-
|
|
1192
|
+
if (!valid_close && row_sep_len2 > 0) {
|
|
1193
|
+
valid_close = true;
|
|
1194
|
+
for (long j = 0; j < row_sep_len2; j++) {
|
|
1195
|
+
if (*(p + 1 + j) != *(row_sepP2 + j)) { valid_close = false; break; }
|
|
1196
|
+
}
|
|
1162
1197
|
}
|
|
1198
|
+
if (valid_close) {
|
|
1199
|
+
in_quotes = false;
|
|
1200
|
+
field_started = true;
|
|
1201
|
+
}
|
|
1202
|
+
// else: quote inside quoted field → literal
|
|
1163
1203
|
}
|
|
1164
|
-
if (valid_close) {
|
|
1165
|
-
in_quotes = false;
|
|
1166
|
-
field_started = true;
|
|
1167
|
-
}
|
|
1168
|
-
// else: quote inside quoted field → literal (handles "" doubling)
|
|
1169
1204
|
} else if (!field_started) {
|
|
1170
1205
|
in_quotes = true; // opening quote at field boundary
|
|
1171
1206
|
field_started = true;
|
|
@@ -1495,6 +1530,14 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1495
1530
|
int numeric_mode = ctx->numeric_mode;
|
|
1496
1531
|
VALUE numeric_keys = ctx->numeric_keys;
|
|
1497
1532
|
bool *keep_bitmap = ctx->keep_bitmap;
|
|
1533
|
+
/* keep_bitmap is cached in the context (xmalloc'd once at construction, sized to the header count
|
|
1534
|
+
* THEN). @headers can grow in place as undeclared extra columns appear, so the live headers_len
|
|
1535
|
+
* (re-read each call below) may exceed the bitmap's length. Every keep_bitmap[] access in this
|
|
1536
|
+
* function MUST be bounded by keep_bitmap_len, never headers_len — indices past the bitmap are
|
|
1537
|
+
* extra columns and follow keep_extra_columns. Bounding by the grown headers_len was an
|
|
1538
|
+
* out-of-bounds heap read (the bug). The sibling rb_parse_line_to_hash safely uses headers_len
|
|
1539
|
+
* because it re-allocs its bitmap to headers_len on every call. */
|
|
1540
|
+
long keep_bitmap_len = ctx->keep_bitmap_len;
|
|
1498
1541
|
bool keep_extra_columns = ctx->keep_extra_columns;
|
|
1499
1542
|
long early_exit_after = ctx->early_exit_after;
|
|
1500
1543
|
|
|
@@ -1596,7 +1639,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1596
1639
|
while (trim_end >= trim_start && (*trim_end == ' ' || *trim_end == '\t')) trim_end--;
|
|
1597
1640
|
}
|
|
1598
1641
|
long trimmed_len = (trim_end >= trim_start) ? (trim_end - trim_start + 1) : 0;
|
|
1599
|
-
if (!keep_bitmap || (element_count <
|
|
1642
|
+
if (!keep_bitmap || (element_count < keep_bitmap_len ? keep_bitmap[element_count] : keep_extra_columns)) {
|
|
1600
1643
|
if (insert_field_into_hash(&xform, trim_start, trimmed_len, element_count, false, quote_char_val, encoding))
|
|
1601
1644
|
all_blank = false;
|
|
1602
1645
|
}
|
|
@@ -1617,7 +1660,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1617
1660
|
while (trim_end >= trim_start && (*trim_end == ' ' || *trim_end == '\t')) trim_end--;
|
|
1618
1661
|
}
|
|
1619
1662
|
long trimmed_len = (trim_end >= trim_start) ? (trim_end - trim_start + 1) : 0;
|
|
1620
|
-
if (!keep_bitmap || (element_count <
|
|
1663
|
+
if (!keep_bitmap || (element_count < keep_bitmap_len ? keep_bitmap[element_count] : keep_extra_columns)) {
|
|
1621
1664
|
if (insert_field_into_hash(&xform, trim_start, trimmed_len, element_count, false, quote_char_val, encoding))
|
|
1622
1665
|
all_blank = false;
|
|
1623
1666
|
}
|
|
@@ -1680,7 +1723,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1680
1723
|
|
|
1681
1724
|
bool has_embedded_quotes = quoted || (trimmed_len > 0 && memchr(trim_start, quote_char_val, trimmed_len));
|
|
1682
1725
|
|
|
1683
|
-
if (!keep_bitmap || (element_count <
|
|
1726
|
+
if (!keep_bitmap || (element_count < keep_bitmap_len ? keep_bitmap[element_count] : keep_extra_columns)) {
|
|
1684
1727
|
if (insert_field_into_hash(&xform, trim_start, trimmed_len, element_count, has_embedded_quotes, quote_char_val, encoding))
|
|
1685
1728
|
all_blank = false;
|
|
1686
1729
|
}
|
|
@@ -1714,25 +1757,40 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1714
1757
|
if (!allow_escaped_quotes || backslash_count % 2 == 0) {
|
|
1715
1758
|
if (__builtin_expect(quote_boundary_standard, 1)) {
|
|
1716
1759
|
if (in_quotes) {
|
|
1717
|
-
|
|
1718
|
-
|
|
1719
|
-
|
|
1720
|
-
|
|
1721
|
-
|
|
1722
|
-
|
|
1760
|
+
if (p + 2 < endP && *(p + 1) == quote_char_val) {
|
|
1761
|
+
/* RFC doubled quote inside a quoted field ("" → ").
|
|
1762
|
+
* Give this precedence over the closing-quote check, but only
|
|
1763
|
+
* when another byte follows the doubled pair.
|
|
1764
|
+
*
|
|
1765
|
+
* Compatibility note: we intentionally do NOT force terminal
|
|
1766
|
+
* "" to be consumed here. SmarterCSV has a long-standing lenient
|
|
1767
|
+
* behavior for malformed tails like ...\"" in :double_quotes mode:
|
|
1768
|
+
* the final quote may still close the field instead of turning the
|
|
1769
|
+
* row into an unclosed-quote error. Issue #334 needs doubled-quote
|
|
1770
|
+
* precedence for ..."",... (more content follows), but we keep the
|
|
1771
|
+
* historical leniency for terminal ..."". */
|
|
1772
|
+
p++;
|
|
1773
|
+
} else {
|
|
1774
|
+
/* closing quote: only valid if followed by col_sep, row_sep, or end */
|
|
1775
|
+
bool valid_close = (p + 1 >= endP);
|
|
1776
|
+
if (!valid_close) {
|
|
1777
|
+
valid_close = true;
|
|
1778
|
+
for (long j = 0; j < col_sep_len; j++) {
|
|
1779
|
+
if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
|
|
1780
|
+
}
|
|
1723
1781
|
}
|
|
1724
|
-
|
|
1725
|
-
|
|
1726
|
-
|
|
1727
|
-
|
|
1728
|
-
|
|
1782
|
+
if (!valid_close && row_sep_len2 > 0) {
|
|
1783
|
+
valid_close = true;
|
|
1784
|
+
for (long j = 0; j < row_sep_len2; j++) {
|
|
1785
|
+
if (*(p + 1 + j) != *(row_sepP2 + j)) { valid_close = false; break; }
|
|
1786
|
+
}
|
|
1729
1787
|
}
|
|
1788
|
+
if (valid_close) {
|
|
1789
|
+
in_quotes = false;
|
|
1790
|
+
field_started = true;
|
|
1791
|
+
}
|
|
1792
|
+
/* else: quote inside quoted field → literal */
|
|
1730
1793
|
}
|
|
1731
|
-
if (valid_close) {
|
|
1732
|
-
in_quotes = false;
|
|
1733
|
-
field_started = true;
|
|
1734
|
-
}
|
|
1735
|
-
/* else: quote inside quoted field → literal (handles "" doubling) */
|
|
1736
1794
|
} else if (!field_started) {
|
|
1737
1795
|
in_quotes = true; /* opening quote at field boundary */
|
|
1738
1796
|
field_started = true;
|
|
@@ -1791,7 +1849,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1791
1849
|
|
|
1792
1850
|
bool has_embedded_quotes = quoted || (trimmed_len > 0 && memchr(trim_start, quote_char_val, trimmed_len));
|
|
1793
1851
|
|
|
1794
|
-
if (!keep_bitmap || (element_count <
|
|
1852
|
+
if (!keep_bitmap || (element_count < keep_bitmap_len ? keep_bitmap[element_count] : keep_extra_columns)) {
|
|
1795
1853
|
if (insert_field_into_hash(&xform, trim_start, trimmed_len, element_count, has_embedded_quotes, quote_char_val, encoding))
|
|
1796
1854
|
all_blank = false;
|
|
1797
1855
|
}
|
|
@@ -1819,7 +1877,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1819
1877
|
if (!remove_empty_values) {
|
|
1820
1878
|
ensure_hash_allocated(&xform);
|
|
1821
1879
|
for (long i = element_count; i < headers_len; i++) {
|
|
1822
|
-
if (!keep_bitmap || keep_bitmap[i]) {
|
|
1880
|
+
if (!keep_bitmap || (i < keep_bitmap_len ? keep_bitmap[i] : keep_extra_columns)) {
|
|
1823
1881
|
rb_hash_aset(xform.hash, rb_ary_entry(headers, i), Qnil);
|
|
1824
1882
|
}
|
|
1825
1883
|
}
|
data/lib/smarter_csv/parser.rb
CHANGED
|
@@ -410,15 +410,28 @@ module SmarterCSV
|
|
|
410
410
|
if !allow_escaped_quotes || backslash_count % 2 == 0
|
|
411
411
|
if quote_boundary_standard
|
|
412
412
|
if in_quotes
|
|
413
|
-
# closing quote: only valid if followed by col_sep, row_sep, or end of line
|
|
414
413
|
next_i = i + 1
|
|
415
|
-
if next_i
|
|
416
|
-
|
|
417
|
-
|
|
414
|
+
if next_i + 1 < bytesize && line.getbyte(next_i) == quote_byte
|
|
415
|
+
# RFC doubled quote inside a quoted field ("" → ").
|
|
416
|
+
# Give this precedence over the closing-quote check, but only
|
|
417
|
+
# when another byte follows the doubled pair.
|
|
418
|
+
#
|
|
419
|
+
# Compatibility note: we intentionally do NOT force terminal
|
|
420
|
+
# "" to be consumed here. SmarterCSV has a long-standing lenient
|
|
421
|
+
# behavior for malformed tails like ...\"" in :double_quotes mode:
|
|
422
|
+
# the final quote may still close the field instead of turning the
|
|
423
|
+
# row into an unclosed-quote error. Issue #334 needs doubled-quote
|
|
424
|
+
# precedence for ..."",... (more content follows), but we keep the
|
|
425
|
+
# historical leniency for terminal ..."".
|
|
426
|
+
i = next_i
|
|
427
|
+
# closing quote: only valid if followed by col_sep, row_sep, or end of line
|
|
428
|
+
elsif next_i >= bytesize ||
|
|
429
|
+
line.getbyte(next_i) == col_sep_byte ||
|
|
430
|
+
(row_sep_bytesize > 0 && line.byteslice(next_i, row_sep_bytesize) == row_sep)
|
|
418
431
|
in_quotes = false
|
|
419
432
|
field_started = true
|
|
420
433
|
end
|
|
421
|
-
# else: quote inside quoted field → literal
|
|
434
|
+
# else: quote inside quoted field → literal
|
|
422
435
|
elsif !field_started # at field boundary: open quoted field
|
|
423
436
|
in_quotes = true
|
|
424
437
|
field_started = true
|
|
@@ -519,15 +532,28 @@ module SmarterCSV
|
|
|
519
532
|
if !allow_escaped_quotes || backslash_count % 2 == 0
|
|
520
533
|
if quote_boundary_standard
|
|
521
534
|
if in_quotes
|
|
522
|
-
# closing quote: only valid if followed by col_sep, row_sep, or end of line
|
|
523
535
|
next_i = i + 1
|
|
524
|
-
if next_i
|
|
525
|
-
|
|
526
|
-
|
|
536
|
+
if next_i + 1 < line_size && line[next_i] == quote
|
|
537
|
+
# RFC doubled quote inside a quoted field ("" → ").
|
|
538
|
+
# Give this precedence over the closing-quote check, but only
|
|
539
|
+
# when another character follows the doubled pair.
|
|
540
|
+
#
|
|
541
|
+
# Compatibility note: we intentionally do NOT force terminal
|
|
542
|
+
# "" to be consumed here. SmarterCSV has a long-standing lenient
|
|
543
|
+
# behavior for malformed tails like ...\"" in :double_quotes mode:
|
|
544
|
+
# the final quote may still close the field instead of turning the
|
|
545
|
+
# row into an unclosed-quote error. Issue #334 needs doubled-quote
|
|
546
|
+
# precedence for ..."",... (more content follows), but we keep the
|
|
547
|
+
# historical leniency for terminal ..."".
|
|
548
|
+
i = next_i
|
|
549
|
+
# closing quote: only valid if followed by col_sep, row_sep, or end of line
|
|
550
|
+
elsif next_i >= line_size ||
|
|
551
|
+
line[next_i...next_i + col_sep_size] == col_sep ||
|
|
552
|
+
(row_sep_size > 0 && line[next_i...next_i + row_sep_size] == row_sep)
|
|
527
553
|
in_quotes = false
|
|
528
554
|
field_started = true
|
|
529
555
|
end
|
|
530
|
-
# else: quote inside quoted field → literal
|
|
556
|
+
# else: quote inside quoted field → literal
|
|
531
557
|
elsif !field_started # at field boundary: open quoted field
|
|
532
558
|
in_quotes = true
|
|
533
559
|
field_started = true
|
data/lib/smarter_csv/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: smarter_csv
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.17.
|
|
4
|
+
version: 1.17.2
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tilo Sloboda
|
|
8
8
|
bindir: bin
|
|
9
9
|
cert_chain: []
|
|
10
|
-
date: 2026-05-
|
|
10
|
+
date: 2026-05-21 00:00:00.000000000 Z
|
|
11
11
|
dependencies: []
|
|
12
12
|
description: |
|
|
13
13
|
SmarterCSV is a high-performance CSV reader and writer for Ruby focused on
|
|
@@ -39,7 +39,6 @@ files:
|
|
|
39
39
|
- LICENSE.txt
|
|
40
40
|
- README.md
|
|
41
41
|
- Rakefile
|
|
42
|
-
- TO_DO.md
|
|
43
42
|
- docs/_introduction.md
|
|
44
43
|
- docs/bad_row_quarantine.md
|
|
45
44
|
- docs/basic_read_api.md
|
data/TO_DO.md
DELETED
|
@@ -1,109 +0,0 @@
|
|
|
1
|
-
# SmarterCSV v2.0 TO DO List
|
|
2
|
-
|
|
3
|
-
DONE:
|
|
4
|
-
[X] Don't call rewind on filehandle
|
|
5
|
-
[X] use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
|
6
|
-
[X] skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120). Or stream large file from S3 (linked in the issue)
|
|
7
|
-
[X] [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
|
|
8
|
-
[X] add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
|
|
9
|
-
[X] Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
|
|
10
|
-
[X] Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
Partially Done:
|
|
14
|
-
[ ] make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
|
15
|
-
|
|
16
|
-
StilL TO DO:
|
|
17
|
-
[ ] Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
|
|
18
|
-
|
|
19
|
-
Arguably by design (e.g. exclude these columns from conversion and have them returned as a string)
|
|
20
|
-
[ ] [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
## Numeric conversion: align the Ruby fallback path with the C path (permissive)
|
|
24
|
-
|
|
25
|
-
Context: `convert_values_to_numeric` runs in two places that currently DISAGREE on edge cases:
|
|
26
|
-
- C path (`acceleration: true`, the default): `ext/smarter_csv/smarter_csv.c#try_numeric_conversion`
|
|
27
|
-
uses `strtol`/`strtod` (base 10; float branch only entered when the field contains a `.`).
|
|
28
|
-
- Ruby fallback (`acceleration: false`): `lib/smarter_csv/hash_transformations.rb` uses the
|
|
29
|
-
strict regex `NUMERIC_REGEX = /\A[+-]?\d+(?:\.\d+)?\z/` plus `to_i` / `to_f`.
|
|
30
|
-
|
|
31
|
-
Divergence (verified empirically):
|
|
32
|
-
| value | C path | Ruby fallback |
|
|
33
|
-
|-----------|------------------|-------------------|
|
|
34
|
-
| ".5" | 0.5 (Float) | ".5" (String) |
|
|
35
|
-
| "3." | 3.0 (Float) | "3." (String) |
|
|
36
|
-
| "1.5e3" | 1500.0 (Float) | "1.5e3" (String) |
|
|
37
|
-
| "1.0e10" | 10000000000.0 | "1.0e10" (String) |
|
|
38
|
-
|
|
39
|
-
Decision: the C path's permissive behavior (corner cases + scientific notation) is the intended
|
|
40
|
-
contract. Fix = make the Ruby fallback match the C path. Do NOT tighten the C path.
|
|
41
|
-
|
|
42
|
-
Ruby-side changes (in `hash_transformations.rb`):
|
|
43
|
-
1. Swap NUMERIC_REGEX for a permissive one:
|
|
44
|
-
/\A[+-]?(?:\d+\.?\d*|\.\d+)(?:[eE][+-]?\d+)?\z/
|
|
45
|
-
matches 1, 1., 1.5, .5, 1e3, 1.5e3, -3.14e-2, etc.; still rejects ".", "e3", "1.2.3",
|
|
46
|
-
"1_000", "0x1F".
|
|
47
|
-
2. Add `DOT_BYTE = '.'.ord` (46) and include it in the first-byte fast-reject's allowed set
|
|
48
|
-
(the C pre-check already allows a leading `.`; without this, ".5" gets rejected on byte 0).
|
|
49
|
-
3. Int-vs-float decision: `(v.include?('.') || v.include?('e') || v.include?('E')) ? v.to_f : v.to_i`
|
|
50
|
-
(currently only checks for `.`).
|
|
51
|
-
|
|
52
|
-
Stays a string on BOTH paths (no change needed, but worth characterization tests — there are
|
|
53
|
-
currently NONE):
|
|
54
|
-
- "010" => 10 (NOT octal 8 — both paths use base-10 conversion: String#to_i / strtol(.,10).
|
|
55
|
-
A switch to Kernel#Integer() would break this. Lock it down with a test.)
|
|
56
|
-
- "0x1F", "0b101", "0o17" => string (radix prefixes not honored by base-10 conversion)
|
|
57
|
-
- "1_000" => string (underscores)
|
|
58
|
-
- "1,200.00", "1.300,00" => string (thousands sep / decimal comma — strtod stops at the
|
|
59
|
-
separator → not fully consumed; regex rejects. This is the only safe behavior; "1,200" is
|
|
60
|
-
genuinely ambiguous. Locale-specific number formats are the caller's job via value_converters.)
|
|
61
|
-
|
|
62
|
-
NOT doing: locale sniffing (read LC_NUMERIC at init and adjust the regexes). Rejected because
|
|
63
|
-
the machine locale tells you nothing about the file's number format, it breaks reproducibility
|
|
64
|
-
(same code + same file → different results on a US vs EU box), and `,` can't be both col_sep and
|
|
65
|
-
decimal separator anyway. Note `strtod` IS locale-sensitive (LC_NUMERIC) but it's dormant — Ruby
|
|
66
|
-
runs in the C/POSIX locale; don't deliberately activate it.
|
|
67
|
-
|
|
68
|
-
When done: parity tests (`[true, false].each`) for the now-consistent set (.5, 3., 1.5e3, 1e3)
|
|
69
|
-
plus characterization tests for the stays-a-string set above; CHANGELOG line noting the Ruby
|
|
70
|
-
fallback's numeric conversion now accepts scientific notation and bare-dot forms, matching the
|
|
71
|
-
accelerated path. Behavior change affects `acceleration: false` users only — and aligns them with
|
|
72
|
-
the default.
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
## Warn once when the C extension didn't load on a platform that supports it
|
|
76
|
-
|
|
77
|
-
Context: `acceleration: true` is the default. When the C extension fails to build / isn't loaded,
|
|
78
|
-
SmarterCSV silently falls back to the Ruby parser — graceful degradation by design (so the gem
|
|
79
|
-
keeps working for users with broken toolchains, JRuby, TruffleRuby, etc.). Today there is no
|
|
80
|
-
signal to the user that they're not getting the C path; their CSV parsing is just slower than
|
|
81
|
-
they might have expected.
|
|
82
|
-
|
|
83
|
-
Idea: emit a one-time warning when:
|
|
84
|
-
* the C extension is NOT loaded — `!SmarterCSV::Parser.respond_to?(:parse_csv_line_c)`, AND
|
|
85
|
-
* the platform is one where it *should* be available — `RUBY_ENGINE == 'ruby'` (MRI / CRuby).
|
|
86
|
-
JRuby and TruffleRuby don't load CRuby C extensions natively; nothing for the user to do.
|
|
87
|
-
|
|
88
|
-
Where to fire:
|
|
89
|
-
* NOT at `require 'smarter_csv'` time — Rails.logger typically isn't set up yet, so any
|
|
90
|
-
"route through the warnings system" code would just fall through to `Kernel#warn` anyway,
|
|
91
|
-
and the warning would land in stderr instead of the Rails log where ops would see it.
|
|
92
|
-
* At first `Reader.new` / `SmarterCSV.process` call — Rails has booted, the existing
|
|
93
|
-
routing-through-Rails.logger-or-Kernel#warn infra works, and the existing deduped warnings
|
|
94
|
-
histogram means it fires once per process regardless of how many parse calls.
|
|
95
|
-
|
|
96
|
-
Implementation sketch:
|
|
97
|
-
* Add a new warning code (e.g. `:c_extension_unavailable`) alongside the existing ones
|
|
98
|
-
(`:chunk_size_default`, `:header_a_method`, `:utf8_missing_binary_mode`, ...).
|
|
99
|
-
* Severity `:warn`. Suppressible via the existing `verbose: :quiet`.
|
|
100
|
-
* Message points at the fix — e.g. "C acceleration extension not loaded on this Ruby; using
|
|
101
|
-
Ruby parser. To enable acceleration, reinstall with `gem pristine smarter_csv` and check
|
|
102
|
-
the build log." Plus a link/pointer to a troubleshooting section in the docs.
|
|
103
|
-
|
|
104
|
-
Bonus: add a public predicate `SmarterCSV.acceleration_available?` returning
|
|
105
|
-
`Parser.respond_to?(:parse_csv_line_c)`. Zero noise, useful for scripts / CI / future spec
|
|
106
|
-
files that want to branch on the environment fact rather than guess.
|
|
107
|
-
|
|
108
|
-
NOT doing: a banner at `require` time (every Rails app would print it at boot, too noisy);
|
|
109
|
-
warning when `acceleration: false` was explicitly chosen (the user knows what they're doing).
|