smarter_csv 1.17.0 → 1.17.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 702bd7049e83c0beb85f0ca11a122e6f1659eddef6afec66eaf1c37c5b30f43f
4
- data.tar.gz: dd1915694d041c9b631324de7408f46fc8f426f9e1c60136c35a8f1e754d4590
3
+ metadata.gz: 2e665f0dc98db44950aa9cbb2cac430068e91df8886062068413dfbcefc74fc3
4
+ data.tar.gz: def43fb66886b16ec13bd429b4fd6923b09aa1a01757a696390a38c18b59fa31
5
5
  SHA512:
6
- metadata.gz: fa00d07c21cffa711a43ecb4622ad3a09b667f1c1965ad26bee864ada6a3c168076ec04550781c75c4f0acbb28fcab60001278f459cf3065cceaef6820764e30
7
- data.tar.gz: eab835a356e5343e20a5cc0784ffd9aafa8ab631256d412ec570a0192060b8f3e9c6f619db36e59494364f6c51cda16a9c434c2565ef5a5b6e23f4813a7eaaef
6
+ metadata.gz: 2fb7793ed4eca64cfef1f7dd82a417b44988832280b373c1748213d9f7c879cd0a2d17c4e3b72c82be6acedb01b0fec26b70e6daaefb645ee2c3bf64b7aedcd8
7
+ data.tar.gz: a0b8842d5a69d8526af81d4e2a64c31fd6a54d6d610c5ce0dbb16298dfd03c3d296546c049a4a76a710908390ec1ea1739bc7530b1ab652c7ead1ceaa02b431d
data/CHANGELOG.md CHANGED
@@ -1,7 +1,25 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
- ## 1.17.0 (NOT RELEASED)
4
+ ## 1.17.2 (2026-05-21)
5
+
6
+ RSpec tests: **2,220→ 2,274** (+54 tests)
7
+
8
+ ### Bug Fix
9
+
10
+ - fixed [Issue #334](https://github.com/tilo/smarter_csv/issues/334) with escaped double quote followed by comma. Thanks to [conorg](https://github.com/conorg)
11
+ - fixed bug when using `headers: { except: }`
12
+ - added more tests
13
+
14
+ ## 1.17.1 (2026-05-17)
15
+
16
+ RSpec tests: **2,210→ 2,220** (+10 tests)
17
+
18
+ ### Bug Fix
19
+
20
+ - fixing issue with `remove_empty_hashes: false` not being honored in accelerated path (does not affect you when you use default settings)
21
+
22
+ ## 1.17.0 (2026-05-14)
5
23
 
6
24
  RSpec tests: **1,434 → 2,210** (+776 tests)
7
25
 
@@ -55,6 +73,13 @@ Measured against 1.16.4 (Apple M4, Ruby 3.4.7):
55
73
 
56
74
  Per-file breakdown: [`docs/releases/1.17.0/performance_notes.md`](docs/releases/1.17.0/performance_notes.md).
57
75
 
76
+ ## 1.16.5 (2026-05-17)
77
+
78
+ ### Bug Fix
79
+
80
+ - fixing issue with `remove_empty_hashes: false` not being honored in accelerated path (does not affect you when you use default settings)
81
+
82
+
58
83
  ## 1.16.4 (2026-04-21) — Bug Fixes
59
84
 
60
85
  RSpec tests: **1,434 → 1,467** (+33 tests)
@@ -206,6 +231,11 @@ Measured on 19 benchmark files, Apple M1, Ruby 3.4.7. See [benchmarks](docs/rele
206
231
  * **Writer temp file** no longer hardcoded to `/tmp` (fixes Windows); properly cleaned up with `Tempfile#close!`.
207
232
  * **Writer `StringIO`**: `finalize` no longer attempts to close a caller-owned `StringIO`.
208
233
 
234
+ ## 1.15.3 (2026-05-17)
235
+
236
+ ### Bug Fix
237
+
238
+ - fixing issue with `remove_empty_hashes: false` not being honored in accelerated path (does not affect you when you use default settings)
209
239
 
210
240
  ## 1.15.2 (2026-02-20)
211
241
 
data/CONTRIBUTORS.md CHANGED
@@ -1,4 +1,4 @@
1
- # A Big Thank You to all 63 Contributors!!
1
+ # A Big Thank You to all 64 Contributors!!
2
2
 
3
3
 
4
4
  A Big Thank you to everyone who filed issues, sent comments, and who contributed with pull requests:
@@ -66,3 +66,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
66
66
  * [Dom Lebron](https://github.com/biglebronski)
67
67
  * [Paho Lurie-Gregg](https://github.com/paholg)
68
68
  * [Jonas Staškevičius](https://github.com/pirminis)
69
+ * [conorg](https://github.com/conorg)
data/README.md CHANGED
@@ -1,8 +1,8 @@
1
1
 
2
2
  # SmarterCSV
3
3
 
4
- ![Gem Version](https://img.shields.io/gem/v/smarter_csv) [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [View on RubyGems](https://rubygems.org/gems/smarter_csv) [View on RubyToolbox](https://www.ruby-toolbox.com/search?q=smarter_csv)
5
-
4
+ ![Gem Version](https://img.shields.io/gem/v/smarter_csv) [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Downloads](https://img.shields.io/gem/dt/smarter_csv)](https://rubygems.org/gems/smarter_csv) [![RubyGems](https://img.shields.io/badge/RubyGems-smarter__csv-brightgreen?logo=rubygems&logoColor=white)](https://rubygems.org/gems/smarter_csv) [![Ruby Toolbox](https://img.shields.io/badge/Ruby%20Toolbox-smarter__csv-brightgreen)](https://www.ruby-toolbox.com/projects/smarter_csv)
5
+
6
6
  SmarterCSV is a high-performance CSV ingestion and generation for Ruby, focused on fast end-to-end CSV ingestion of real-world data — no silent failures, no surprises, not just tokenization.
7
7
 
8
8
  ⭐ If SmarterCSV saved you hours of import time, please star the repo, and consider sponsoring this project.
@@ -311,7 +311,7 @@ Or install it yourself as:
311
311
  * [Examples](docs/examples.md)
312
312
  * [Real-World CSV Files](docs/real_world_csv.md)
313
313
  * [SmarterCSV over the Years](docs/history.md)
314
- * [Release Notes](docs/releases/1.16.0/changes.md)
314
+ * [Release Notes](docs/releases/1.17.0/changes.md)
315
315
 
316
316
  ## Articles
317
317
  * [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38)
@@ -333,7 +333,7 @@ For reporting issues, please:
333
333
  * open a pull-request adding a test that demonstrates the issue
334
334
  * mention your version of SmarterCSV, Ruby, Rails
335
335
 
336
- # [A Special Thanks to all 63 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
336
+ # [A Special Thanks to all 64 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
337
337
 
338
338
 
339
339
  ## Contributing
@@ -331,25 +331,40 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
331
331
  if (!allow_escaped_quotes || backslash_count % 2 == 0) {
332
332
  if (__builtin_expect(quote_boundary_standard, 1)) {
333
333
  if (in_quotes) {
334
- // closing quote: only valid if followed by col_sep, row_sep, or end of line
335
- bool valid_close = (p + 1 >= endP);
336
- if (!valid_close) {
337
- valid_close = true;
338
- for (long j = 0; j < col_sep_len; j++) {
339
- if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
334
+ if (p + 2 < endP && *(p + 1) == quote_char_val) {
335
+ /* RFC doubled quote inside a quoted field ("" → ").
336
+ * Give this precedence over the closing-quote check, but only
337
+ * when another byte follows the doubled pair.
338
+ *
339
+ * Compatibility note: we intentionally do NOT force terminal
340
+ * "" to be consumed here. SmarterCSV has a long-standing lenient
341
+ * behavior for malformed tails like ...\"" in :double_quotes mode:
342
+ * the final quote may still close the field instead of turning the
343
+ * row into an unclosed-quote error. Issue #334 needs doubled-quote
344
+ * precedence for ..."",... (more content follows), but we keep the
345
+ * historical leniency for terminal ..."". */
346
+ p++;
347
+ } else {
348
+ // closing quote: only valid if followed by col_sep, row_sep, or end of line
349
+ bool valid_close = (p + 1 >= endP);
350
+ if (!valid_close) {
351
+ valid_close = true;
352
+ for (long j = 0; j < col_sep_len; j++) {
353
+ if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
354
+ }
340
355
  }
341
- }
342
- if (!valid_close && row_sep_len > 0) {
343
- valid_close = true;
344
- for (long j = 0; j < row_sep_len; j++) {
345
- if (*(p + 1 + j) != *(row_sepP + j)) { valid_close = false; break; }
356
+ if (!valid_close && row_sep_len > 0) {
357
+ valid_close = true;
358
+ for (long j = 0; j < row_sep_len; j++) {
359
+ if (*(p + 1 + j) != *(row_sepP + j)) { valid_close = false; break; }
360
+ }
346
361
  }
362
+ if (valid_close) {
363
+ in_quotes = false;
364
+ field_started = true;
365
+ }
366
+ // else: quote inside quoted field → literal
347
367
  }
348
- if (valid_close) {
349
- in_quotes = false;
350
- field_started = true;
351
- }
352
- // else: quote inside quoted field → literal (handles "" doubling)
353
368
  } else if (!field_started) {
354
369
  in_quotes = true; // opening quote at field boundary
355
370
  field_started = true;
@@ -829,6 +844,11 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
829
844
  * the frame stays well below 4 KB and ___chkstk_darwin never fires on ARM64 macOS.
830
845
  */
831
846
  bool *keep_bitmap = NULL;
847
+ /* In THIS (non-ctx) function the bitmap is alloca'd to headers_len on every call (see the alloca
848
+ * sites below), so keep_bitmap[] is exactly headers_len long and headers_len is the correct bound
849
+ * at all access sites. Do NOT mirror rb_parse_line_to_hash_ctx's keep_bitmap_len here: that variant
850
+ * caches its bitmap across rows (where @headers can grow), so it must use the captured length; this
851
+ * one rebuilds per call and does not. */
832
852
  bool keep_extra_columns = true; /* extra cols (> headers_len): keep by default */
833
853
  bool has_only = false; /* true when only_headers: filtering is active */
834
854
  long early_exit_after = -1; /* column index after which we stop; -1 = no early exit */
@@ -1147,25 +1167,40 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
1147
1167
  if (!allow_escaped_quotes || backslash_count % 2 == 0) {
1148
1168
  if (__builtin_expect(quote_boundary_standard, 1)) {
1149
1169
  if (in_quotes) {
1150
- // closing quote: only valid if followed by col_sep, row_sep, or end of line
1151
- bool valid_close = (p + 1 >= endP);
1152
- if (!valid_close) {
1153
- valid_close = true;
1154
- for (long j = 0; j < col_sep_len; j++) {
1155
- if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
1170
+ if (p + 2 < endP && *(p + 1) == quote_char_val) {
1171
+ /* RFC doubled quote inside a quoted field ("" → ").
1172
+ * Give this precedence over the closing-quote check, but only
1173
+ * when another byte follows the doubled pair.
1174
+ *
1175
+ * Compatibility note: we intentionally do NOT force terminal
1176
+ * "" to be consumed here. SmarterCSV has a long-standing lenient
1177
+ * behavior for malformed tails like ...\"" in :double_quotes mode:
1178
+ * the final quote may still close the field instead of turning the
1179
+ * row into an unclosed-quote error. Issue #334 needs doubled-quote
1180
+ * precedence for ..."",... (more content follows), but we keep the
1181
+ * historical leniency for terminal ..."". */
1182
+ p++;
1183
+ } else {
1184
+ // closing quote: only valid if followed by col_sep, row_sep, or end of line
1185
+ bool valid_close = (p + 1 >= endP);
1186
+ if (!valid_close) {
1187
+ valid_close = true;
1188
+ for (long j = 0; j < col_sep_len; j++) {
1189
+ if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
1190
+ }
1156
1191
  }
1157
- }
1158
- if (!valid_close && row_sep_len2 > 0) {
1159
- valid_close = true;
1160
- for (long j = 0; j < row_sep_len2; j++) {
1161
- if (*(p + 1 + j) != *(row_sepP2 + j)) { valid_close = false; break; }
1192
+ if (!valid_close && row_sep_len2 > 0) {
1193
+ valid_close = true;
1194
+ for (long j = 0; j < row_sep_len2; j++) {
1195
+ if (*(p + 1 + j) != *(row_sepP2 + j)) { valid_close = false; break; }
1196
+ }
1162
1197
  }
1198
+ if (valid_close) {
1199
+ in_quotes = false;
1200
+ field_started = true;
1201
+ }
1202
+ // else: quote inside quoted field → literal
1163
1203
  }
1164
- if (valid_close) {
1165
- in_quotes = false;
1166
- field_started = true;
1167
- }
1168
- // else: quote inside quoted field → literal (handles "" doubling)
1169
1204
  } else if (!field_started) {
1170
1205
  in_quotes = true; // opening quote at field boundary
1171
1206
  field_started = true;
@@ -1242,12 +1277,20 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
1242
1277
  * return nil instead of the hash so the row can be skipped.
1243
1278
  * With lazy allocation, if all_blank is true, xform.hash is still Qnil —
1244
1279
  * no hash was ever allocated.
1280
+ *
1281
+ * If remove_empty_hashes is disabled, preserve the row as an empty hash.
1282
+ * This keeps parity with the Ruby path without impacting the normal
1283
+ * non-blank hot path.
1245
1284
  */
1246
- if (remove_empty && all_blank) {
1247
- VALUE result = rb_ary_new_capa(2);
1248
- rb_ary_push(result, Qnil);
1249
- rb_ary_push(result, LONG2FIX(element_count));
1250
- return result;
1285
+ if (all_blank) {
1286
+ if (remove_empty) {
1287
+ VALUE result = rb_ary_new_capa(2);
1288
+ rb_ary_push(result, Qnil);
1289
+ rb_ary_push(result, LONG2FIX(element_count));
1290
+ return result;
1291
+ }
1292
+
1293
+ ensure_hash_allocated(&xform);
1251
1294
  }
1252
1295
 
1253
1296
  /* ----------------------------------------
@@ -1487,6 +1530,14 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1487
1530
  int numeric_mode = ctx->numeric_mode;
1488
1531
  VALUE numeric_keys = ctx->numeric_keys;
1489
1532
  bool *keep_bitmap = ctx->keep_bitmap;
1533
+ /* keep_bitmap is cached in the context (xmalloc'd once at construction, sized to the header count
1534
+ * THEN). @headers can grow in place as undeclared extra columns appear, so the live headers_len
1535
+ * (re-read each call below) may exceed the bitmap's length. Every keep_bitmap[] access in this
1536
+ * function MUST be bounded by keep_bitmap_len, never headers_len — indices past the bitmap are
1537
+ * extra columns and follow keep_extra_columns. Bounding by the grown headers_len was an
1538
+ * out-of-bounds heap read (the bug). The sibling rb_parse_line_to_hash safely uses headers_len
1539
+ * because it re-allocs its bitmap to headers_len on every call. */
1540
+ long keep_bitmap_len = ctx->keep_bitmap_len;
1490
1541
  bool keep_extra_columns = ctx->keep_extra_columns;
1491
1542
  long early_exit_after = ctx->early_exit_after;
1492
1543
 
@@ -1588,7 +1639,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1588
1639
  while (trim_end >= trim_start && (*trim_end == ' ' || *trim_end == '\t')) trim_end--;
1589
1640
  }
1590
1641
  long trimmed_len = (trim_end >= trim_start) ? (trim_end - trim_start + 1) : 0;
1591
- if (!keep_bitmap || (element_count < headers_len ? keep_bitmap[element_count] : keep_extra_columns)) {
1642
+ if (!keep_bitmap || (element_count < keep_bitmap_len ? keep_bitmap[element_count] : keep_extra_columns)) {
1592
1643
  if (insert_field_into_hash(&xform, trim_start, trimmed_len, element_count, false, quote_char_val, encoding))
1593
1644
  all_blank = false;
1594
1645
  }
@@ -1609,7 +1660,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1609
1660
  while (trim_end >= trim_start && (*trim_end == ' ' || *trim_end == '\t')) trim_end--;
1610
1661
  }
1611
1662
  long trimmed_len = (trim_end >= trim_start) ? (trim_end - trim_start + 1) : 0;
1612
- if (!keep_bitmap || (element_count < headers_len ? keep_bitmap[element_count] : keep_extra_columns)) {
1663
+ if (!keep_bitmap || (element_count < keep_bitmap_len ? keep_bitmap[element_count] : keep_extra_columns)) {
1613
1664
  if (insert_field_into_hash(&xform, trim_start, trimmed_len, element_count, false, quote_char_val, encoding))
1614
1665
  all_blank = false;
1615
1666
  }
@@ -1672,7 +1723,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1672
1723
 
1673
1724
  bool has_embedded_quotes = quoted || (trimmed_len > 0 && memchr(trim_start, quote_char_val, trimmed_len));
1674
1725
 
1675
- if (!keep_bitmap || (element_count < headers_len ? keep_bitmap[element_count] : keep_extra_columns)) {
1726
+ if (!keep_bitmap || (element_count < keep_bitmap_len ? keep_bitmap[element_count] : keep_extra_columns)) {
1676
1727
  if (insert_field_into_hash(&xform, trim_start, trimmed_len, element_count, has_embedded_quotes, quote_char_val, encoding))
1677
1728
  all_blank = false;
1678
1729
  }
@@ -1706,25 +1757,40 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1706
1757
  if (!allow_escaped_quotes || backslash_count % 2 == 0) {
1707
1758
  if (__builtin_expect(quote_boundary_standard, 1)) {
1708
1759
  if (in_quotes) {
1709
- /* closing quote: only valid if followed by col_sep, row_sep, or end */
1710
- bool valid_close = (p + 1 >= endP);
1711
- if (!valid_close) {
1712
- valid_close = true;
1713
- for (long j = 0; j < col_sep_len; j++) {
1714
- if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
1760
+ if (p + 2 < endP && *(p + 1) == quote_char_val) {
1761
+ /* RFC doubled quote inside a quoted field ("" → ").
1762
+ * Give this precedence over the closing-quote check, but only
1763
+ * when another byte follows the doubled pair.
1764
+ *
1765
+ * Compatibility note: we intentionally do NOT force terminal
1766
+ * "" to be consumed here. SmarterCSV has a long-standing lenient
1767
+ * behavior for malformed tails like ...\"" in :double_quotes mode:
1768
+ * the final quote may still close the field instead of turning the
1769
+ * row into an unclosed-quote error. Issue #334 needs doubled-quote
1770
+ * precedence for ..."",... (more content follows), but we keep the
1771
+ * historical leniency for terminal ..."". */
1772
+ p++;
1773
+ } else {
1774
+ /* closing quote: only valid if followed by col_sep, row_sep, or end */
1775
+ bool valid_close = (p + 1 >= endP);
1776
+ if (!valid_close) {
1777
+ valid_close = true;
1778
+ for (long j = 0; j < col_sep_len; j++) {
1779
+ if (*(p + 1 + j) != *(col_sepP + j)) { valid_close = false; break; }
1780
+ }
1715
1781
  }
1716
- }
1717
- if (!valid_close && row_sep_len2 > 0) {
1718
- valid_close = true;
1719
- for (long j = 0; j < row_sep_len2; j++) {
1720
- if (*(p + 1 + j) != *(row_sepP2 + j)) { valid_close = false; break; }
1782
+ if (!valid_close && row_sep_len2 > 0) {
1783
+ valid_close = true;
1784
+ for (long j = 0; j < row_sep_len2; j++) {
1785
+ if (*(p + 1 + j) != *(row_sepP2 + j)) { valid_close = false; break; }
1786
+ }
1721
1787
  }
1788
+ if (valid_close) {
1789
+ in_quotes = false;
1790
+ field_started = true;
1791
+ }
1792
+ /* else: quote inside quoted field → literal */
1722
1793
  }
1723
- if (valid_close) {
1724
- in_quotes = false;
1725
- field_started = true;
1726
- }
1727
- /* else: quote inside quoted field → literal (handles "" doubling) */
1728
1794
  } else if (!field_started) {
1729
1795
  in_quotes = true; /* opening quote at field boundary */
1730
1796
  field_started = true;
@@ -1783,7 +1849,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1783
1849
 
1784
1850
  bool has_embedded_quotes = quoted || (trimmed_len > 0 && memchr(trim_start, quote_char_val, trimmed_len));
1785
1851
 
1786
- if (!keep_bitmap || (element_count < headers_len ? keep_bitmap[element_count] : keep_extra_columns)) {
1852
+ if (!keep_bitmap || (element_count < keep_bitmap_len ? keep_bitmap[element_count] : keep_extra_columns)) {
1787
1853
  if (insert_field_into_hash(&xform, trim_start, trimmed_len, element_count, has_embedded_quotes, quote_char_val, encoding))
1788
1854
  all_blank = false;
1789
1855
  }
@@ -1794,11 +1860,15 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1794
1860
  /* ----------------------------------------
1795
1861
  * SECTION 6: Handle blank rows
1796
1862
  * ---------------------------------------- */
1797
- if (remove_empty && all_blank) {
1798
- VALUE result = rb_ary_new_capa(2);
1799
- rb_ary_push(result, Qnil);
1800
- rb_ary_push(result, LONG2FIX(element_count));
1801
- return result;
1863
+ if (all_blank) {
1864
+ if (remove_empty) {
1865
+ VALUE result = rb_ary_new_capa(2);
1866
+ rb_ary_push(result, Qnil);
1867
+ rb_ary_push(result, LONG2FIX(element_count));
1868
+ return result;
1869
+ }
1870
+
1871
+ ensure_hash_allocated(&xform);
1802
1872
  }
1803
1873
 
1804
1874
  /* ----------------------------------------
@@ -1807,7 +1877,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1807
1877
  if (!remove_empty_values) {
1808
1878
  ensure_hash_allocated(&xform);
1809
1879
  for (long i = element_count; i < headers_len; i++) {
1810
- if (!keep_bitmap || keep_bitmap[i]) {
1880
+ if (!keep_bitmap || (i < keep_bitmap_len ? keep_bitmap[i] : keep_extra_columns)) {
1811
1881
  rb_hash_aset(xform.hash, rb_ary_entry(headers, i), Qnil);
1812
1882
  }
1813
1883
  }
@@ -410,15 +410,28 @@ module SmarterCSV
410
410
  if !allow_escaped_quotes || backslash_count % 2 == 0
411
411
  if quote_boundary_standard
412
412
  if in_quotes
413
- # closing quote: only valid if followed by col_sep, row_sep, or end of line
414
413
  next_i = i + 1
415
- if next_i >= bytesize ||
416
- line.getbyte(next_i) == col_sep_byte ||
417
- (row_sep_bytesize > 0 && line.byteslice(next_i, row_sep_bytesize) == row_sep)
414
+ if next_i + 1 < bytesize && line.getbyte(next_i) == quote_byte
415
+ # RFC doubled quote inside a quoted field ("" ").
416
+ # Give this precedence over the closing-quote check, but only
417
+ # when another byte follows the doubled pair.
418
+ #
419
+ # Compatibility note: we intentionally do NOT force terminal
420
+ # "" to be consumed here. SmarterCSV has a long-standing lenient
421
+ # behavior for malformed tails like ...\"" in :double_quotes mode:
422
+ # the final quote may still close the field instead of turning the
423
+ # row into an unclosed-quote error. Issue #334 needs doubled-quote
424
+ # precedence for ..."",... (more content follows), but we keep the
425
+ # historical leniency for terminal ..."".
426
+ i = next_i
427
+ # closing quote: only valid if followed by col_sep, row_sep, or end of line
428
+ elsif next_i >= bytesize ||
429
+ line.getbyte(next_i) == col_sep_byte ||
430
+ (row_sep_bytesize > 0 && line.byteslice(next_i, row_sep_bytesize) == row_sep)
418
431
  in_quotes = false
419
432
  field_started = true
420
433
  end
421
- # else: quote inside quoted field → literal (handles "" doubling)
434
+ # else: quote inside quoted field → literal
422
435
  elsif !field_started # at field boundary: open quoted field
423
436
  in_quotes = true
424
437
  field_started = true
@@ -519,15 +532,28 @@ module SmarterCSV
519
532
  if !allow_escaped_quotes || backslash_count % 2 == 0
520
533
  if quote_boundary_standard
521
534
  if in_quotes
522
- # closing quote: only valid if followed by col_sep, row_sep, or end of line
523
535
  next_i = i + 1
524
- if next_i >= line_size ||
525
- line[next_i...next_i + col_sep_size] == col_sep ||
526
- (row_sep_size > 0 && line[next_i...next_i + row_sep_size] == row_sep)
536
+ if next_i + 1 < line_size && line[next_i] == quote
537
+ # RFC doubled quote inside a quoted field ("" → ").
538
+ # Give this precedence over the closing-quote check, but only
539
+ # when another character follows the doubled pair.
540
+ #
541
+ # Compatibility note: we intentionally do NOT force terminal
542
+ # "" to be consumed here. SmarterCSV has a long-standing lenient
543
+ # behavior for malformed tails like ...\"" in :double_quotes mode:
544
+ # the final quote may still close the field instead of turning the
545
+ # row into an unclosed-quote error. Issue #334 needs doubled-quote
546
+ # precedence for ..."",... (more content follows), but we keep the
547
+ # historical leniency for terminal ..."".
548
+ i = next_i
549
+ # closing quote: only valid if followed by col_sep, row_sep, or end of line
550
+ elsif next_i >= line_size ||
551
+ line[next_i...next_i + col_sep_size] == col_sep ||
552
+ (row_sep_size > 0 && line[next_i...next_i + row_sep_size] == row_sep)
527
553
  in_quotes = false
528
554
  field_started = true
529
555
  end
530
- # else: quote inside quoted field → literal (handles "" doubling)
556
+ # else: quote inside quoted field → literal
531
557
  elsif !field_started # at field boundary: open quoted field
532
558
  in_quotes = true
533
559
  field_started = true
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.17.0"
4
+ VERSION = "1.17.2"
5
5
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.17.0
4
+ version: 1.17.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  bindir: bin
9
9
  cert_chain: []
10
- date: 2026-05-14 00:00:00.000000000 Z
10
+ date: 2026-05-21 00:00:00.000000000 Z
11
11
  dependencies: []
12
12
  description: |
13
13
  SmarterCSV is a high-performance CSV reader and writer for Ruby focused on
@@ -39,7 +39,6 @@ files:
39
39
  - LICENSE.txt
40
40
  - README.md
41
41
  - Rakefile
42
- - TO_DO.md
43
42
  - docs/_introduction.md
44
43
  - docs/bad_row_quarantine.md
45
44
  - docs/basic_read_api.md
data/TO_DO.md DELETED
@@ -1,109 +0,0 @@
1
- # SmarterCSV v2.0 TO DO List
2
-
3
- DONE:
4
- [X] Don't call rewind on filehandle
5
- [X] use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
6
- [X] skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120). Or stream large file from S3 (linked in the issue)
7
- [X] [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
8
- [X] add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
9
- [X] Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
10
- [X] Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
11
-
12
-
13
- Partially Done:
14
- [ ] make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
15
-
16
- StilL TO DO:
17
- [ ] Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
18
-
19
- Arguably by design (e.g. exclude these columns from conversion and have them returned as a string)
20
- [ ] [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
21
-
22
-
23
- ## Numeric conversion: align the Ruby fallback path with the C path (permissive)
24
-
25
- Context: `convert_values_to_numeric` runs in two places that currently DISAGREE on edge cases:
26
- - C path (`acceleration: true`, the default): `ext/smarter_csv/smarter_csv.c#try_numeric_conversion`
27
- uses `strtol`/`strtod` (base 10; float branch only entered when the field contains a `.`).
28
- - Ruby fallback (`acceleration: false`): `lib/smarter_csv/hash_transformations.rb` uses the
29
- strict regex `NUMERIC_REGEX = /\A[+-]?\d+(?:\.\d+)?\z/` plus `to_i` / `to_f`.
30
-
31
- Divergence (verified empirically):
32
- | value | C path | Ruby fallback |
33
- |-----------|------------------|-------------------|
34
- | ".5" | 0.5 (Float) | ".5" (String) |
35
- | "3." | 3.0 (Float) | "3." (String) |
36
- | "1.5e3" | 1500.0 (Float) | "1.5e3" (String) |
37
- | "1.0e10" | 10000000000.0 | "1.0e10" (String) |
38
-
39
- Decision: the C path's permissive behavior (corner cases + scientific notation) is the intended
40
- contract. Fix = make the Ruby fallback match the C path. Do NOT tighten the C path.
41
-
42
- Ruby-side changes (in `hash_transformations.rb`):
43
- 1. Swap NUMERIC_REGEX for a permissive one:
44
- /\A[+-]?(?:\d+\.?\d*|\.\d+)(?:[eE][+-]?\d+)?\z/
45
- matches 1, 1., 1.5, .5, 1e3, 1.5e3, -3.14e-2, etc.; still rejects ".", "e3", "1.2.3",
46
- "1_000", "0x1F".
47
- 2. Add `DOT_BYTE = '.'.ord` (46) and include it in the first-byte fast-reject's allowed set
48
- (the C pre-check already allows a leading `.`; without this, ".5" gets rejected on byte 0).
49
- 3. Int-vs-float decision: `(v.include?('.') || v.include?('e') || v.include?('E')) ? v.to_f : v.to_i`
50
- (currently only checks for `.`).
51
-
52
- Stays a string on BOTH paths (no change needed, but worth characterization tests — there are
53
- currently NONE):
54
- - "010" => 10 (NOT octal 8 — both paths use base-10 conversion: String#to_i / strtol(.,10).
55
- A switch to Kernel#Integer() would break this. Lock it down with a test.)
56
- - "0x1F", "0b101", "0o17" => string (radix prefixes not honored by base-10 conversion)
57
- - "1_000" => string (underscores)
58
- - "1,200.00", "1.300,00" => string (thousands sep / decimal comma — strtod stops at the
59
- separator → not fully consumed; regex rejects. This is the only safe behavior; "1,200" is
60
- genuinely ambiguous. Locale-specific number formats are the caller's job via value_converters.)
61
-
62
- NOT doing: locale sniffing (read LC_NUMERIC at init and adjust the regexes). Rejected because
63
- the machine locale tells you nothing about the file's number format, it breaks reproducibility
64
- (same code + same file → different results on a US vs EU box), and `,` can't be both col_sep and
65
- decimal separator anyway. Note `strtod` IS locale-sensitive (LC_NUMERIC) but it's dormant — Ruby
66
- runs in the C/POSIX locale; don't deliberately activate it.
67
-
68
- When done: parity tests (`[true, false].each`) for the now-consistent set (.5, 3., 1.5e3, 1e3)
69
- plus characterization tests for the stays-a-string set above; CHANGELOG line noting the Ruby
70
- fallback's numeric conversion now accepts scientific notation and bare-dot forms, matching the
71
- accelerated path. Behavior change affects `acceleration: false` users only — and aligns them with
72
- the default.
73
-
74
-
75
- ## Warn once when the C extension didn't load on a platform that supports it
76
-
77
- Context: `acceleration: true` is the default. When the C extension fails to build / isn't loaded,
78
- SmarterCSV silently falls back to the Ruby parser — graceful degradation by design (so the gem
79
- keeps working for users with broken toolchains, JRuby, TruffleRuby, etc.). Today there is no
80
- signal to the user that they're not getting the C path; their CSV parsing is just slower than
81
- they might have expected.
82
-
83
- Idea: emit a one-time warning when:
84
- * the C extension is NOT loaded — `!SmarterCSV::Parser.respond_to?(:parse_csv_line_c)`, AND
85
- * the platform is one where it *should* be available — `RUBY_ENGINE == 'ruby'` (MRI / CRuby).
86
- JRuby and TruffleRuby don't load CRuby C extensions natively; nothing for the user to do.
87
-
88
- Where to fire:
89
- * NOT at `require 'smarter_csv'` time — Rails.logger typically isn't set up yet, so any
90
- "route through the warnings system" code would just fall through to `Kernel#warn` anyway,
91
- and the warning would land in stderr instead of the Rails log where ops would see it.
92
- * At first `Reader.new` / `SmarterCSV.process` call — Rails has booted, the existing
93
- routing-through-Rails.logger-or-Kernel#warn infra works, and the existing deduped warnings
94
- histogram means it fires once per process regardless of how many parse calls.
95
-
96
- Implementation sketch:
97
- * Add a new warning code (e.g. `:c_extension_unavailable`) alongside the existing ones
98
- (`:chunk_size_default`, `:header_a_method`, `:utf8_missing_binary_mode`, ...).
99
- * Severity `:warn`. Suppressible via the existing `verbose: :quiet`.
100
- * Message points at the fix — e.g. "C acceleration extension not loaded on this Ruby; using
101
- Ruby parser. To enable acceleration, reinstall with `gem pristine smarter_csv` and check
102
- the build log." Plus a link/pointer to a troubleshooting section in the docs.
103
-
104
- Bonus: add a public predicate `SmarterCSV.acceleration_available?` returning
105
- `Parser.respond_to?(:parse_csv_line_c)`. Zero noise, useful for scripts / CI / future spec
106
- files that want to branch on the environment fact rather than guess.
107
-
108
- NOT doing: a banner at `require` time (every Rails app would print it at boot, too noisy);
109
- warning when `acceleration: false` was explicitly chosen (the user knows what they're doing).