smarter_csv 1.17.4 → 1.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 140f9359c26f8903b9075faeb59e9c1fc4b5c4b9dd5fcef664e12bb53fe13073
4
- data.tar.gz: 0e7be3195610bcddb77870744a24f0eee9431643c4c28fcbea12d4b8663bb2db
3
+ metadata.gz: 3335e39a1c0792f01df9e95401c7f3885c49a0d64eeb9c76e5c20e25d01a62f5
4
+ data.tar.gz: e43f00228777b56fc1ee0814a74acaa6a23c51fe8da6f64e42ad92fe1b54002f
5
5
  SHA512:
6
- metadata.gz: 176daa024372ade1d6431e5e4fd5355175cd1ebf03e7a216d70b8ff4554eb259a023969783b53118eba6c3fbf747a08e286ed7daad455ebc5cf8d7b61475d3d1
7
- data.tar.gz: 8b1ca7263e5a54fc642c8f76bfd56577f6056784910acdcdcd0f18c6222c8793280016e749f2e08a61485bf8255cf341250ac1250a8d3ff424f8b07b1edbd51b
6
+ metadata.gz: 2abcd136f30d284c3c27cbd2b6c9782aec4235ec62cc27ea7620380ae9efc889f9f1c05c10ad957d6e7c84d65be75d35d96d4ac59ba5780dc5e65e0151c661e6
7
+ data.tar.gz: 6c61062c08d0a89dea2c91a7faafd88f6d09c88015ad8c7f4facb2eb474b44b18e0c5eaffc36fa5e2885f0b2c02487497f2460d66ce2433f62c09bf42d92a4ce
data/CHANGELOG.md CHANGED
@@ -4,6 +4,47 @@
4
4
  > [!TIP]
5
5
  > **Upgrading?** The [SmarterCSV Upgrade Wizard](https://tilo.github.io/smarter_csv/upgrade_wizard.html) walks you through what (if anything) you need to change for your specific version. Most steps do not require any changes.
6
6
 
7
+ ## 1.18.0 (2026-06-17)
8
+
9
+ This release is focused on both performance and the introduction of automatic conversion of decimals to big_decimal or float, preserving the precision, and also supporting scientific notation.
10
+
11
+ ⚠️ This version is particularly interesting if you have geolocation, scientific, or high-precision data.
12
+
13
+ ### New Features
14
+
15
+ - **`decimal_precision` option** (`:auto` default, or `:float` / `:bigdecimal`) — controls how decimal values are converted. `:auto` returns a `Float` unless the value carries more than 16 significant digits, in which case it returns a `BigDecimal` so no precision is lost; `:float` always returns `Float`; `:bigdecimal` always returns `BigDecimal`. Integers are unaffected (always `Integer`). Works identically on the C and Ruby paths. (Ruby's standard-library CSV has no high-precision option — its `:numeric`/`:float` converters use `Float()` and lose precision.)
16
+ - **Float** conversion on the C path now uses the fast **Eisel-Lemire** algorithm (fast_float, vendored) for mantissas up to 19 significant digits — correctly rounded, bit-for-bit identical to `String#to_f` — with a `strtod` fallback beyond that (more than 19 digits / extreme exponents). High-precision values that become `BigDecimal` under `:auto`/`:bigdecimal` are parsed by Ruby's `BigDecimal`.
17
+
18
+ ### Behavior Changes
19
+
20
+ - **Scientific notation now converts to a number** (e.g. `"1e3"`, `"1.5e-5"`, `"6.022e23"`). Previously the Ruby path left these as Strings and the C path was inconsistent.
21
+ - **The C and Ruby numeric-conversion paths are now aligned.** Bare-dot forms like `".5"` and `"3."` stay Strings on **both** paths (the shared grammar requires an integer part and, when a dot is present, a fraction digit). Previously the C path converted these and the Ruby path did not.
22
+ - With the default `decimal_precision: :auto`, decimal values carrying more than 16 significant digits are now returned as `BigDecimal` instead of `Float`. Pass `decimal_precision: :float` to keep the previous always-`Float` behavior.
23
+ - `bigdecimal` is now a runtime dependency (it is no longer a default gem on Ruby 3.4+).
24
+
25
+ ### Performance
26
+
27
+ The C-accelerated path is faster across the board, **up to ~1.5× on the right shapes** — numeric-heavy data and backslash-escaped quoted fields — and ~1.04–1.08× on typical files.
28
+
29
+ - Eisel-Lemire (Mushtak-Lemire) algorithm on the C path to convert decimals to `Float` or `BigDecimal`. Numeric-heavy data (many float/decimal columns) parses significantly faster.
30
+ - SIMD scanner for backslash-escaped quoted fields (C-path), using NEON (arm64) and SSE2 (x86-64) with a scalar fallback. Speeds up `quote_escaping: :backslash` parsing of long quoted fields.
31
+
32
+ | File | C-path | driver |
33
+ |---------------------------------|----------------------------------|-----------------------|
34
+ | backslash_long_fields_60k | 1.48× faster (0.1880s → 0.1273s) | SIMD quote/backslash scanner |
35
+ | sensor_data_50krows_50cols | 1.40× faster (0.2763s → 0.1975s) | Eisel-Lemire numeric conversion |
36
+
37
+ ### Improvements
38
+
39
+ - Improved robustness of symbol-valued enum option processing.
40
+
41
+ ### Tests
42
+
43
+ - added parity tests for long quoted-field scanning across 16-byte boundaries, running on both the C and Ruby paths.
44
+ - added tests for string-to-symbol coercion of the enum options.
45
+
46
+
47
+
7
48
  ## 1.17.4 (2026-06-03)
8
49
 
9
50
  ### Bug Fix
data/README.md CHANGED
@@ -15,6 +15,9 @@
15
15
 
16
16
  > See [**Ruby CSV Pitfalls**](docs/ruby_csv_pitfalls.md) for 10 ways `CSV.read` silently corrupts or loses data, and how SmarterCSV handles them.
17
17
 
18
+ > [!TIP]
19
+ > **No silent precision loss (new in 1.18.0).** For scientific data, GPS/geo coordinates, and financial figures — which routinely carry 16+ significant digits — Ruby's standard CSV converts with `Float()`, so a value like `1234567890.123456789` is silently rounded to `1234567890.1234567`. SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` for values beyond 16 significant digits (and `Float` otherwise) — full precision, no data loss. Floats are decoded with the Eisel-Lemire algorithm: correctly rounded, bit-for-bit identical to `String#to_f`.
20
+
18
21
  Beyond raw speed, SmarterCSV is designed to provide a significantly more convenient and developer-friendly interface than traditional CSV libraries. Instead of returning raw arrays that require substantial post-processing, SmarterCSV produces Rails-ready hashes for each row, making the data immediately usable with ActiveRecord, Sidekiq pipelines, parallel processing, and JSON-based workflows such as S3.
19
22
 
20
23
  In a Rails app, warnings auto-route through `Rails.logger` and instrumentation hooks compose with `ActiveSupport::Notifications` — no setup required. Outside Rails, warnings fall back to `$stderr` and the same APIs work without any framework dependency.
@@ -89,6 +92,8 @@ rows = SmarterCSV.process('data.csv')
89
92
  data = SmarterCSV.parse(csv_string)
90
93
  ```
91
94
 
95
+ Numeric conversion is also more accurate: where Ruby's `:numeric`/`:float` converters round high-precision decimals through `Float()`, SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` past 16 significant digits, so no precision is lost (pass `decimal_precision: :float` for like-for-like `Float` output).
96
+
92
97
  * See [**Migrating from Ruby CSV**](docs/migrating_from_csv.md) for a full comparison of options, behavior differences, and a quick-reference table.
93
98
 
94
99
  ## Examples
@@ -156,6 +156,39 @@ data = SmarterCSV.process(file,
156
156
  convert_values_to_numeric: { only: [:quantity, :price] })
157
157
  ```
158
158
 
159
+ Scientific notation (e.g. `"1.5e3"`, `"6.022e23"`) is recognized and converted too. Bare-dot forms like `".5"` and `"3."` are left as Strings (they are not valid numbers here). Integers and floats convert identically on the C-accelerated and pure-Ruby paths.
160
+
161
+ ---
162
+
163
+ ## `decimal_precision`
164
+
165
+ **Default: `:auto`**
166
+
167
+ Controls how decimal values (those with a `.` or an exponent) are converted. Integers are unaffected — they are always returned as `Integer`.
168
+
169
+ | Value | Result |
170
+ |---------------|-----------------------------------------------------------------------------------------|
171
+ | `:auto` | `Float`, unless the value carries more than 16 significant digits — then `BigDecimal`. |
172
+ | `:float` | Always `Float` (correctly rounded; matches `String#to_f`). |
173
+ | `:bigdecimal` | Always `BigDecimal` (full precision). |
174
+
175
+ ```ruby
176
+ # :auto (default) — keeps full precision only when needed
177
+ SmarterCSV.process(file)
178
+ # "3.14" => 3.14 (Float)
179
+ # "1234567890.123456789" => 0.1234567890123456789e10 (BigDecimal — >16 sig digits)
180
+
181
+ # :float — always Float (faster, may lose precision on long decimals)
182
+ SmarterCSV.process(file, decimal_precision: :float)
183
+ # "1234567890.123456789" => 1234567890.1234567 (Float)
184
+
185
+ # :bigdecimal — always BigDecimal
186
+ SmarterCSV.process(file, decimal_precision: :bigdecimal)
187
+ # "3.14" => 0.314e1 (BigDecimal)
188
+ ```
189
+
190
+ Unlike Ruby's standard-library CSV — whose `:numeric`/`:float` converters use `Float()` and silently lose precision — `:auto` preserves high-precision decimals as `BigDecimal`. Decimal values are decoded on the C path with the Eisel-Lemire algorithm (correctly rounded, identical to `String#to_f`).
191
+
159
192
  ---
160
193
 
161
194
  ## `remove_empty_hashes`
@@ -223,6 +223,24 @@ rows = SmarterCSV.process('sample.csv',
223
223
  convert_values_to_numeric: { except: [:zip_code, :phone, :account_number] })
224
224
  ```
225
225
 
226
+ **High-precision decimals — scientific data and geo coordinates.** GPS/geo coordinates, scientific measurements, and financial figures routinely carry 16+ significant digits, where Ruby's `Float()`-based conversion (`converters: :numeric` / `:float`) silently rounds the value. SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` once a value exceeds 16 significant digits (and a `Float` otherwise), so the full value is preserved; scientific notation (`6.022e23`, `1.6e-19`) is recognized as numeric too.
227
+
228
+ **With Ruby CSV (precision lost):**
229
+ ```ruby
230
+ CSV.read('locations.csv', headers: true, converters: :float).first['lat']
231
+ # => -122.42200352825247 ← Float() dropped the last digits of -122.422003528252475
232
+ ```
233
+
234
+ **With SmarterCSV (full precision kept):**
235
+ ```ruby
236
+ SmarterCSV.process('locations.csv').first[:lat]
237
+ # => -0.122422003528252475e3 (BigDecimal — all 18 significant digits preserved)
238
+
239
+ # Force Float everywhere, like-for-like with Ruby CSV:
240
+ SmarterCSV.process('locations.csv', decimal_precision: :float).first[:lat]
241
+ # => -122.42200352825247 (Float)
242
+ ```
243
+
226
244
  ### 3. Empty values are removed by default
227
245
 
228
246
  SmarterCSV drops key/value pairs where the value is `nil` or blank
data/docs/options.md CHANGED
@@ -121,7 +121,8 @@ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling
121
121
  | Option | Default | Explanation |
122
122
  |--------|---------|-------------|
123
123
  | `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
124
- | `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
124
+ | `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats (including scientific notation like `1.5e3`) to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
125
+ | `:decimal_precision` | `:auto` | How decimals are converted: `:auto` returns `Float` but `BigDecimal` above 16 significant digits (no precision loss); `:float` always returns `Float`; `:bigdecimal` always returns `BigDecimal`. Integers are unaffected. |
125
126
  | `:value_converters` | `nil` | Hash of `:header => converter`; converter can be a lambda/Proc or a class implementing `self.convert(value)`. See [Value Converters](./value_converters.md). |
126
127
  | `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil`, empty, or whitespace-only — any Unicode whitespace, same as Ruby's `String#blank?`. |
127
128
  | `:remove_zero_values` | `false` | Remove key/value pairs whose value is zero — numeric `0` / `0.0`, or any textual form of zero (`"0"`, `"0.0"`, `"00.00"`, `"+0"`, `"-0.0"`, …). |
@@ -7,6 +7,14 @@
7
7
  #include <stdlib.h>
8
8
  #include <errno.h>
9
9
 
10
+ #ifdef __ARM_NEON
11
+ #include <arm_neon.h>
12
+ #elif defined(__SSE2__)
13
+ #include <immintrin.h>
14
+ #endif
15
+
16
+ #include "vendor/eisel_lemire.h" /* Eisel-Lemire decimal->double, correctly rounded (fast_float) */
17
+
10
18
  #ifndef bool
11
19
  #define bool int
12
20
  #define false ((bool)0)
@@ -41,6 +49,8 @@ static ID id_only, id_except, id_quote_boundary;
41
49
  static ID id_only_headers, id_except_headers, id_keep_cols, id_strict;
42
50
  static ID id_keep_bitmap, id_keep_extra_cols, id_early_exit_after_sym;
43
51
  static ID id_backslash, id_standard;
52
+ static ID id_decimal_precision, id_float, id_bigdecimal;
53
+ static ID id_BigDecimal; /* the Kernel#BigDecimal() method (require 'bigdecimal' done in Ruby) */
44
54
 
45
55
  /* ================================================================================
46
56
  * ParseContext — wraps all per-file parse options as a GC-managed TypedData object.
@@ -70,6 +80,9 @@ typedef struct {
70
80
  /* Numeric conversion: 0=off, 1=all, 2=only listed keys, 3=except listed keys */
71
81
  int numeric_mode;
72
82
 
83
+ /* Decimal handling: 0=float, 1=auto (BigDecimal above 16 sig digits), 2=bigdecimal */
84
+ int decimal_precision;
85
+
73
86
  /* Column filter bitmap (xmalloc'd; NULL when no filtering active) */
74
87
  bool *keep_bitmap;
75
88
  long keep_bitmap_len;
@@ -133,6 +146,51 @@ static const rb_data_type_t parse_context_type = {
133
146
  RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_WB_PROTECTED
134
147
  };
135
148
 
149
+ /* Scan [p, end) for the first `quote` char or backslash; returns a pointer to it,
150
+ * or `end` if neither occurs. NEON (arm64) or SSE2 (x86-64) processes 16 bytes per
151
+ * iteration; scalar fallback elsewhere. Ported from smarter_json's fj_scan_str.
152
+ *
153
+ * Used by the quoted-field slow path in :backslash escaping mode, where the only bytes
154
+ * that can change parser state inside a quoted field are the quote char (closing /
155
+ * doubled) and the backslash (escape). Bulk-skipping the plain content between them
156
+ * keeps the byte-by-byte state machine's behavior but avoids stepping every byte.
157
+ * In RFC mode the slow path uses a plain memchr-to-quote instead (only one byte class
158
+ * matters there), so this two-class scan is reserved for backslash mode. */
159
+ static inline const char *scan_quote_or_backslash(const char *p, const char *end, char quote) {
160
+ #ifdef __ARM_NEON
161
+ const uint8x16_t vq = vdupq_n_u8((uint8_t)quote);
162
+ const uint8x16_t vbs = vdupq_n_u8((uint8_t)'\\');
163
+ while (p + 16 <= end) {
164
+ uint8x16_t chunk = vld1q_u8((const uint8_t *)p);
165
+ uint8x16_t m = vorrq_u8(vceqq_u8(chunk, vq), vceqq_u8(chunk, vbs));
166
+ /* movemask emulation (Oj's technique): pack to 4 bits/byte, then ctz/4. */
167
+ uint8x8_t res = vshrn_n_u16(vreinterpretq_u16_u8(m), 4);
168
+ uint64_t mask = vget_lane_u64(vreinterpret_u64_u8(res), 0);
169
+ if (__builtin_expect(mask != 0, 0)) { /* most 16-byte chunks contain neither */
170
+ mask &= 0x8888888888888888ull;
171
+ return p + (__builtin_ctzll(mask) >> 2);
172
+ }
173
+ p += 16;
174
+ }
175
+ #elif defined(__SSE2__)
176
+ const __m128i vq = _mm_set1_epi8(quote);
177
+ const __m128i vbs = _mm_set1_epi8('\\');
178
+ while (p + 16 <= end) {
179
+ __m128i chunk = _mm_loadu_si128((const __m128i *)p);
180
+ __m128i m = _mm_or_si128(_mm_cmpeq_epi8(chunk, vq), _mm_cmpeq_epi8(chunk, vbs));
181
+ int mask = _mm_movemask_epi8(m); /* one bit per byte that matched */
182
+ if (__builtin_expect(mask != 0, 0)) { /* most 16-byte chunks contain neither */
183
+ return p + __builtin_ctz(mask);
184
+ }
185
+ p += 16;
186
+ }
187
+ #endif
188
+ for (; p < end; p++) {
189
+ if (*p == quote || *p == '\\') return p;
190
+ }
191
+ return end;
192
+ }
193
+
136
194
  static VALUE unescape_quotes(char *str, long len, char quote_char, rb_encoding *encoding) {
137
195
  // Fast path: scan for any doubled quote pair. If none present, the field has
138
196
  // nothing to unescape — emit it directly via rb_enc_str_new and skip the
@@ -386,6 +444,20 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
386
444
  backslash_count = 0;
387
445
  field_started = false; // reset for next field
388
446
  } else {
447
+ /* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7).
448
+ * Inside a quoted field the only state-changing bytes are the quote char and the
449
+ * backslash; bulk-skip the plain content between them. Skipped bytes are plain
450
+ * content, which the byte-by-byte loop resets backslash_count to 0 on, so reset
451
+ * it here whenever we actually move p. */
452
+ if (allow_escaped_quotes && in_quotes) {
453
+ const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
454
+ if (hit != p) {
455
+ backslash_count = 0;
456
+ p = (char *)hit;
457
+ if (p == endP) continue; /* no quote/backslash before end → unclosed */
458
+ }
459
+ }
460
+
389
461
  if (allow_escaped_quotes && *p == '\\') {
390
462
  backslash_count++;
391
463
  if (__builtin_expect(quote_boundary_standard, 1) && !in_quotes) field_started = true;
@@ -525,47 +597,101 @@ static inline VALUE get_key_for_index(long index, VALUE headers, long headers_le
525
597
  * Handles overflow: if strtol overflows (ERANGE), falls back to rb_cstr_to_inum
526
598
  * which produces a Ruby Bignum.
527
599
  */
528
- static inline VALUE try_numeric_conversion(char *trim_start, long trimmed_len) {
529
- // Quick pre-check: first char must be digit, +, -, or .
530
- char first = trim_start[0];
531
- if (!((first >= '0' && first <= '9') || first == '+' || first == '-' || first == '.')) {
600
+ static inline VALUE try_numeric_conversion(char *s, long n, int decimal_precision) {
601
+ // Quick pre-check: first char must be a digit or a sign.
602
+ char first = s[0];
603
+ if (!((first >= '0' && first <= '9') || first == '+' || first == '-')) {
532
604
  return Qundef;
533
605
  }
534
606
 
535
- // Need null-terminated string for strtol/strtod; use stack buffer for typical fields
536
- if (trimmed_len >= 63) return Qundef; // very long fields are unlikely to be simple numbers
537
-
538
- char num_buf[64];
539
- memcpy(num_buf, trim_start, trimmed_len);
540
- num_buf[trimmed_len] = '\0';
541
-
542
- char *endptr;
543
-
544
- // Try integer first (most common numeric type in CSV)
545
- // Don't try integer if field starts with '.' (e.g., ".5")
546
- if (first != '.') {
547
- errno = 0;
548
- long int_val = strtol(num_buf, &endptr, 10);
549
- if (endptr == num_buf + trimmed_len) {
550
- // Entire string was consumed valid integer
551
- if (errno == ERANGE) {
552
- // Overflow: fall back to Ruby Bignum
553
- return rb_cstr_to_inum(num_buf, 10, false);
607
+ /* Single pass: validate the token against the same grammar as the Ruby path's
608
+ * NUMERIC_REGEX = /\A[+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?\z/ and, in the same pass,
609
+ * extract everything the fast paths need:
610
+ * - mantissa value m10 (exact for <= 18 digits; `overflow` flags beyond)
611
+ * - significant-digit count `sig` (leading zeros excluded; matches the Ruby
612
+ * significant_digits helper / Oj dec_cnt) — drives the :auto Float/BigDecimal split
613
+ * - base-10 exponent e10 (from the fraction length and any explicit exponent)
614
+ * Anything the grammar rejects returns Qundef (stays a String), keeping the C and
615
+ * Ruby paths byte-identical on what does and does not convert. */
616
+ long i = 0;
617
+ int neg = 0;
618
+ if (s[i] == '+' || s[i] == '-') { neg = (s[i] == '-'); i++; }
619
+
620
+ uint64_t m10 = 0;
621
+ int m10digits = 0; /* mantissa digits accumulated into m10 (capped at 19) */
622
+ long sig = 0; /* significant digits (leading zeros excluded) */
623
+ int sig_started = 0;
624
+ bool overflow = false;
625
+ long int_digits = 0, frac_digits = 0;
626
+ bool seen_dot = false, seen_exp = false, any_digit = false, exp_any = false;
627
+ int64_t exp_val = 0; int exp_neg = 0;
628
+
629
+ for (; i < n; i++) {
630
+ char c = s[i];
631
+ if (c >= '0' && c <= '9') {
632
+ any_digit = true;
633
+ if (!seen_exp) {
634
+ if (seen_dot) frac_digits++; else int_digits++;
635
+ if (sig_started) sig++;
636
+ else if (c != '0') { sig_started = 1; sig = 1; }
637
+ if (m10digits < 19) { m10 = m10 * 10 + (uint64_t)(c - '0'); m10digits++; }
638
+ else overflow = true;
639
+ } else {
640
+ exp_any = true;
641
+ exp_val = exp_val * 10 + (c - '0');
642
+ if (exp_val > 1000000) overflow = true; /* extreme exponent → strtod fallback */
554
643
  }
555
- return LONG2NUM(int_val);
644
+ } else if (c == '.' && !seen_dot && !seen_exp) {
645
+ seen_dot = true;
646
+ } else if ((c == 'e' || c == 'E') && !seen_exp && any_digit) {
647
+ seen_exp = true;
648
+ if (i + 1 < n && (s[i + 1] == '+' || s[i + 1] == '-')) { exp_neg = (s[i + 1] == '-'); i++; }
649
+ } else {
650
+ return Qundef; /* invalid char for a number → not numeric */
556
651
  }
557
652
  }
558
653
 
559
- // Try float (only if contains '.')
560
- if (memchr(num_buf, '.', trimmed_len)) {
561
- errno = 0;
562
- double float_val = strtod(num_buf, &endptr);
563
- if (endptr == num_buf + trimmed_len && errno != ERANGE) {
564
- return DBL2NUM(float_val);
654
+ /* Enforce NUMERIC_REGEX exactly: an integer part is required; a dot requires a
655
+ * fraction digit; an exponent requires an exponent digit. */
656
+ if (int_digits == 0) return Qundef;
657
+ if (seen_dot && frac_digits == 0) return Qundef;
658
+ if (seen_exp && !exp_any) return Qundef;
659
+
660
+ bool is_decimal = seen_dot || seen_exp;
661
+
662
+ if (!is_decimal) {
663
+ /* Integer. Fast path when it fits in a long; otherwise a Ruby Integer/Bignum. */
664
+ if (!overflow && m10digits <= 18) {
665
+ long v = (long)m10;
666
+ return LONG2NUM(neg ? -v : v);
565
667
  }
668
+ VALUE str = rb_str_new(s, n);
669
+ return rb_cstr_to_inum(RSTRING_PTR(str), 10, false);
566
670
  }
567
671
 
568
- return Qundef; // not numeric
672
+ /* Decimal (has a '.' or an exponent) — honor decimal_precision. 0=float, 1=auto, 2=bigdecimal */
673
+ if (decimal_precision == 2 || (decimal_precision == 1 && sig > 16)) {
674
+ VALUE str = rb_str_new(s, n);
675
+ return rb_funcall(rb_cObject, id_BigDecimal, 1, str);
676
+ }
677
+
678
+ /* Float. base-10 exponent = explicit exponent minus the fraction length. */
679
+ int64_t e10 = (exp_neg ? -exp_val : exp_val) - (int64_t)frac_digits;
680
+ double d;
681
+ if (!overflow && m10digits >= 1 && m10digits <= 19 && ((long)m10digits + e10) >= -307) {
682
+ /* Eisel-Lemire is correctly-rounded for any nonzero mantissa that fits exactly in a
683
+ * uint64 — i.e. up to 19 significant digits (the max 19-digit value ~1.0e19 is below
684
+ * UINT64_MAX ~1.8e19). Verified bit-for-bit vs the stdlib over 1..19-digit ties. */
685
+ d = (m10 == 0) ? (neg ? -0.0 : 0.0) : fj_eisel_lemire_s2d(e10, m10, neg);
686
+ } else {
687
+ /* >19 digits / extreme or subnormal exponent: fall back to Ruby's own correctly-rounded
688
+ * strtod (rb_cstr_to_dbl) — the exact conversion String#to_f uses — so the C path and the
689
+ * Ruby path produce the identical double on every platform, not just where the system
690
+ * strtod happens to be correctly rounded. The token is pre-validated, so badcheck=0. */
691
+ VALUE str = rb_str_new(s, n);
692
+ d = rb_cstr_to_dbl(RSTRING_PTR(str), 0);
693
+ }
694
+ return DBL2NUM(d);
569
695
  }
570
696
 
571
697
  /*
@@ -614,6 +740,7 @@ typedef struct {
614
740
  long headers_len;
615
741
  long hash_capa; // Pre-computed capacity for lazy hash allocation
616
742
  int numeric_mode; // 0=off, 1=all, 2=only, 3=except
743
+ int decimal_precision; // 0=float, 1=auto (BigDecimal above 16 sig digits), 2=bigdecimal
617
744
  bool remove_empty_values;
618
745
  bool remove_zero_values;
619
746
  } field_transform_opts;
@@ -705,7 +832,7 @@ static inline __attribute__((always_inline)) bool insert_field_into_hash(
705
832
  (opts->numeric_mode == 2 && rb_ary_includes(opts->numeric_keys, key) == Qtrue) ||
706
833
  (opts->numeric_mode == 3 && rb_ary_includes(opts->numeric_keys, key) != Qtrue);
707
834
  if (do_convert) {
708
- VALUE numeric = try_numeric_conversion(trim_start, trimmed_len);
835
+ VALUE numeric = try_numeric_conversion(trim_start, trimmed_len, opts->decimal_precision);
709
836
  if (numeric != Qundef) {
710
837
  ensure_hash_allocated(opts);
711
838
  rb_hash_aset(opts->hash, key, numeric);
@@ -752,6 +879,18 @@ void parse_numeric_option(VALUE options_hash, int *out_mode, VALUE *out_keys) {
752
879
  }
753
880
  }
754
881
 
882
+ /* Read decimal_precision into 0=float, 1=auto, 2=bigdecimal. Default :auto (1).
883
+ * The option is validated and coerced to a symbol on the Ruby side before we get here. */
884
+ static inline int parse_decimal_precision(VALUE options_hash) {
885
+ VALUE v = rb_hash_aref(options_hash, ID2SYM(id_decimal_precision));
886
+ if (RB_TYPE_P(v, T_SYMBOL)) {
887
+ ID s = SYM2ID(v);
888
+ if (s == id_float) return 0;
889
+ if (s == id_bigdecimal) return 2;
890
+ }
891
+ return 1; // :auto (also the default when unset)
892
+ }
893
+
755
894
  /*
756
895
  * ================================================================================
757
896
  * rb_parse_line_to_hash - Parse CSV line directly into a Ruby Hash
@@ -826,6 +965,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
826
965
  int numeric_mode = 0;
827
966
  VALUE numeric_keys = Qnil;
828
967
  parse_numeric_option(options_hash, &numeric_mode, &numeric_keys);
968
+ int decimal_precision = parse_decimal_precision(options_hash);
829
969
 
830
970
  // quote_escaping and quote_boundary are only needed in Section 5 (quoted/slow path).
831
971
  // They are declared here as forward declarations so Section 5 can set them lazily.
@@ -990,6 +1130,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
990
1130
  .headers_len = headers_len,
991
1131
  .hash_capa = hash_size,
992
1132
  .numeric_mode = numeric_mode,
1133
+ .decimal_precision = decimal_precision,
993
1134
  .remove_empty_values = remove_empty_values,
994
1135
  .remove_zero_values = remove_zero_values,
995
1136
  };
@@ -1160,6 +1301,20 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
1160
1301
  p = next_quote; /* jump to quote char; fall through to quote-handling code */
1161
1302
  }
1162
1303
 
1304
+ /* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7).
1305
+ * The RFC memchr skip above only matters for one byte class; with escaping on
1306
+ * a backslash also changes state, so scan for both. Skipped bytes are plain
1307
+ * content (the byte-by-byte loop resets backslash_count to 0 on them), so reset
1308
+ * it here whenever we actually move p. */
1309
+ if (allow_escaped_quotes && in_quotes) {
1310
+ const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
1311
+ if (hit != p) {
1312
+ backslash_count = 0;
1313
+ p = (char *)hit;
1314
+ if (p == endP) continue; /* no quote/backslash before end → unclosed */
1315
+ }
1316
+ }
1317
+
1163
1318
  if (allow_escaped_quotes && *p == '\\') {
1164
1319
  // Count consecutive backslashes for escape sequence detection
1165
1320
  backslash_count++;
@@ -1354,6 +1509,7 @@ __attribute__((cold)) static VALUE rb_new_parse_context(VALUE self, VALUE header
1354
1509
 
1355
1510
  /* Numeric conversion */
1356
1511
  parse_numeric_option(options_hash, &ctx->numeric_mode, &ctx->numeric_keys);
1512
+ ctx->decimal_precision = parse_decimal_precision(options_hash);
1357
1513
 
1358
1514
  /* quote_escaping → allow_escaped_quotes */
1359
1515
  VALUE quote_escaping_val = rb_hash_aref(options_hash, ID2SYM(id_quote_escaping));
@@ -1474,6 +1630,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1474
1630
  bool remove_empty_values = ctx->remove_empty_values;
1475
1631
  bool remove_zero_values = ctx->remove_zero_values;
1476
1632
  int numeric_mode = ctx->numeric_mode;
1633
+ int decimal_precision = ctx->decimal_precision;
1477
1634
  VALUE numeric_keys = ctx->numeric_keys;
1478
1635
  bool *keep_bitmap = ctx->keep_bitmap;
1479
1636
  /* keep_bitmap is cached in the context (xmalloc'd once at construction, sized to the header count
@@ -1525,6 +1682,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1525
1682
  .headers_len = headers_len,
1526
1683
  .hash_capa = hash_size,
1527
1684
  .numeric_mode = numeric_mode,
1685
+ .decimal_precision = decimal_precision,
1528
1686
  .remove_empty_values = remove_empty_values,
1529
1687
  .remove_zero_values = remove_zero_values,
1530
1688
  };
@@ -1654,6 +1812,16 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
1654
1812
  p = next_quote; /* fall through to quote-handling code */
1655
1813
  }
1656
1814
 
1815
+ /* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7). */
1816
+ if (allow_escaped_quotes && in_quotes) {
1817
+ const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
1818
+ if (hit != p) {
1819
+ backslash_count = 0;
1820
+ p = (char *)hit;
1821
+ if (p == endP) continue; /* no quote/backslash before end → unclosed */
1822
+ }
1823
+ }
1824
+
1657
1825
  if (allow_escaped_quotes && *p == '\\') {
1658
1826
  backslash_count++;
1659
1827
  if (__builtin_expect(quote_boundary_standard, 1) && !in_quotes) field_started = true;
@@ -1878,6 +2046,10 @@ void Init_smarter_csv(void) {
1878
2046
  id_strict = rb_intern("strict");
1879
2047
  id_backslash = rb_intern("backslash");
1880
2048
  id_standard = rb_intern("standard");
2049
+ id_decimal_precision = rb_intern("decimal_precision");
2050
+ id_float = rb_intern("float");
2051
+ id_bigdecimal = rb_intern("bigdecimal");
2052
+ id_BigDecimal = rb_intern("BigDecimal"); /* Kernel#BigDecimal(); 'bigdecimal' is required in lib/smarter_csv.rb */
1881
2053
 
1882
2054
  rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 9);
1883
2055
  rb_define_module_function(Parser, "count_quote_chars_c", rb_count_quote_chars, 4);
@@ -0,0 +1,27 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2021 The fast_float authors
4
+
5
+ Permission is hereby granted, free of charge, to any
6
+ person obtaining a copy of this software and associated
7
+ documentation files (the "Software"), to deal in the
8
+ Software without restriction, including without
9
+ limitation the rights to use, copy, modify, merge,
10
+ publish, distribute, sublicense, and/or sell copies of
11
+ the Software, and to permit persons to whom the Software
12
+ is furnished to do so, subject to the following
13
+ conditions:
14
+
15
+ The above copyright notice and this permission notice
16
+ shall be included in all copies or substantial portions
17
+ of the Software.
18
+
19
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
20
+ ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
21
+ TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
22
+ PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
23
+ SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
24
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
25
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
26
+ IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
27
+ DEALINGS IN THE SOFTWARE.