RubyGems - smarter_csv - Versions diffs - 1.17.4 → 1.18.0 - Mend

smarter_csv 1.17.4 → 1.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +41 -0
data/README.md +5 -0
data/docs/data_transformations.md +33 -0
data/docs/migrating_from_csv.md +18 -0
data/docs/options.md +2 -1
data/ext/smarter_csv/smarter_csv.c +204 -32
data/ext/smarter_csv/vendor/LICENSE-fast_float-MIT +27 -0
data/ext/smarter_csv/vendor/eisel_lemire.h +117 -0
data/ext/smarter_csv/vendor/eisel_lemire.md +29 -0
data/ext/smarter_csv/vendor/eisel_lemire_powers.h +663 -0
data/lib/smarter_csv/hash_transformations.rb +51 -2
data/lib/smarter_csv/reader_options.rb +24 -0
data/lib/smarter_csv/version.rb +1 -1
data/lib/smarter_csv.rb +1 -0
data/smarter_csv.gemspec +3 -0
metadata +22 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 140f9359c26f8903b9075faeb59e9c1fc4b5c4b9dd5fcef664e12bb53fe13073
-  data.tar.gz: 0e7be3195610bcddb77870744a24f0eee9431643c4c28fcbea12d4b8663bb2db
+  metadata.gz: 3335e39a1c0792f01df9e95401c7f3885c49a0d64eeb9c76e5c20e25d01a62f5
+  data.tar.gz: e43f00228777b56fc1ee0814a74acaa6a23c51fe8da6f64e42ad92fe1b54002f
 SHA512:
-  metadata.gz: 176daa024372ade1d6431e5e4fd5355175cd1ebf03e7a216d70b8ff4554eb259a023969783b53118eba6c3fbf747a08e286ed7daad455ebc5cf8d7b61475d3d1
-  data.tar.gz: 8b1ca7263e5a54fc642c8f76bfd56577f6056784910acdcdcd0f18c6222c8793280016e749f2e08a61485bf8255cf341250ac1250a8d3ff424f8b07b1edbd51b
+  metadata.gz: 2abcd136f30d284c3c27cbd2b6c9782aec4235ec62cc27ea7620380ae9efc889f9f1c05c10ad957d6e7c84d65be75d35d96d4ac59ba5780dc5e65e0151c661e6
+  data.tar.gz: 6c61062c08d0a89dea2c91a7faafd88f6d09c88015ad8c7f4facb2eb474b44b18e0c5eaffc36fa5e2885f0b2c02487497f2460d66ce2433f62c09bf42d92a4ce

data/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,47 @@
 > [!TIP]
 > **Upgrading?** The [SmarterCSV Upgrade Wizard](https://tilo.github.io/smarter_csv/upgrade_wizard.html) walks you through what (if anything) you need to change for your specific version. Most steps do not require any changes.
+## 1.18.0 (2026-06-17)
+This release is focused on both performance and the introduction of automatic conversion of decimals to big_decimal or float, preserving the precision, and also supporting scientific notation.
+⚠️ This version is particularly interesting if you have geolocation, scientific, or high-precision data.
+### New Features
+  - **`decimal_precision` option** (`:auto` default, or `:float` / `:bigdecimal`) — controls how decimal values are converted. `:auto` returns a `Float` unless the value carries more than 16 significant digits, in which case it returns a `BigDecimal` so no precision is lost; `:float` always returns `Float`; `:bigdecimal` always returns `BigDecimal`. Integers are unaffected (always `Integer`). Works identically on the C and Ruby paths. (Ruby's standard-library CSV has no high-precision option — its `:numeric`/`:float` converters use `Float()` and lose precision.)
+  - **Float** conversion on the C path now uses the fast **Eisel-Lemire** algorithm (fast_float, vendored) for mantissas up to 19 significant digits — correctly rounded, bit-for-bit identical to `String#to_f` — with a `strtod` fallback beyond that (more than 19 digits / extreme exponents). High-precision values that become `BigDecimal` under `:auto`/`:bigdecimal` are parsed by Ruby's `BigDecimal`.
+### Behavior Changes
+  - **Scientific notation now converts to a number** (e.g. `"1e3"`, `"1.5e-5"`, `"6.022e23"`). Previously the Ruby path left these as Strings and the C path was inconsistent.
+  - **The C and Ruby numeric-conversion paths are now aligned.** Bare-dot forms like `".5"` and `"3."` stay Strings on **both** paths (the shared grammar requires an integer part and, when a dot is present, a fraction digit). Previously the C path converted these and the Ruby path did not.
+  - With the default `decimal_precision: :auto`, decimal values carrying more than 16 significant digits are now returned as `BigDecimal` instead of `Float`. Pass `decimal_precision: :float` to keep the previous always-`Float` behavior.
+  - `bigdecimal` is now a runtime dependency (it is no longer a default gem on Ruby 3.4+).
+### Performance
+  The C-accelerated path is faster across the board, **up to ~1.5× on the right shapes** — numeric-heavy data and backslash-escaped quoted fields — and ~1.04–1.08× on typical files.
+  - Eisel-Lemire (Mushtak-Lemire) algorithm on the C path to convert decimals to `Float` or `BigDecimal`. Numeric-heavy data (many float/decimal columns) parses significantly faster.
+  - SIMD scanner for backslash-escaped quoted fields (C-path), using NEON (arm64) and SSE2 (x86-64) with a scalar fallback. Speeds up `quote_escaping: :backslash` parsing of long quoted fields.
+  | File                            | C-path                           | driver                |
+  |---------------------------------|----------------------------------|-----------------------|
+  | backslash_long_fields_60k       | 1.48× faster (0.1880s → 0.1273s) | SIMD quote/backslash scanner |
+  | sensor_data_50krows_50cols      | 1.40× faster (0.2763s → 0.1975s) | Eisel-Lemire numeric conversion |
+### Improvements
+  - Improved robustness of symbol-valued enum option processing.
+### Tests
+  - added parity tests for long quoted-field scanning across 16-byte boundaries, running on both the C and Ruby paths.
+  - added tests for string-to-symbol coercion of the enum options.
 ## 1.17.4 (2026-06-03)
 ### Bug Fix

data/README.md CHANGED Viewed

@@ -15,6 +15,9 @@
   > See [**Ruby CSV Pitfalls**](docs/ruby_csv_pitfalls.md) for 10 ways `CSV.read` silently corrupts or loses data, and how SmarterCSV handles them.
+  > [!TIP]
+  > **No silent precision loss (new in 1.18.0).** For scientific data, GPS/geo coordinates, and financial figures — which routinely carry 16+ significant digits — Ruby's standard CSV converts with `Float()`, so a value like `1234567890.123456789` is silently rounded to `1234567890.1234567`. SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` for values beyond 16 significant digits (and `Float` otherwise) — full precision, no data loss. Floats are decoded with the Eisel-Lemire algorithm: correctly rounded, bit-for-bit identical to `String#to_f`.
   Beyond raw speed, SmarterCSV is designed to provide a significantly more convenient and developer-friendly interface than traditional CSV libraries. Instead of returning raw arrays that require substantial post-processing, SmarterCSV produces Rails-ready hashes for each row, making the data immediately usable with ActiveRecord, Sidekiq pipelines, parallel processing, and JSON-based workflows such as S3.
   In a Rails app, warnings auto-route through `Rails.logger` and instrumentation hooks compose with `ActiveSupport::Notifications` — no setup required. Outside Rails, warnings fall back to `$stderr` and the same APIs work without any framework dependency.
@@ -89,6 +92,8 @@ rows = SmarterCSV.process('data.csv')
 data = SmarterCSV.parse(csv_string)
 ```
+Numeric conversion is also more accurate: where Ruby's `:numeric`/`:float` converters round high-precision decimals through `Float()`, SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` past 16 significant digits, so no precision is lost (pass `decimal_precision: :float` for like-for-like `Float` output).
 * See [**Migrating from Ruby CSV**](docs/migrating_from_csv.md) for a full comparison of options, behavior differences, and a quick-reference table.
 ## Examples

data/docs/data_transformations.md CHANGED Viewed

@@ -156,6 +156,39 @@ data = SmarterCSV.process(file,
   convert_values_to_numeric: { only: [:quantity, :price] })
 ```
+Scientific notation (e.g. `"1.5e3"`, `"6.022e23"`) is recognized and converted too. Bare-dot forms like `".5"` and `"3."` are left as Strings (they are not valid numbers here). Integers and floats convert identically on the C-accelerated and pure-Ruby paths.
+---
+## `decimal_precision`
+**Default: `:auto`**
+Controls how decimal values (those with a `.` or an exponent) are converted. Integers are unaffected — they are always returned as `Integer`.
+| Value         | Result                                                                                  |
+|---------------|-----------------------------------------------------------------------------------------|
+| `:auto`       | `Float`, unless the value carries more than 16 significant digits — then `BigDecimal`.   |
+| `:float`      | Always `Float` (correctly rounded; matches `String#to_f`).                               |
+| `:bigdecimal` | Always `BigDecimal` (full precision).                                                    |
+```ruby
+# :auto (default) — keeps full precision only when needed
+SmarterCSV.process(file)
+# "3.14"                 => 3.14                              (Float)
+# "1234567890.123456789" => 0.1234567890123456789e10          (BigDecimal — >16 sig digits)
+# :float — always Float (faster, may lose precision on long decimals)
+SmarterCSV.process(file, decimal_precision: :float)
+# "1234567890.123456789" => 1234567890.1234567               (Float)
+# :bigdecimal — always BigDecimal
+SmarterCSV.process(file, decimal_precision: :bigdecimal)
+# "3.14" => 0.314e1 (BigDecimal)
+```
+Unlike Ruby's standard-library CSV — whose `:numeric`/`:float` converters use `Float()` and silently lose precision — `:auto` preserves high-precision decimals as `BigDecimal`. Decimal values are decoded on the C path with the Eisel-Lemire algorithm (correctly rounded, identical to `String#to_f`).
 ---
 ## `remove_empty_hashes`

data/docs/migrating_from_csv.md CHANGED Viewed

@@ -223,6 +223,24 @@ rows = SmarterCSV.process('sample.csv',
   convert_values_to_numeric: { except: [:zip_code, :phone, :account_number] })
 ```
+**High-precision decimals — scientific data and geo coordinates.** GPS/geo coordinates, scientific measurements, and financial figures routinely carry 16+ significant digits, where Ruby's `Float()`-based conversion (`converters: :numeric` / `:float`) silently rounds the value. SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` once a value exceeds 16 significant digits (and a `Float` otherwise), so the full value is preserved; scientific notation (`6.022e23`, `1.6e-19`) is recognized as numeric too.
+**With Ruby CSV (precision lost):**
+```ruby
+CSV.read('locations.csv', headers: true, converters: :float).first['lat']
+# => -122.42200352825247   ← Float() dropped the last digits of -122.422003528252475
+```
+**With SmarterCSV (full precision kept):**
+```ruby
+SmarterCSV.process('locations.csv').first[:lat]
+# => -0.122422003528252475e3   (BigDecimal — all 18 significant digits preserved)
+# Force Float everywhere, like-for-like with Ruby CSV:
+SmarterCSV.process('locations.csv', decimal_precision: :float).first[:lat]
+# => -122.42200352825247   (Float)
+```
 ### 3. Empty values are removed by default
 SmarterCSV drops key/value pairs where the value is `nil` or blank

data/docs/options.md CHANGED Viewed

@@ -121,7 +121,8 @@ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling
 | Option | Default | Explanation |
 |--------|---------|-------------|
 | `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
-| `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
+| `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats (including scientific notation like `1.5e3`) to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
+| `:decimal_precision` | `:auto` | How decimals are converted: `:auto` returns `Float` but `BigDecimal` above 16 significant digits (no precision loss); `:float` always returns `Float`; `:bigdecimal` always returns `BigDecimal`. Integers are unaffected. |
 | `:value_converters` | `nil` | Hash of `:header => converter`; converter can be a lambda/Proc or a class implementing `self.convert(value)`. See [Value Converters](./value_converters.md). |
 | `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil`, empty, or whitespace-only — any Unicode whitespace, same as Ruby's `String#blank?`. |
 | `:remove_zero_values` | `false` | Remove key/value pairs whose value is zero — numeric `0` / `0.0`, or any textual form of zero (`"0"`, `"0.0"`, `"00.00"`, `"+0"`, `"-0.0"`, …). |

data/ext/smarter_csv/smarter_csv.c CHANGED Viewed

@@ -7,6 +7,14 @@
 #include <stdlib.h>
 #include <errno.h>
+#ifdef __ARM_NEON
+  #include <arm_neon.h>
+#elif defined(__SSE2__)
+  #include <immintrin.h>
+#endif
+#include "vendor/eisel_lemire.h" /* Eisel-Lemire decimal->double, correctly rounded (fast_float) */
 #ifndef bool
   #define bool int
   #define false ((bool)0)
@@ -41,6 +49,8 @@ static ID id_only, id_except, id_quote_boundary;
 static ID id_only_headers, id_except_headers, id_keep_cols, id_strict;
 static ID id_keep_bitmap, id_keep_extra_cols, id_early_exit_after_sym;
 static ID id_backslash, id_standard;
+static ID id_decimal_precision, id_float, id_bigdecimal;
+static ID id_BigDecimal; /* the Kernel#BigDecimal() method (require 'bigdecimal' done in Ruby) */
 /* ================================================================================
  * ParseContext — wraps all per-file parse options as a GC-managed TypedData object.
@@ -70,6 +80,9 @@ typedef struct {
   /* Numeric conversion: 0=off, 1=all, 2=only listed keys, 3=except listed keys */
   int  numeric_mode;
+  /* Decimal handling: 0=float, 1=auto (BigDecimal above 16 sig digits), 2=bigdecimal */
+  int  decimal_precision;
   /* Column filter bitmap (xmalloc'd; NULL when no filtering active) */
   bool *keep_bitmap;
   long  keep_bitmap_len;
@@ -133,6 +146,51 @@ static const rb_data_type_t parse_context_type = {
   RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_WB_PROTECTED
 };
+/* Scan [p, end) for the first `quote` char or backslash; returns a pointer to it,
+ * or `end` if neither occurs. NEON (arm64) or SSE2 (x86-64) processes 16 bytes per
+ * iteration; scalar fallback elsewhere. Ported from smarter_json's fj_scan_str.
+ *
+ * Used by the quoted-field slow path in :backslash escaping mode, where the only bytes
+ * that can change parser state inside a quoted field are the quote char (closing /
+ * doubled) and the backslash (escape). Bulk-skipping the plain content between them
+ * keeps the byte-by-byte state machine's behavior but avoids stepping every byte.
+ * In RFC mode the slow path uses a plain memchr-to-quote instead (only one byte class
+ * matters there), so this two-class scan is reserved for backslash mode. */
+static inline const char *scan_quote_or_backslash(const char *p, const char *end, char quote) {
+#ifdef __ARM_NEON
+  const uint8x16_t vq  = vdupq_n_u8((uint8_t)quote);
+  const uint8x16_t vbs = vdupq_n_u8((uint8_t)'\\');
+  while (p + 16 <= end) {
+    uint8x16_t chunk = vld1q_u8((const uint8_t *)p);
+    uint8x16_t m     = vorrq_u8(vceqq_u8(chunk, vq), vceqq_u8(chunk, vbs));
+    /* movemask emulation (Oj's technique): pack to 4 bits/byte, then ctz/4. */
+    uint8x8_t  res   = vshrn_n_u16(vreinterpretq_u16_u8(m), 4);
+    uint64_t   mask  = vget_lane_u64(vreinterpret_u64_u8(res), 0);
+    if (__builtin_expect(mask != 0, 0)) {  /* most 16-byte chunks contain neither */
+      mask &= 0x8888888888888888ull;
+      return p + (__builtin_ctzll(mask) >> 2);
+    }
+    p += 16;
+  }
+#elif defined(__SSE2__)
+  const __m128i vq  = _mm_set1_epi8(quote);
+  const __m128i vbs = _mm_set1_epi8('\\');
+  while (p + 16 <= end) {
+    __m128i chunk = _mm_loadu_si128((const __m128i *)p);
+    __m128i m     = _mm_or_si128(_mm_cmpeq_epi8(chunk, vq), _mm_cmpeq_epi8(chunk, vbs));
+    int     mask  = _mm_movemask_epi8(m);  /* one bit per byte that matched */
+    if (__builtin_expect(mask != 0, 0)) {  /* most 16-byte chunks contain neither */
+      return p + __builtin_ctz(mask);
+    }
+    p += 16;
+  }
+#endif
+  for (; p < end; p++) {
+    if (*p == quote || *p == '\\') return p;
+  }
+  return end;
+}
 static VALUE unescape_quotes(char *str, long len, char quote_char, rb_encoding *encoding) {
   // Fast path: scan for any doubled quote pair. If none present, the field has
   // nothing to unescape — emit it directly via rb_enc_str_new and skip the
@@ -386,6 +444,20 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
       backslash_count = 0;
       field_started = false;  // reset for next field
     } else {
+      /* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7).
+       * Inside a quoted field the only state-changing bytes are the quote char and the
+       * backslash; bulk-skip the plain content between them. Skipped bytes are plain
+       * content, which the byte-by-byte loop resets backslash_count to 0 on, so reset
+       * it here whenever we actually move p. */
+      if (allow_escaped_quotes && in_quotes) {
+        const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
+        if (hit != p) {
+          backslash_count = 0;
+          p = (char *)hit;
+          if (p == endP) continue;  /* no quote/backslash before end → unclosed */
+        }
+      }
       if (allow_escaped_quotes && *p == '\\') {
         backslash_count++;
         if (__builtin_expect(quote_boundary_standard, 1) && !in_quotes) field_started = true;
@@ -525,47 +597,101 @@ static inline VALUE get_key_for_index(long index, VALUE headers, long headers_le
  * Handles overflow: if strtol overflows (ERANGE), falls back to rb_cstr_to_inum
  * which produces a Ruby Bignum.
  */
-static inline VALUE try_numeric_conversion(char *trim_start, long trimmed_len) {
-  // Quick pre-check: first char must be digit, +, -, or .
-  char first = trim_start[0];
-  if (!((first >= '0' && first <= '9') || first == '+' || first == '-' || first == '.')) {
+static inline VALUE try_numeric_conversion(char *s, long n, int decimal_precision) {
+  // Quick pre-check: first char must be a digit or a sign.
+  char first = s[0];
+  if (!((first >= '0' && first <= '9') || first == '+' || first == '-')) {
     return Qundef;
   }
-  // Need null-terminated string for strtol/strtod; use stack buffer for typical fields
-  if (trimmed_len >= 63) return Qundef;  // very long fields are unlikely to be simple numbers
-  char num_buf[64];
-  memcpy(num_buf, trim_start, trimmed_len);
-  num_buf[trimmed_len] = '\0';
-  char *endptr;
-  // Try integer first (most common numeric type in CSV)
-  // Don't try integer if field starts with '.' (e.g., ".5")
-  if (first != '.') {
-    errno = 0;
-    long int_val = strtol(num_buf, &endptr, 10);
-    if (endptr == num_buf + trimmed_len) {
-      // Entire string was consumed → valid integer
-      if (errno == ERANGE) {
-        // Overflow: fall back to Ruby Bignum
-        return rb_cstr_to_inum(num_buf, 10, false);
+  /* Single pass: validate the token against the same grammar as the Ruby path's
+   * NUMERIC_REGEX = /\A[+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?\z/ and, in the same pass,
+   * extract everything the fast paths need:
+   *   - mantissa value m10 (exact for <= 18 digits; `overflow` flags beyond)
+   *   - significant-digit count `sig` (leading zeros excluded; matches the Ruby
+   *     significant_digits helper / Oj dec_cnt) — drives the :auto Float/BigDecimal split
+   *   - base-10 exponent e10 (from the fraction length and any explicit exponent)
+   * Anything the grammar rejects returns Qundef (stays a String), keeping the C and
+   * Ruby paths byte-identical on what does and does not convert. */
+  long i = 0;
+  int neg = 0;
+  if (s[i] == '+' || s[i] == '-') { neg = (s[i] == '-'); i++; }
+  uint64_t m10 = 0;
+  int  m10digits = 0;       /* mantissa digits accumulated into m10 (capped at 19) */
+  long sig = 0;             /* significant digits (leading zeros excluded) */
+  int  sig_started = 0;
+  bool overflow = false;
+  long int_digits = 0, frac_digits = 0;
+  bool seen_dot = false, seen_exp = false, any_digit = false, exp_any = false;
+  int64_t exp_val = 0; int exp_neg = 0;
+  for (; i < n; i++) {
+    char c = s[i];
+    if (c >= '0' && c <= '9') {
+      any_digit = true;
+      if (!seen_exp) {
+        if (seen_dot) frac_digits++; else int_digits++;
+        if (sig_started) sig++;
+        else if (c != '0') { sig_started = 1; sig = 1; }
+        if (m10digits < 19) { m10 = m10 * 10 + (uint64_t)(c - '0'); m10digits++; }
+        else overflow = true;
+      } else {
+        exp_any = true;
+        exp_val = exp_val * 10 + (c - '0');
+        if (exp_val > 1000000) overflow = true; /* extreme exponent → strtod fallback */
       }
-      return LONG2NUM(int_val);
+    } else if (c == '.' && !seen_dot && !seen_exp) {
+      seen_dot = true;
+    } else if ((c == 'e' || c == 'E') && !seen_exp && any_digit) {
+      seen_exp = true;
+      if (i + 1 < n && (s[i + 1] == '+' || s[i + 1] == '-')) { exp_neg = (s[i + 1] == '-'); i++; }
+    } else {
+      return Qundef; /* invalid char for a number → not numeric */
     }
   }
-  // Try float (only if contains '.')
-  if (memchr(num_buf, '.', trimmed_len)) {
-    errno = 0;
-    double float_val = strtod(num_buf, &endptr);
-    if (endptr == num_buf + trimmed_len && errno != ERANGE) {
-      return DBL2NUM(float_val);
+  /* Enforce NUMERIC_REGEX exactly: an integer part is required; a dot requires a
+   * fraction digit; an exponent requires an exponent digit. */
+  if (int_digits == 0) return Qundef;
+  if (seen_dot && frac_digits == 0) return Qundef;
+  if (seen_exp && !exp_any) return Qundef;
+  bool is_decimal = seen_dot || seen_exp;
+  if (!is_decimal) {
+    /* Integer. Fast path when it fits in a long; otherwise a Ruby Integer/Bignum. */
+    if (!overflow && m10digits <= 18) {
+      long v = (long)m10;
+      return LONG2NUM(neg ? -v : v);
     }
+    VALUE str = rb_str_new(s, n);
+    return rb_cstr_to_inum(RSTRING_PTR(str), 10, false);
   }
-  return Qundef;  // not numeric
+  /* Decimal (has a '.' or an exponent) — honor decimal_precision. 0=float, 1=auto, 2=bigdecimal */
+  if (decimal_precision == 2 || (decimal_precision == 1 && sig > 16)) {
+    VALUE str = rb_str_new(s, n);
+    return rb_funcall(rb_cObject, id_BigDecimal, 1, str);
+  }
+  /* Float. base-10 exponent = explicit exponent minus the fraction length. */
+  int64_t e10 = (exp_neg ? -exp_val : exp_val) - (int64_t)frac_digits;
+  double d;
+  if (!overflow && m10digits >= 1 && m10digits <= 19 && ((long)m10digits + e10) >= -307) {
+    /* Eisel-Lemire is correctly-rounded for any nonzero mantissa that fits exactly in a
+     * uint64 — i.e. up to 19 significant digits (the max 19-digit value ~1.0e19 is below
+     * UINT64_MAX ~1.8e19). Verified bit-for-bit vs the stdlib over 1..19-digit ties. */
+    d = (m10 == 0) ? (neg ? -0.0 : 0.0) : fj_eisel_lemire_s2d(e10, m10, neg);
+  } else {
+    /* >19 digits / extreme or subnormal exponent: fall back to Ruby's own correctly-rounded
+     * strtod (rb_cstr_to_dbl) — the exact conversion String#to_f uses — so the C path and the
+     * Ruby path produce the identical double on every platform, not just where the system
+     * strtod happens to be correctly rounded. The token is pre-validated, so badcheck=0. */
+    VALUE str = rb_str_new(s, n);
+    d = rb_cstr_to_dbl(RSTRING_PTR(str), 0);
+  }
+  return DBL2NUM(d);
 }
 /*
@@ -614,6 +740,7 @@ typedef struct {
   long headers_len;
   long hash_capa;           // Pre-computed capacity for lazy hash allocation
   int numeric_mode;         // 0=off, 1=all, 2=only, 3=except
+  int decimal_precision;    // 0=float, 1=auto (BigDecimal above 16 sig digits), 2=bigdecimal
   bool remove_empty_values;
   bool remove_zero_values;
 } field_transform_opts;
@@ -705,7 +832,7 @@ static inline __attribute__((always_inline)) bool insert_field_into_hash(
                       (opts->numeric_mode == 2 && rb_ary_includes(opts->numeric_keys, key) == Qtrue) ||
                       (opts->numeric_mode == 3 && rb_ary_includes(opts->numeric_keys, key) != Qtrue);
     if (do_convert) {
-      VALUE numeric = try_numeric_conversion(trim_start, trimmed_len);
+      VALUE numeric = try_numeric_conversion(trim_start, trimmed_len, opts->decimal_precision);
       if (numeric != Qundef) {
         ensure_hash_allocated(opts);
         rb_hash_aset(opts->hash, key, numeric);
@@ -752,6 +879,18 @@ void parse_numeric_option(VALUE options_hash, int *out_mode, VALUE *out_keys) {
   }
 }
+/* Read decimal_precision into 0=float, 1=auto, 2=bigdecimal. Default :auto (1).
+ * The option is validated and coerced to a symbol on the Ruby side before we get here. */
+static inline int parse_decimal_precision(VALUE options_hash) {
+  VALUE v = rb_hash_aref(options_hash, ID2SYM(id_decimal_precision));
+  if (RB_TYPE_P(v, T_SYMBOL)) {
+    ID s = SYM2ID(v);
+    if (s == id_float) return 0;
+    if (s == id_bigdecimal) return 2;
+  }
+  return 1; // :auto (also the default when unset)
+}
 /*
  * ================================================================================
  * rb_parse_line_to_hash - Parse CSV line directly into a Ruby Hash
@@ -826,6 +965,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
   int numeric_mode = 0;
   VALUE numeric_keys = Qnil;
   parse_numeric_option(options_hash, &numeric_mode, &numeric_keys);
+  int decimal_precision = parse_decimal_precision(options_hash);
   // quote_escaping and quote_boundary are only needed in Section 5 (quoted/slow path).
   // They are declared here as forward declarations so Section 5 can set them lazily.
@@ -990,6 +1130,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
     .headers_len = headers_len,
     .hash_capa = hash_size,
     .numeric_mode = numeric_mode,
+    .decimal_precision = decimal_precision,
     .remove_empty_values = remove_empty_values,
     .remove_zero_values = remove_zero_values,
   };
@@ -1160,6 +1301,20 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
           p = next_quote;  /* jump to quote char; fall through to quote-handling code */
         }
+        /* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7).
+         * The RFC memchr skip above only matters for one byte class; with escaping on
+         * a backslash also changes state, so scan for both. Skipped bytes are plain
+         * content (the byte-by-byte loop resets backslash_count to 0 on them), so reset
+         * it here whenever we actually move p. */
+        if (allow_escaped_quotes && in_quotes) {
+          const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
+          if (hit != p) {
+            backslash_count = 0;
+            p = (char *)hit;
+            if (p == endP) continue;  /* no quote/backslash before end → unclosed */
+          }
+        }
         if (allow_escaped_quotes && *p == '\\') {
           // Count consecutive backslashes for escape sequence detection
           backslash_count++;
@@ -1354,6 +1509,7 @@ __attribute__((cold)) static VALUE rb_new_parse_context(VALUE self, VALUE header
   /* Numeric conversion */
   parse_numeric_option(options_hash, &ctx->numeric_mode, &ctx->numeric_keys);
+  ctx->decimal_precision = parse_decimal_precision(options_hash);
   /* quote_escaping → allow_escaped_quotes */
   VALUE quote_escaping_val = rb_hash_aref(options_hash, ID2SYM(id_quote_escaping));
@@ -1474,6 +1630,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
   bool remove_empty_values = ctx->remove_empty_values;
   bool remove_zero_values  = ctx->remove_zero_values;
   int  numeric_mode        = ctx->numeric_mode;
+  int  decimal_precision   = ctx->decimal_precision;
   VALUE numeric_keys       = ctx->numeric_keys;
   bool *keep_bitmap         = ctx->keep_bitmap;
   /* keep_bitmap is cached in the context (xmalloc'd once at construction, sized to the header count
@@ -1525,6 +1682,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
     .headers_len       = headers_len,
     .hash_capa         = hash_size,
     .numeric_mode      = numeric_mode,
+    .decimal_precision = decimal_precision,
     .remove_empty_values = remove_empty_values,
     .remove_zero_values  = remove_zero_values,
   };
@@ -1654,6 +1812,16 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
           p = next_quote;  /* fall through to quote-handling code */
         }
+        /* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7). */
+        if (allow_escaped_quotes && in_quotes) {
+          const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
+          if (hit != p) {
+            backslash_count = 0;
+            p = (char *)hit;
+            if (p == endP) continue;  /* no quote/backslash before end → unclosed */
+          }
+        }
         if (allow_escaped_quotes && *p == '\\') {
           backslash_count++;
           if (__builtin_expect(quote_boundary_standard, 1) && !in_quotes) field_started = true;
@@ -1878,6 +2046,10 @@ void Init_smarter_csv(void) {
   id_strict             = rb_intern("strict");
   id_backslash      = rb_intern("backslash");
   id_standard       = rb_intern("standard");
+  id_decimal_precision = rb_intern("decimal_precision");
+  id_float          = rb_intern("float");
+  id_bigdecimal     = rb_intern("bigdecimal");
+  id_BigDecimal     = rb_intern("BigDecimal"); /* Kernel#BigDecimal(); 'bigdecimal' is required in lib/smarter_csv.rb */
   rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 9);
   rb_define_module_function(Parser, "count_quote_chars_c", rb_count_quote_chars, 4);

data/ext/smarter_csv/vendor/LICENSE-fast_float-MIT ADDED Viewed

@@ -0,0 +1,27 @@
+MIT License
+Copyright (c) 2021 The fast_float authors
+Permission is hereby granted, free of charge, to any
+person obtaining a copy of this software and associated
+documentation files (the "Software"), to deal in the
+Software without restriction, including without
+limitation the rights to use, copy, modify, merge,
+publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software
+is furnished to do so, subject to the following
+conditions:
+The above copyright notice and this permission notice
+shall be included in all copies or substantial portions
+of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
+ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
+TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
+PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
+SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
+IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.