smarter_csv 1.17.4 → 1.18.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +41 -0
- data/README.md +5 -0
- data/docs/data_transformations.md +33 -0
- data/docs/migrating_from_csv.md +18 -0
- data/docs/options.md +2 -1
- data/ext/smarter_csv/smarter_csv.c +204 -32
- data/ext/smarter_csv/vendor/LICENSE-fast_float-MIT +27 -0
- data/ext/smarter_csv/vendor/eisel_lemire.h +117 -0
- data/ext/smarter_csv/vendor/eisel_lemire.md +29 -0
- data/ext/smarter_csv/vendor/eisel_lemire_powers.h +663 -0
- data/lib/smarter_csv/hash_transformations.rb +51 -2
- data/lib/smarter_csv/reader_options.rb +24 -0
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +1 -0
- data/smarter_csv.gemspec +3 -0
- metadata +22 -4
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 3335e39a1c0792f01df9e95401c7f3885c49a0d64eeb9c76e5c20e25d01a62f5
|
|
4
|
+
data.tar.gz: e43f00228777b56fc1ee0814a74acaa6a23c51fe8da6f64e42ad92fe1b54002f
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 2abcd136f30d284c3c27cbd2b6c9782aec4235ec62cc27ea7620380ae9efc889f9f1c05c10ad957d6e7c84d65be75d35d96d4ac59ba5780dc5e65e0151c661e6
|
|
7
|
+
data.tar.gz: 6c61062c08d0a89dea2c91a7faafd88f6d09c88015ad8c7f4facb2eb474b44b18e0c5eaffc36fa5e2885f0b2c02487497f2460d66ce2433f62c09bf42d92a4ce
|
data/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,47 @@
|
|
|
4
4
|
> [!TIP]
|
|
5
5
|
> **Upgrading?** The [SmarterCSV Upgrade Wizard](https://tilo.github.io/smarter_csv/upgrade_wizard.html) walks you through what (if anything) you need to change for your specific version. Most steps do not require any changes.
|
|
6
6
|
|
|
7
|
+
## 1.18.0 (2026-06-17)
|
|
8
|
+
|
|
9
|
+
This release is focused on both performance and the introduction of automatic conversion of decimals to big_decimal or float, preserving the precision, and also supporting scientific notation.
|
|
10
|
+
|
|
11
|
+
⚠️ This version is particularly interesting if you have geolocation, scientific, or high-precision data.
|
|
12
|
+
|
|
13
|
+
### New Features
|
|
14
|
+
|
|
15
|
+
- **`decimal_precision` option** (`:auto` default, or `:float` / `:bigdecimal`) — controls how decimal values are converted. `:auto` returns a `Float` unless the value carries more than 16 significant digits, in which case it returns a `BigDecimal` so no precision is lost; `:float` always returns `Float`; `:bigdecimal` always returns `BigDecimal`. Integers are unaffected (always `Integer`). Works identically on the C and Ruby paths. (Ruby's standard-library CSV has no high-precision option — its `:numeric`/`:float` converters use `Float()` and lose precision.)
|
|
16
|
+
- **Float** conversion on the C path now uses the fast **Eisel-Lemire** algorithm (fast_float, vendored) for mantissas up to 19 significant digits — correctly rounded, bit-for-bit identical to `String#to_f` — with a `strtod` fallback beyond that (more than 19 digits / extreme exponents). High-precision values that become `BigDecimal` under `:auto`/`:bigdecimal` are parsed by Ruby's `BigDecimal`.
|
|
17
|
+
|
|
18
|
+
### Behavior Changes
|
|
19
|
+
|
|
20
|
+
- **Scientific notation now converts to a number** (e.g. `"1e3"`, `"1.5e-5"`, `"6.022e23"`). Previously the Ruby path left these as Strings and the C path was inconsistent.
|
|
21
|
+
- **The C and Ruby numeric-conversion paths are now aligned.** Bare-dot forms like `".5"` and `"3."` stay Strings on **both** paths (the shared grammar requires an integer part and, when a dot is present, a fraction digit). Previously the C path converted these and the Ruby path did not.
|
|
22
|
+
- With the default `decimal_precision: :auto`, decimal values carrying more than 16 significant digits are now returned as `BigDecimal` instead of `Float`. Pass `decimal_precision: :float` to keep the previous always-`Float` behavior.
|
|
23
|
+
- `bigdecimal` is now a runtime dependency (it is no longer a default gem on Ruby 3.4+).
|
|
24
|
+
|
|
25
|
+
### Performance
|
|
26
|
+
|
|
27
|
+
The C-accelerated path is faster across the board, **up to ~1.5× on the right shapes** — numeric-heavy data and backslash-escaped quoted fields — and ~1.04–1.08× on typical files.
|
|
28
|
+
|
|
29
|
+
- Eisel-Lemire (Mushtak-Lemire) algorithm on the C path to convert decimals to `Float` or `BigDecimal`. Numeric-heavy data (many float/decimal columns) parses significantly faster.
|
|
30
|
+
- SIMD scanner for backslash-escaped quoted fields (C-path), using NEON (arm64) and SSE2 (x86-64) with a scalar fallback. Speeds up `quote_escaping: :backslash` parsing of long quoted fields.
|
|
31
|
+
|
|
32
|
+
| File | C-path | driver |
|
|
33
|
+
|---------------------------------|----------------------------------|-----------------------|
|
|
34
|
+
| backslash_long_fields_60k | 1.48× faster (0.1880s → 0.1273s) | SIMD quote/backslash scanner |
|
|
35
|
+
| sensor_data_50krows_50cols | 1.40× faster (0.2763s → 0.1975s) | Eisel-Lemire numeric conversion |
|
|
36
|
+
|
|
37
|
+
### Improvements
|
|
38
|
+
|
|
39
|
+
- Improved robustness of symbol-valued enum option processing.
|
|
40
|
+
|
|
41
|
+
### Tests
|
|
42
|
+
|
|
43
|
+
- added parity tests for long quoted-field scanning across 16-byte boundaries, running on both the C and Ruby paths.
|
|
44
|
+
- added tests for string-to-symbol coercion of the enum options.
|
|
45
|
+
|
|
46
|
+
|
|
47
|
+
|
|
7
48
|
## 1.17.4 (2026-06-03)
|
|
8
49
|
|
|
9
50
|
### Bug Fix
|
data/README.md
CHANGED
|
@@ -15,6 +15,9 @@
|
|
|
15
15
|
|
|
16
16
|
> See [**Ruby CSV Pitfalls**](docs/ruby_csv_pitfalls.md) for 10 ways `CSV.read` silently corrupts or loses data, and how SmarterCSV handles them.
|
|
17
17
|
|
|
18
|
+
> [!TIP]
|
|
19
|
+
> **No silent precision loss (new in 1.18.0).** For scientific data, GPS/geo coordinates, and financial figures — which routinely carry 16+ significant digits — Ruby's standard CSV converts with `Float()`, so a value like `1234567890.123456789` is silently rounded to `1234567890.1234567`. SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` for values beyond 16 significant digits (and `Float` otherwise) — full precision, no data loss. Floats are decoded with the Eisel-Lemire algorithm: correctly rounded, bit-for-bit identical to `String#to_f`.
|
|
20
|
+
|
|
18
21
|
Beyond raw speed, SmarterCSV is designed to provide a significantly more convenient and developer-friendly interface than traditional CSV libraries. Instead of returning raw arrays that require substantial post-processing, SmarterCSV produces Rails-ready hashes for each row, making the data immediately usable with ActiveRecord, Sidekiq pipelines, parallel processing, and JSON-based workflows such as S3.
|
|
19
22
|
|
|
20
23
|
In a Rails app, warnings auto-route through `Rails.logger` and instrumentation hooks compose with `ActiveSupport::Notifications` — no setup required. Outside Rails, warnings fall back to `$stderr` and the same APIs work without any framework dependency.
|
|
@@ -89,6 +92,8 @@ rows = SmarterCSV.process('data.csv')
|
|
|
89
92
|
data = SmarterCSV.parse(csv_string)
|
|
90
93
|
```
|
|
91
94
|
|
|
95
|
+
Numeric conversion is also more accurate: where Ruby's `:numeric`/`:float` converters round high-precision decimals through `Float()`, SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` past 16 significant digits, so no precision is lost (pass `decimal_precision: :float` for like-for-like `Float` output).
|
|
96
|
+
|
|
92
97
|
* See [**Migrating from Ruby CSV**](docs/migrating_from_csv.md) for a full comparison of options, behavior differences, and a quick-reference table.
|
|
93
98
|
|
|
94
99
|
## Examples
|
|
@@ -156,6 +156,39 @@ data = SmarterCSV.process(file,
|
|
|
156
156
|
convert_values_to_numeric: { only: [:quantity, :price] })
|
|
157
157
|
```
|
|
158
158
|
|
|
159
|
+
Scientific notation (e.g. `"1.5e3"`, `"6.022e23"`) is recognized and converted too. Bare-dot forms like `".5"` and `"3."` are left as Strings (they are not valid numbers here). Integers and floats convert identically on the C-accelerated and pure-Ruby paths.
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
## `decimal_precision`
|
|
164
|
+
|
|
165
|
+
**Default: `:auto`**
|
|
166
|
+
|
|
167
|
+
Controls how decimal values (those with a `.` or an exponent) are converted. Integers are unaffected — they are always returned as `Integer`.
|
|
168
|
+
|
|
169
|
+
| Value | Result |
|
|
170
|
+
|---------------|-----------------------------------------------------------------------------------------|
|
|
171
|
+
| `:auto` | `Float`, unless the value carries more than 16 significant digits — then `BigDecimal`. |
|
|
172
|
+
| `:float` | Always `Float` (correctly rounded; matches `String#to_f`). |
|
|
173
|
+
| `:bigdecimal` | Always `BigDecimal` (full precision). |
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
# :auto (default) — keeps full precision only when needed
|
|
177
|
+
SmarterCSV.process(file)
|
|
178
|
+
# "3.14" => 3.14 (Float)
|
|
179
|
+
# "1234567890.123456789" => 0.1234567890123456789e10 (BigDecimal — >16 sig digits)
|
|
180
|
+
|
|
181
|
+
# :float — always Float (faster, may lose precision on long decimals)
|
|
182
|
+
SmarterCSV.process(file, decimal_precision: :float)
|
|
183
|
+
# "1234567890.123456789" => 1234567890.1234567 (Float)
|
|
184
|
+
|
|
185
|
+
# :bigdecimal — always BigDecimal
|
|
186
|
+
SmarterCSV.process(file, decimal_precision: :bigdecimal)
|
|
187
|
+
# "3.14" => 0.314e1 (BigDecimal)
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
Unlike Ruby's standard-library CSV — whose `:numeric`/`:float` converters use `Float()` and silently lose precision — `:auto` preserves high-precision decimals as `BigDecimal`. Decimal values are decoded on the C path with the Eisel-Lemire algorithm (correctly rounded, identical to `String#to_f`).
|
|
191
|
+
|
|
159
192
|
---
|
|
160
193
|
|
|
161
194
|
## `remove_empty_hashes`
|
data/docs/migrating_from_csv.md
CHANGED
|
@@ -223,6 +223,24 @@ rows = SmarterCSV.process('sample.csv',
|
|
|
223
223
|
convert_values_to_numeric: { except: [:zip_code, :phone, :account_number] })
|
|
224
224
|
```
|
|
225
225
|
|
|
226
|
+
**High-precision decimals — scientific data and geo coordinates.** GPS/geo coordinates, scientific measurements, and financial figures routinely carry 16+ significant digits, where Ruby's `Float()`-based conversion (`converters: :numeric` / `:float`) silently rounds the value. SmarterCSV's default `decimal_precision: :auto` returns a `BigDecimal` once a value exceeds 16 significant digits (and a `Float` otherwise), so the full value is preserved; scientific notation (`6.022e23`, `1.6e-19`) is recognized as numeric too.
|
|
227
|
+
|
|
228
|
+
**With Ruby CSV (precision lost):**
|
|
229
|
+
```ruby
|
|
230
|
+
CSV.read('locations.csv', headers: true, converters: :float).first['lat']
|
|
231
|
+
# => -122.42200352825247 ← Float() dropped the last digits of -122.422003528252475
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
**With SmarterCSV (full precision kept):**
|
|
235
|
+
```ruby
|
|
236
|
+
SmarterCSV.process('locations.csv').first[:lat]
|
|
237
|
+
# => -0.122422003528252475e3 (BigDecimal — all 18 significant digits preserved)
|
|
238
|
+
|
|
239
|
+
# Force Float everywhere, like-for-like with Ruby CSV:
|
|
240
|
+
SmarterCSV.process('locations.csv', decimal_precision: :float).first[:lat]
|
|
241
|
+
# => -122.42200352825247 (Float)
|
|
242
|
+
```
|
|
243
|
+
|
|
226
244
|
### 3. Empty values are removed by default
|
|
227
245
|
|
|
228
246
|
SmarterCSV drops key/value pairs where the value is `nil` or blank
|
data/docs/options.md
CHANGED
|
@@ -121,7 +121,8 @@ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling
|
|
|
121
121
|
| Option | Default | Explanation |
|
|
122
122
|
|--------|---------|-------------|
|
|
123
123
|
| `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
|
|
124
|
-
| `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
|
|
124
|
+
| `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats (including scientific notation like `1.5e3`) to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
|
|
125
|
+
| `:decimal_precision` | `:auto` | How decimals are converted: `:auto` returns `Float` but `BigDecimal` above 16 significant digits (no precision loss); `:float` always returns `Float`; `:bigdecimal` always returns `BigDecimal`. Integers are unaffected. |
|
|
125
126
|
| `:value_converters` | `nil` | Hash of `:header => converter`; converter can be a lambda/Proc or a class implementing `self.convert(value)`. See [Value Converters](./value_converters.md). |
|
|
126
127
|
| `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil`, empty, or whitespace-only — any Unicode whitespace, same as Ruby's `String#blank?`. |
|
|
127
128
|
| `:remove_zero_values` | `false` | Remove key/value pairs whose value is zero — numeric `0` / `0.0`, or any textual form of zero (`"0"`, `"0.0"`, `"00.00"`, `"+0"`, `"-0.0"`, …). |
|
|
@@ -7,6 +7,14 @@
|
|
|
7
7
|
#include <stdlib.h>
|
|
8
8
|
#include <errno.h>
|
|
9
9
|
|
|
10
|
+
#ifdef __ARM_NEON
|
|
11
|
+
#include <arm_neon.h>
|
|
12
|
+
#elif defined(__SSE2__)
|
|
13
|
+
#include <immintrin.h>
|
|
14
|
+
#endif
|
|
15
|
+
|
|
16
|
+
#include "vendor/eisel_lemire.h" /* Eisel-Lemire decimal->double, correctly rounded (fast_float) */
|
|
17
|
+
|
|
10
18
|
#ifndef bool
|
|
11
19
|
#define bool int
|
|
12
20
|
#define false ((bool)0)
|
|
@@ -41,6 +49,8 @@ static ID id_only, id_except, id_quote_boundary;
|
|
|
41
49
|
static ID id_only_headers, id_except_headers, id_keep_cols, id_strict;
|
|
42
50
|
static ID id_keep_bitmap, id_keep_extra_cols, id_early_exit_after_sym;
|
|
43
51
|
static ID id_backslash, id_standard;
|
|
52
|
+
static ID id_decimal_precision, id_float, id_bigdecimal;
|
|
53
|
+
static ID id_BigDecimal; /* the Kernel#BigDecimal() method (require 'bigdecimal' done in Ruby) */
|
|
44
54
|
|
|
45
55
|
/* ================================================================================
|
|
46
56
|
* ParseContext — wraps all per-file parse options as a GC-managed TypedData object.
|
|
@@ -70,6 +80,9 @@ typedef struct {
|
|
|
70
80
|
/* Numeric conversion: 0=off, 1=all, 2=only listed keys, 3=except listed keys */
|
|
71
81
|
int numeric_mode;
|
|
72
82
|
|
|
83
|
+
/* Decimal handling: 0=float, 1=auto (BigDecimal above 16 sig digits), 2=bigdecimal */
|
|
84
|
+
int decimal_precision;
|
|
85
|
+
|
|
73
86
|
/* Column filter bitmap (xmalloc'd; NULL when no filtering active) */
|
|
74
87
|
bool *keep_bitmap;
|
|
75
88
|
long keep_bitmap_len;
|
|
@@ -133,6 +146,51 @@ static const rb_data_type_t parse_context_type = {
|
|
|
133
146
|
RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_WB_PROTECTED
|
|
134
147
|
};
|
|
135
148
|
|
|
149
|
+
/* Scan [p, end) for the first `quote` char or backslash; returns a pointer to it,
|
|
150
|
+
* or `end` if neither occurs. NEON (arm64) or SSE2 (x86-64) processes 16 bytes per
|
|
151
|
+
* iteration; scalar fallback elsewhere. Ported from smarter_json's fj_scan_str.
|
|
152
|
+
*
|
|
153
|
+
* Used by the quoted-field slow path in :backslash escaping mode, where the only bytes
|
|
154
|
+
* that can change parser state inside a quoted field are the quote char (closing /
|
|
155
|
+
* doubled) and the backslash (escape). Bulk-skipping the plain content between them
|
|
156
|
+
* keeps the byte-by-byte state machine's behavior but avoids stepping every byte.
|
|
157
|
+
* In RFC mode the slow path uses a plain memchr-to-quote instead (only one byte class
|
|
158
|
+
* matters there), so this two-class scan is reserved for backslash mode. */
|
|
159
|
+
static inline const char *scan_quote_or_backslash(const char *p, const char *end, char quote) {
|
|
160
|
+
#ifdef __ARM_NEON
|
|
161
|
+
const uint8x16_t vq = vdupq_n_u8((uint8_t)quote);
|
|
162
|
+
const uint8x16_t vbs = vdupq_n_u8((uint8_t)'\\');
|
|
163
|
+
while (p + 16 <= end) {
|
|
164
|
+
uint8x16_t chunk = vld1q_u8((const uint8_t *)p);
|
|
165
|
+
uint8x16_t m = vorrq_u8(vceqq_u8(chunk, vq), vceqq_u8(chunk, vbs));
|
|
166
|
+
/* movemask emulation (Oj's technique): pack to 4 bits/byte, then ctz/4. */
|
|
167
|
+
uint8x8_t res = vshrn_n_u16(vreinterpretq_u16_u8(m), 4);
|
|
168
|
+
uint64_t mask = vget_lane_u64(vreinterpret_u64_u8(res), 0);
|
|
169
|
+
if (__builtin_expect(mask != 0, 0)) { /* most 16-byte chunks contain neither */
|
|
170
|
+
mask &= 0x8888888888888888ull;
|
|
171
|
+
return p + (__builtin_ctzll(mask) >> 2);
|
|
172
|
+
}
|
|
173
|
+
p += 16;
|
|
174
|
+
}
|
|
175
|
+
#elif defined(__SSE2__)
|
|
176
|
+
const __m128i vq = _mm_set1_epi8(quote);
|
|
177
|
+
const __m128i vbs = _mm_set1_epi8('\\');
|
|
178
|
+
while (p + 16 <= end) {
|
|
179
|
+
__m128i chunk = _mm_loadu_si128((const __m128i *)p);
|
|
180
|
+
__m128i m = _mm_or_si128(_mm_cmpeq_epi8(chunk, vq), _mm_cmpeq_epi8(chunk, vbs));
|
|
181
|
+
int mask = _mm_movemask_epi8(m); /* one bit per byte that matched */
|
|
182
|
+
if (__builtin_expect(mask != 0, 0)) { /* most 16-byte chunks contain neither */
|
|
183
|
+
return p + __builtin_ctz(mask);
|
|
184
|
+
}
|
|
185
|
+
p += 16;
|
|
186
|
+
}
|
|
187
|
+
#endif
|
|
188
|
+
for (; p < end; p++) {
|
|
189
|
+
if (*p == quote || *p == '\\') return p;
|
|
190
|
+
}
|
|
191
|
+
return end;
|
|
192
|
+
}
|
|
193
|
+
|
|
136
194
|
static VALUE unescape_quotes(char *str, long len, char quote_char, rb_encoding *encoding) {
|
|
137
195
|
// Fast path: scan for any doubled quote pair. If none present, the field has
|
|
138
196
|
// nothing to unescape — emit it directly via rb_enc_str_new and skip the
|
|
@@ -386,6 +444,20 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
|
386
444
|
backslash_count = 0;
|
|
387
445
|
field_started = false; // reset for next field
|
|
388
446
|
} else {
|
|
447
|
+
/* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7).
|
|
448
|
+
* Inside a quoted field the only state-changing bytes are the quote char and the
|
|
449
|
+
* backslash; bulk-skip the plain content between them. Skipped bytes are plain
|
|
450
|
+
* content, which the byte-by-byte loop resets backslash_count to 0 on, so reset
|
|
451
|
+
* it here whenever we actually move p. */
|
|
452
|
+
if (allow_escaped_quotes && in_quotes) {
|
|
453
|
+
const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
|
|
454
|
+
if (hit != p) {
|
|
455
|
+
backslash_count = 0;
|
|
456
|
+
p = (char *)hit;
|
|
457
|
+
if (p == endP) continue; /* no quote/backslash before end → unclosed */
|
|
458
|
+
}
|
|
459
|
+
}
|
|
460
|
+
|
|
389
461
|
if (allow_escaped_quotes && *p == '\\') {
|
|
390
462
|
backslash_count++;
|
|
391
463
|
if (__builtin_expect(quote_boundary_standard, 1) && !in_quotes) field_started = true;
|
|
@@ -525,47 +597,101 @@ static inline VALUE get_key_for_index(long index, VALUE headers, long headers_le
|
|
|
525
597
|
* Handles overflow: if strtol overflows (ERANGE), falls back to rb_cstr_to_inum
|
|
526
598
|
* which produces a Ruby Bignum.
|
|
527
599
|
*/
|
|
528
|
-
static inline VALUE try_numeric_conversion(char *
|
|
529
|
-
// Quick pre-check: first char must be digit
|
|
530
|
-
char first =
|
|
531
|
-
if (!((first >= '0' && first <= '9') || first == '+' || first == '-'
|
|
600
|
+
static inline VALUE try_numeric_conversion(char *s, long n, int decimal_precision) {
|
|
601
|
+
// Quick pre-check: first char must be a digit or a sign.
|
|
602
|
+
char first = s[0];
|
|
603
|
+
if (!((first >= '0' && first <= '9') || first == '+' || first == '-')) {
|
|
532
604
|
return Qundef;
|
|
533
605
|
}
|
|
534
606
|
|
|
535
|
-
|
|
536
|
-
|
|
537
|
-
|
|
538
|
-
|
|
539
|
-
|
|
540
|
-
|
|
541
|
-
|
|
542
|
-
|
|
543
|
-
|
|
544
|
-
|
|
545
|
-
|
|
546
|
-
if (
|
|
547
|
-
|
|
548
|
-
|
|
549
|
-
|
|
550
|
-
|
|
551
|
-
|
|
552
|
-
|
|
553
|
-
|
|
607
|
+
/* Single pass: validate the token against the same grammar as the Ruby path's
|
|
608
|
+
* NUMERIC_REGEX = /\A[+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?\z/ and, in the same pass,
|
|
609
|
+
* extract everything the fast paths need:
|
|
610
|
+
* - mantissa value m10 (exact for <= 18 digits; `overflow` flags beyond)
|
|
611
|
+
* - significant-digit count `sig` (leading zeros excluded; matches the Ruby
|
|
612
|
+
* significant_digits helper / Oj dec_cnt) — drives the :auto Float/BigDecimal split
|
|
613
|
+
* - base-10 exponent e10 (from the fraction length and any explicit exponent)
|
|
614
|
+
* Anything the grammar rejects returns Qundef (stays a String), keeping the C and
|
|
615
|
+
* Ruby paths byte-identical on what does and does not convert. */
|
|
616
|
+
long i = 0;
|
|
617
|
+
int neg = 0;
|
|
618
|
+
if (s[i] == '+' || s[i] == '-') { neg = (s[i] == '-'); i++; }
|
|
619
|
+
|
|
620
|
+
uint64_t m10 = 0;
|
|
621
|
+
int m10digits = 0; /* mantissa digits accumulated into m10 (capped at 19) */
|
|
622
|
+
long sig = 0; /* significant digits (leading zeros excluded) */
|
|
623
|
+
int sig_started = 0;
|
|
624
|
+
bool overflow = false;
|
|
625
|
+
long int_digits = 0, frac_digits = 0;
|
|
626
|
+
bool seen_dot = false, seen_exp = false, any_digit = false, exp_any = false;
|
|
627
|
+
int64_t exp_val = 0; int exp_neg = 0;
|
|
628
|
+
|
|
629
|
+
for (; i < n; i++) {
|
|
630
|
+
char c = s[i];
|
|
631
|
+
if (c >= '0' && c <= '9') {
|
|
632
|
+
any_digit = true;
|
|
633
|
+
if (!seen_exp) {
|
|
634
|
+
if (seen_dot) frac_digits++; else int_digits++;
|
|
635
|
+
if (sig_started) sig++;
|
|
636
|
+
else if (c != '0') { sig_started = 1; sig = 1; }
|
|
637
|
+
if (m10digits < 19) { m10 = m10 * 10 + (uint64_t)(c - '0'); m10digits++; }
|
|
638
|
+
else overflow = true;
|
|
639
|
+
} else {
|
|
640
|
+
exp_any = true;
|
|
641
|
+
exp_val = exp_val * 10 + (c - '0');
|
|
642
|
+
if (exp_val > 1000000) overflow = true; /* extreme exponent → strtod fallback */
|
|
554
643
|
}
|
|
555
|
-
|
|
644
|
+
} else if (c == '.' && !seen_dot && !seen_exp) {
|
|
645
|
+
seen_dot = true;
|
|
646
|
+
} else if ((c == 'e' || c == 'E') && !seen_exp && any_digit) {
|
|
647
|
+
seen_exp = true;
|
|
648
|
+
if (i + 1 < n && (s[i + 1] == '+' || s[i + 1] == '-')) { exp_neg = (s[i + 1] == '-'); i++; }
|
|
649
|
+
} else {
|
|
650
|
+
return Qundef; /* invalid char for a number → not numeric */
|
|
556
651
|
}
|
|
557
652
|
}
|
|
558
653
|
|
|
559
|
-
|
|
560
|
-
|
|
561
|
-
|
|
562
|
-
|
|
563
|
-
|
|
564
|
-
|
|
654
|
+
/* Enforce NUMERIC_REGEX exactly: an integer part is required; a dot requires a
|
|
655
|
+
* fraction digit; an exponent requires an exponent digit. */
|
|
656
|
+
if (int_digits == 0) return Qundef;
|
|
657
|
+
if (seen_dot && frac_digits == 0) return Qundef;
|
|
658
|
+
if (seen_exp && !exp_any) return Qundef;
|
|
659
|
+
|
|
660
|
+
bool is_decimal = seen_dot || seen_exp;
|
|
661
|
+
|
|
662
|
+
if (!is_decimal) {
|
|
663
|
+
/* Integer. Fast path when it fits in a long; otherwise a Ruby Integer/Bignum. */
|
|
664
|
+
if (!overflow && m10digits <= 18) {
|
|
665
|
+
long v = (long)m10;
|
|
666
|
+
return LONG2NUM(neg ? -v : v);
|
|
565
667
|
}
|
|
668
|
+
VALUE str = rb_str_new(s, n);
|
|
669
|
+
return rb_cstr_to_inum(RSTRING_PTR(str), 10, false);
|
|
566
670
|
}
|
|
567
671
|
|
|
568
|
-
|
|
672
|
+
/* Decimal (has a '.' or an exponent) — honor decimal_precision. 0=float, 1=auto, 2=bigdecimal */
|
|
673
|
+
if (decimal_precision == 2 || (decimal_precision == 1 && sig > 16)) {
|
|
674
|
+
VALUE str = rb_str_new(s, n);
|
|
675
|
+
return rb_funcall(rb_cObject, id_BigDecimal, 1, str);
|
|
676
|
+
}
|
|
677
|
+
|
|
678
|
+
/* Float. base-10 exponent = explicit exponent minus the fraction length. */
|
|
679
|
+
int64_t e10 = (exp_neg ? -exp_val : exp_val) - (int64_t)frac_digits;
|
|
680
|
+
double d;
|
|
681
|
+
if (!overflow && m10digits >= 1 && m10digits <= 19 && ((long)m10digits + e10) >= -307) {
|
|
682
|
+
/* Eisel-Lemire is correctly-rounded for any nonzero mantissa that fits exactly in a
|
|
683
|
+
* uint64 — i.e. up to 19 significant digits (the max 19-digit value ~1.0e19 is below
|
|
684
|
+
* UINT64_MAX ~1.8e19). Verified bit-for-bit vs the stdlib over 1..19-digit ties. */
|
|
685
|
+
d = (m10 == 0) ? (neg ? -0.0 : 0.0) : fj_eisel_lemire_s2d(e10, m10, neg);
|
|
686
|
+
} else {
|
|
687
|
+
/* >19 digits / extreme or subnormal exponent: fall back to Ruby's own correctly-rounded
|
|
688
|
+
* strtod (rb_cstr_to_dbl) — the exact conversion String#to_f uses — so the C path and the
|
|
689
|
+
* Ruby path produce the identical double on every platform, not just where the system
|
|
690
|
+
* strtod happens to be correctly rounded. The token is pre-validated, so badcheck=0. */
|
|
691
|
+
VALUE str = rb_str_new(s, n);
|
|
692
|
+
d = rb_cstr_to_dbl(RSTRING_PTR(str), 0);
|
|
693
|
+
}
|
|
694
|
+
return DBL2NUM(d);
|
|
569
695
|
}
|
|
570
696
|
|
|
571
697
|
/*
|
|
@@ -614,6 +740,7 @@ typedef struct {
|
|
|
614
740
|
long headers_len;
|
|
615
741
|
long hash_capa; // Pre-computed capacity for lazy hash allocation
|
|
616
742
|
int numeric_mode; // 0=off, 1=all, 2=only, 3=except
|
|
743
|
+
int decimal_precision; // 0=float, 1=auto (BigDecimal above 16 sig digits), 2=bigdecimal
|
|
617
744
|
bool remove_empty_values;
|
|
618
745
|
bool remove_zero_values;
|
|
619
746
|
} field_transform_opts;
|
|
@@ -705,7 +832,7 @@ static inline __attribute__((always_inline)) bool insert_field_into_hash(
|
|
|
705
832
|
(opts->numeric_mode == 2 && rb_ary_includes(opts->numeric_keys, key) == Qtrue) ||
|
|
706
833
|
(opts->numeric_mode == 3 && rb_ary_includes(opts->numeric_keys, key) != Qtrue);
|
|
707
834
|
if (do_convert) {
|
|
708
|
-
VALUE numeric = try_numeric_conversion(trim_start, trimmed_len);
|
|
835
|
+
VALUE numeric = try_numeric_conversion(trim_start, trimmed_len, opts->decimal_precision);
|
|
709
836
|
if (numeric != Qundef) {
|
|
710
837
|
ensure_hash_allocated(opts);
|
|
711
838
|
rb_hash_aset(opts->hash, key, numeric);
|
|
@@ -752,6 +879,18 @@ void parse_numeric_option(VALUE options_hash, int *out_mode, VALUE *out_keys) {
|
|
|
752
879
|
}
|
|
753
880
|
}
|
|
754
881
|
|
|
882
|
+
/* Read decimal_precision into 0=float, 1=auto, 2=bigdecimal. Default :auto (1).
|
|
883
|
+
* The option is validated and coerced to a symbol on the Ruby side before we get here. */
|
|
884
|
+
static inline int parse_decimal_precision(VALUE options_hash) {
|
|
885
|
+
VALUE v = rb_hash_aref(options_hash, ID2SYM(id_decimal_precision));
|
|
886
|
+
if (RB_TYPE_P(v, T_SYMBOL)) {
|
|
887
|
+
ID s = SYM2ID(v);
|
|
888
|
+
if (s == id_float) return 0;
|
|
889
|
+
if (s == id_bigdecimal) return 2;
|
|
890
|
+
}
|
|
891
|
+
return 1; // :auto (also the default when unset)
|
|
892
|
+
}
|
|
893
|
+
|
|
755
894
|
/*
|
|
756
895
|
* ================================================================================
|
|
757
896
|
* rb_parse_line_to_hash - Parse CSV line directly into a Ruby Hash
|
|
@@ -826,6 +965,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
|
|
|
826
965
|
int numeric_mode = 0;
|
|
827
966
|
VALUE numeric_keys = Qnil;
|
|
828
967
|
parse_numeric_option(options_hash, &numeric_mode, &numeric_keys);
|
|
968
|
+
int decimal_precision = parse_decimal_precision(options_hash);
|
|
829
969
|
|
|
830
970
|
// quote_escaping and quote_boundary are only needed in Section 5 (quoted/slow path).
|
|
831
971
|
// They are declared here as forward declarations so Section 5 can set them lazily.
|
|
@@ -990,6 +1130,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
|
|
|
990
1130
|
.headers_len = headers_len,
|
|
991
1131
|
.hash_capa = hash_size,
|
|
992
1132
|
.numeric_mode = numeric_mode,
|
|
1133
|
+
.decimal_precision = decimal_precision,
|
|
993
1134
|
.remove_empty_values = remove_empty_values,
|
|
994
1135
|
.remove_zero_values = remove_zero_values,
|
|
995
1136
|
};
|
|
@@ -1160,6 +1301,20 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash(VALUE self, VALUE line,
|
|
|
1160
1301
|
p = next_quote; /* jump to quote char; fall through to quote-handling code */
|
|
1161
1302
|
}
|
|
1162
1303
|
|
|
1304
|
+
/* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7).
|
|
1305
|
+
* The RFC memchr skip above only matters for one byte class; with escaping on
|
|
1306
|
+
* a backslash also changes state, so scan for both. Skipped bytes are plain
|
|
1307
|
+
* content (the byte-by-byte loop resets backslash_count to 0 on them), so reset
|
|
1308
|
+
* it here whenever we actually move p. */
|
|
1309
|
+
if (allow_escaped_quotes && in_quotes) {
|
|
1310
|
+
const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
|
|
1311
|
+
if (hit != p) {
|
|
1312
|
+
backslash_count = 0;
|
|
1313
|
+
p = (char *)hit;
|
|
1314
|
+
if (p == endP) continue; /* no quote/backslash before end → unclosed */
|
|
1315
|
+
}
|
|
1316
|
+
}
|
|
1317
|
+
|
|
1163
1318
|
if (allow_escaped_quotes && *p == '\\') {
|
|
1164
1319
|
// Count consecutive backslashes for escape sequence detection
|
|
1165
1320
|
backslash_count++;
|
|
@@ -1354,6 +1509,7 @@ __attribute__((cold)) static VALUE rb_new_parse_context(VALUE self, VALUE header
|
|
|
1354
1509
|
|
|
1355
1510
|
/* Numeric conversion */
|
|
1356
1511
|
parse_numeric_option(options_hash, &ctx->numeric_mode, &ctx->numeric_keys);
|
|
1512
|
+
ctx->decimal_precision = parse_decimal_precision(options_hash);
|
|
1357
1513
|
|
|
1358
1514
|
/* quote_escaping → allow_escaped_quotes */
|
|
1359
1515
|
VALUE quote_escaping_val = rb_hash_aref(options_hash, ID2SYM(id_quote_escaping));
|
|
@@ -1474,6 +1630,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1474
1630
|
bool remove_empty_values = ctx->remove_empty_values;
|
|
1475
1631
|
bool remove_zero_values = ctx->remove_zero_values;
|
|
1476
1632
|
int numeric_mode = ctx->numeric_mode;
|
|
1633
|
+
int decimal_precision = ctx->decimal_precision;
|
|
1477
1634
|
VALUE numeric_keys = ctx->numeric_keys;
|
|
1478
1635
|
bool *keep_bitmap = ctx->keep_bitmap;
|
|
1479
1636
|
/* keep_bitmap is cached in the context (xmalloc'd once at construction, sized to the header count
|
|
@@ -1525,6 +1682,7 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1525
1682
|
.headers_len = headers_len,
|
|
1526
1683
|
.hash_capa = hash_size,
|
|
1527
1684
|
.numeric_mode = numeric_mode,
|
|
1685
|
+
.decimal_precision = decimal_precision,
|
|
1528
1686
|
.remove_empty_values = remove_empty_values,
|
|
1529
1687
|
.remove_zero_values = remove_zero_values,
|
|
1530
1688
|
};
|
|
@@ -1654,6 +1812,16 @@ __attribute__((hot)) static VALUE rb_parse_line_to_hash_ctx(VALUE self, VALUE li
|
|
|
1654
1812
|
p = next_quote; /* fall through to quote-handling code */
|
|
1655
1813
|
}
|
|
1656
1814
|
|
|
1815
|
+
/* Backslash mode: NEON scan-ahead to the next quote OR backslash (Opt #7). */
|
|
1816
|
+
if (allow_escaped_quotes && in_quotes) {
|
|
1817
|
+
const char *hit = scan_quote_or_backslash(p, endP, quote_char_val);
|
|
1818
|
+
if (hit != p) {
|
|
1819
|
+
backslash_count = 0;
|
|
1820
|
+
p = (char *)hit;
|
|
1821
|
+
if (p == endP) continue; /* no quote/backslash before end → unclosed */
|
|
1822
|
+
}
|
|
1823
|
+
}
|
|
1824
|
+
|
|
1657
1825
|
if (allow_escaped_quotes && *p == '\\') {
|
|
1658
1826
|
backslash_count++;
|
|
1659
1827
|
if (__builtin_expect(quote_boundary_standard, 1) && !in_quotes) field_started = true;
|
|
@@ -1878,6 +2046,10 @@ void Init_smarter_csv(void) {
|
|
|
1878
2046
|
id_strict = rb_intern("strict");
|
|
1879
2047
|
id_backslash = rb_intern("backslash");
|
|
1880
2048
|
id_standard = rb_intern("standard");
|
|
2049
|
+
id_decimal_precision = rb_intern("decimal_precision");
|
|
2050
|
+
id_float = rb_intern("float");
|
|
2051
|
+
id_bigdecimal = rb_intern("bigdecimal");
|
|
2052
|
+
id_BigDecimal = rb_intern("BigDecimal"); /* Kernel#BigDecimal(); 'bigdecimal' is required in lib/smarter_csv.rb */
|
|
1881
2053
|
|
|
1882
2054
|
rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 9);
|
|
1883
2055
|
rb_define_module_function(Parser, "count_quote_chars_c", rb_count_quote_chars, 4);
|
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2021 The fast_float authors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any
|
|
6
|
+
person obtaining a copy of this software and associated
|
|
7
|
+
documentation files (the "Software"), to deal in the
|
|
8
|
+
Software without restriction, including without
|
|
9
|
+
limitation the rights to use, copy, modify, merge,
|
|
10
|
+
publish, distribute, sublicense, and/or sell copies of
|
|
11
|
+
the Software, and to permit persons to whom the Software
|
|
12
|
+
is furnished to do so, subject to the following
|
|
13
|
+
conditions:
|
|
14
|
+
|
|
15
|
+
The above copyright notice and this permission notice
|
|
16
|
+
shall be included in all copies or substantial portions
|
|
17
|
+
of the Software.
|
|
18
|
+
|
|
19
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
|
|
20
|
+
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
|
|
21
|
+
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
|
|
22
|
+
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
|
|
23
|
+
SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
|
24
|
+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
|
25
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
|
|
26
|
+
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
|
|
27
|
+
DEALINGS IN THE SOFTWARE.
|