smarter_csv 1.15.1 → 1.15.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +10 -0
- data/README.md +12 -6
- data/docs/basic_read_api.md +16 -8
- data/ext/smarter_csv/extconf.rb +2 -0
- data/lib/smarter_csv/parser.rb +101 -34
- data/lib/smarter_csv/reader.rb +23 -11
- data/lib/smarter_csv/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 41a8d63c5aea4500d77b4268079521194f0d2d34de2b3e5f2264c48181159273
|
|
4
|
+
data.tar.gz: 586facc801af166270eebf0ece90949061ccfeaadfa3e7837678cb935e032bcb
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: ed4072e64c4e66fb5b982dfaffe49d32370b087aa9a1ff689c2f73bfa6450ae275547bb17818ff227e8843834bcb981a8a906b5e7936bbf999f497e89b2cb91d
|
|
7
|
+
data.tar.gz: 31ecb71b2b50e1bb5f2aa037583550eb878f2e1faf66adf0803c8dcdeafbd52b0fa24c3b78bcc9bcdc3a3c759b53667004541257c32799d08b944a4ed53d9b49
|
data/CHANGELOG.md
CHANGED
|
@@ -1,6 +1,16 @@
|
|
|
1
1
|
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
|
3
3
|
|
|
4
|
+
## 1.15.2 (2026-02-20)
|
|
5
|
+
|
|
6
|
+
* Performance Optimizations
|
|
7
|
+
- 1.6× to 7.2× faster than CSV.read
|
|
8
|
+
- 6× to 113× faster than Ruby’s CSV.table
|
|
9
|
+
- 5.4× to 37.4× faster than SmarterCSV 1.14.4 (with C-acceleration)
|
|
10
|
+
- 1.4× to 9.5× faster than SmarterCSV 1.14.4 (without C-acceleration, pure Ruby path)
|
|
11
|
+
|
|
12
|
+
[More details here](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [here](https://github.com/tilo/smarter_csv/pull/319)
|
|
13
|
+
|
|
4
14
|
## 1.15.1 (2026-02-17)
|
|
5
15
|
|
|
6
16
|
### Bug Fix
|
data/README.md
CHANGED
|
@@ -25,13 +25,19 @@ SmarterCSV is designed for **real-world CSV processing**, returning fully usable
|
|
|
25
25
|
|
|
26
26
|
For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to SmarterCSV.
|
|
27
27
|
|
|
28
|
-
| Comparison
|
|
29
|
-
|
|
30
|
-
| vs SmarterCSV 1.14.4 |
|
|
31
|
-
| vs
|
|
32
|
-
| vs CSV
|
|
28
|
+
| Comparison | Range |
|
|
29
|
+
|------------------------------------------|----------------------|
|
|
30
|
+
| vs SmarterCSV 1.14.4 (with acceleration) | 5.4× to 37.4x faster |
|
|
31
|
+
| vs SmarterCSV 1.14.4 (pure Ruby) | 1.4× to 9.5× faster |
|
|
32
|
+
| vs CSV.read (arrays of arrays) | 1.6x to 7.2x faster |
|
|
33
|
+
| vs CSV.table (arrays of hashes) | 6× to 113× faster |
|
|
34
|
+
| vs ZSV (arrays of hashes) | 1.4× to 6.3× faster |
|
|
33
35
|
|
|
34
|
-
|
|
36
|
+
[More details here](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [here](https://github.com/tilo/smarter_csv/pull/319)
|
|
37
|
+
|
|
38
|
+
SmarterCSV also wins 14 of 16 benchmark files head-to-head against ZSV+wrapper (SIMD-accelerated C parser with Ruby wrapper to produce equivalent hash output).
|
|
39
|
+
|
|
40
|
+
_Benchmarks: 16 CSV files (43k–80k rows), Ruby 3.4.7, Apple M1. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for details._
|
|
35
41
|
|
|
36
42
|
## Examples
|
|
37
43
|
|
data/docs/basic_read_api.md
CHANGED
|
@@ -29,18 +29,26 @@ Learn more about this [in this section](docs/examples/row_col_sep.md).
|
|
|
29
29
|
The simplified call to read CSV files is:
|
|
30
30
|
|
|
31
31
|
```
|
|
32
|
-
array_of_hashes = SmarterCSV.process(file_or_input, options
|
|
32
|
+
array_of_hashes = SmarterCSV.process(file_or_input, options)
|
|
33
33
|
|
|
34
34
|
```
|
|
35
|
-
It can also be used with a block:
|
|
35
|
+
It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
|
|
36
36
|
|
|
37
37
|
```
|
|
38
|
-
SmarterCSV.process(file_or_input, options
|
|
39
|
-
|
|
38
|
+
SmarterCSV.process(file_or_input, options) do |array_of_hashes|
|
|
39
|
+
# without chunk_size, each yield conatins a one-element array (one row)
|
|
40
40
|
end
|
|
41
41
|
```
|
|
42
42
|
|
|
43
|
-
|
|
43
|
+
or
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
SmarterCSV.process(file_or_input, options) do |array_of_hashes, chunk_index|
|
|
47
|
+
# the chunk_index can be used to track chunks for parallel processing
|
|
48
|
+
end
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
When processing batches of rows, use the `chunk_size` option. The block receives an array of up to `chunk_size` hashes per yield:
|
|
44
52
|
|
|
45
53
|
```
|
|
46
54
|
SmarterCSV.process(file_or_input, {chunk_size: 100}) do |array_of_hashes, chunk_index|
|
|
@@ -59,11 +67,11 @@ The simplified API works in most cases, but if you need access to the internal s
|
|
|
59
67
|
|
|
60
68
|
puts reader.raw_headers
|
|
61
69
|
```
|
|
62
|
-
It
|
|
70
|
+
It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
|
|
63
71
|
|
|
64
|
-
```
|
|
72
|
+
```
|
|
65
73
|
reader = SmarterCSV::Reader.new(file_or_input, options)
|
|
66
|
-
data = reader.process do
|
|
74
|
+
data = reader.process do |array_of_hashes, chunk_index|
|
|
67
75
|
# do something here
|
|
68
76
|
end
|
|
69
77
|
|
data/ext/smarter_csv/extconf.rb
CHANGED
|
@@ -12,6 +12,8 @@ end
|
|
|
12
12
|
optflags = "-O3 -flto -fomit-frame-pointer -DNDEBUG".dup
|
|
13
13
|
optflags << " -march=native" unless RUBY_PLATFORM.start_with?("arm64-darwin")
|
|
14
14
|
|
|
15
|
+
append_cflags('-Wno-compound-token-split-by-macro')
|
|
16
|
+
|
|
15
17
|
CONFIG["optflags"] = optflags
|
|
16
18
|
CONFIG["debugflags"] = ""
|
|
17
19
|
|
data/lib/smarter_csv/parser.rb
CHANGED
|
@@ -41,8 +41,9 @@ module SmarterCSV
|
|
|
41
41
|
[elements, elements.size]
|
|
42
42
|
# :nocov:
|
|
43
43
|
else
|
|
44
|
-
|
|
45
|
-
|
|
44
|
+
# Optimization #4: cache merged options hashes for :auto mode
|
|
45
|
+
@backslash_options ||= options.merge(quote_escaping: :backslash)
|
|
46
|
+
parse_csv_line_ruby(line, @backslash_options, header_size, has_quotes)
|
|
46
47
|
end
|
|
47
48
|
rescue MalformedCSV
|
|
48
49
|
# Backslash interpretation failed — fall back to RFC 4180
|
|
@@ -52,8 +53,9 @@ module SmarterCSV
|
|
|
52
53
|
[elements, elements.size]
|
|
53
54
|
# :nocov:
|
|
54
55
|
else
|
|
55
|
-
|
|
56
|
-
|
|
56
|
+
# Optimization #4: cache merged options hashes for :auto mode
|
|
57
|
+
@rfc_options ||= options.merge(quote_escaping: :double_quotes)
|
|
58
|
+
parse_csv_line_ruby(line, @rfc_options, header_size, has_quotes)
|
|
57
59
|
end
|
|
58
60
|
end
|
|
59
61
|
end
|
|
@@ -80,25 +82,29 @@ module SmarterCSV
|
|
|
80
82
|
# Try backslash-escape interpretation first
|
|
81
83
|
if options[:acceleration] && has_acceleration
|
|
82
84
|
# :nocov:
|
|
83
|
-
|
|
84
|
-
|
|
85
|
+
# Optimization #4: cache merged options hashes for :auto mode
|
|
86
|
+
@backslash_options ||= options.merge(quote_escaping: :backslash)
|
|
87
|
+
parse_line_to_hash_c(line, headers, @backslash_options)
|
|
85
88
|
# :nocov:
|
|
86
89
|
else
|
|
87
90
|
has_quotes = line.include?(options[:quote_char])
|
|
88
|
-
|
|
89
|
-
|
|
91
|
+
# Optimization #4: cache merged options hashes for :auto mode
|
|
92
|
+
@backslash_options ||= options.merge(quote_escaping: :backslash)
|
|
93
|
+
parse_line_to_hash_ruby(line, headers, @backslash_options, has_quotes)
|
|
90
94
|
end
|
|
91
95
|
rescue MalformedCSV
|
|
92
96
|
# Backslash interpretation failed — fall back to RFC 4180
|
|
93
97
|
if options[:acceleration] && has_acceleration
|
|
94
98
|
# :nocov:
|
|
95
|
-
|
|
96
|
-
|
|
99
|
+
# Optimization #4: cache merged options hashes for :auto mode
|
|
100
|
+
@rfc_options ||= options.merge(quote_escaping: :double_quotes)
|
|
101
|
+
parse_line_to_hash_c(line, headers, @rfc_options)
|
|
97
102
|
# :nocov:
|
|
98
103
|
else
|
|
99
104
|
has_quotes = line.include?(options[:quote_char])
|
|
100
|
-
|
|
101
|
-
|
|
105
|
+
# Optimization #4: cache merged options hashes for :auto mode
|
|
106
|
+
@rfc_options ||= options.merge(quote_escaping: :double_quotes)
|
|
107
|
+
parse_line_to_hash_ruby(line, headers, @rfc_options, has_quotes)
|
|
102
108
|
end
|
|
103
109
|
end
|
|
104
110
|
end
|
|
@@ -113,9 +119,16 @@ module SmarterCSV
|
|
|
113
119
|
# Parse the line into values
|
|
114
120
|
elements, data_size = parse_csv_line_ruby(line, options, nil, has_quotes)
|
|
115
121
|
|
|
116
|
-
#
|
|
117
|
-
|
|
118
|
-
|
|
122
|
+
# Optimization #6: elements are always String or nil from parse_csv_line_ruby,
|
|
123
|
+
# so .to_s is unnecessary. If strip_whitespace is on, fields are already
|
|
124
|
+
# stripped, so .strip is also redundant — just check .empty?.
|
|
125
|
+
if options[:remove_empty_hashes]
|
|
126
|
+
all_blank = if options[:strip_whitespace]
|
|
127
|
+
elements.empty? || elements.all? { |v| v.nil? || v.empty? }
|
|
128
|
+
else
|
|
129
|
+
elements.empty? || elements.all? { |v| v.nil? || v.strip.empty? }
|
|
130
|
+
end
|
|
131
|
+
return [nil, data_size] if all_blank
|
|
119
132
|
end
|
|
120
133
|
|
|
121
134
|
# Build the hash - only include keys for values that exist
|
|
@@ -161,11 +174,33 @@ module SmarterCSV
|
|
|
161
174
|
#
|
|
162
175
|
# Our convention is that empty fields are returned as empty strings, not as nil.
|
|
163
176
|
|
|
164
|
-
def parse_csv_line_ruby(line, options, header_size = nil,
|
|
177
|
+
def parse_csv_line_ruby(line, options, header_size = nil, has_quotes = false)
|
|
165
178
|
return [[], 0] if line.nil?
|
|
166
179
|
|
|
167
|
-
line_size = line.size
|
|
168
180
|
col_sep = options[:col_sep]
|
|
181
|
+
strip = options[:strip_whitespace]
|
|
182
|
+
|
|
183
|
+
# Ensure has_quotes is set correctly (callers via parse/parse_line_to_hash
|
|
184
|
+
# always pass this, but direct callers may not)
|
|
185
|
+
has_quotes = line.include?(options[:quote_char]) unless has_quotes
|
|
186
|
+
|
|
187
|
+
# Optimization #7: when line has no quotes, use String#split (C-implemented)
|
|
188
|
+
# to bypass the entire character-by-character loop.
|
|
189
|
+
# Note: String#split(" ") has special whitespace-collapsing behavior in Ruby,
|
|
190
|
+
# so we must use a literal string pattern only for non-space separators,
|
|
191
|
+
# or fall through to the character loop for space separators.
|
|
192
|
+
unless has_quotes || col_sep == ' '
|
|
193
|
+
if header_size && header_size <= 0
|
|
194
|
+
return [[], 0]
|
|
195
|
+
end
|
|
196
|
+
elements = line.split(col_sep, -1) # -1 preserves trailing empty fields
|
|
197
|
+
elements = elements[0, header_size] if header_size
|
|
198
|
+
elements.map!(&:strip) if strip
|
|
199
|
+
return [elements, elements.size]
|
|
200
|
+
end
|
|
201
|
+
|
|
202
|
+
# Quoted-line path: character-by-character parsing required
|
|
203
|
+
line_size = line.size
|
|
169
204
|
col_sep_size = col_sep.size
|
|
170
205
|
quote = options[:quote_char]
|
|
171
206
|
elements = []
|
|
@@ -176,27 +211,58 @@ module SmarterCSV
|
|
|
176
211
|
in_quotes = false
|
|
177
212
|
allow_escaped_quotes = options[:quote_escaping] == :backslash
|
|
178
213
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
214
|
+
# Optimization #1: for the common single-char separator, use direct
|
|
215
|
+
# character comparison instead of allocating a substring via line[i...i+n].
|
|
216
|
+
if col_sep_size == 1
|
|
217
|
+
while i < line_size
|
|
218
|
+
if line[i] == col_sep && !in_quotes
|
|
219
|
+
break if !header_size.nil? && elements.size >= header_size
|
|
183
220
|
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
backslash_count += 1
|
|
221
|
+
field = line[start...i]
|
|
222
|
+
field = cleanup_quotes(field, quote)
|
|
223
|
+
elements << (strip ? field.strip : field)
|
|
224
|
+
i += 1
|
|
225
|
+
start = i
|
|
226
|
+
backslash_count = 0
|
|
191
227
|
else
|
|
192
|
-
if line[i] ==
|
|
193
|
-
|
|
194
|
-
|
|
228
|
+
if allow_escaped_quotes && line[i] == '\\'
|
|
229
|
+
backslash_count += 1
|
|
230
|
+
else
|
|
231
|
+
if line[i] == quote
|
|
232
|
+
if !allow_escaped_quotes || backslash_count % 2 == 0
|
|
233
|
+
in_quotes = !in_quotes
|
|
234
|
+
end
|
|
195
235
|
end
|
|
236
|
+
backslash_count = 0
|
|
196
237
|
end
|
|
238
|
+
i += 1
|
|
239
|
+
end
|
|
240
|
+
end
|
|
241
|
+
else
|
|
242
|
+
# Multi-char col_sep: use substring comparison (original path)
|
|
243
|
+
while i < line_size
|
|
244
|
+
if line[i...i+col_sep_size] == col_sep && !in_quotes
|
|
245
|
+
break if !header_size.nil? && elements.size >= header_size
|
|
246
|
+
|
|
247
|
+
field = line[start...i]
|
|
248
|
+
field = cleanup_quotes(field, quote)
|
|
249
|
+
elements << (strip ? field.strip : field)
|
|
250
|
+
i += col_sep_size
|
|
251
|
+
start = i
|
|
197
252
|
backslash_count = 0
|
|
253
|
+
else
|
|
254
|
+
if allow_escaped_quotes && line[i] == '\\'
|
|
255
|
+
backslash_count += 1
|
|
256
|
+
else
|
|
257
|
+
if line[i] == quote
|
|
258
|
+
if !allow_escaped_quotes || backslash_count % 2 == 0
|
|
259
|
+
in_quotes = !in_quotes
|
|
260
|
+
end
|
|
261
|
+
end
|
|
262
|
+
backslash_count = 0
|
|
263
|
+
end
|
|
264
|
+
i += 1
|
|
198
265
|
end
|
|
199
|
-
i += 1
|
|
200
266
|
end
|
|
201
267
|
end
|
|
202
268
|
|
|
@@ -209,10 +275,11 @@ module SmarterCSV
|
|
|
209
275
|
|
|
210
276
|
# Process the remaining field
|
|
211
277
|
if header_size.nil? || elements.size < header_size
|
|
212
|
-
|
|
278
|
+
field = line[start..-1]
|
|
279
|
+
field = cleanup_quotes(field, quote)
|
|
280
|
+
elements << (strip ? field.strip : field)
|
|
213
281
|
end
|
|
214
282
|
|
|
215
|
-
elements.map!(&:strip) if options[:strip_whitespace]
|
|
216
283
|
[elements, elements.size]
|
|
217
284
|
end
|
|
218
285
|
|
data/lib/smarter_csv/reader.rb
CHANGED
|
@@ -102,8 +102,10 @@ module SmarterCSV
|
|
|
102
102
|
if detect_multiline(line, options)
|
|
103
103
|
raise MalformedCSV, "Unclosed quoted field detected in multiline data"
|
|
104
104
|
else
|
|
105
|
+
# :nocov:
|
|
105
106
|
# Quotes are balanced; proceed without raising an error.
|
|
106
107
|
break
|
|
108
|
+
# :nocov:
|
|
107
109
|
end
|
|
108
110
|
end
|
|
109
111
|
next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
|
|
@@ -188,9 +190,9 @@ module SmarterCSV
|
|
|
188
190
|
end
|
|
189
191
|
|
|
190
192
|
# Fallback to Ruby implementation
|
|
191
|
-
count = 0
|
|
192
|
-
|
|
193
193
|
if quote_escaping == :backslash
|
|
194
|
+
# Backslash mode: must walk character-by-character to track escape state
|
|
195
|
+
count = 0
|
|
194
196
|
escaped = false
|
|
195
197
|
|
|
196
198
|
line.each_char do |char|
|
|
@@ -203,14 +205,12 @@ module SmarterCSV
|
|
|
203
205
|
escaped = false
|
|
204
206
|
end
|
|
205
207
|
end
|
|
208
|
+
count
|
|
206
209
|
else
|
|
207
|
-
# :double_quotes mode —
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
end
|
|
210
|
+
# Optimization #3: double_quotes mode — use String#count (single C call,
|
|
211
|
+
# no per-character String allocation)
|
|
212
|
+
line.count(quote_char)
|
|
211
213
|
end
|
|
212
|
-
|
|
213
|
-
count
|
|
214
214
|
end
|
|
215
215
|
|
|
216
216
|
# Returns [escaped_count, rfc_count] for :auto mode dual counting.
|
|
@@ -223,13 +223,21 @@ module SmarterCSV
|
|
|
223
223
|
return SmarterCSV::Parser.count_quote_chars_auto_c(line, quote_char, col_sep)
|
|
224
224
|
end
|
|
225
225
|
|
|
226
|
-
rfc_count
|
|
226
|
+
# Optimization #3: rfc_count uses String#count (single C call)
|
|
227
|
+
rfc_count = line.count(quote_char)
|
|
228
|
+
|
|
229
|
+
# Optimization #9: if no backslashes in line, escaped_count == rfc_count
|
|
230
|
+
# (no escaping possible), skip the character-by-character walk entirely.
|
|
231
|
+
unless line.include?('\\')
|
|
232
|
+
return [rfc_count, rfc_count]
|
|
233
|
+
end
|
|
234
|
+
|
|
235
|
+
# escaped_count needs character-by-character walk for backslash tracking
|
|
227
236
|
escaped_count = 0
|
|
228
237
|
escaped = false
|
|
229
238
|
|
|
230
239
|
line.each_char do |char|
|
|
231
240
|
if char == quote_char
|
|
232
|
-
rfc_count += 1
|
|
233
241
|
escaped_count += 1 unless escaped
|
|
234
242
|
escaped = false
|
|
235
243
|
elsif char == '\\'
|
|
@@ -246,7 +254,10 @@ module SmarterCSV
|
|
|
246
254
|
|
|
247
255
|
# Determine if a line has unbalanced quotes requiring multiline stitching.
|
|
248
256
|
# For :auto mode, uses dual counting to avoid false multiline detection.
|
|
257
|
+
# Optimization #8: skip quote counting entirely when line has no quote chars.
|
|
249
258
|
def detect_multiline(line, options)
|
|
259
|
+
return false unless line.include?(options[:quote_char])
|
|
260
|
+
|
|
250
261
|
if options[:quote_escaping] == :auto
|
|
251
262
|
escaped_count, rfc_count = count_quote_chars_auto(line, options[:quote_char], options[:col_sep])
|
|
252
263
|
# If backslash-aware count is even → line is self-contained either way
|
|
@@ -265,10 +276,11 @@ module SmarterCSV
|
|
|
265
276
|
# and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
|
|
266
277
|
BLANK_RE = /\A\s*\z/.freeze
|
|
267
278
|
|
|
279
|
+
# Optimization #5: fast-path empty string and nil checks before regex
|
|
268
280
|
def blank?(value)
|
|
269
281
|
case value
|
|
270
282
|
when String
|
|
271
|
-
BLANK_RE.match?(value)
|
|
283
|
+
value.empty? || BLANK_RE.match?(value)
|
|
272
284
|
when NilClass
|
|
273
285
|
true
|
|
274
286
|
when Array
|
data/lib/smarter_csv/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: smarter_csv
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.15.
|
|
4
|
+
version: 1.15.2
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tilo Sloboda
|
|
8
8
|
bindir: bin
|
|
9
9
|
cert_chain: []
|
|
10
|
-
date: 2026-02-
|
|
10
|
+
date: 2026-02-20 00:00:00.000000000 Z
|
|
11
11
|
dependencies:
|
|
12
12
|
- !ruby/object:Gem::Dependency
|
|
13
13
|
name: awesome_print
|