smarter_csv 1.15.1 → 1.15.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: df37543c55dff7b37543c32787704664b6b4b6c187b7d9d69f02bb7472bfc85e
4
- data.tar.gz: 4cd09212aa83588e8dd533b3ef1ed1b742b35a8a63e24f963760890646c17116
3
+ metadata.gz: 41a8d63c5aea4500d77b4268079521194f0d2d34de2b3e5f2264c48181159273
4
+ data.tar.gz: 586facc801af166270eebf0ece90949061ccfeaadfa3e7837678cb935e032bcb
5
5
  SHA512:
6
- metadata.gz: 4010ed4d675e979512c632a0173f8f4e660e707a8f2677489132c3e1e65d1e63199a314a03379e3ef3cf6157c8821b2880ec4ba83119cdcf5551fb9d7d7fdbff
7
- data.tar.gz: adb848ec9d97796ff85331dae23cdb8fe121ba42ee12fa1ebc9056cddfe09ba9015c89d85237fbb4065d1525544a405877e8e2bbb6f8f661b886746ba0532e57
6
+ metadata.gz: ed4072e64c4e66fb5b982dfaffe49d32370b087aa9a1ff689c2f73bfa6450ae275547bb17818ff227e8843834bcb981a8a906b5e7936bbf999f497e89b2cb91d
7
+ data.tar.gz: 31ecb71b2b50e1bb5f2aa037583550eb878f2e1faf66adf0803c8dcdeafbd52b0fa24c3b78bcc9bcdc3a3c759b53667004541257c32799d08b944a4ed53d9b49
data/CHANGELOG.md CHANGED
@@ -1,6 +1,16 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.15.2 (2026-02-20)
5
+
6
+ * Performance Optimizations
7
+ - 1.6× to 7.2× faster than CSV.read
8
+ - 6× to 113× faster than Ruby’s CSV.table
9
+ - 5.4× to 37.4× faster than SmarterCSV 1.14.4 (with C-acceleration)
10
+ - 1.4× to 9.5× faster than SmarterCSV 1.14.4 (without C-acceleration, pure Ruby path)
11
+
12
+ [More details here](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [here](https://github.com/tilo/smarter_csv/pull/319)
13
+
4
14
  ## 1.15.1 (2026-02-17)
5
15
 
6
16
  ### Bug Fix
data/README.md CHANGED
@@ -25,13 +25,19 @@ SmarterCSV is designed for **real-world CSV processing**, returning fully usable
25
25
 
26
26
  For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to SmarterCSV.
27
27
 
28
- | Comparison | Speedup (P90) |
29
- |----------------------|------------------|
30
- | vs SmarterCSV 1.14.4 | ~5× faster |
31
- | vs CSV.table | ~7× faster |
32
- | vs CSV hashes | ~3× faster |
28
+ | Comparison | Range |
29
+ |------------------------------------------|----------------------|
30
+ | vs SmarterCSV 1.14.4 (with acceleration) | 5.4× to 37.4x faster |
31
+ | vs SmarterCSV 1.14.4 (pure Ruby) | 1.4× to 9.5× faster |
32
+ | vs CSV.read (arrays of arrays) | 1.6x to 7.2x faster |
33
+ | vs CSV.table (arrays of hashes) | 6× to 113× faster |
34
+ | vs ZSV (arrays of hashes) | 1.4× to 6.3× faster |
33
35
 
34
- _Benchmarks: Ruby 3.4.7, M1 Apple Silicon. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) for details._
36
+ [More details here](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [here](https://github.com/tilo/smarter_csv/pull/319)
37
+
38
+ SmarterCSV also wins 14 of 16 benchmark files head-to-head against ZSV+wrapper (SIMD-accelerated C parser with Ruby wrapper to produce equivalent hash output).
39
+
40
+ _Benchmarks: 16 CSV files (43k–80k rows), Ruby 3.4.7, Apple M1. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for details._
35
41
 
36
42
  ## Examples
37
43
 
@@ -29,18 +29,26 @@ Learn more about this [in this section](docs/examples/row_col_sep.md).
29
29
  The simplified call to read CSV files is:
30
30
 
31
31
  ```
32
- array_of_hashes = SmarterCSV.process(file_or_input, options, &block)
32
+ array_of_hashes = SmarterCSV.process(file_or_input, options)
33
33
 
34
34
  ```
35
- It can also be used with a block:
35
+ It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
36
36
 
37
37
  ```
38
- SmarterCSV.process(file_or_input, options, &block) do |hash|
39
- # process one row of CSV
38
+ SmarterCSV.process(file_or_input, options) do |array_of_hashes|
39
+ # without chunk_size, each yield conatins a one-element array (one row)
40
40
  end
41
41
  ```
42
42
 
43
- It can also be used for processing batches of rows. An optional second block parameter provides the 0-based chunk index:
43
+ or
44
+
45
+ ```
46
+ SmarterCSV.process(file_or_input, options) do |array_of_hashes, chunk_index|
47
+ # the chunk_index can be used to track chunks for parallel processing
48
+ end
49
+ ```
50
+
51
+ When processing batches of rows, use the `chunk_size` option. The block receives an array of up to `chunk_size` hashes per yield:
44
52
 
45
53
  ```
46
54
  SmarterCSV.process(file_or_input, {chunk_size: 100}) do |array_of_hashes, chunk_index|
@@ -59,11 +67,11 @@ The simplified API works in most cases, but if you need access to the internal s
59
67
 
60
68
  puts reader.raw_headers
61
69
  ```
62
- It cal also be used with a block:
70
+ It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
63
71
 
64
- ```
72
+ ```
65
73
  reader = SmarterCSV::Reader.new(file_or_input, options)
66
- data = reader.process do
74
+ data = reader.process do |array_of_hashes, chunk_index|
67
75
  # do something here
68
76
  end
69
77
 
@@ -12,6 +12,8 @@ end
12
12
  optflags = "-O3 -flto -fomit-frame-pointer -DNDEBUG".dup
13
13
  optflags << " -march=native" unless RUBY_PLATFORM.start_with?("arm64-darwin")
14
14
 
15
+ append_cflags('-Wno-compound-token-split-by-macro')
16
+
15
17
  CONFIG["optflags"] = optflags
16
18
  CONFIG["debugflags"] = ""
17
19
 
@@ -41,8 +41,9 @@ module SmarterCSV
41
41
  [elements, elements.size]
42
42
  # :nocov:
43
43
  else
44
- backslash_options = options.merge(quote_escaping: :backslash)
45
- parse_csv_line_ruby(line, backslash_options, header_size, has_quotes)
44
+ # Optimization #4: cache merged options hashes for :auto mode
45
+ @backslash_options ||= options.merge(quote_escaping: :backslash)
46
+ parse_csv_line_ruby(line, @backslash_options, header_size, has_quotes)
46
47
  end
47
48
  rescue MalformedCSV
48
49
  # Backslash interpretation failed — fall back to RFC 4180
@@ -52,8 +53,9 @@ module SmarterCSV
52
53
  [elements, elements.size]
53
54
  # :nocov:
54
55
  else
55
- rfc_options = options.merge(quote_escaping: :double_quotes)
56
- parse_csv_line_ruby(line, rfc_options, header_size, has_quotes)
56
+ # Optimization #4: cache merged options hashes for :auto mode
57
+ @rfc_options ||= options.merge(quote_escaping: :double_quotes)
58
+ parse_csv_line_ruby(line, @rfc_options, header_size, has_quotes)
57
59
  end
58
60
  end
59
61
  end
@@ -80,25 +82,29 @@ module SmarterCSV
80
82
  # Try backslash-escape interpretation first
81
83
  if options[:acceleration] && has_acceleration
82
84
  # :nocov:
83
- backslash_options = options.merge(quote_escaping: :backslash)
84
- parse_line_to_hash_c(line, headers, backslash_options)
85
+ # Optimization #4: cache merged options hashes for :auto mode
86
+ @backslash_options ||= options.merge(quote_escaping: :backslash)
87
+ parse_line_to_hash_c(line, headers, @backslash_options)
85
88
  # :nocov:
86
89
  else
87
90
  has_quotes = line.include?(options[:quote_char])
88
- backslash_options = options.merge(quote_escaping: :backslash)
89
- parse_line_to_hash_ruby(line, headers, backslash_options, has_quotes)
91
+ # Optimization #4: cache merged options hashes for :auto mode
92
+ @backslash_options ||= options.merge(quote_escaping: :backslash)
93
+ parse_line_to_hash_ruby(line, headers, @backslash_options, has_quotes)
90
94
  end
91
95
  rescue MalformedCSV
92
96
  # Backslash interpretation failed — fall back to RFC 4180
93
97
  if options[:acceleration] && has_acceleration
94
98
  # :nocov:
95
- rfc_options = options.merge(quote_escaping: :double_quotes)
96
- parse_line_to_hash_c(line, headers, rfc_options)
99
+ # Optimization #4: cache merged options hashes for :auto mode
100
+ @rfc_options ||= options.merge(quote_escaping: :double_quotes)
101
+ parse_line_to_hash_c(line, headers, @rfc_options)
97
102
  # :nocov:
98
103
  else
99
104
  has_quotes = line.include?(options[:quote_char])
100
- rfc_options = options.merge(quote_escaping: :double_quotes)
101
- parse_line_to_hash_ruby(line, headers, rfc_options, has_quotes)
105
+ # Optimization #4: cache merged options hashes for :auto mode
106
+ @rfc_options ||= options.merge(quote_escaping: :double_quotes)
107
+ parse_line_to_hash_ruby(line, headers, @rfc_options, has_quotes)
102
108
  end
103
109
  end
104
110
  end
@@ -113,9 +119,16 @@ module SmarterCSV
113
119
  # Parse the line into values
114
120
  elements, data_size = parse_csv_line_ruby(line, options, nil, has_quotes)
115
121
 
116
- # Check if all values are blank
117
- if options[:remove_empty_hashes] && (elements.empty? || elements.all? { |v| v.nil? || v.to_s.strip.empty? })
118
- return [nil, data_size]
122
+ # Optimization #6: elements are always String or nil from parse_csv_line_ruby,
123
+ # so .to_s is unnecessary. If strip_whitespace is on, fields are already
124
+ # stripped, so .strip is also redundant — just check .empty?.
125
+ if options[:remove_empty_hashes]
126
+ all_blank = if options[:strip_whitespace]
127
+ elements.empty? || elements.all? { |v| v.nil? || v.empty? }
128
+ else
129
+ elements.empty? || elements.all? { |v| v.nil? || v.strip.empty? }
130
+ end
131
+ return [nil, data_size] if all_blank
119
132
  end
120
133
 
121
134
  # Build the hash - only include keys for values that exist
@@ -161,11 +174,33 @@ module SmarterCSV
161
174
  #
162
175
  # Our convention is that empty fields are returned as empty strings, not as nil.
163
176
 
164
- def parse_csv_line_ruby(line, options, header_size = nil, _has_quotes = false)
177
+ def parse_csv_line_ruby(line, options, header_size = nil, has_quotes = false)
165
178
  return [[], 0] if line.nil?
166
179
 
167
- line_size = line.size
168
180
  col_sep = options[:col_sep]
181
+ strip = options[:strip_whitespace]
182
+
183
+ # Ensure has_quotes is set correctly (callers via parse/parse_line_to_hash
184
+ # always pass this, but direct callers may not)
185
+ has_quotes = line.include?(options[:quote_char]) unless has_quotes
186
+
187
+ # Optimization #7: when line has no quotes, use String#split (C-implemented)
188
+ # to bypass the entire character-by-character loop.
189
+ # Note: String#split(" ") has special whitespace-collapsing behavior in Ruby,
190
+ # so we must use a literal string pattern only for non-space separators,
191
+ # or fall through to the character loop for space separators.
192
+ unless has_quotes || col_sep == ' '
193
+ if header_size && header_size <= 0
194
+ return [[], 0]
195
+ end
196
+ elements = line.split(col_sep, -1) # -1 preserves trailing empty fields
197
+ elements = elements[0, header_size] if header_size
198
+ elements.map!(&:strip) if strip
199
+ return [elements, elements.size]
200
+ end
201
+
202
+ # Quoted-line path: character-by-character parsing required
203
+ line_size = line.size
169
204
  col_sep_size = col_sep.size
170
205
  quote = options[:quote_char]
171
206
  elements = []
@@ -176,27 +211,58 @@ module SmarterCSV
176
211
  in_quotes = false
177
212
  allow_escaped_quotes = options[:quote_escaping] == :backslash
178
213
 
179
- while i < line_size
180
- # Check if the current position matches the column separator and we're not inside quotes
181
- if line[i...i+col_sep_size] == col_sep && !in_quotes
182
- break if !header_size.nil? && elements.size >= header_size
214
+ # Optimization #1: for the common single-char separator, use direct
215
+ # character comparison instead of allocating a substring via line[i...i+n].
216
+ if col_sep_size == 1
217
+ while i < line_size
218
+ if line[i] == col_sep && !in_quotes
219
+ break if !header_size.nil? && elements.size >= header_size
183
220
 
184
- elements << cleanup_quotes(line[start...i], quote)
185
- i += col_sep_size
186
- start = i
187
- backslash_count = 0
188
- else
189
- if allow_escaped_quotes && line[i] == '\\'
190
- backslash_count += 1
221
+ field = line[start...i]
222
+ field = cleanup_quotes(field, quote)
223
+ elements << (strip ? field.strip : field)
224
+ i += 1
225
+ start = i
226
+ backslash_count = 0
191
227
  else
192
- if line[i] == quote
193
- if !allow_escaped_quotes || backslash_count % 2 == 0
194
- in_quotes = !in_quotes
228
+ if allow_escaped_quotes && line[i] == '\\'
229
+ backslash_count += 1
230
+ else
231
+ if line[i] == quote
232
+ if !allow_escaped_quotes || backslash_count % 2 == 0
233
+ in_quotes = !in_quotes
234
+ end
195
235
  end
236
+ backslash_count = 0
196
237
  end
238
+ i += 1
239
+ end
240
+ end
241
+ else
242
+ # Multi-char col_sep: use substring comparison (original path)
243
+ while i < line_size
244
+ if line[i...i+col_sep_size] == col_sep && !in_quotes
245
+ break if !header_size.nil? && elements.size >= header_size
246
+
247
+ field = line[start...i]
248
+ field = cleanup_quotes(field, quote)
249
+ elements << (strip ? field.strip : field)
250
+ i += col_sep_size
251
+ start = i
197
252
  backslash_count = 0
253
+ else
254
+ if allow_escaped_quotes && line[i] == '\\'
255
+ backslash_count += 1
256
+ else
257
+ if line[i] == quote
258
+ if !allow_escaped_quotes || backslash_count % 2 == 0
259
+ in_quotes = !in_quotes
260
+ end
261
+ end
262
+ backslash_count = 0
263
+ end
264
+ i += 1
198
265
  end
199
- i += 1
200
266
  end
201
267
  end
202
268
 
@@ -209,10 +275,11 @@ module SmarterCSV
209
275
 
210
276
  # Process the remaining field
211
277
  if header_size.nil? || elements.size < header_size
212
- elements << cleanup_quotes(line[start..-1], quote)
278
+ field = line[start..-1]
279
+ field = cleanup_quotes(field, quote)
280
+ elements << (strip ? field.strip : field)
213
281
  end
214
282
 
215
- elements.map!(&:strip) if options[:strip_whitespace]
216
283
  [elements, elements.size]
217
284
  end
218
285
 
@@ -102,8 +102,10 @@ module SmarterCSV
102
102
  if detect_multiline(line, options)
103
103
  raise MalformedCSV, "Unclosed quoted field detected in multiline data"
104
104
  else
105
+ # :nocov:
105
106
  # Quotes are balanced; proceed without raising an error.
106
107
  break
108
+ # :nocov:
107
109
  end
108
110
  end
109
111
  next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
@@ -188,9 +190,9 @@ module SmarterCSV
188
190
  end
189
191
 
190
192
  # Fallback to Ruby implementation
191
- count = 0
192
-
193
193
  if quote_escaping == :backslash
194
+ # Backslash mode: must walk character-by-character to track escape state
195
+ count = 0
194
196
  escaped = false
195
197
 
196
198
  line.each_char do |char|
@@ -203,14 +205,12 @@ module SmarterCSV
203
205
  escaped = false
204
206
  end
205
207
  end
208
+ count
206
209
  else
207
- # :double_quotes mode — backslash has no special meaning
208
- line.each_char do |char|
209
- count += 1 if char == quote_char
210
- end
210
+ # Optimization #3: double_quotes mode — use String#count (single C call,
211
+ # no per-character String allocation)
212
+ line.count(quote_char)
211
213
  end
212
-
213
- count
214
214
  end
215
215
 
216
216
  # Returns [escaped_count, rfc_count] for :auto mode dual counting.
@@ -223,13 +223,21 @@ module SmarterCSV
223
223
  return SmarterCSV::Parser.count_quote_chars_auto_c(line, quote_char, col_sep)
224
224
  end
225
225
 
226
- rfc_count = 0
226
+ # Optimization #3: rfc_count uses String#count (single C call)
227
+ rfc_count = line.count(quote_char)
228
+
229
+ # Optimization #9: if no backslashes in line, escaped_count == rfc_count
230
+ # (no escaping possible), skip the character-by-character walk entirely.
231
+ unless line.include?('\\')
232
+ return [rfc_count, rfc_count]
233
+ end
234
+
235
+ # escaped_count needs character-by-character walk for backslash tracking
227
236
  escaped_count = 0
228
237
  escaped = false
229
238
 
230
239
  line.each_char do |char|
231
240
  if char == quote_char
232
- rfc_count += 1
233
241
  escaped_count += 1 unless escaped
234
242
  escaped = false
235
243
  elsif char == '\\'
@@ -246,7 +254,10 @@ module SmarterCSV
246
254
 
247
255
  # Determine if a line has unbalanced quotes requiring multiline stitching.
248
256
  # For :auto mode, uses dual counting to avoid false multiline detection.
257
+ # Optimization #8: skip quote counting entirely when line has no quote chars.
249
258
  def detect_multiline(line, options)
259
+ return false unless line.include?(options[:quote_char])
260
+
250
261
  if options[:quote_escaping] == :auto
251
262
  escaped_count, rfc_count = count_quote_chars_auto(line, options[:quote_char], options[:col_sep])
252
263
  # If backslash-aware count is even → line is self-contained either way
@@ -265,10 +276,11 @@ module SmarterCSV
265
276
  # and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
266
277
  BLANK_RE = /\A\s*\z/.freeze
267
278
 
279
+ # Optimization #5: fast-path empty string and nil checks before regex
268
280
  def blank?(value)
269
281
  case value
270
282
  when String
271
- BLANK_RE.match?(value)
283
+ value.empty? || BLANK_RE.match?(value)
272
284
  when NilClass
273
285
  true
274
286
  when Array
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.15.1"
4
+ VERSION = "1.15.2"
5
5
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.15.1
4
+ version: 1.15.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  bindir: bin
9
9
  cert_chain: []
10
- date: 2026-02-17 00:00:00.000000000 Z
10
+ date: 2026-02-20 00:00:00.000000000 Z
11
11
  dependencies:
12
12
  - !ruby/object:Gem::Dependency
13
13
  name: awesome_print