RubyGems - smarter_csv - Versions diffs - 1.15.1 → 1.15.2 - Mend

smarter_csv 1.15.1 → 1.15.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: df37543c55dff7b37543c32787704664b6b4b6c187b7d9d69f02bb7472bfc85e
-  data.tar.gz: 4cd09212aa83588e8dd533b3ef1ed1b742b35a8a63e24f963760890646c17116
+  metadata.gz: 41a8d63c5aea4500d77b4268079521194f0d2d34de2b3e5f2264c48181159273
+  data.tar.gz: 586facc801af166270eebf0ece90949061ccfeaadfa3e7837678cb935e032bcb
 SHA512:
-  metadata.gz: 4010ed4d675e979512c632a0173f8f4e660e707a8f2677489132c3e1e65d1e63199a314a03379e3ef3cf6157c8821b2880ec4ba83119cdcf5551fb9d7d7fdbff
-  data.tar.gz: adb848ec9d97796ff85331dae23cdb8fe121ba42ee12fa1ebc9056cddfe09ba9015c89d85237fbb4065d1525544a405877e8e2bbb6f8f661b886746ba0532e57
+  metadata.gz: ed4072e64c4e66fb5b982dfaffe49d32370b087aa9a1ff689c2f73bfa6450ae275547bb17818ff227e8843834bcb981a8a906b5e7936bbf999f497e89b2cb91d
+  data.tar.gz: 31ecb71b2b50e1bb5f2aa037583550eb878f2e1faf66adf0803c8dcdeafbd52b0fa24c3b78bcc9bcdc3a3c759b53667004541257c32799d08b944a4ed53d9b49

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,16 @@
 # SmarterCSV 1.x Change Log
+## 1.15.2 (2026-02-20)
+* Performance Optimizations
+ - 1.6× to 7.2× faster than CSV.read
+ - 6× to 113× faster than Ruby’s CSV.table
+ - 5.4× to 37.4× faster than SmarterCSV 1.14.4 (with C-acceleration)
+ - 1.4× to 9.5× faster than SmarterCSV 1.14.4 (without C-acceleration, pure Ruby path)
+ [More details here](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [here](https://github.com/tilo/smarter_csv/pull/319)
 ## 1.15.1 (2026-02-17)
 ### Bug Fix

data/README.md CHANGED Viewed

@@ -25,13 +25,19 @@ SmarterCSV is designed for **real-world CSV processing**, returning fully usable
 For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to SmarterCSV.
-| Comparison           | Speedup (P90)    |
-|----------------------|------------------|
-| vs SmarterCSV 1.14.4 | ~5× faster       |
-| vs CSV.table         | ~7× faster       |
-| vs CSV hashes        | ~3× faster       |
+| Comparison                               | Range                |
+|------------------------------------------|----------------------|
+| vs SmarterCSV 1.14.4 (with acceleration) | 5.4× to 37.4x faster |
+| vs SmarterCSV 1.14.4 (pure Ruby)         | 1.4× to 9.5× faster  |
+| vs CSV.read  (arrays of arrays)          | 1.6x to 7.2x faster  |
+| vs CSV.table (arrays of hashes)          | 6× to 113× faster    |
+| vs ZSV (arrays of hashes)                | 1.4× to 6.3× faster  |
-_Benchmarks: Ruby 3.4.7, M1 Apple Silicon. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) for details._
+ [More details here](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [here](https://github.com/tilo/smarter_csv/pull/319)
+SmarterCSV also wins 14 of 16 benchmark files head-to-head against ZSV+wrapper (SIMD-accelerated C parser with Ruby wrapper to produce equivalent hash output).
+_Benchmarks: 16 CSV files (43k–80k rows), Ruby 3.4.7, Apple M1. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for details._
 ## Examples

data/docs/basic_read_api.md CHANGED Viewed

@@ -29,18 +29,26 @@ Learn more about this [in this section](docs/examples/row_col_sep.md).
 The simplified call to read CSV files is:
       ```
-         array_of_hashes = SmarterCSV.process(file_or_input, options, &block)
+         array_of_hashes = SmarterCSV.process(file_or_input, options)
       ```
-It can also be used with a block:
+It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
       ```
-         SmarterCSV.process(file_or_input, options, &block) do |hash|
-            # process one row of CSV
+         SmarterCSV.process(file_or_input, options) do |array_of_hashes|
+           # without chunk_size, each yield conatins a one-element array (one row)
          end
       ```
-It can also be used for processing batches of rows. An optional second block parameter provides the 0-based chunk index:
+or
+      ```
+         SmarterCSV.process(file_or_input, options) do |array_of_hashes, chunk_index|
+            # the chunk_index can be used to track chunks for parallel processing
+         end
+      ```
+When processing batches of rows, use the `chunk_size` option. The block receives an array of up to `chunk_size` hashes per yield:
       ```
          SmarterCSV.process(file_or_input, {chunk_size: 100}) do |array_of_hashes, chunk_index|
@@ -59,11 +67,11 @@ The simplified API works in most cases, but if you need access to the internal s
         puts reader.raw_headers
       ```
-It cal also be used with a block:
+It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
-      ```
+      ```
         reader = SmarterCSV::Reader.new(file_or_input, options)
-        data = reader.process do
+        data = reader.process do |array_of_hashes, chunk_index|
            # do something here
         end

data/ext/smarter_csv/extconf.rb CHANGED Viewed

@@ -12,6 +12,8 @@ end
 optflags = "-O3 -flto -fomit-frame-pointer -DNDEBUG".dup
 optflags << " -march=native" unless RUBY_PLATFORM.start_with?("arm64-darwin")
+append_cflags('-Wno-compound-token-split-by-macro')
 CONFIG["optflags"] = optflags
 CONFIG["debugflags"] = ""

data/lib/smarter_csv/parser.rb CHANGED Viewed

@@ -41,8 +41,9 @@ module SmarterCSV
           [elements, elements.size]
           # :nocov:
         else
-          backslash_options = options.merge(quote_escaping: :backslash)
-          parse_csv_line_ruby(line, backslash_options, header_size, has_quotes)
+          # Optimization #4: cache merged options hashes for :auto mode
+          @backslash_options ||= options.merge(quote_escaping: :backslash)
+          parse_csv_line_ruby(line, @backslash_options, header_size, has_quotes)
         end
       rescue MalformedCSV
         # Backslash interpretation failed — fall back to RFC 4180
@@ -52,8 +53,9 @@ module SmarterCSV
           [elements, elements.size]
           # :nocov:
         else
-          rfc_options = options.merge(quote_escaping: :double_quotes)
-          parse_csv_line_ruby(line, rfc_options, header_size, has_quotes)
+          # Optimization #4: cache merged options hashes for :auto mode
+          @rfc_options ||= options.merge(quote_escaping: :double_quotes)
+          parse_csv_line_ruby(line, @rfc_options, header_size, has_quotes)
         end
       end
     end
@@ -80,25 +82,29 @@ module SmarterCSV
         # Try backslash-escape interpretation first
         if options[:acceleration] && has_acceleration
           # :nocov:
-          backslash_options = options.merge(quote_escaping: :backslash)
-          parse_line_to_hash_c(line, headers, backslash_options)
+          # Optimization #4: cache merged options hashes for :auto mode
+          @backslash_options ||= options.merge(quote_escaping: :backslash)
+          parse_line_to_hash_c(line, headers, @backslash_options)
           # :nocov:
         else
           has_quotes = line.include?(options[:quote_char])
-          backslash_options = options.merge(quote_escaping: :backslash)
-          parse_line_to_hash_ruby(line, headers, backslash_options, has_quotes)
+          # Optimization #4: cache merged options hashes for :auto mode
+          @backslash_options ||= options.merge(quote_escaping: :backslash)
+          parse_line_to_hash_ruby(line, headers, @backslash_options, has_quotes)
         end
       rescue MalformedCSV
         # Backslash interpretation failed — fall back to RFC 4180
         if options[:acceleration] && has_acceleration
           # :nocov:
-          rfc_options = options.merge(quote_escaping: :double_quotes)
-          parse_line_to_hash_c(line, headers, rfc_options)
+          # Optimization #4: cache merged options hashes for :auto mode
+          @rfc_options ||= options.merge(quote_escaping: :double_quotes)
+          parse_line_to_hash_c(line, headers, @rfc_options)
           # :nocov:
         else
           has_quotes = line.include?(options[:quote_char])
-          rfc_options = options.merge(quote_escaping: :double_quotes)
-          parse_line_to_hash_ruby(line, headers, rfc_options, has_quotes)
+          # Optimization #4: cache merged options hashes for :auto mode
+          @rfc_options ||= options.merge(quote_escaping: :double_quotes)
+          parse_line_to_hash_ruby(line, headers, @rfc_options, has_quotes)
         end
       end
     end
@@ -113,9 +119,16 @@ module SmarterCSV
       # Parse the line into values
       elements, data_size = parse_csv_line_ruby(line, options, nil, has_quotes)
-      # Check if all values are blank
-      if options[:remove_empty_hashes] && (elements.empty? || elements.all? { |v| v.nil? || v.to_s.strip.empty? })
-        return [nil, data_size]
+      # Optimization #6: elements are always String or nil from parse_csv_line_ruby,
+      # so .to_s is unnecessary. If strip_whitespace is on, fields are already
+      # stripped, so .strip is also redundant — just check .empty?.
+      if options[:remove_empty_hashes]
+        all_blank = if options[:strip_whitespace]
+                      elements.empty? || elements.all? { |v| v.nil? || v.empty? }
+                    else
+                      elements.empty? || elements.all? { |v| v.nil? || v.strip.empty? }
+                    end
+        return [nil, data_size] if all_blank
       end
       # Build the hash - only include keys for values that exist
@@ -161,11 +174,33 @@ module SmarterCSV
     #
     # Our convention is that empty fields are returned as empty strings, not as nil.
-    def parse_csv_line_ruby(line, options, header_size = nil, _has_quotes = false)
+    def parse_csv_line_ruby(line, options, header_size = nil, has_quotes = false)
       return [[], 0] if line.nil?
-      line_size = line.size
       col_sep = options[:col_sep]
+      strip = options[:strip_whitespace]
+      # Ensure has_quotes is set correctly (callers via parse/parse_line_to_hash
+      # always pass this, but direct callers may not)
+      has_quotes = line.include?(options[:quote_char]) unless has_quotes
+      # Optimization #7: when line has no quotes, use String#split (C-implemented)
+      # to bypass the entire character-by-character loop.
+      # Note: String#split(" ") has special whitespace-collapsing behavior in Ruby,
+      # so we must use a literal string pattern only for non-space separators,
+      # or fall through to the character loop for space separators.
+      unless has_quotes || col_sep == ' '
+        if header_size && header_size <= 0
+          return [[], 0]
+        end
+        elements = line.split(col_sep, -1) # -1 preserves trailing empty fields
+        elements = elements[0, header_size] if header_size
+        elements.map!(&:strip) if strip
+        return [elements, elements.size]
+      end
+      # Quoted-line path: character-by-character parsing required
+      line_size = line.size
       col_sep_size = col_sep.size
       quote = options[:quote_char]
       elements = []
@@ -176,27 +211,58 @@ module SmarterCSV
       in_quotes = false
       allow_escaped_quotes = options[:quote_escaping] == :backslash
-      while i < line_size
-        # Check if the current position matches the column separator and we're not inside quotes
-        if line[i...i+col_sep_size] == col_sep && !in_quotes
-          break if !header_size.nil? && elements.size >= header_size
+      # Optimization #1: for the common single-char separator, use direct
+      # character comparison instead of allocating a substring via line[i...i+n].
+      if col_sep_size == 1
+        while i < line_size
+          if line[i] == col_sep && !in_quotes
+            break if !header_size.nil? && elements.size >= header_size
-          elements << cleanup_quotes(line[start...i], quote)
-          i += col_sep_size
-          start = i
-          backslash_count = 0
-        else
-          if allow_escaped_quotes && line[i] == '\\'
-            backslash_count += 1
+            field = line[start...i]
+            field = cleanup_quotes(field, quote)
+            elements << (strip ? field.strip : field)
+            i += 1
+            start = i
+            backslash_count = 0
           else
-            if line[i] == quote
-              if !allow_escaped_quotes || backslash_count % 2 == 0
-                in_quotes = !in_quotes
+            if allow_escaped_quotes && line[i] == '\\'
+              backslash_count += 1
+            else
+              if line[i] == quote
+                if !allow_escaped_quotes || backslash_count % 2 == 0
+                  in_quotes = !in_quotes
+                end
               end
+              backslash_count = 0
             end
+            i += 1
+          end
+        end
+      else
+        # Multi-char col_sep: use substring comparison (original path)
+        while i < line_size
+          if line[i...i+col_sep_size] == col_sep && !in_quotes
+            break if !header_size.nil? && elements.size >= header_size
+            field = line[start...i]
+            field = cleanup_quotes(field, quote)
+            elements << (strip ? field.strip : field)
+            i += col_sep_size
+            start = i
             backslash_count = 0
+          else
+            if allow_escaped_quotes && line[i] == '\\'
+              backslash_count += 1
+            else
+              if line[i] == quote
+                if !allow_escaped_quotes || backslash_count % 2 == 0
+                  in_quotes = !in_quotes
+                end
+              end
+              backslash_count = 0
+            end
+            i += 1
           end
-          i += 1
         end
       end
@@ -209,10 +275,11 @@ module SmarterCSV
       # Process the remaining field
       if header_size.nil? || elements.size < header_size
-        elements << cleanup_quotes(line[start..-1], quote)
+        field = line[start..-1]
+        field = cleanup_quotes(field, quote)
+        elements << (strip ? field.strip : field)
       end
-      elements.map!(&:strip) if options[:strip_whitespace]
       [elements, elements.size]
     end

data/lib/smarter_csv/reader.rb CHANGED Viewed

@@ -102,8 +102,10 @@ module SmarterCSV
               if detect_multiline(line, options)
                 raise MalformedCSV, "Unclosed quoted field detected in multiline data"
               else
+                # :nocov:
                 # Quotes are balanced; proceed without raising an error.
                 break
+                # :nocov:
               end
             end
             next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
@@ -188,9 +190,9 @@ module SmarterCSV
       end
       # Fallback to Ruby implementation
-      count = 0
       if quote_escaping == :backslash
+        # Backslash mode: must walk character-by-character to track escape state
+        count = 0
         escaped = false
         line.each_char do |char|
@@ -203,14 +205,12 @@ module SmarterCSV
             escaped = false
           end
         end
+        count
       else
-        # :double_quotes mode — backslash has no special meaning
-        line.each_char do |char|
-          count += 1 if char == quote_char
-        end
+        # Optimization #3: double_quotes mode — use String#count (single C call,
+        # no per-character String allocation)
+        line.count(quote_char)
       end
-      count
     end
     # Returns [escaped_count, rfc_count] for :auto mode dual counting.
@@ -223,13 +223,21 @@ module SmarterCSV
         return SmarterCSV::Parser.count_quote_chars_auto_c(line, quote_char, col_sep)
       end
-      rfc_count = 0
+      # Optimization #3: rfc_count uses String#count (single C call)
+      rfc_count = line.count(quote_char)
+      # Optimization #9: if no backslashes in line, escaped_count == rfc_count
+      # (no escaping possible), skip the character-by-character walk entirely.
+      unless line.include?('\\')
+        return [rfc_count, rfc_count]
+      end
+      # escaped_count needs character-by-character walk for backslash tracking
       escaped_count = 0
       escaped = false
       line.each_char do |char|
         if char == quote_char
-          rfc_count += 1
           escaped_count += 1 unless escaped
           escaped = false
         elsif char == '\\'
@@ -246,7 +254,10 @@ module SmarterCSV
     # Determine if a line has unbalanced quotes requiring multiline stitching.
     # For :auto mode, uses dual counting to avoid false multiline detection.
+    # Optimization #8: skip quote counting entirely when line has no quote chars.
     def detect_multiline(line, options)
+      return false unless line.include?(options[:quote_char])
       if options[:quote_escaping] == :auto
         escaped_count, rfc_count = count_quote_chars_auto(line, options[:quote_char], options[:col_sep])
         # If backslash-aware count is even → line is self-contained either way
@@ -265,10 +276,11 @@ module SmarterCSV
     # and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
     BLANK_RE = /\A\s*\z/.freeze
+    # Optimization #5: fast-path empty string and nil checks before regex
     def blank?(value)
       case value
       when String
-        BLANK_RE.match?(value)
+        value.empty? || BLANK_RE.match?(value)
       when NilClass
         true
       when Array

data/lib/smarter_csv/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterCSV
-  VERSION = "1.15.1"
+  VERSION = "1.15.2"
 end

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.15.1
+  version: 1.15.2
 platform: ruby
 authors:
 - Tilo Sloboda
 bindir: bin
 cert_chain: []
-date: 2026-02-17 00:00:00.000000000 Z
+date: 2026-02-20 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: awesome_print