RubyGems - smarter_csv - Versions diffs - 1.15.0 → 1.15.1 - Mend

smarter_csv 1.15.0 → 1.15.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +15 -2
data/CONTRIBUTORS.md +1 -1
data/README.md +62 -1
data/docs/_introduction.md +2 -1
data/docs/basic_read_api.md +2 -1
data/docs/basic_write_api.md +1 -0
data/docs/batch_processing.md +2 -1
data/docs/data_transformations.md +1 -0
data/docs/examples.md +1 -0
data/docs/header_transformations.md +1 -0
data/docs/header_validations.md +1 -0
data/docs/options.md +7 -2
data/docs/parsing_strategy.md +99 -0
data/docs/row_col_sep.md +1 -0
data/docs/value_converters.md +1 -0
data/ext/smarter_csv/smarter_csv.c +399 -95
data/lib/smarter_csv/hash_transformations.rb +11 -5
data/lib/smarter_csv/options.rb +4 -0
data/lib/smarter_csv/parser.rb +86 -31
data/lib/smarter_csv/reader.rb +127 -43
data/lib/smarter_csv/version.rb +1 -1
data/smarter_csv.gemspec +2 -1
metadata +4 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 40ec747f330628f6aebd66fd7478349007cd7d2f049ae5bddbbc3d67cc5f07be
-  data.tar.gz: 3d02147aee5983e9fabcd05aed3d0e4ac9da3399d7a231fa2905a5c1f061b9b3
+  metadata.gz: df37543c55dff7b37543c32787704664b6b4b6c187b7d9d69f02bb7472bfc85e
+  data.tar.gz: 4cd09212aa83588e8dd533b3ef1ed1b742b35a8a63e24f963760890646c17116
 SHA512:
-  metadata.gz: cc825395fc200eca00ff37fb2a8d07d5e05a28c0b9fe2307f5ee5d8cbb7a6286d15856fbdb8b85904e2e685fa29a73a097ed4281068ad0f92833d35a3254fde3
-  data.tar.gz: 4f3050d6c33535d4b12e6737c5241708c87408bfb9c0cb320fb89470867bb338cd3359f6bf8ba4cdaa8d6566c1884f23436285ed50d61aef0f4356c772f83fa6
+  metadata.gz: 4010ed4d675e979512c632a0173f8f4e660e707a8f2677489132c3e1e65d1e63199a314a03379e3ef3cf6157c8821b2880ec4ba83119cdcf5551fb9d7d7fdbff
+  data.tar.gz: adb848ec9d97796ff85331dae23cdb8fe121ba42ee12fa1ebc9056cddfe09ba9015c89d85237fbb4065d1525544a405877e8e2bbb6f8f661b886746ba0532e57

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,19 @@
 # SmarterCSV 1.x Change Log
+## 1.15.1 (2026-02-17)
+### Bug Fix
+ * **Fix for quoted fields ending with backslash** ([issue #316](https://github.com/tilo/smarter_csv/issues/316), [issue #252](https://github.com/tilo/smarter_csv/issues/252)): Since v1.8.5, SmarterCSV unconditionally treated `\"` as an escaped quote, which caused `MalformedCSV` or `EOFError` for CSV files containing literal backslashes in quoted fields (e.g. Windows paths like `"C:\Users\"`).
+### New Option
+ * **New option `quote_escaping`**: Controls how quotes are escaped inside quoted fields. Default: `:auto`. See [Parsing Strategy](docs/parsing_strategy.md) for details.
+   - `:auto` (default): Tries backslash-escape interpretation first, falls back to RFC 4180 if parsing fails. This handles both conventions automatically without breaking existing data.
+   - `:double_quotes` (RFC 4180): Only doubled quotes (`""`) escape a quote character. Backslash is always literal.
+   - `:backslash` (MySQL/Unix): `\"` is treated as an escaped quote.
 ## 1.15.0 (2026-02-04)
 * Dropping support for Ruby 2.5
@@ -80,7 +93,7 @@ _P90 measured over the full set of benchmarked files_
 |---------------------------|--------|------|--------|--------|------------|
 | worldcities.csv           |   5 MB |  48K |  1.27s |  0.49s |  **2.6x**  |
 | LANDSAT_ETM_C2_L1_50k.csv |  31 MB |  50K |  6.73s |  1.99s |  **3.4x**  |
-| PILOT_CERT.csv            |  62 MB |  50K |  8.43s |  2.43s |  **3.5x**  |
+| PEOPLE_IMPORT.csv         |  62 MB |  50K |  8.43s |  2.43s |  **3.5x**  |
 | wide_500_cols_20k.csv     |  98 MB |  20K | 19.38s |  5.09s |  **3.8x**  |
 | long_fields_20k.csv       |  22 MB |  20K |  3.05s |  0.15s | **20.5x**  |
 | embedded_newlines_20k.csv | 1.5 MB |  20K |  0.59s |  0.12s |  **5.1x**  |
@@ -99,7 +112,7 @@ For this reason, **CSV.table is the closest equivalent to SmarterCSV.**
 |---------------------------|--------|------|------------|-----------|--------|-----------|------------|
 | worldcities.csv           |   5 MB |  48K |    1.06s   |   2.12s   |  0.49s | **2.2x**  |  **4.3x**  |
 | LANDSAT_ETM_C2_L1_50k.csv |  31 MB |  50K |    3.85s   |   9.25s   |  1.99s | **1.9x**  |  **4.7x**  |
-| PILOT_CERT.csv            |  62 MB |  50K |    9.10s   |  24.39s   |  2.43s | **3.8x**  | **10.1x**  |
+| PEOPLE_IMPORT.csv         |  62 MB |  50K |    9.10s   |  24.39s   |  2.43s | **3.8x**  | **10.1x**  |
 | wide_500_cols_20k.csv     |  98 MB |  20K |   34.24s   |  61.24s   |  5.09s | **6.7x**  | **12.0x**  |
 | long_fields_20k.csv       |  22 MB |  20K |    0.34s   |   0.81s   |  0.15s | **2.3x**  |  **5.5x**  |
 | whitespace_heavy_20k.csv  | 3.3 MB |  20K |    0.30s   |   0.83s   |  0.12s | **2.5x**  |  **7.0x**  |

data/CONTRIBUTORS.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# A Big Thank You to all the Contributors!!
+# A Big Thank You to all 59 Contributors!!
 A Big Thank you to everyone who filed issues, sent comments, and who contributed with pull requests:

data/README.md CHANGED Viewed

@@ -33,6 +33,66 @@ For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to Smarter
 _Benchmarks: Ruby 3.4.7, M1 Apple Silicon. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) for details._
+## Examples
+### Simple Example:
+SmarterCSV is designed for robustness — real-world CSV data often has inconsistent formatting, extra whitespace, and varied column separators. Its intelligent defaults automatically clean and normalize data, returning high-quality hashes ready for direct use with ActiveRecord, Sidekiq, or any data pipeline — no post-processing required. See [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38) for more background.
+```ruby
+$ cat spec/fixtures/sample.csv
+   First Name  , Last	 Name , Emoji , Posts
+José ,Corüazón, ❤️, 12
+Jürgen, Müller ,😐,3
+ Michael, May ,😞, 7
+$ irb
+>> require 'smarter_csv'
+=> true
+>> data = SmarterCSV.process('spec/fixtures/sample.csv')
+=> [{:first_name=>"José", :last_name=>"Corüazón", :emoji=>"❤️", :posts=>12},
+    {:first_name=>"Jürgen", :last_name=>"Müller", :emoji=>"😐", :posts=>3},
+    {:first_name=>"Michael", :last_name=>"May", :emoji=>"😞", :posts=>7}]
+```
+Notice how SmarterCSV automatically (all defaults):
+- Normalizes headers → `downcase_header: true`, `strings_as_keys: false`
+- Strips whitespace → `strip_whitespace: true`
+- Converts numbers → `convert_values_to_numeric: true`
+- Removes empty values → `remove_empty_values: true`
+- Preserves Unicode and emoji characters
+### Batch Processing:
+Processing large CSV files in chunks minimizes memory usage and enables powerful workflows:
+- **Database imports** — bulk insert records in batches for better performance
+- **Parallel processing** — distribute chunks across Sidekiq, Resque, or other background workers
+- **Progress tracking** — the optional `chunk_index` parameter enables progress reporting
+- **Memory efficiency** — only one chunk is held in memory at a time, regardless of file size
+The block receives a `chunk` (array of hashes) and an optional `chunk_index` (0-based sequence number):
+```ruby
+# Database bulk import
+SmarterCSV.process(filename, chunk_size: 100) do |chunk, chunk_index|
+  puts "Processing chunk #{chunk_index}..."
+  MyModel.insert_all(chunk)  # chunk is an array of hashes
+end
+# Parallel processing with Sidekiq
+SmarterCSV.process(filename, chunk_size: 100) do |chunk|
+  MyWorker.perform_async(chunk)  # each chunk processed in parallel
+end
+```
+See [Examples](docs/examples.md), [Batch Processing](docs/batch_processing.md), and [Configuration Options](docs/options.md) for more.
+## Requirements
+**Minimum Ruby Version:** >= 2.6
+**C Extension:** SmarterCSV includes a native C extension for accelerated CSV parsing.
+The C extension is automatically compiled on MRI Ruby. For JRuby and TruffleRuby, SmarterCSV falls back to a pure Ruby implementation.
 # Installation
 Add this line to your application's Gemfile:
@@ -51,6 +111,7 @@ Or install it yourself as:
 # Documentation
   * [Introduction](docs/_introduction.md)
+  * [Parsing Strategy](docs/parsing_strategy.md)
   * [The Basic Read API](docs/basic_read_api.md)
   * [The Basic Write API](docs/basic_write_api.md)
   * [Batch Processing](./docs/batch_processing.md)
@@ -80,7 +141,7 @@ For reporting issues, please:
   * open a pull-request adding a test that demonstrates the issue
   * mention your version of SmarterCSV, Ruby, Rails
-# [A Special Thanks to all Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
+# [A Special Thanks to all 59 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
 # Contributing

data/docs/_introduction.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [**Introduction**](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)
@@ -54,4 +55,4 @@ The CSV processing also needed to be robust against variations in the input data
   (planned feature)
 ---------------
-PREVIOUS [README](../README.md) | NEXT: [The Basic Read API](./basic_read_api.md)
+PREVIOUS [README](../README.md) | NEXT: [Parsing Strategy](./parsing_strategy.md)

data/docs/basic_read_api.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [**The Basic Read API**](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)
@@ -116,4 +117,4 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
 ```
 ----------------
-PREVIOUS: [Introduction](./_introduction.md) | NEXT: [The Basic Write API](./basic_write_api.md)
+PREVIOUS: [Parsing Strategy](./parsing_strategy.md) | NEXT: [The Basic Write API](./basic_write_api.md)

data/docs/basic_write_api.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [**The Basic Write API**](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)

data/docs/batch_processing.md CHANGED Viewed

@@ -2,8 +2,9 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
-  * [The Basic Write API](./basic_write_api.md)
+  * [The Basic Write API](./basic_write_api.md)
   * [**Batch Processing**](././batch_processing.md)
   * [Configuration Options](./options.md)
   * [Row and Column Separators](./row_col_sep.md)

data/docs/data_transformations.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)

data/docs/examples.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)

data/docs/header_transformations.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)

data/docs/header_validations.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)

data/docs/options.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)
@@ -11,8 +12,8 @@
   * [Header Validations](./header_validations.md)
   * [Data Transformations](./data_transformations.md)
   * [Value Converters](./value_converters.md)
---------------
+--------------
 # Configuration Options
@@ -56,6 +57,10 @@
      |                             |          | This can also be set to :auto, but will process the whole cvs file first  (slow!)    |
      | :auto_row_sep_chars         |   500    | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
      | :quote_char                 |   '"'    | quotation character                                                                  |
+     | :quote_escaping             | :auto    | How quotes are escaped inside quoted fields. See [Parsing Strategy](./parsing_strategy.md). |
+     |                             |          | `:auto` (default): tries backslash-escape first, falls back to RFC 4180.             |
+     |                             |          | `:double_quotes` (RFC 4180): only `""` escapes a quote. Backslash is literal.        |
+     |                             |          | `:backslash` (MySQL/Unix): `\"` also escapes a quote.                                |
      ---------------------------------------------------------------------------------------------------------------------------------
      | :headers_in_file            |  true(1) | Whether or not the file contains headers as the first line.                          |
      |                             |          | (1): if `user_provided_headers` is given, the default is `false`,                    |

data/docs/parsing_strategy.md ADDED Viewed

@@ -0,0 +1,99 @@
+### Contents
+  * [Introduction](./_introduction.md)
+  * [**Parsing Strategy**](./parsing_strategy.md)
+  * [The Basic Read API](./basic_read_api.md)
+  * [The Basic Write API](./basic_write_api.md)
+  * [Batch Processing](././batch_processing.md)
+  * [Configuration Options](./options.md)
+  * [Row and Column Separators](./row_col_sep.md)
+  * [Header Transformations](./header_transformations.md)
+  * [Header Validations](./header_validations.md)
+  * [Data Transformations](./data_transformations.md)
+  * [Value Converters](./value_converters.md)
+--------------
+# Parsing Strategy
+In the real world, you rarely get to choose the quality of the CSV data you need to process. Files come from different systems, different export tools, different people — and they don't always follow the same rules. A header row might have extra whitespace, column separators vary, and quoting conventions differ from one source to the next.
+Beyond parsing, consuming CSV data in Ruby and Rails has its own requirements. Working with database records, Sidekiq jobs, or JSON APIs means you need each row as a hash — and with symbol keys rather than strings, because symbols are interned and reused in memory, while duplicate strings allocate new objects for every row. For large CSV files with millions of rows, reading the entire file into memory is not practical. Instead, the data needs to be processed in chunks, where each chunk is an array of hashes that can be bulk-inserted into a database, passed to a background job, or uploaded to S3 — enabling parallel processing without ever holding the full dataset in memory.
+SmarterCSV is designed around this reality. Rather than requiring you to know the exact format of your input upfront, it uses sensible defaults and auto-detection to handle the most common variations automatically. Column and row separators are auto-detected, headers are normalized, whitespace is stripped, and numeric values are converted. The output is an array of hashes with symbols as keys — ideal for direct consumption in Ruby and Rails. All of this works out of the box, without configuration.
+SmarterCSV auto-detects CSV column and row separators. The same philosophy extends to how quoted fields are parsed. The `quote_escaping: :auto` default means you don't need to know whether your CSV producer uses RFC 4180 doubled quotes or MySQL-style backslash escapes — SmarterCSV figures it out for you, row by row.
+The goal is simple: **make the common case work without options, and provide explicit options when you need control.**
+## Quote Escaping: The `quote_escaping` Option
+CSV files use quote characters (typically `"`) to wrap fields that contain special characters like the column separator or newlines. But there are two common conventions for how a literal quote character is represented *inside* a quoted field:
+| Convention | Example field value | How it appears in CSV |
+|---|---|---|
+| **RFC 4180** (doubled quotes) | `She said "hello"` | `"She said ""hello"""` |
+| **MySQL / Unix** (backslash escape) | `She said "hello"` | `"She said \"hello\""` |
+The `quote_escaping` option controls which convention SmarterCSV uses when parsing.
+## `:auto` (default)
+The `:auto` mode handles both conventions automatically. It tries backslash-escape interpretation first. If that produces a malformed result (unclosed quoted field), it falls back to RFC 4180 interpretation.
+This means both styles of CSV files work out of the box:
+```ruby
+# RFC 4180 style — works
+csv = %Q{name\n"She said ""hello"""}
+SmarterCSV.process(StringIO.new(csv))
+# => [{name: 'She said "hello"'}]
+# MySQL/Unix style — also works
+csv = %Q{name\n"She said \\"hello\\""}
+SmarterCSV.process(StringIO.new(csv))
+# => [{name: 'She said \\"hello\\"'}]
+```
+The `:auto` mode also correctly handles fields that end with a literal backslash (a common source of parsing errors, see [Issue #316](https://github.com/tilo/smarter_csv/issues/316)):
+```ruby
+# Field value is a Windows path ending in backslash
+csv = %Q{path,label\n"C:\\Users\\Docs\\",important}
+SmarterCSV.process(StringIO.new(csv))
+# => [{path: "C:\\Users\\Docs\\", label: "important"}]
+```
+### How `:auto` works internally
+1. **Multiline detection** uses dual counting: it computes both a backslash-aware quote count and an RFC (plain) quote count in a single pass. A line is only considered multiline if *both* counts are odd. This prevents false multiline stitching when a field simply ends with `\"`.
+2. **Parsing** tries the backslash-escape interpretation first. If the parser raises `MalformedCSV` (unclosed quote), it retries with RFC 4180 interpretation.
+3. The fallback is per-line, so different rows in the same file can use different conventions.
+## `:double_quotes`
+Strict RFC 4180 mode. Backslash has no special meaning — it is always a literal character. Only `""` (doubled quotes) inside a quoted field represents a single `"`.
+Use this when you know your data follows RFC 4180 and want to avoid the small overhead of the try/fallback logic.
+```ruby
+SmarterCSV.process("file.csv", quote_escaping: :double_quotes)
+```
+## `:backslash`
+MySQL / Unix mode. A backslash before a quote character (`\"`) is treated as an escaped quote — the quote does not close the field. An even number of backslashes before a quote (e.g. `\\"`) means the backslashes are literal and the quote closes normally.
+Use this when your data was exported from MySQL or another system that uses backslash escaping.
+```ruby
+SmarterCSV.process("file.csv", quote_escaping: :backslash)
+```
+**Note:** In `:backslash` mode, a field like `"abc\"` will raise `MalformedCSV` because the closing quote is escaped, leaving the field unclosed.
+--------------
+PREVIOUS: [Introduction](./_introduction.md) | NEXT: [The Basic Read API](./basic_read_api.md)

data/docs/row_col_sep.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)

data/docs/value_converters.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Parsing Strategy](./parsing_strategy.md)
   * [The Basic Read API](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
   * [Batch Processing](././batch_processing.md)