smarter_csv 1.15.0 → 1.15.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 40ec747f330628f6aebd66fd7478349007cd7d2f049ae5bddbbc3d67cc5f07be
4
- data.tar.gz: 3d02147aee5983e9fabcd05aed3d0e4ac9da3399d7a231fa2905a5c1f061b9b3
3
+ metadata.gz: df37543c55dff7b37543c32787704664b6b4b6c187b7d9d69f02bb7472bfc85e
4
+ data.tar.gz: 4cd09212aa83588e8dd533b3ef1ed1b742b35a8a63e24f963760890646c17116
5
5
  SHA512:
6
- metadata.gz: cc825395fc200eca00ff37fb2a8d07d5e05a28c0b9fe2307f5ee5d8cbb7a6286d15856fbdb8b85904e2e685fa29a73a097ed4281068ad0f92833d35a3254fde3
7
- data.tar.gz: 4f3050d6c33535d4b12e6737c5241708c87408bfb9c0cb320fb89470867bb338cd3359f6bf8ba4cdaa8d6566c1884f23436285ed50d61aef0f4356c772f83fa6
6
+ metadata.gz: 4010ed4d675e979512c632a0173f8f4e660e707a8f2677489132c3e1e65d1e63199a314a03379e3ef3cf6157c8821b2880ec4ba83119cdcf5551fb9d7d7fdbff
7
+ data.tar.gz: adb848ec9d97796ff85331dae23cdb8fe121ba42ee12fa1ebc9056cddfe09ba9015c89d85237fbb4065d1525544a405877e8e2bbb6f8f661b886746ba0532e57
data/CHANGELOG.md CHANGED
@@ -1,6 +1,19 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.15.1 (2026-02-17)
5
+
6
+ ### Bug Fix
7
+
8
+ * **Fix for quoted fields ending with backslash** ([issue #316](https://github.com/tilo/smarter_csv/issues/316), [issue #252](https://github.com/tilo/smarter_csv/issues/252)): Since v1.8.5, SmarterCSV unconditionally treated `\"` as an escaped quote, which caused `MalformedCSV` or `EOFError` for CSV files containing literal backslashes in quoted fields (e.g. Windows paths like `"C:\Users\"`).
9
+
10
+ ### New Option
11
+
12
+ * **New option `quote_escaping`**: Controls how quotes are escaped inside quoted fields. Default: `:auto`. See [Parsing Strategy](docs/parsing_strategy.md) for details.
13
+ - `:auto` (default): Tries backslash-escape interpretation first, falls back to RFC 4180 if parsing fails. This handles both conventions automatically without breaking existing data.
14
+ - `:double_quotes` (RFC 4180): Only doubled quotes (`""`) escape a quote character. Backslash is always literal.
15
+ - `:backslash` (MySQL/Unix): `\"` is treated as an escaped quote.
16
+
4
17
  ## 1.15.0 (2026-02-04)
5
18
 
6
19
  * Dropping support for Ruby 2.5
@@ -80,7 +93,7 @@ _P90 measured over the full set of benchmarked files_
80
93
  |---------------------------|--------|------|--------|--------|------------|
81
94
  | worldcities.csv | 5 MB | 48K | 1.27s | 0.49s | **2.6x** |
82
95
  | LANDSAT_ETM_C2_L1_50k.csv | 31 MB | 50K | 6.73s | 1.99s | **3.4x** |
83
- | PILOT_CERT.csv | 62 MB | 50K | 8.43s | 2.43s | **3.5x** |
96
+ | PEOPLE_IMPORT.csv | 62 MB | 50K | 8.43s | 2.43s | **3.5x** |
84
97
  | wide_500_cols_20k.csv | 98 MB | 20K | 19.38s | 5.09s | **3.8x** |
85
98
  | long_fields_20k.csv | 22 MB | 20K | 3.05s | 0.15s | **20.5x** |
86
99
  | embedded_newlines_20k.csv | 1.5 MB | 20K | 0.59s | 0.12s | **5.1x** |
@@ -99,7 +112,7 @@ For this reason, **CSV.table is the closest equivalent to SmarterCSV.**
99
112
  |---------------------------|--------|------|------------|-----------|--------|-----------|------------|
100
113
  | worldcities.csv | 5 MB | 48K | 1.06s | 2.12s | 0.49s | **2.2x** | **4.3x** |
101
114
  | LANDSAT_ETM_C2_L1_50k.csv | 31 MB | 50K | 3.85s | 9.25s | 1.99s | **1.9x** | **4.7x** |
102
- | PILOT_CERT.csv | 62 MB | 50K | 9.10s | 24.39s | 2.43s | **3.8x** | **10.1x** |
115
+ | PEOPLE_IMPORT.csv | 62 MB | 50K | 9.10s | 24.39s | 2.43s | **3.8x** | **10.1x** |
103
116
  | wide_500_cols_20k.csv | 98 MB | 20K | 34.24s | 61.24s | 5.09s | **6.7x** | **12.0x** |
104
117
  | long_fields_20k.csv | 22 MB | 20K | 0.34s | 0.81s | 0.15s | **2.3x** | **5.5x** |
105
118
  | whitespace_heavy_20k.csv | 3.3 MB | 20K | 0.30s | 0.83s | 0.12s | **2.5x** | **7.0x** |
data/CONTRIBUTORS.md CHANGED
@@ -1,4 +1,4 @@
1
- # A Big Thank You to all the Contributors!!
1
+ # A Big Thank You to all 59 Contributors!!
2
2
 
3
3
 
4
4
  A Big Thank you to everyone who filed issues, sent comments, and who contributed with pull requests:
data/README.md CHANGED
@@ -33,6 +33,66 @@ For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to Smarter
33
33
 
34
34
  _Benchmarks: Ruby 3.4.7, M1 Apple Silicon. Memory: 39% less allocated, 43% fewer objects. See [CHANGELOG](./CHANGELOG.md) for details._
35
35
 
36
+ ## Examples
37
+
38
+ ### Simple Example:
39
+
40
+ SmarterCSV is designed for robustness — real-world CSV data often has inconsistent formatting, extra whitespace, and varied column separators. Its intelligent defaults automatically clean and normalize data, returning high-quality hashes ready for direct use with ActiveRecord, Sidekiq, or any data pipeline — no post-processing required. See [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38) for more background.
41
+
42
+ ```ruby
43
+ $ cat spec/fixtures/sample.csv
44
+ First Name , Last Name , Emoji , Posts
45
+ José ,Corüazón, ❤️, 12
46
+ Jürgen, Müller ,😐,3
47
+ Michael, May ,😞, 7
48
+
49
+ $ irb
50
+ >> require 'smarter_csv'
51
+ => true
52
+ >> data = SmarterCSV.process('spec/fixtures/sample.csv')
53
+ => [{:first_name=>"José", :last_name=>"Corüazón", :emoji=>"❤️", :posts=>12},
54
+ {:first_name=>"Jürgen", :last_name=>"Müller", :emoji=>"😐", :posts=>3},
55
+ {:first_name=>"Michael", :last_name=>"May", :emoji=>"😞", :posts=>7}]
56
+ ```
57
+ Notice how SmarterCSV automatically (all defaults):
58
+ - Normalizes headers → `downcase_header: true`, `strings_as_keys: false`
59
+ - Strips whitespace → `strip_whitespace: true`
60
+ - Converts numbers → `convert_values_to_numeric: true`
61
+ - Removes empty values → `remove_empty_values: true`
62
+ - Preserves Unicode and emoji characters
63
+
64
+ ### Batch Processing:
65
+
66
+ Processing large CSV files in chunks minimizes memory usage and enables powerful workflows:
67
+ - **Database imports** — bulk insert records in batches for better performance
68
+ - **Parallel processing** — distribute chunks across Sidekiq, Resque, or other background workers
69
+ - **Progress tracking** — the optional `chunk_index` parameter enables progress reporting
70
+ - **Memory efficiency** — only one chunk is held in memory at a time, regardless of file size
71
+
72
+ The block receives a `chunk` (array of hashes) and an optional `chunk_index` (0-based sequence number):
73
+
74
+ ```ruby
75
+ # Database bulk import
76
+ SmarterCSV.process(filename, chunk_size: 100) do |chunk, chunk_index|
77
+ puts "Processing chunk #{chunk_index}..."
78
+ MyModel.insert_all(chunk) # chunk is an array of hashes
79
+ end
80
+
81
+ # Parallel processing with Sidekiq
82
+ SmarterCSV.process(filename, chunk_size: 100) do |chunk|
83
+ MyWorker.perform_async(chunk) # each chunk processed in parallel
84
+ end
85
+ ```
86
+
87
+ See [Examples](docs/examples.md), [Batch Processing](docs/batch_processing.md), and [Configuration Options](docs/options.md) for more.
88
+
89
+ ## Requirements
90
+
91
+ **Minimum Ruby Version:** >= 2.6
92
+
93
+ **C Extension:** SmarterCSV includes a native C extension for accelerated CSV parsing.
94
+ The C extension is automatically compiled on MRI Ruby. For JRuby and TruffleRuby, SmarterCSV falls back to a pure Ruby implementation.
95
+
36
96
  # Installation
37
97
 
38
98
  Add this line to your application's Gemfile:
@@ -51,6 +111,7 @@ Or install it yourself as:
51
111
  # Documentation
52
112
 
53
113
  * [Introduction](docs/_introduction.md)
114
+ * [Parsing Strategy](docs/parsing_strategy.md)
54
115
  * [The Basic Read API](docs/basic_read_api.md)
55
116
  * [The Basic Write API](docs/basic_write_api.md)
56
117
  * [Batch Processing](./docs/batch_processing.md)
@@ -80,7 +141,7 @@ For reporting issues, please:
80
141
  * open a pull-request adding a test that demonstrates the issue
81
142
  * mention your version of SmarterCSV, Ruby, Rails
82
143
 
83
- # [A Special Thanks to all Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
144
+ # [A Special Thanks to all 59 Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
84
145
 
85
146
 
86
147
  # Contributing
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [**Introduction**](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
@@ -54,4 +55,4 @@ The CSV processing also needed to be robust against variations in the input data
54
55
  (planned feature)
55
56
 
56
57
  ---------------
57
- PREVIOUS [README](../README.md) | NEXT: [The Basic Read API](./basic_read_api.md)
58
+ PREVIOUS [README](../README.md) | NEXT: [Parsing Strategy](./parsing_strategy.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [**The Basic Read API**](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
@@ -116,4 +117,4 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
116
117
  ```
117
118
 
118
119
  ----------------
119
- PREVIOUS: [Introduction](./_introduction.md) | NEXT: [The Basic Write API](./basic_write_api.md)
120
+ PREVIOUS: [Parsing Strategy](./parsing_strategy.md) | NEXT: [The Basic Write API](./basic_write_api.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [**The Basic Write API**](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
@@ -2,8 +2,9 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
- * [The Basic Write API](./basic_write_api.md)
7
+ * [The Basic Write API](./basic_write_api.md)
7
8
  * [**Batch Processing**](././batch_processing.md)
8
9
  * [Configuration Options](./options.md)
9
10
  * [Row and Column Separators](./row_col_sep.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
data/docs/examples.md CHANGED
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
data/docs/options.md CHANGED
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
@@ -11,8 +12,8 @@
11
12
  * [Header Validations](./header_validations.md)
12
13
  * [Data Transformations](./data_transformations.md)
13
14
  * [Value Converters](./value_converters.md)
14
-
15
- --------------
15
+
16
+ --------------
16
17
 
17
18
  # Configuration Options
18
19
 
@@ -56,6 +57,10 @@
56
57
  | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
57
58
  | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
58
59
  | :quote_char | '"' | quotation character |
60
+ | :quote_escaping | :auto | How quotes are escaped inside quoted fields. See [Parsing Strategy](./parsing_strategy.md). |
61
+ | | | `:auto` (default): tries backslash-escape first, falls back to RFC 4180. |
62
+ | | | `:double_quotes` (RFC 4180): only `""` escapes a quote. Backslash is literal. |
63
+ | | | `:backslash` (MySQL/Unix): `\"` also escapes a quote. |
59
64
  ---------------------------------------------------------------------------------------------------------------------------------
60
65
  | :headers_in_file | true(1) | Whether or not the file contains headers as the first line. |
61
66
  | | | (1): if `user_provided_headers` is given, the default is `false`, |
@@ -0,0 +1,99 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [**Parsing Strategy**](./parsing_strategy.md)
6
+ * [The Basic Read API](./basic_read_api.md)
7
+ * [The Basic Write API](./basic_write_api.md)
8
+ * [Batch Processing](././batch_processing.md)
9
+ * [Configuration Options](./options.md)
10
+ * [Row and Column Separators](./row_col_sep.md)
11
+ * [Header Transformations](./header_transformations.md)
12
+ * [Header Validations](./header_validations.md)
13
+ * [Data Transformations](./data_transformations.md)
14
+ * [Value Converters](./value_converters.md)
15
+
16
+ --------------
17
+
18
+ # Parsing Strategy
19
+
20
+ In the real world, you rarely get to choose the quality of the CSV data you need to process. Files come from different systems, different export tools, different people — and they don't always follow the same rules. A header row might have extra whitespace, column separators vary, and quoting conventions differ from one source to the next.
21
+
22
+ Beyond parsing, consuming CSV data in Ruby and Rails has its own requirements. Working with database records, Sidekiq jobs, or JSON APIs means you need each row as a hash — and with symbol keys rather than strings, because symbols are interned and reused in memory, while duplicate strings allocate new objects for every row. For large CSV files with millions of rows, reading the entire file into memory is not practical. Instead, the data needs to be processed in chunks, where each chunk is an array of hashes that can be bulk-inserted into a database, passed to a background job, or uploaded to S3 — enabling parallel processing without ever holding the full dataset in memory.
23
+
24
+ SmarterCSV is designed around this reality. Rather than requiring you to know the exact format of your input upfront, it uses sensible defaults and auto-detection to handle the most common variations automatically. Column and row separators are auto-detected, headers are normalized, whitespace is stripped, and numeric values are converted. The output is an array of hashes with symbols as keys — ideal for direct consumption in Ruby and Rails. All of this works out of the box, without configuration.
25
+
26
+ SmarterCSV auto-detects CSV column and row separators. The same philosophy extends to how quoted fields are parsed. The `quote_escaping: :auto` default means you don't need to know whether your CSV producer uses RFC 4180 doubled quotes or MySQL-style backslash escapes — SmarterCSV figures it out for you, row by row.
27
+
28
+ The goal is simple: **make the common case work without options, and provide explicit options when you need control.**
29
+
30
+ ## Quote Escaping: The `quote_escaping` Option
31
+
32
+ CSV files use quote characters (typically `"`) to wrap fields that contain special characters like the column separator or newlines. But there are two common conventions for how a literal quote character is represented *inside* a quoted field:
33
+
34
+ | Convention | Example field value | How it appears in CSV |
35
+ |---|---|---|
36
+ | **RFC 4180** (doubled quotes) | `She said "hello"` | `"She said ""hello"""` |
37
+ | **MySQL / Unix** (backslash escape) | `She said "hello"` | `"She said \"hello\""` |
38
+
39
+ The `quote_escaping` option controls which convention SmarterCSV uses when parsing.
40
+
41
+ ## `:auto` (default)
42
+
43
+ The `:auto` mode handles both conventions automatically. It tries backslash-escape interpretation first. If that produces a malformed result (unclosed quoted field), it falls back to RFC 4180 interpretation.
44
+
45
+ This means both styles of CSV files work out of the box:
46
+
47
+ ```ruby
48
+ # RFC 4180 style — works
49
+ csv = %Q{name\n"She said ""hello"""}
50
+ SmarterCSV.process(StringIO.new(csv))
51
+ # => [{name: 'She said "hello"'}]
52
+
53
+ # MySQL/Unix style — also works
54
+ csv = %Q{name\n"She said \\"hello\\""}
55
+ SmarterCSV.process(StringIO.new(csv))
56
+ # => [{name: 'She said \\"hello\\"'}]
57
+ ```
58
+
59
+ The `:auto` mode also correctly handles fields that end with a literal backslash (a common source of parsing errors, see [Issue #316](https://github.com/tilo/smarter_csv/issues/316)):
60
+
61
+ ```ruby
62
+ # Field value is a Windows path ending in backslash
63
+ csv = %Q{path,label\n"C:\\Users\\Docs\\",important}
64
+ SmarterCSV.process(StringIO.new(csv))
65
+ # => [{path: "C:\\Users\\Docs\\", label: "important"}]
66
+ ```
67
+
68
+ ### How `:auto` works internally
69
+
70
+ 1. **Multiline detection** uses dual counting: it computes both a backslash-aware quote count and an RFC (plain) quote count in a single pass. A line is only considered multiline if *both* counts are odd. This prevents false multiline stitching when a field simply ends with `\"`.
71
+
72
+ 2. **Parsing** tries the backslash-escape interpretation first. If the parser raises `MalformedCSV` (unclosed quote), it retries with RFC 4180 interpretation.
73
+
74
+ 3. The fallback is per-line, so different rows in the same file can use different conventions.
75
+
76
+ ## `:double_quotes`
77
+
78
+ Strict RFC 4180 mode. Backslash has no special meaning — it is always a literal character. Only `""` (doubled quotes) inside a quoted field represents a single `"`.
79
+
80
+ Use this when you know your data follows RFC 4180 and want to avoid the small overhead of the try/fallback logic.
81
+
82
+ ```ruby
83
+ SmarterCSV.process("file.csv", quote_escaping: :double_quotes)
84
+ ```
85
+
86
+ ## `:backslash`
87
+
88
+ MySQL / Unix mode. A backslash before a quote character (`\"`) is treated as an escaped quote — the quote does not close the field. An even number of backslashes before a quote (e.g. `\\"`) means the backslashes are literal and the quote closes normally.
89
+
90
+ Use this when your data was exported from MySQL or another system that uses backslash escaping.
91
+
92
+ ```ruby
93
+ SmarterCSV.process("file.csv", quote_escaping: :backslash)
94
+ ```
95
+
96
+ **Note:** In `:backslash` mode, a field like `"abc\"` will raise `MalformedCSV` because the closing quote is escaped, leaving the field unclosed.
97
+
98
+ --------------
99
+ PREVIOUS: [Introduction](./_introduction.md) | NEXT: [The Basic Read API](./basic_read_api.md)
data/docs/row_col_sep.md CHANGED
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Parsing Strategy](./parsing_strategy.md)
5
6
  * [The Basic Read API](./basic_read_api.md)
6
7
  * [The Basic Write API](./basic_write_api.md)
7
8
  * [Batch Processing](././batch_processing.md)