smarter_csv 1.15.2 → 1.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +9 -0
  3. data/CHANGELOG.md +68 -1
  4. data/CONTRIBUTORS.md +3 -1
  5. data/Gemfile +1 -0
  6. data/README.md +123 -27
  7. data/docs/_introduction.md +40 -24
  8. data/docs/bad_row_quarantine.md +285 -0
  9. data/docs/basic_read_api.md +151 -9
  10. data/docs/basic_write_api.md +474 -59
  11. data/docs/batch_processing.md +161 -4
  12. data/docs/column_selection.md +183 -0
  13. data/docs/data_transformations.md +162 -29
  14. data/docs/examples.md +339 -46
  15. data/docs/header_transformations.md +93 -12
  16. data/docs/header_validations.md +56 -18
  17. data/docs/history.md +117 -0
  18. data/docs/instrumentation.md +165 -0
  19. data/docs/migrating_from_csv.md +290 -0
  20. data/docs/options.md +150 -87
  21. data/docs/parsing_strategy.md +63 -1
  22. data/docs/real_world_csv.md +262 -0
  23. data/docs/releases/1.16.0/benchmarks.md +223 -0
  24. data/docs/releases/1.16.0/changes.md +272 -0
  25. data/docs/releases/1.16.0/performance_notes.md +114 -0
  26. data/docs/row_col_sep.md +14 -5
  27. data/docs/value_converters.md +193 -57
  28. data/ext/smarter_csv/extconf.rb +3 -0
  29. data/ext/smarter_csv/smarter_csv.c +1007 -71
  30. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  31. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  32. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  36. data/lib/smarter_csv/errors.rb +8 -0
  37. data/lib/smarter_csv/file_io.rb +1 -1
  38. data/lib/smarter_csv/hash_transformations.rb +14 -13
  39. data/lib/smarter_csv/header_transformations.rb +21 -2
  40. data/lib/smarter_csv/headers.rb +2 -1
  41. data/lib/smarter_csv/options.rb +124 -7
  42. data/lib/smarter_csv/parser.rb +362 -75
  43. data/lib/smarter_csv/reader.rb +494 -46
  44. data/lib/smarter_csv/version.rb +1 -1
  45. data/lib/smarter_csv/writer.rb +71 -19
  46. data/lib/smarter_csv.rb +95 -12
  47. data/smarter_csv.gemspec +20 -10
  48. metadata +37 -80
@@ -0,0 +1,262 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Parsing Strategy](./parsing_strategy.md)
7
+ * [The Basic Read API](./basic_read_api.md)
8
+ * [The Basic Write API](./basic_write_api.md)
9
+ * [Batch Processing](././batch_processing.md)
10
+ * [Configuration Options](./options.md)
11
+ * [Row and Column Separators](./row_col_sep.md)
12
+ * [Header Transformations](./header_transformations.md)
13
+ * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
15
+ * [Data Transformations](./data_transformations.md)
16
+ * [Value Converters](./value_converters.md)
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [**Real-World CSV Files**](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ ---
25
+
26
+ # Real-World CSV Files in Production
27
+
28
+ CSV is the most common data exchange format in enterprise software — and also one of the most inconsistently implemented. This page documents what you will actually encounter when processing production CSV files, and how SmarterCSV handles each case.
29
+
30
+ ## Status Legend
31
+
32
+ | Symbol | Meaning |
33
+ |--------|---------|
34
+ | ✅ | Handled automatically — no configuration needed |
35
+ | 🔘 | Handled — but requires the user to specify an option |
36
+ | ❌ | Not handled — caller must pre-process or work around |
37
+
38
+ ---
39
+
40
+ ## Encoding & BOM
41
+
42
+ Real-world files come from dozens of different systems, each with their own default encoding. Excel in particular is notorious for writing UTF-8 files with a Byte Order Mark (BOM) that trips up many parsers.
43
+
44
+ | Issue | Status | Notes |
45
+ |-------|--------|-------|
46
+ | UTF-8 with BOM (`\xEF\xBB\xBF`) | ✅ | Stripped automatically from the first line. Excel always writes this. |
47
+ | `\r\n` CRLF line endings | ✅ | Auto-detected. The default on Windows and most enterprise exports. |
48
+ | `\r` only (classic Mac) | ✅ | Auto-detected. Rare today but still seen in legacy pipelines. |
49
+ | Windows-1252 / Latin-1 | 🔘 | Specify `file_encoding: 'windows-1252'`. Common in European financial exports, older SAP systems, QuickBooks. |
50
+ | UTF-16 LE with BOM | 🔘 | Specify `file_encoding: 'utf-16le'`. Some Microsoft SQL Server and Access exports default to this. |
51
+ | Shift-JIS / EUC-JP | 🔘 | Specify `file_encoding: 'shift_jis'` or `'euc-jp'`. Japanese ERP and POS systems. |
52
+
53
+ ---
54
+
55
+ ## Quoting & Escaping
56
+
57
+ Two competing quoting conventions exist in the wild and are both common: RFC 4180 (used by Excel) and backslash escaping (used by MySQL, PostgreSQL). SmarterCSV defaults to `:auto` mode which tries backslash first and falls back to RFC 4180.
58
+
59
+ | Issue | Status | Notes |
60
+ |-------|--------|-------|
61
+ | RFC 4180 double-quote escaping (`""`) | ✅ | The Excel standard. Handled by `quote_escaping: :auto` (default). |
62
+ | Backslash escaping (`\"`) | ✅ | Used by MySQL `SELECT INTO OUTFILE`, `mysqldump`, PostgreSQL `COPY`. Handled by `:auto` default. |
63
+ | Newlines inside quoted fields | ✅ | Multi-line field stitching. Common in address fields, notes, and CRM comment exports. |
64
+ | Mid-field quote characters (`5'10"`, inch marks, apostrophes) | ✅ | `quote_boundary: :standard` (default since 1.16.0) only recognizes quotes at field boundaries. Mid-field quotes are treated as literal characters. |
65
+ | Semicolon-delimited files mislabeled as CSV | ✅ | `col_sep: :auto` (default) detects the actual separator. Common in European locales where comma is the decimal separator. |
66
+ | Tab-delimited TSV files | ✅ | `col_sep: :auto` detects tabs. Common in bioinformatics and some government data portals. |
67
+
68
+ ---
69
+
70
+ ## Header Quirks
71
+
72
+ Headers in production files are rarely as clean as you'd expect. They carry units, source system field names, BOM characters, duplicates, and sometimes no headers at all.
73
+
74
+ | Issue | Status | Notes |
75
+ |-------|--------|-------|
76
+ | BOM on first header field | ✅ | Stripped automatically. Without this, the first key would be `:\xEF\xBB\xBFname` instead of `:name`. |
77
+ | Duplicate headers | ✅ | Disambiguated using `duplicate_header_suffix` (default `''` → `:email`, `:email_2`, `:email_3`). |
78
+ | Empty or whitespace-only headers | ✅ | Auto-named using `missing_header_prefix` (default `column_`) → `:column_1`, `:column_2`. Values are never silently dropped. |
79
+ | Trailing comma on header row (phantom empty column) | ✅ | The phantom column is auto-named just like any other empty header. |
80
+ | Headers with spaces and special characters (`Revenue (USD)`) | ✅ | Spaces and dashes normalized to underscores → `:revenue_(usd)`. Parentheses, slashes, etc. are preserved. |
81
+ | Extra data columns beyond the header row | ✅ | Auto-generates `column_N` names for extra fields. Controlled by `missing_headers:` option. |
82
+ | No header row at all | 🔘 | Use `headers_in_file: false, user_provided_headers: [:col1, :col2, ...]`. Common in raw database dumps and fixed-format legacy exports. |
83
+
84
+ ---
85
+
86
+ ## Numeric & Data Type Landmines
87
+
88
+ Numeric conversion is one of the most common sources of data loss. SmarterCSV converts values that look like numbers by default — which is correct for most cases — but certain fields must be excluded explicitly.
89
+
90
+ | Issue | Status | Notes |
91
+ |-------|--------|-------|
92
+ | Integer and float conversion | ✅ | `convert_values_to_numeric: true` (default). `"42"` → `42`, `"3.14"` → `3.14`. |
93
+ | Currency symbols in values (`$1,234.56`, `€1.234,56`) | ✅ / 🔘| Won't match the numeric pattern — safely left as a string. Use `value_converters` if numeric value is needed.|
94
+ | Percentage values (`12.5%`) | ✅ / 🔘| Won't match the numeric pattern — safely left as a string. Use `value_converters` if numeric value is needed.|
95
+ | Leading zeros (ZIP codes, phone numbers, SKUs, account numbers) | 🔘 | `convert_values_to_numeric: { except: [:zip, :phone, :sku] }`. Without this, `"01234"` becomes `1234`. One of the most common silent data loss bugs in CSV processing! US ZIP codes have leading zeroes. |
96
+ | NULL / empty value variants (`NULL`, `\N`, `N/A`, `(null)`, `#N/A`) | 🔘 | Use `nil_values_matching: /\A(NULL\\|\\N\|N\/A\|#N\/A\|\\(null\\))\z/i`. Without configuration these are left as literal strings. |
97
+ | Date values (`2023-01-15`, `01/02/2023`, `Jan 2, 2023`) | 🔘 | Use `value_converters` with a date parsing lambda. SmarterCSV does not auto-convert dates — format ambiguity (`01/02/2023` = Jan 2 or Feb 1?) makes auto-conversion unsafe. |
98
+ | Boolean variants (`Y/N`, `Yes/No`, `TRUE/FALSE`, `1/0`, `X/` in SAP) | 🔘 | Use `value_converters` for the relevant columns. |
99
+ | European number format (`1.234,56` meaning 1234.56) | 🔘 | Use a value_converter that swaps dot and comma before parsing. Common in German, French, Italian, and Spanish exports. |
100
+
101
+ ---
102
+
103
+ ## File Size & Structure
104
+
105
+ | Issue | Status | Notes |
106
+ |-------|--------|-------|
107
+ | Millions of rows | ✅ | Use `chunk_size: N` for batch processing. SmarterCSV streams the file and never loads it entirely into memory. |
108
+ | Gigabyte-sized files | ✅ | Streaming architecture. Memory usage is proportional to chunk size, not file size. |
109
+ | Single-column CSV files | ✅ | `col_sep: :auto` handles files with no detected separator gracefully (fixed in issue #222). |
110
+ | Ragged rows — fewer fields than headers | ✅ | Missing trailing fields produce no key in the hash. Combined with `remove_empty_values: true` (default), short rows are handled cleanly. |
111
+ | Ragged rows — more fields than headers | ✅ | Extra columns are auto-named `column_N` via `missing_header_prefix`. |
112
+ | Empty or whitespace-only file | ✅ | Raises `SmarterCSV::EmptyFileError` with a clear message instead of a cryptic internal error. |
113
+
114
+ ---
115
+
116
+ ## Enterprise & Application-Specific Patterns
117
+
118
+ ### Databases & Data Warehouses
119
+
120
+ | Source | Issue | Status | Notes |
121
+ |--------|-------|--------|-------|
122
+ | MySQL `SELECT INTO OUTFILE` | Backslash quote escaping | ✅ | `quote_escaping: :auto` default. |
123
+ | PostgreSQL `COPY TO` | Backslash quote escaping, `\N` for NULL | ✅ / 🔘 | Escaping handled automatically; `\N` as nil requires `nil_values_matching`. |
124
+ | SEC EDGAR | Pipe-delimited, UTF-8, clean format | ✅ | `col_sep: :auto` detects the pipe separator. |
125
+ | UNIX DB Dumps† | CTRL-A col separator, CTRL-B row separator, `#` comment lines | 🔘 | `col_sep: "\cA", row_sep: "\cB", comment_regexp: /^#/` |
126
+
127
+ ### ERP & CRM
128
+
129
+ | Source | Issue | Status | Notes |
130
+ |--------|-------|--------|-------|
131
+ | SAP ALV / IDOC exports | Space-padded fixed-width fields | ✅ | `strip_whitespace: true` (default) trims all field values. |
132
+ | SAP BW/BEx | Very wide exports (300–500+ columns) | ✅ | No column count limit. |
133
+ | Salesforce reports | Trailing empty columns, quoted address fields with newlines | ✅ | Both handled by default. |
134
+
135
+ ### Spreadsheets
136
+
137
+ | Source | Issue | Status | Notes |
138
+ |--------|-------|--------|-------|
139
+ | Excel `Save As CSV` | UTF-8 BOM, RFC 4180 quoting, 1,048,576 row limit | ✅ | BOM stripped, quoting handled. Row limit is an Excel constraint — SmarterCSV will parse whatever Excel wrote. |
140
+
141
+ ### Finance & Banking
142
+
143
+ | Source | Issue | Status | Notes |
144
+ |--------|-------|--------|-------|
145
+ | Stripe / Coinbase / modern fintechs | Clean UTF-8 CSV, ISO 8601 dates, no BOM | ✅ | No special configuration needed. |
146
+ | Bank statement exports (Chase, Wells Fargo, Barclays, …) | Metadata preamble rows before the header (account number, date range, institution name) | 🔘 | Use `skip_lines: N` to skip the preamble. N varies by bank and may change with format updates. |
147
+ | Accounting negative notation | `(1,234.56)` instead of `-1234.56` — used by QuickBooks, Xero, SAP, and most bank exports | 🔘 | Use a `value_converters` lambda: `->(v) { v&.match?(/\A\(.*\)\z/) ? -v.gsub(/[(),]/, '').to_f : v }` |
148
+ | PayPal transaction exports | Preamble rows, mixed currency/amount columns, locale-specific date format | 🔘 | Use `skip_lines:` for preamble; use `value_converters` for dates and signed amounts. |
149
+ | Bloomberg / Refinitiv terminal exports | `\|` separator, `N.A.` for nulls, proprietary date formats | 🔘 | `col_sep: "\|"`, `nil_values_matching: /\AN\.A\.\z/`, `value_converters` for dates. |
150
+ | QuickBooks exports | Windows-1252 encoding, currency-formatted values | 🔘 | Specify `file_encoding: 'windows-1252'`. Currency values like `"$1,234.56"` stay as strings. |
151
+
152
+ ### Government & Public Data
153
+
154
+ | Source | Issue | Status | Notes |
155
+ |--------|-------|--------|-------|
156
+ | Government open data portals | Semicolons as separator, Latin-1, inconsistent quoting | ✅ / 🔘 | `col_sep: :auto` handles semicolons; specify `file_encoding:` if non-UTF-8. |
157
+ | US Census Bureau | Very large files (millions of rows), heavily coded values | ✅ | Use `chunk_size:` for memory-efficient processing. |
158
+ | US Treasury / USASpending.gov | Large files, many empty columns, dollar amounts as plain strings | ✅ | Works out of the box; `remove_empty_values: true` (default) drops empty columns. |
159
+ | World Bank / IMF data exports | 4–5 preamble rows (title, source, notes) before the header | 🔘 | `skip_lines: N` to skip the preamble. N is typically 4 for World Bank, 5 for IMF. |
160
+ | Australian ABS (Bureau of Statistics) | UTF-8 BOM, preamble metadata rows before the header | 🔘 | BOM stripped automatically; use `skip_lines: N` for the preamble. |
161
+
162
+ ### Healthcare & Life Sciences
163
+
164
+ | Source | Issue | Status | Notes |
165
+ |--------|-------|--------|-------|
166
+ | HL7 / FHIR flattened exports | Very wide files (100+ columns), many empty fields, cryptic column names (`component_0_valueQuantity_value`) | ✅ | Parses fine. `remove_empty_values: true` (default) drops empty fields automatically. |
167
+ | Epic / Cerner EHR exports | Windows-1252 encoding, locale-specific date formats | 🔘 | `file_encoding: 'windows-1252'`; use `value_converters` for date columns. |
168
+ | Lab instrument exports (Roche, Abbott, Siemens) | Semicolon separator (European instruments), preamble rows with instrument metadata | 🔘 | `col_sep: :auto` detects the separator; `skip_lines: N` for the preamble. |
169
+ | DICOM-SR flattened to CSV | Nested structured report data squashed into column names | ✅ | Parses fine. Data model is messy but no special configuration needed. |
170
+ | FDA adverse event / MedWatch exports | Pipe-delimited, `null` literal strings, long free-text fields with embedded newlines | 🔘 | `col_sep: "\|"`, `nil_values_matching: /\Anull\z/i`; embedded newlines handled automatically. |
171
+ | Bioinformatics (VCF-derived) | Thousands of columns (one sample per column) | ✅ | No column count limit in the parsing hot path. |
172
+
173
+ ### E-commerce & Survey Tools
174
+
175
+ | Source | Issue | Status | Notes |
176
+ |--------|-------|--------|-------|
177
+ | Shopify / WooCommerce | Pipe-delimited values within a field (`tag1\|tag2\|tag3`) | 🔘 | Use `value_converters` to split on `\|` for the relevant column. |
178
+ | Qualtrics / SurveyMonkey | 200–800 columns, multi-row headers, HTML in values | 🔘 | Multi-row headers require pre-processing; HTML in values left as-is (use value_converters to strip). |
179
+
180
+ ### Legacy & Unusual Formats
181
+
182
+ | Source | Issue | Status | Notes |
183
+ |--------|-------|--------|-------|
184
+ | Apple iTunes DB export† | CTRL-A col separator, CTRL-B row separator, `#` comment lines | 🔘 | `col_sep: "\cA", row_sep: "\cB", comment_regexp: /^#/` |
185
+
186
+ ### I/O Patterns
187
+
188
+ | Source | Issue | Status | Notes |
189
+ |--------|-------|--------|-------|
190
+ | Gzipped CSV (`.csv.gz`) | Compressed file | 🔘 | Decompress and pass the resulting IO object: `SmarterCSV.process(Zlib::GzipReader.open(path))`. |
191
+ | HTTP streaming | Parsing from a live HTTP response | 🔘 | Pass any IO-compatible object that responds to `#gets`. |
192
+
193
+ †: Legacy Apple DB Dump and older UNIX data dumps use ASCII control characters as delimiters:
194
+
195
+ ```
196
+ col_sep = "\x01" # CTRL-A
197
+ row_sep = "\x02" # CTRL-B
198
+ comment_prefix = "#"
199
+ ```
200
+
201
+ This is a clever design: since CTRL-A and CTRL-B never appear in normal text, fields never need quoting or escaping — eliminating an entire class of parsing ambiguity.
202
+
203
+ ---
204
+
205
+ ## Pathological Cases ❌
206
+
207
+ These formats have structural problems that no CSV parser can transparently resolve. Pre-processing the file before passing it to SmarterCSV is the only reliable solution.
208
+
209
+ | Issue | Why it breaks | Workaround |
210
+ |-------|--------------|------------|
211
+ | Mixed line endings within one file | Row separator is detected once from the first N bytes. A file mixing `\r\n` and `\n` will produce rows with stray `\r` on some values. | Pre-process with `dos2unix` or equivalent. |
212
+ | Mixed encodings within one file | Happens when CSVs are concatenated from multiple sources. `force_utf8: true` with `invalid_byte_sequence: ''` is the best available mitigation, but true mixed-encoding files cannot be reliably fixed by any parser. | Identify and re-encode each source file before concatenating. |
213
+ | Unquoted fields containing the column separator | Malformed CSV — the field will be split incorrectly and there is no way to recover the original value. | Fix upstream at the data source. |
214
+ | Repeated header row mid-file | Happens when files are assembled with `cat chunk_1.csv chunk_2.csv`. The repeated header lands as a data row: `{name: "name", age: "age"}`. | Strip repeated header lines before parsing, or post-filter rows where all values equal their key names. |
215
+ | Trailer / summary rows | Totals or citation rows at end of file have no consistent marker. | Pre-process to remove, or post-filter with a sentinel check: `rows.reject { \|r\| r[:date].nil? }`. |
216
+ | REDCap (clinical trial data) | Two-row header: field names row + field labels row. The labels row lands as the first data row. | Drop post-parse: `rows.drop(1)`, or pre-process to remove the labels row. |
217
+ | IRS / SOI Tax Stats — footnote rows mixed into data | Footnote rows (e.g. `* Data suppressed`) appear mid-file with no consistent column structure. No option to filter mid-file rows by pattern. | Pre-process to strip footnote lines before parsing. |
218
+ | UK ONS (Office for National Statistics) — multi-row headers | Title row + unit row before the actual header row. SmarterCSV reads one header row; the extra rows land as data. | Pre-process to collapse or remove the extra header rows. |
219
+ | UK ONS — footer footnotes (`[note]`, `[x]`) | Footnote rows at end of file use inline markers with no consistent structure. | Pre-process to strip footer lines, or post-filter rows where key fields are nil. |
220
+ | World Bank / IMF — footer with source citation | Last 1–3 lines contain source attribution text, not data. | Pre-process to strip, or use `rows[0..-N]` to drop the last N rows post-parse. |
221
+ | Australian ABS — merged cell artifacts | Excel merged cells export as a value in the first occurrence and blank in subsequent rows. The blank column becomes `:column_1` with empty values. | Post-process: forward-fill the blank column from the previous non-empty value. |
222
+
223
+ ---
224
+
225
+ ## Quick Reference: Common Option Combinations
226
+
227
+ ```ruby
228
+ # Legacy enterprise export (Windows, Latin-1, BOM, CRLF)
229
+ SmarterCSV.process(file, file_encoding: 'windows-1252')
230
+
231
+ # MySQL dump (backslash escaping, \N for NULL)
232
+ SmarterCSV.process(file,
233
+ quote_escaping: :backslash,
234
+ nil_values_matching: /\A\\N\z/)
235
+
236
+ # Financial data (preserve leading zeros, no numeric conversion on key fields)
237
+ SmarterCSV.process(file,
238
+ convert_values_to_numeric: { except: [:account_number, :zip, :routing_number] })
239
+
240
+ # SAP wide export with duplicate column names
241
+ SmarterCSV.process(file,
242
+ duplicate_header_suffix: '_',
243
+ strip_whitespace: true)
244
+
245
+ # Survey export with boolean and N/A values
246
+ SmarterCSV.process(file,
247
+ nil_values_matching: /\A(N\/A|NA|n\/a)\z/,
248
+ value_converters: {
249
+ completed: ->(v) { v&.upcase == 'Y' }
250
+ })
251
+
252
+ # Gzipped CSV
253
+ require 'zlib'
254
+ SmarterCSV.process(Zlib::GzipReader.open('data.csv.gz'))
255
+
256
+ # HTTP streaming
257
+ require 'open-uri'
258
+ SmarterCSV.process(URI.open('https://example.com/data.csv'))
259
+ ```
260
+
261
+ --------------------
262
+ PREVIOUS: [Examples](./examples.md) | NEXT: [SmarterCSV over the Years](./history.md) | UP: [README](../README.md)
@@ -0,0 +1,223 @@
1
+ # SmarterCSV 1.16.0 — Benchmark Results
2
+
3
+ - **Date:** 2026-03-11 (two runs, best of each taken)
4
+ - **Ruby:** 3.4.7 [arm64-darwin25] on Apple M1 Pro
5
+ - **SmarterCSV:** 1.16.0.dev10
6
+ - **Versions compared:** 1.14.4, 1.15.0, 1.15.2, 1.16.0
7
+ - **Ruby CSV:** 3.3.5
8
+ - **ZSV:** 1.3.1
9
+ - **Methodology:** best of 30 measured runs (2 warm-up), best result taken across two independent sessions
10
+
11
+ > **Note:** ZSV results have GC disabled during calls (zsv-ruby 1.3.1 GC bug on Ruby 3.4.x).
12
+ > This gives ZSV a slight speed advantage — no GC pauses during measurement.
13
+
14
+ ---
15
+
16
+ ## SmarterCSV C accelerated — version comparison
17
+
18
+ | File | Rows | v1.14.4 | v1.15.0 | v1.15.2 | v1.16.0 | newest vs oldest |
19
+ |--------------------------------------|---------|---------------|---------------|---------------|---------------|------------------|
20
+ | PEOPLE_IMPORT_B.csv | 50000 | 1.6556s | 0.3952s | 0.1012s | 0.0869s | 19.05× faster |
21
+ | PEOPLE_IMPORT_C.csv | 50000 | 8.1715s | 1.9714s | 0.2065s | 0.1691s | 48.32× faster |
22
+ | PEOPLE_IMPORT_NB.csv | 50000 | 1.6053s | 0.6043s | 0.0859s | 0.0799s | 20.09× faster |
23
+ | PEOPLE_IMPORT_NC.csv | 50000 | 1.4952s | 0.6202s | 0.0763s | 0.0630s | 23.73× faster |
24
+ | uscities.csv | 31257 | 1.0576s | 0.3395s | 0.1126s | 0.1079s | 9.80× faster |
25
+ | uszips.csv | 33782 | 1.2769s | 0.4532s | 0.1113s | 0.1019s | 12.53× faster |
26
+ | worldcities.csv | 48059 | 1.0703s | 0.4362s | 0.1160s | 0.0973s | 11.00× faster |
27
+ | embedded_newlines_20k.csv | 80000 | 0.5404s | 0.0962s | 0.0564s | 0.0543s | 9.95× faster |
28
+ | embedded_separators_20k.csv | 20000 | 0.2779s | 0.0831s | 0.0320s | 0.0248s | 11.21× faster |
29
+ | heavy_quoting_20k.csv | 20000 | 0.5222s | 0.1330s | 0.0540s | 0.0359s | 14.55× faster |
30
+ | long_fields_20k.csv | 20000 | 2.9604s | 0.1357s | 0.1101s | 0.0451s | 65.64× faster |
31
+ | many_empty_fields_20k.csv | 20000 | 0.3946s | 0.3787s | 0.0313s | 0.0251s | 15.72× faster |
32
+ | multi_char_separator_20k.csv | 20000 | 0.5390s | 0.5452s | 0.0328s | 0.0260s | 20.73× faster |
33
+ | sample_10M.csv | 50000 | 0.4593s | 0.1642s | 0.0534s | 0.0461s | 9.96× faster |
34
+ | sensor_data_50krows_50cols.csv | 50000 | 3.9848s | 1.4278s | 0.2722s | 0.2640s | 15.09× faster |
35
+ | tab_separated_20k.tsv | 20000 | 0.4618s | 0.1111s | 0.0343s | 0.0245s | 18.85× faster |
36
+ | utf8_multibyte_20k.csv | 20000 | 0.2276s | 0.0688s | 0.0204s | 0.0167s | 13.63× faster |
37
+ | whitespace_heavy_20k.csv | 20000 | 0.5360s | 0.1206s | 0.0355s | 0.0281s | 19.07× faster |
38
+ | wide_500_cols_20k.csv | 20000 | 17.6581s | 5.2151s | 1.4185s | 1.3519s | 13.06× faster |
39
+
40
+ ## SmarterCSV Ruby path — version comparison
41
+
42
+ | File | Rows | v1.14.4 | v1.15.0 | v1.15.2 | v1.16.0 | newest vs oldest |
43
+ |--------------------------------------|---------|---------------|---------------|---------------|---------------|------------------|
44
+ | PEOPLE_IMPORT_B.csv | 50000 | 4.6704s | 3.6190s | 0.5382s | 0.5174s | 9.03× faster |
45
+ | PEOPLE_IMPORT_C.csv | 50000 | 26.6781s | 22.8627s | 2.5588s | 1.3184s | 20.24× faster |
46
+ | PEOPLE_IMPORT_NB.csv | 50000 | 4.6031s | 3.5647s | 0.5325s | 0.4649s | 9.90× faster |
47
+ | PEOPLE_IMPORT_NC.csv | 50000 | 4.4299s | 3.7989s | 0.5843s | 0.3963s | 11.18× faster |
48
+ | uscities.csv | 31257 | 2.7374s | 2.1679s | 1.8397s | 1.0811s | 2.53× faster |
49
+ | uszips.csv | 33782 | 3.2771s | 2.6214s | 2.1987s | 1.3326s | 2.46× faster |
50
+ | worldcities.csv | 48059 | 2.8980s | 2.3094s | 1.9354s | 1.0869s | 2.67× faster |
51
+ | embedded_newlines_20k.csv | 80000 | 0.9685s | 0.5729s | 0.4696s | 0.4275s | 2.27× faster |
52
+ | embedded_separators_20k.csv | 20000 | 0.7177s | 0.5696s | 0.4620s | 0.2725s | 2.63× faster |
53
+ | heavy_quoting_20k.csv | 20000 | 1.4473s | 1.1282s | 0.8769s | 0.5295s | 2.73× faster |
54
+ | long_fields_20k.csv | 20000 | 9.0238s | 6.4373s | 4.8163s | 2.5469s | 3.54× faster |
55
+ | many_empty_fields_20k.csv | 20000 | 0.8739s | 0.7527s | 0.2603s | 0.1652s | 5.29× faster |
56
+ | multi_char_separator_20k.csv | 20000 | 1.4261s | 1.1569s | 0.2457s | 0.1645s | 8.67× faster |
57
+ | sample_10M.csv | 50000 | 1.0699s | 0.8684s | 0.2419s | 0.2220s | 4.82× faster |
58
+ | sensor_data_50krows_50cols.csv | 50000 | 9.2662s | 6.8954s | 1.8555s | 1.8147s | 5.11× faster |
59
+ | tab_separated_20k.tsv | 20000 | 1.2786s | 0.9850s | 0.1620s | 0.1551s | 8.24× faster |
60
+ | utf8_multibyte_20k.csv | 20000 | 0.6595s | 0.5650s | 0.1154s | 0.1054s | 6.26× faster |
61
+ | whitespace_heavy_20k.csv | 20000 | 1.5723s | 1.2288s | 0.1684s | 0.1555s | 10.11× faster |
62
+ | wide_500_cols_20k.csv | 20000 | 45.2838s | 34.7364s | 7.2952s | 6.9952s | 6.47× faster |
63
+
64
+ ---
65
+
66
+ ## Full Results — all adapters (seconds, best of 2 sessions × 30 runs)
67
+
68
+ | File | Rows | CSV.read¹ | CSV.hashes¹ | CSV.table² | SmarterCSV/C | SmarterCSV/Rb | ZSV.read¹ | ZSV+wrapper² |
69
+ |--------------------------------------|---------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
70
+ | PEOPLE_IMPORT_B.csv | 50000 | 0.2537s | 0.7059s | 2.1440s | 0.0887s | 0.4895s | 0.0323s | 0.2380s |
71
+ | PEOPLE_IMPORT_C.csv | 50000 | 1.4265s | 8.1133s | 22.6230s | 0.1755s | 1.3401s | 0.2209s | 1.2759s |
72
+ | PEOPLE_IMPORT_NB.csv | 50000 | 0.2241s | 0.7087s | 2.2152s | 0.0838s | 0.4749s | 0.0312s | 0.2429s |
73
+ | PEOPLE_IMPORT_NC.csv | 50000 | 0.2847s | 0.8949s | 2.8887s | 0.0598s | 0.4015s | 0.0367s | 0.2192s |
74
+ | uscities.csv | 31257 | 0.5273s | 0.8796s | 1.7620s | 0.0830s | 1.0875s | 0.0244s | 0.2227s |
75
+ | uszips.csv | 33782 | 0.6994s | 1.1180s | 2.2444s | 0.0814s | 1.3326s | 0.0299s | 0.2448s |
76
+ | worldcities.csv | 48059 | 0.6033s | 0.9531s | 1.9404s | 0.0965s | 1.0869s | 0.0262s | 0.2125s |
77
+ | embedded_newlines_20k.csv | 80000 | 0.1511s | 0.2185s | 0.3908s | 0.0545s | 0.4275s | 0.0045s | 0.0373s |
78
+ | embedded_separators_20k.csv | 20000 | 0.1187s | 0.1769s | 0.3856s | 0.0197s | 0.2725s | 0.0051s | 0.0467s |
79
+ | heavy_quoting_20k.csv | 20000 | 0.1128s | 0.2315s | 0.6996s | 0.0367s | 0.5295s | 0.0096s | 0.0740s |
80
+ | long_fields_20k.csv | 20000 | 0.2411s | 0.2812s | 0.6809s | 0.0437s | 2.5469s | 0.0255s | 0.0528s |
81
+ | many_empty_fields_20k.csv | 20000 | 0.1075s | 0.3515s | 0.9626s | 0.0208s | 0.1652s | 0.0145s | 0.0740s |
82
+ | multi_char_separator_20k.csv | 20000 | 0.0790s | 0.1946s | 0.6649s | 0.0334s | 0.1645s | N/A | N/A |
83
+ | sample_10M.csv | 50000 | 0.1506s | 0.2846s | 0.7051s | 0.0347s | 0.2220s | 0.0095s | 0.0759s |
84
+ | sensor_data_50krows_50cols.csv | 50000 | 0.5643s | 2.6419s | 6.2180s | 0.2587s | 1.8147s | 0.0946s | 1.2241s |
85
+ | tab_separated_20k.tsv | 20000 | 0.0805s | 0.2009s | 0.6594s | 0.0244s | 0.1571s | 0.0094s | 0.0740s |
86
+ | utf8_multibyte_20k.csv | 20000 | 0.0638s | 0.1253s | 0.3405s | 0.0150s | 0.1054s | 0.0050s | 0.0420s |
87
+ | whitespace_heavy_20k.csv | 20000 | 0.0897s | 0.2035s | 0.7104s | 0.0294s | 0.1555s | 0.0111s | 0.0834s |
88
+ | wide_500_cols_20k.csv | 20000 | 2.4090s | 32.2438s | 57.6183s | 1.3898s | 6.9952s | 0.3565s | 4.6425s |
89
+
90
+ ---
91
+
92
+ ## Throughput (rows/second) — SmarterCSV 1.16.0 (C accelerated)
93
+
94
+ Higher is better.
95
+
96
+ | File | Rows | CSV.read¹ | CSV.hashes¹ | CSV.table² | SmarterCSV/C | SmarterCSV/Rb | ZSV.read¹ | ZSV+wrapper² |
97
+ |--------------------------------------|---------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
98
+ | PEOPLE_IMPORT_B.csv | 50000 | 197087 | 70828 | 23321 | 563946 | 96638 | 1548001 | 210084 |
99
+ | PEOPLE_IMPORT_C.csv | 50000 | 35052 | 6163 | 2210 | 284899 | 37311 | 226347 | 39190 |
100
+ | PEOPLE_IMPORT_NB.csv | 50000 | 223123 | 70549 | 22571 | 596658 | 105270 | 1602564 | 205983 |
101
+ | PEOPLE_IMPORT_NC.csv | 50000 | 175620 | 55873 | 17309 | 835452 | 124535 | 1362398 | 228086 |
102
+ | uscities.csv | 31257 | 59277 | 35534 | 17740 | 376590 | 28741 | 1281148 | 140367 |
103
+ | uszips.csv | 33782 | 48298 | 30217 | 15051 | 415012 | 25351 | 1130399 | 138005 |
104
+ | worldcities.csv | 48059 | 79667 | 50423 | 24768 | 498016 | 44218 | 1833862 | 226161 |
105
+ | embedded_newlines_20k.csv | 80000 | 529538 | 366143 | 204710 | 1467890 | 186709 | 17586283 | 2144255 |
106
+ | embedded_separators_20k.csv | 20000 | 168462 | 113084 | 51872 | 1015228 | 73382 | 3921569 | 428265 |
107
+ | heavy_quoting_20k.csv | 20000 | 176235 | 86408 | 28588 | 544796 | 37773 | 2088773 | 270270 |
108
+ | long_fields_20k.csv | 20000 | 82969 | 71112 | 29373 | 457666 | 7853 | 784314 | 378788 |
109
+ | many_empty_fields_20k.csv | 20000 | 186043 | 56891 | 20776 | 961538 | 121068 | 1379310 | 270270 |
110
+ | multi_char_separator_20k.csv | 20000 | 253165 | 102769 | 30082 | 598802 | 121618 | N/A | N/A |
111
+ | sample_10M.csv | 50000 | 331928 | 175694 | 70916 | 1441753 | 225225 | 5263158 | 658623 |
112
+ | sensor_data_50krows_50cols.csv | 50000 | 88607 | 18926 | 8041 | 193280 | 27553 | 528434 | 40847 |
113
+ | tab_separated_20k.tsv | 20000 | 248490 | 99572 | 30331 | 819672 | 127370 | 2127660 | 270270 |
114
+ | utf8_multibyte_20k.csv | 20000 | 313480 | 159579 | 58741 | 1333333 | 189753 | 4000000 | 476190 |
115
+ | whitespace_heavy_20k.csv | 20000 | 222933 | 98286 | 28152 | 680272 | 128617 | 1801802 | 239808 |
116
+ | wide_500_cols_20k.csv | 20000 | 8302 | 620 | 347 | 14391 | 2859 | 56108 | 4308 |
117
+
118
+ ---
119
+
120
+ ## Speedup vs SmarterCSV 1.16.0 (C accelerated)
121
+
122
+ | File | Rows | CSV.read¹ | CSV.hashes¹ | CSV.table² | SmarterCSV/C | ZSV.read¹ | ZSV+wrapper² |
123
+ |--------------------------------------|---------|---------------|---------------|---------------|---------------|---------------|---------------|
124
+ | PEOPLE_IMPORT_B.csv | 50000 | 2.86× slower | 7.96× slower | 24.17× slower | ref | 2.74× faster | 2.68× slower |
125
+ | PEOPLE_IMPORT_C.csv | 50000 | 8.13× slower | 46.23× slower | 128.90× slower | ref | 1.26× slower | 7.27× slower |
126
+ | PEOPLE_IMPORT_NB.csv | 50000 | 2.67× slower | 8.46× slower | 26.43× slower | ref | 2.61× faster | 2.90× slower |
127
+ | PEOPLE_IMPORT_NC.csv | 50000 | 4.76× slower | 14.96× slower | 48.31× slower | ref | 1.63× faster | 3.67× slower |
128
+ | uscities.csv | 31257 | 6.35× slower | 10.60× slower | 21.23× slower | ref | 3.40× faster | 2.68× slower |
129
+ | uszips.csv | 33782 | 8.59× slower | 13.74× slower | 27.57× slower | ref | 2.72× faster | 3.01× slower |
130
+ | worldcities.csv | 48059 | 6.25× slower | 9.88× slower | 20.11× slower | ref | 3.68× faster | 2.20× slower |
131
+ | embedded_newlines_20k.csv | 80000 | 2.77× slower | 4.01× slower | 7.17× slower | ref | 12.11× faster | 1.46× faster |
132
+ | embedded_separators_20k.csv | 20000 | 6.02× slower | 8.98× slower | 19.57× slower | ref | 3.87× faster | 2.37× slower |
133
+ | heavy_quoting_20k.csv | 20000 | 3.07× slower | 6.31× slower | 19.06× slower | ref | 3.84× faster | 2.02× slower |
134
+ | long_fields_20k.csv | 20000 | 5.52× slower | 6.43× slower | 15.58× slower | ref | 1.71× faster | 1.21× slower |
135
+ | many_empty_fields_20k.csv | 20000 | 5.17× slower | 16.90× slower | 46.28× slower | ref | 1.40× faster | 3.56× slower |
136
+ | multi_char_separator_20k.csv | 20000 | 2.37× slower | 5.83× slower | 19.91× slower | ref | N/A | N/A |
137
+ | sample_10M.csv | 50000 | 4.34× slower | 8.20× slower | 20.32× slower | ref | 3.65× faster | 2.19× slower |
138
+ | sensor_data_50krows_50cols.csv | 50000 | 2.18× slower | 10.21× slower | 24.04× slower | ref | 2.73× faster | 4.73× slower |
139
+ | tab_separated_20k.tsv | 20000 | 3.30× slower | 8.23× slower | 27.02× slower | ref | 2.57× faster | 3.03× slower |
140
+ | utf8_multibyte_20k.csv | 20000 | 4.25× slower | 8.35× slower | 22.70× slower | ref | 3.33× faster | 2.80× slower |
141
+ | whitespace_heavy_20k.csv | 20000 | 3.05× slower | 6.92× slower | 24.16× slower | ref | 2.78× faster | 2.84× slower |
142
+ | wide_500_cols_20k.csv | 20000 | 1.73× slower | 23.20× slower | 41.46× slower | ref | 3.88× faster | 3.34× slower |
143
+
144
+ ## Fair Comparison: equivalent-output adapters vs CSV.table
145
+
146
+ | File | Rows | CSV.table² | SmarterCSV/C | ZSV+wrapper² |
147
+ |--------------------------------------|---------|---------------|---------------|---------------|
148
+ | PEOPLE_IMPORT_B.csv | 50000 | ref | 24.17× faster | 9.01× faster |
149
+ | PEOPLE_IMPORT_C.csv | 50000 | ref | 128.90× faster | 17.73× faster |
150
+ | PEOPLE_IMPORT_NB.csv | 50000 | ref | 26.43× faster | 9.12× faster |
151
+ | PEOPLE_IMPORT_NC.csv | 50000 | ref | 48.31× faster | 13.12× faster |
152
+ | uscities.csv | 31257 | ref | 21.23× faster | 7.91× faster |
153
+ | uszips.csv | 33782 | ref | 27.57× faster | 9.17× faster |
154
+ | worldcities.csv | 48059 | ref | 20.11× faster | 9.08× faster |
155
+ | embedded_newlines_20k.csv | 80000 | ref | 7.17× faster | 9.56× faster |
156
+ | embedded_separators_20k.csv | 20000 | ref | 19.57× faster | 8.25× faster |
157
+ | heavy_quoting_20k.csv | 20000 | ref | 19.06× faster | 9.46× faster |
158
+ | long_fields_20k.csv | 20000 | ref | 15.58× faster | 12.89× faster |
159
+ | many_empty_fields_20k.csv | 20000 | ref | 46.28× faster | 12.93× faster |
160
+ | multi_char_separator_20k.csv | 20000 | ref | 19.91× faster | N/A |
161
+ | sample_10M.csv | 50000 | ref | 20.32× faster | 9.29× faster |
162
+ | sensor_data_50krows_50cols.csv | 50000 | ref | 24.04× faster | 5.08× faster |
163
+ | tab_separated_20k.tsv | 20000 | ref | 27.02× faster | 8.86× faster |
164
+ | utf8_multibyte_20k.csv | 20000 | ref | 22.70× faster | 8.10× faster |
165
+ | whitespace_heavy_20k.csv | 20000 | ref | 24.16× faster | 8.52× faster |
166
+ | wide_500_cols_20k.csv | 20000 | ref | 41.46× faster | 12.37× faster |
167
+
168
+ ## Head-to-Head: SmarterCSV 1.16.0 (C accelerated) vs ZSV+wrapper
169
+
170
+ | File | Rows | SmarterCSV/C | ZSV+wrapper² |
171
+ |--------------------------------------|---------|---------------|---------------|
172
+ | PEOPLE_IMPORT_B.csv | 50000 | ref | 2.68× slower |
173
+ | PEOPLE_IMPORT_C.csv | 50000 | ref | 7.27× slower |
174
+ | PEOPLE_IMPORT_NB.csv | 50000 | ref | 2.90× slower |
175
+ | PEOPLE_IMPORT_NC.csv | 50000 | ref | 3.67× slower |
176
+ | uscities.csv | 31257 | ref | 2.68× slower |
177
+ | uszips.csv | 33782 | ref | 3.01× slower |
178
+ | worldcities.csv | 48059 | ref | 2.20× slower |
179
+ | embedded_newlines_20k.csv | 80000 | ref | 1.46× faster |
180
+ | embedded_separators_20k.csv | 20000 | ref | 2.37× slower |
181
+ | heavy_quoting_20k.csv | 20000 | ref | 2.02× slower |
182
+ | long_fields_20k.csv | 20000 | ref | 1.21× slower |
183
+ | many_empty_fields_20k.csv | 20000 | ref | 3.56× slower |
184
+ | multi_char_separator_20k.csv | 20000 | N/A | N/A |
185
+ | sample_10M.csv | 50000 | ref | 2.19× slower |
186
+ | sensor_data_50krows_50cols.csv | 50000 | ref | 4.73× slower |
187
+ | tab_separated_20k.tsv | 20000 | ref | 3.03× slower |
188
+ | utf8_multibyte_20k.csv | 20000 | ref | 2.80× slower |
189
+ | whitespace_heavy_20k.csv | 20000 | ref | 2.84× slower |
190
+ | wide_500_cols_20k.csv | 20000 | ref | 3.34× slower |
191
+
192
+ ## Raw Parsing: SmarterCSV 1.16.0 (C accelerated) vs ZSV.read
193
+
194
+ | File | Rows | SmarterCSV/C | ZSV.read¹ |
195
+ |--------------------------------------|---------|---------------|---------------|
196
+ | PEOPLE_IMPORT_B.csv | 50000 | ref | 2.74× faster |
197
+ | PEOPLE_IMPORT_C.csv | 50000 | ref | 1.26× slower |
198
+ | PEOPLE_IMPORT_NB.csv | 50000 | ref | 2.61× faster |
199
+ | PEOPLE_IMPORT_NC.csv | 50000 | ref | 1.63× faster |
200
+ | uscities.csv | 31257 | ref | 3.40× faster |
201
+ | uszips.csv | 33782 | ref | 2.72× faster |
202
+ | worldcities.csv | 48059 | ref | 3.68× faster |
203
+ | embedded_newlines_20k.csv | 80000 | ref | 12.11× faster |
204
+ | embedded_separators_20k.csv | 20000 | ref | 3.87× faster |
205
+ | heavy_quoting_20k.csv | 20000 | ref | 3.84× faster |
206
+ | long_fields_20k.csv | 20000 | ref | 1.71× faster |
207
+ | many_empty_fields_20k.csv | 20000 | ref | 1.40× faster |
208
+ | multi_char_separator_20k.csv | 20000 | N/A | N/A |
209
+ | sample_10M.csv | 50000 | ref | 3.65× faster |
210
+ | sensor_data_50krows_50cols.csv | 50000 | ref | 2.73× faster |
211
+ | tab_separated_20k.tsv | 20000 | ref | 2.57× faster |
212
+ | utf8_multibyte_20k.csv | 20000 | ref | 3.33× faster |
213
+ | whitespace_heavy_20k.csv | 20000 | ref | 2.78× faster |
214
+ | wide_500_cols_20k.csv | 20000 | ref | 3.88× faster |
215
+
216
+ ---
217
+
218
+ ¹ **Raw output** — no post-processing applied. Returns plain arrays or string-keyed hashes.
219
+ No header normalization, type conversion, whitespace stripping, or empty-value removal.
220
+ Your own post-processing must be added to produce usable data.
221
+
222
+ ² **Near-equivalent** to SmarterCSV output (symbol keys, numeric conversion), but not 100%
223
+ identical. Whitespace handling, empty-value removal, and duplicate-header behavior may differ.