smarter_csv 1.15.2 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. checksums.yaml +4 -4
  2. data/.rspec +2 -0
  3. data/.rubocop.yml +9 -0
  4. data/CHANGELOG.md +112 -1
  5. data/CONTRIBUTORS.md +4 -1
  6. data/Gemfile +1 -0
  7. data/README.md +129 -27
  8. data/docs/_introduction.md +45 -24
  9. data/docs/bad_row_quarantine.md +342 -0
  10. data/docs/basic_read_api.md +152 -9
  11. data/docs/basic_write_api.md +475 -59
  12. data/docs/batch_processing.md +162 -4
  13. data/docs/column_selection.md +184 -0
  14. data/docs/data_transformations.md +163 -29
  15. data/docs/examples.md +340 -46
  16. data/docs/header_transformations.md +94 -12
  17. data/docs/header_validations.md +57 -18
  18. data/docs/history.md +119 -0
  19. data/docs/instrumentation.md +166 -0
  20. data/docs/migrating_from_csv.md +565 -0
  21. data/docs/options.md +151 -87
  22. data/docs/parsing_strategy.md +64 -1
  23. data/docs/real_world_csv.md +263 -0
  24. data/docs/releases/1.16.0/benchmarks.md +223 -0
  25. data/docs/releases/1.16.0/changes.md +273 -0
  26. data/docs/releases/1.16.0/performance_notes.md +114 -0
  27. data/docs/row_col_sep.md +15 -5
  28. data/docs/ruby_csv_pitfalls.md +514 -0
  29. data/docs/value_converters.md +194 -57
  30. data/ext/smarter_csv/extconf.rb +3 -0
  31. data/ext/smarter_csv/smarter_csv.c +1017 -82
  32. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  36. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  37. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  38. data/lib/smarter_csv/errors.rb +8 -0
  39. data/lib/smarter_csv/file_io.rb +1 -1
  40. data/lib/smarter_csv/hash_transformations.rb +14 -13
  41. data/lib/smarter_csv/header_transformations.rb +21 -2
  42. data/lib/smarter_csv/headers.rb +2 -1
  43. data/lib/smarter_csv/options.rb +124 -7
  44. data/lib/smarter_csv/parser.rb +358 -74
  45. data/lib/smarter_csv/reader.rb +494 -46
  46. data/lib/smarter_csv/version.rb +1 -1
  47. data/lib/smarter_csv/writer.rb +71 -19
  48. data/lib/smarter_csv.rb +134 -13
  49. data/smarter_csv.gemspec +20 -10
  50. metadata +38 -80
@@ -0,0 +1,273 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](../../_introduction.md)
5
+ * [Migrating from Ruby CSV](../../migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](../../ruby_csv_pitfalls.md)
7
+ * [Parsing Strategy](../../parsing_strategy.md)
8
+ * [The Basic Read API](../../basic_read_api.md)
9
+ * [The Basic Write API](../../basic_write_api.md)
10
+ * [Batch Processing](../../batch_processing.md)
11
+ * [Configuration Options](../../options.md)
12
+ * [Row and Column Separators](../../row_col_sep.md)
13
+ * [Header Transformations](../../header_transformations.md)
14
+ * [Header Validations](../../header_validations.md)
15
+ * [Column Selection](../../column_selection.md)
16
+ * [Data Transformations](../../data_transformations.md)
17
+ * [Value Converters](../../value_converters.md)
18
+ * [Bad Row Quarantine](../../bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](../../instrumentation.md)
20
+ * [Examples](../../examples.md)
21
+ * [Real-World CSV Files](../../real_world_csv.md)
22
+ * [SmarterCSV over the Years](../../history.md)
23
+ * [**Release Notes**](./changes.md)
24
+
25
+ --------------
26
+
27
+ # SmarterCSV 1.16.0 — Changes
28
+
29
+ RSpec tests: **714 → 1,247** (+533 tests)
30
+
31
+ ---
32
+
33
+ ## Minor Breaking Change
34
+
35
+ New option **`quote_boundary:`**
36
+ * defaults to `:standard`**: quotes are now only recognized as field delimiters at field boundaries;
37
+ mid-field quotes are treated as literal characters.
38
+
39
+ This aligns SmarterCSV with RFC 4180 and other CSV libraries. In practice, mid-field quotes
40
+ were already producing silently corrupt output in previous versions — so most users will see
41
+ correct behavior improve, not regress.
42
+
43
+ * Use `quote_boundary: :legacy` only in exceptional cases to restore previous behavior. See [Parsing Strategy](../../parsing_strategy.md).
44
+
45
+ ---
46
+
47
+ ## Performance Improvements
48
+
49
+ ### Net Benchmark Result (C-accelerated, Apple M1, Ruby 3.4.7)
50
+
51
+ | Comparison | Range |
52
+ |---|---|
53
+ | vs Ruby `CSV.read` † | **2×–8× faster** |
54
+ | vs Ruby `CSV.table` ‡ | **7×–129× faster** |
55
+ | vs SmarterCSV 1.14.4 (C-path) | **9×–65× faster** |
56
+ | vs SmarterCSV 1.15.2 (C-path) | **up to 2.4× faster** |
57
+ | vs SmarterCSV 1.15.2 (Ruby-path) | **up to 2× faster** |
58
+
59
+ † `CSV.read` returns raw arrays of arrays — hash construction, key normalization, and type conversion still need to happen, understating the real cost difference.
60
+
61
+ ‡ `CSV.table` is the closest Ruby equivalent to SmarterCSV — both return symbol-keyed hashes.
62
+
63
+ ![SmarterCSV 1.16.0 vs previous versions — C-accelerated path](../../../images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg)
64
+
65
+ ![SmarterCSV 1.16.0 vs previous versions — Ruby path](../../../images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg)
66
+
67
+ See [performance_notes.md](performance_notes.md) and [benchmarks.md](benchmarks.md).
68
+
69
+ ### C Extension
70
+
71
+ - **ParseContext architecture**: All per-file parse options are now wrapped in a GC-managed
72
+ `TypedData` object (`parse_context_t`) built once after headers are loaded. Eliminates
73
+ ~10 `rb_hash_aref` calls per row that previously read directly from the options hash on
74
+ every row.
75
+ - **Column-filter bitmap**: `_keep_bitmap` precomputed as a packed binary `String` — one
76
+ `memcpy`-style check per row replaces N `rb_ary_entry` calls. Loop invariants
77
+ `_keep_extra_cols` and `_early_exit_after` precomputed once; `_keep_cols=false` sentinel
78
+ skips bitmap logic entirely on files without column selection (one `!= Qfalse` test per row).
79
+ - **Section 4 fast-path split**: The C unquoted inner loop is split into two sub-paths —
80
+ plain unquoted vs. boundary-aware `:standard` mode — so the common case avoids all
81
+ quote-boundary state tracking. `__builtin_expect` hints applied to both guards.
82
+ - **Section 2 lazy lookups**: `quote_escaping` / `quote_boundary` reads moved from
83
+ unconditional Section 2 (every row) to Section 5 (quoted-field path only).
84
+ `only_headers` / `except_headers` / `strict` lookups guarded by `_keep_cols` nil-check.
85
+ Duplicate `row_sep` lookup removed.
86
+ - **Byte-level indexing**: All `line[i]` character lookups inside inner loops replaced with
87
+ `line.getbyte(i)` (returns Integer Fixnum directly, ~5–10 ns, zero allocation vs. ~30–50 ns
88
+ one-char String per call). Field extraction switched to `line.byteslice(start, len)`.
89
+ `col_sep_byte` and `quote_byte` precomputed as integers.
90
+ - **Skip-ahead in quoted fields**: `memchr` jump to next quote character instead of advancing
91
+ one byte at a time inside quoted fields.
92
+ - **Skip-ahead for unquoted fields in `:standard` mode**: Once a field is confirmed unquoted,
93
+ `String#index` jumps directly to the next `col_sep`, bypassing per-character state checks.
94
+ - **Compiler flag `-fno-semantic-interposition`**: Added to `extconf.rb` for GCC/Clang
95
+ (excluded from MSVC). Enables more aggressive LTO inlining and bypasses the PLT for
96
+ intra-library calls on Linux.
97
+ - **`cold`/`hot` function attributes + compiler hints**: Applied to rarely-executed paths and
98
+ hot inner loops respectively to guide branch predictor and instruction cache layout.
99
+
100
+ ### Ruby Path
101
+
102
+ - **Unquoted fast path — direct hash construction**: `parse_line_to_hash_ruby` builds the
103
+ result hash directly from `String#split` for unquoted lines. Eliminates the intermediate
104
+ `Array` from `parse_csv_line_ruby` and a second full-row iteration. Uses integer-index
105
+ `while` loops instead of Ruby enumerators.
106
+ - **`byteindex` skip-ahead**: Inside quoted fields, `String#byteindex` (Ruby 3.2+) or inline
107
+ `getbyte` scan jumps to next quote or col_sep at C speed. Falls back correctly on
108
+ JRuby/TruffleRuby.
109
+ - **Empty field skipping inline**: `remove_empty_values` now filters empty fields inline
110
+ during hash building rather than post-processing. Combined with `strip_whitespace: true`
111
+ (default), catches both empty and whitespace-only fields without regex.
112
+ - **Quoted field extraction**: Content extracted directly with `byteslice` excluding
113
+ surrounding quotes; avoids double allocation. In-place `.strip!` on fresh byteslice avoids
114
+ a second allocation.
115
+ - **Backslash detection fast-path**: In `:auto` quote_escaping mode, when the line contains no
116
+ backslash character, skips the backslash-try dance and calls RFC 4180 mode directly.
117
+ - **Hot-path option caching**: `@hot_path_options`, `@quote_escaping_backslash`,
118
+ `@quote_escaping_double`, `@delete_nil_keys`, `@delete_empty_keys`, `@quote_char`, and
119
+ `@field_size_limit` precomputed as ivars once after headers are loaded — all per-row
120
+ option-hash lookups replaced by cheap ivar reads.
121
+ - **Multiline gate optimization**: `detect_multiline_strict` used as a cheap gate in the
122
+ stitch loop; avoids N-2 full re-parses per multiline row in the Ruby path.
123
+
124
+ ---
125
+
126
+ ## New Features
127
+
128
+ ### Reader
129
+
130
+ **New top-level API:**
131
+
132
+ - **`SmarterCSV.parse(csv_string, options = {})`**: Parse a CSV string directly without
133
+ wrapping in `StringIO`. Drop-in equivalent of `CSV.parse(str, headers: true,
134
+ header_converters: :symbol)` with numeric conversion included. See
135
+ [Migrating from Ruby CSV](../../migrating_from_csv.md).
136
+ - **`SmarterCSV.each(input, options = {}, &block)`**: Row-by-row enumerator yielding each
137
+ row as a `Hash`. Returns an `Enumerator` when called without a block.
138
+ - **`SmarterCSV.each_chunk(input, options = {}, &block)`**: Chunked enumerator yielding
139
+ `(Array<Hash>, chunk_index)`. Requires `chunk_size` in options. Returns an `Enumerator`
140
+ without a block.
141
+
142
+ **New `Reader` instance methods:**
143
+
144
+ - **`Reader#each { |hash| }`**: Yields each row as a `Hash`. `Reader` now includes
145
+ `Enumerable` (enables `map`, `select`, `lazy`, etc.).
146
+ - **`Reader#each_chunk { |chunk, index| }`**: Yields each chunk plus 0-based chunk index.
147
+
148
+ **New options:**
149
+
150
+ - **`quote_boundary: :standard`** *(default — breaking change)*: Quotes are only recognized
151
+ as field delimiters at field boundaries; mid-field quotes are treated as literal characters.
152
+ Use `quote_boundary: :legacy` to restore previous behavior.
153
+ - **`quote_escaping: :auto`** *(default)*: Tries backslash interpretation first; automatically
154
+ downgrades to RFC 4180 when no backslash is present in the line. Also accepts `:backslash`
155
+ and `:double_quotes`.
156
+ - **`headers: { only: [...] }`**: Keep only the specified columns in each result hash.
157
+ Excluded columns are skipped in the C hot path — no string allocation, no conversion, no
158
+ hash insertion. See [Column Selection](../../column_selection.md).
159
+ - **`headers: { except: [...] }`**: Remove the specified columns from each result hash. Same
160
+ hot-path optimization. Cannot be combined with `headers: { only: }`.
161
+ - **`on_bad_row:`**: Controls behavior when a row raises a parse error. Values: `:raise`
162
+ (default), `:skip`, `:collect`, or a callable. With `:collect`, error records accumulate in
163
+ `reader.errors[:bad_rows]`. See [Bad Row Quarantine](../../bad_row_quarantine.md).
164
+ - **`bad_row_limit: N`**: Raises `SmarterCSV::TooManyBadRows` after N bad rows. Default: `nil`
165
+ (unlimited).
166
+ - **`collect_raw_lines: true`** *(default)*: Include the raw stitched line in bad-row error
167
+ records. Set to `false` for privacy or memory savings.
168
+ - **`field_size_limit: N`**: Maximum size of any extracted field in bytes. Raises
169
+ `SmarterCSV::FieldSizeLimitExceeded` if a field or accumulating multiline buffer exceeds
170
+ the limit. Prevents DoS from runaway quoted fields. See
171
+ [Bad Row Quarantine](../../bad_row_quarantine.md#limiting-field-size-field_size_limit).
172
+ - **`nil_values_matching: regex`**: Set matching values to `nil` via regular expression. With
173
+ `remove_empty_values: true` (default), nil-ified values are removed. With
174
+ `remove_empty_values: false`, the key is retained with a `nil` value. Replaces deprecated
175
+ `remove_values_matching:`.
176
+ - **`missing_headers: :auto`** *(default)*: Auto-generate names for extra columns using
177
+ `missing_header_prefix` (e.g. `column_7`, `column_8`). Use `:raise` to raise
178
+ `HeaderSizeMismatch` instead. Replaces deprecated `strict:`.
179
+ - **`verbose: :quiet / :normal / :debug`**: Symbol-based verbosity levels. `:quiet` suppresses
180
+ all output; `:normal` (default) shows behavioral warnings; `:debug` adds computed options and
181
+ per-row diagnostics to `$stderr`. Replaces deprecated `verbose: true/false`.
182
+ - New Instrumentation Hooks: See [Instrumentation Hooks](../../instrumentation.md).
183
+ - **`on_start: callable`**: Fires once before the first row with
184
+ `{ input:, file_size:, col_sep:, row_sep: }`.
185
+ - **`on_chunk: callable`**: Fires after each chunk (chunked mode only) with
186
+ `{ chunk_number:, rows_in_chunk:, total_rows_so_far: }`.
187
+ - **`on_complete: callable`**: Fires after the file is exhausted with
188
+ `{ total_rows:, total_chunks:, duration:, bad_rows: }`.
189
+
190
+
191
+ **New exceptions:**
192
+
193
+ - **`SmarterCSV::FieldSizeLimitExceeded`**: Raised when `field_size_limit` is exceeded.
194
+ - **`SmarterCSV::TooManyBadRows`**: Raised when `bad_row_limit` is exceeded.
195
+
196
+ **Deprecations:**
197
+
198
+ - `only_headers:` → use `headers: { only: }`
199
+ - `except_headers:` → use `headers: { except: }`
200
+ - `remove_values_matching:` → use `nil_values_matching:`
201
+ - `strict: true` → use `missing_headers: :raise`
202
+ - `strict: false` → use `missing_headers: :auto`
203
+ - `verbose: true` → use `verbose: :debug`
204
+ - `verbose: false` → use `verbose: :normal`
205
+
206
+ ### Writer
207
+
208
+ - **IO and StringIO support**: `SmarterCSV.generate` and `SmarterCSV::Writer.new` now accept
209
+ any `IO`-compatible object (responding to `#write`) in addition to a file path or
210
+ `Pathname`. The caller retains ownership of passed-in IO objects.
211
+ - **`SmarterCSV.generate` returns a String when called without a destination**: Omit the file
212
+ argument and the CSV is written to an internal buffer and returned as a `String`. Options
213
+ hash can be passed as the sole argument.
214
+ - **Streaming mode for known headers**: When `headers:` or `map_headers:` is provided at
215
+ construction time, the Writer skips the internal temp file entirely — the header line is
216
+ written immediately and each `<<` streams directly to the output file. No API change;
217
+ existing code benefits automatically. See [The Basic Write API](../../basic_write_api.md).
218
+ - **`encoding:` option**: Specifies the file encoding (e.g. `'UTF-8'`, `'ISO-8859-1'`).
219
+ Supports Ruby's `'external:internal'` transcoding notation. Only applies when writing to a
220
+ file path; ignored for IO objects.
221
+ - **`write_nil_value:` option** *(default: `''`)*: String written in place of `nil` field
222
+ values.
223
+ - **`write_empty_value:` option** *(default: `''`)*: String written in place of empty-string
224
+ field values, including missing keys.
225
+ - **`write_bom:` option** *(default: `false`)*: Prepends a UTF-8 BOM (`\xEF\xBB\xBF`) to the
226
+ output. Useful for Excel compatibility with non-ASCII content.
227
+
228
+ ---
229
+
230
+ ## Bug Fixes
231
+
232
+ ### Reader
233
+
234
+ - **Mid-field quotes no longer corrupt unquoted fields**: `quote_boundary: :standard` (now the
235
+ default) prevents a quote character mid-field (e.g. `b"bb`) from toggling quoted state. This
236
+ silently corrupted rows in 1.15.2 when data contained apostrophes or inch marks.
237
+ - **Unclosed-quote fallback in `:auto` mode**: When backslash mode encounters an unclosed quote
238
+ at EOL, the parser now tries RFC 4180 mode as a fallback before treating the row as multiline.
239
+ - **Empty headers bug fixed** ([#324](https://github.com/tilo/smarter_csv/issues/324),
240
+ [#312](https://github.com/tilo/smarter_csv/issues/312)): CSV files with empty or
241
+ whitespace-only header fields (e.g. `name,,`) now auto-generate column names using
242
+ `missing_header_prefix` (default: `column_1`, `column_2`, …).
243
+ - **All library output now goes to `$stderr`**: Behavioral warnings use `warn` (suppressible
244
+ via `-W0` or `verbose: :quiet`); debug diagnostics use `$stderr.puts`. Nothing is written to
245
+ `$stdout`.
246
+ - **`SmarterCSV.generate` raises `ArgumentError`** (not a blank `RuntimeError`) when called
247
+ without a block.
248
+
249
+ ### Writer
250
+
251
+ - **Temp file no longer hardcoded to `/tmp`**: Fixes `Errno::ENOENT` on Windows.
252
+ - **Temp file properly cleaned up**: `Tempfile#close!` now used instead of `Tempfile#delete`,
253
+ ensuring the file is both closed and unlinked.
254
+ - **`StringIO` handling**: Writing to a `StringIO` no longer attempts to close it on
255
+ `finalize`.
256
+
257
+ ---
258
+
259
+ ## Misc
260
+
261
+ - **`@mapped_keys` changed from `Array` to `Set`**: O(1) lookup per field instead of O(n)
262
+ scan on the `value_converters` key check.
263
+ - **`escape_csv_field` micro-optimizations**: `@escaped_quote_char` precomputed once in
264
+ `initialize`; redundant `.to_s` call removed; row separator appended with `<<` (mutating)
265
+ instead of `+` to save one string allocation per row.
266
+ - **`Reader` includes `Enumerable`**: Enables `map`, `select`, `reject`, `lazy`, and other
267
+ Enumerable methods on `Reader#each` results.
268
+ - **`DEFAULT_CHUNK_SIZE = 100`**: Constant added; warning emitted when `each_chunk` is called
269
+ without explicit `chunk_size`.
270
+
271
+ ---
272
+
273
+ PREVIOUS: [SmarterCSV over the Years](../../history.md) | UP: [README](../../../README.md)
@@ -0,0 +1,114 @@
1
+ # SmarterCSV 1.16.0 — Performance Notes
2
+
3
+ Measured on Apple M1 Pro, Ruby 3.4.7, best of two benchmark sessions (30 runs each).
4
+ See [benchmarks.md](benchmarks.md) for full tables.
5
+
6
+ ---
7
+
8
+ ## vs Ruby CSV
9
+
10
+ ### vs CSV.read (raw tokenization only — no hashes, no post-processing)
11
+
12
+ `CSV.read` is the *fastest* Ruby CSV mode. It returns plain string arrays with no header
13
+ handling, no symbol keys, no numeric conversion. SmarterCSV/C delivers fully processed
14
+ hashes — and still beats it on every single file:
15
+
16
+ | Range | Files |
17
+ |--------------|--------------------------------------------------------------------|
18
+ | **8–9×** | PEOPLE_IMPORT_C (8.1×), uszips (8.6×) |
19
+ | **6–7×** | uscities (6.4×), worldcities (6.3×), embedded_sep (6.0×) |
20
+ | **4–5×** | PEOPLE_IMPORT_NC (4.8×), long_fields (5.5×), many_empty (5.2×), sample_10M (4.3×), utf8 (4.3×) |
21
+ | **3×** | heavy_quoting (3.1×), tab_sep (3.3×), whitespace (3.1×), embedded_newlines (2.8×) |
22
+ | **2–3×** | PEOPLE_IMPORT_B (2.9×), PEOPLE_IMPORT_NB (2.7×), sensor_data (2.2×), multi_char (2.4×) |
23
+ | **~1.7×** | wide_500_cols (1.7×) — most column-heavy file, hash overhead visible |
24
+
25
+ **Summary: 1.7×–8.6× faster than CSV.read, while returning fully processed hashes.**
26
+
27
+ ### vs CSV.table (symbol keys + numeric conversion — nearest equivalent output)
28
+
29
+ `CSV.table` is the fairest apples-to-apples comparison: it also produces symbol-keyed
30
+ rows with type conversion applied. SmarterCSV/C is dramatically faster:
31
+
32
+ | Range | Files |
33
+ |----------------|-----------------------------------------------------------------|
34
+ | **100×+** | PEOPLE_IMPORT_C (129×) |
35
+ | **40–50×** | PEOPLE_IMPORT_NC (48×), many_empty (46×), wide_500_cols (41×) |
36
+ | **20–30×** | PEOPLE_IMPORT_B (24×), PEOPLE_IMPORT_NB (26×), uszips (28×), tab_sep (27×), whitespace (24×), sensor_data (24×), utf8 (23×), multi_char (20×), worldcities (20×), sample_10M (20×) |
37
+ | **15–20×** | uscities (21×), long_fields (16×), heavy_quoting (19×), embedded_sep (20×) |
38
+ | **7×** | embedded_newlines (7×) — multiline rows, overhead unavoidable |
39
+
40
+ **Summary: 7×–129× faster than CSV.table.**
41
+
42
+ ---
43
+
44
+ ## vs SmarterCSV 1.15.2
45
+
46
+ ### C path
47
+
48
+ | Gain | Files |
49
+ |--------------|---------------------------------------------------------------------|
50
+ | **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields |
51
+ | **1.5×** | heavy_quoting — same skip-ahead benefit |
52
+ | **1.4×** | tab_separated |
53
+ | **1.2–1.3×** | embedded_sep, utf8, PEOPLE_IMPORT_C/NC, worldcities, whitespace, multi_char |
54
+ | **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols |
55
+ | **~1.0×** | sensor_data, embedded_newlines (within noise) |
56
+
57
+ 15 of 19 files are measurably faster; 2 within noise; 2 files show a small regression
58
+ (PEOPLE_IMPORT_NB −7%, wide_500_cols −5%) attributable to the new `quote_boundary: :standard`
59
+ default adding one extra state check on the unquoted fast path.
60
+
61
+ ### Ruby path
62
+
63
+ | Gain | Files |
64
+ |--------------|---------------------------------------------------------------------|
65
+ | **1.9×** | PEOPLE_IMPORT_C (117 cols) — direct hash construction bypasses intermediate Array |
66
+ | **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep |
67
+ | **1.0–1.1×** | most other files |
68
+
69
+ The Ruby path gains are concentrated on wide/complex files where the direct-hash
70
+ construction optimization (Opt #11) has the most impact.
71
+
72
+ ---
73
+
74
+ ## vs SmarterCSV 1.14.4
75
+
76
+ C path is **9×–65× faster** across all 19 benchmark files:
77
+
78
+ - Long fields: **65×** (v1.15.0 introduced `memchr` skip-ahead)
79
+ - PEOPLE_IMPORT_C: **48×** (117 cols × 50k rows)
80
+ - PEOPLE_IMPORT_NC, multi_char_sep: **~21–24×**
81
+ - Typical real-world file: **10–20×**
82
+ - Minimum: **9.8×** (uscities, embedded_newlines)
83
+
84
+ ---
85
+
86
+ ## vs ZSV (C library, GC disabled)
87
+
88
+ ZSV is a dedicated C CSV library with GC disabled during measurement (working around a
89
+ bug in zsv-ruby 1.3.1 on Ruby 3.4.x). Despite this advantage:
90
+
91
+ **SmarterCSV/C beats ZSV+wrapper** (the fair comparison — both return processed hashes)
92
+ on 18 of 19 files, by **2–7×**. ZSV+wrapper is faster only on `embedded_newlines`
93
+ (1.5×), where ZSV's chunked I/O is particularly efficient.
94
+
95
+ **SmarterCSV/C vs ZSV.read** (raw arrays, GC disabled): ZSV.read is faster on most files
96
+ (2–12×), which is expected — it does far less work and has GC disabled. SmarterCSV/C
97
+ matches or beats ZSV.read on PEOPLE_IMPORT_C (the 117-column file) and PEOPLE_IMPORT_NC,
98
+ where our C hash-building overhead is proportionally small.
99
+
100
+ ---
101
+
102
+ ## column_selection speedup (`headers: { only: }`)
103
+
104
+ When using `headers: { only: [...] }` to select a subset of columns, excluded columns
105
+ are skipped entirely in the C hot path — no string allocation, no conversion, no hash
106
+ insertion. Benchmark on `wide_500_cols_20k.csv` (500 columns):
107
+
108
+ | Columns kept | Speedup vs no selection |
109
+ |---|---|
110
+ | 2 of 500 | ~16× faster |
111
+ | 10 of 500 | ~8× faster |
112
+ | 50 of 500 | ~3× faster |
113
+
114
+ This is additive on top of the baseline gains above.
data/docs/row_col_sep.md CHANGED
@@ -2,6 +2,8 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
5
7
  * [Parsing Strategy](./parsing_strategy.md)
6
8
  * [The Basic Read API](./basic_read_api.md)
7
9
  * [The Basic Write API](./basic_write_api.md)
@@ -10,10 +12,17 @@
10
12
  * [**Row and Column Separators**](./row_col_sep.md)
11
13
  * [Header Transformations](./header_transformations.md)
12
14
  * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
13
16
  * [Data Transformations](./data_transformations.md)
14
17
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
17
26
 
18
27
  # Row and Column Separators
19
28
 
@@ -52,7 +61,7 @@ This data format uses CTRL-A as the column separator, and CTRL-B as the record s
52
61
  ```ruby
53
62
  filename = '/tmp/itunes_db_dump'
54
63
  options = {
55
- :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
64
+ :col_sep => "\cA", :row_sep => "\cB", :comment_regexp => /^#/,
56
65
  :chunk_size => 100 , :key_mapping => {export_date: nil, name: :genre},
57
66
  }
58
67
  n = SmarterCSV.process(filename, options) do |chunk|
@@ -93,7 +102,7 @@ In this example, we use `comment_regexp` to filter out and ignore any lines star
93
102
  # Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
94
103
  filename = '/tmp/strange_db_dump'
95
104
  options = {
96
- :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
105
+ :col_sep => "\cA", :row_sep => "\cB", :comment_regexp => /^#/,
97
106
  :chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
98
107
  }
99
108
  n = SmarterCSV.process(filename, options) do |chunk|
@@ -103,4 +112,5 @@ In this example, we use `comment_regexp` to filter out and ignore any lines star
103
112
  ```
104
113
 
105
114
  ----------------
106
- PREVIOUS: [Configuration Options](./options.md) | NEXT: [Header Transformations](./header_transformations.md)
115
+
116
+ PREVIOUS: [Configuration Options](./options.md) | NEXT: [Header Transformations](./header_transformations.md) | UP: [README](../README.md)