smarter_csv 1.15.2 → 1.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +9 -0
  3. data/CHANGELOG.md +68 -1
  4. data/CONTRIBUTORS.md +3 -1
  5. data/Gemfile +1 -0
  6. data/README.md +123 -27
  7. data/docs/_introduction.md +40 -24
  8. data/docs/bad_row_quarantine.md +285 -0
  9. data/docs/basic_read_api.md +151 -9
  10. data/docs/basic_write_api.md +474 -59
  11. data/docs/batch_processing.md +161 -4
  12. data/docs/column_selection.md +183 -0
  13. data/docs/data_transformations.md +162 -29
  14. data/docs/examples.md +339 -46
  15. data/docs/header_transformations.md +93 -12
  16. data/docs/header_validations.md +56 -18
  17. data/docs/history.md +117 -0
  18. data/docs/instrumentation.md +165 -0
  19. data/docs/migrating_from_csv.md +290 -0
  20. data/docs/options.md +150 -87
  21. data/docs/parsing_strategy.md +63 -1
  22. data/docs/real_world_csv.md +262 -0
  23. data/docs/releases/1.16.0/benchmarks.md +223 -0
  24. data/docs/releases/1.16.0/changes.md +272 -0
  25. data/docs/releases/1.16.0/performance_notes.md +114 -0
  26. data/docs/row_col_sep.md +14 -5
  27. data/docs/value_converters.md +193 -57
  28. data/ext/smarter_csv/extconf.rb +3 -0
  29. data/ext/smarter_csv/smarter_csv.c +1007 -71
  30. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  31. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  32. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  36. data/lib/smarter_csv/errors.rb +8 -0
  37. data/lib/smarter_csv/file_io.rb +1 -1
  38. data/lib/smarter_csv/hash_transformations.rb +14 -13
  39. data/lib/smarter_csv/header_transformations.rb +21 -2
  40. data/lib/smarter_csv/headers.rb +2 -1
  41. data/lib/smarter_csv/options.rb +124 -7
  42. data/lib/smarter_csv/parser.rb +362 -75
  43. data/lib/smarter_csv/reader.rb +494 -46
  44. data/lib/smarter_csv/version.rb +1 -1
  45. data/lib/smarter_csv/writer.rb +71 -19
  46. data/lib/smarter_csv.rb +95 -12
  47. data/smarter_csv.gemspec +20 -10
  48. metadata +37 -80
@@ -0,0 +1,272 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](../../_introduction.md)
5
+ * [Migrating from Ruby CSV](../../migrating_from_csv.md)
6
+ * [Parsing Strategy](../../parsing_strategy.md)
7
+ * [The Basic Read API](../../basic_read_api.md)
8
+ * [The Basic Write API](../../basic_write_api.md)
9
+ * [Batch Processing](../../batch_processing.md)
10
+ * [Configuration Options](../../options.md)
11
+ * [Row and Column Separators](../../row_col_sep.md)
12
+ * [Header Transformations](../../header_transformations.md)
13
+ * [Header Validations](../../header_validations.md)
14
+ * [Column Selection](../../column_selection.md)
15
+ * [Data Transformations](../../data_transformations.md)
16
+ * [Value Converters](../../value_converters.md)
17
+ * [Bad Row Quarantine](../../bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](../../instrumentation.md)
19
+ * [Examples](../../examples.md)
20
+ * [Real-World CSV Files](../../real_world_csv.md)
21
+ * [SmarterCSV over the Years](../../history.md)
22
+ * [**Release Notes**](./changes.md)
23
+
24
+ --------------
25
+
26
+ # SmarterCSV 1.16.0 — Changes
27
+
28
+ RSpec tests: **714 → 1,247** (+533 tests)
29
+
30
+ ---
31
+
32
+ ## Minor Breaking Change
33
+
34
+ New option **`quote_boundary:`**
35
+ * defaults to `:standard`**: quotes are now only recognized as field delimiters at field boundaries;
36
+ mid-field quotes are treated as literal characters.
37
+
38
+ This aligns SmarterCSV with RFC 4180 and other CSV libraries. In practice, mid-field quotes
39
+ were already producing silently corrupt output in previous versions — so most users will see
40
+ correct behavior improve, not regress.
41
+
42
+ * Use `quote_boundary: :legacy` only in exceptional cases to restore previous behavior. See [Parsing Strategy](../../parsing_strategy.md).
43
+
44
+ ---
45
+
46
+ ## Performance Improvements
47
+
48
+ ### Net Benchmark Result (C-accelerated, Apple M1, Ruby 3.4.7)
49
+
50
+ | Comparison | Range |
51
+ |---|---|
52
+ | vs Ruby `CSV.read` † | **2×–8× faster** |
53
+ | vs Ruby `CSV.table` ‡ | **7×–129× faster** |
54
+ | vs SmarterCSV 1.14.4 (C-path) | **9×–65× faster** |
55
+ | vs SmarterCSV 1.15.2 (C-path) | **up to 2.4× faster** |
56
+ | vs SmarterCSV 1.15.2 (Ruby-path) | **up to 2× faster** |
57
+
58
+ † `CSV.read` returns raw arrays of arrays — hash construction, key normalization, and type conversion still need to happen, understating the real cost difference.
59
+
60
+ ‡ `CSV.table` is the closest Ruby equivalent to SmarterCSV — both return symbol-keyed hashes.
61
+
62
+ ![SmarterCSV 1.16.0 vs previous versions — C-accelerated path](../../../images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg)
63
+
64
+ ![SmarterCSV 1.16.0 vs previous versions — Ruby path](../../../images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg)
65
+
66
+ See [performance_notes.md](performance_notes.md) and [benchmarks.md](benchmarks.md).
67
+
68
+ ### C Extension
69
+
70
+ - **ParseContext architecture**: All per-file parse options are now wrapped in a GC-managed
71
+ `TypedData` object (`parse_context_t`) built once after headers are loaded. Eliminates
72
+ ~10 `rb_hash_aref` calls per row that previously read directly from the options hash on
73
+ every row.
74
+ - **Column-filter bitmap**: `_keep_bitmap` precomputed as a packed binary `String` — one
75
+ `memcpy`-style check per row replaces N `rb_ary_entry` calls. Loop invariants
76
+ `_keep_extra_cols` and `_early_exit_after` precomputed once; `_keep_cols=false` sentinel
77
+ skips bitmap logic entirely on files without column selection (one `!= Qfalse` test per row).
78
+ - **Section 4 fast-path split**: The C unquoted inner loop is split into two sub-paths —
79
+ plain unquoted vs. boundary-aware `:standard` mode — so the common case avoids all
80
+ quote-boundary state tracking. `__builtin_expect` hints applied to both guards.
81
+ - **Section 2 lazy lookups**: `quote_escaping` / `quote_boundary` reads moved from
82
+ unconditional Section 2 (every row) to Section 5 (quoted-field path only).
83
+ `only_headers` / `except_headers` / `strict` lookups guarded by `_keep_cols` nil-check.
84
+ Duplicate `row_sep` lookup removed.
85
+ - **Byte-level indexing**: All `line[i]` character lookups inside inner loops replaced with
86
+ `line.getbyte(i)` (returns Integer Fixnum directly, ~5–10 ns, zero allocation vs. ~30–50 ns
87
+ one-char String per call). Field extraction switched to `line.byteslice(start, len)`.
88
+ `col_sep_byte` and `quote_byte` precomputed as integers.
89
+ - **Skip-ahead in quoted fields**: `memchr` jump to next quote character instead of advancing
90
+ one byte at a time inside quoted fields.
91
+ - **Skip-ahead for unquoted fields in `:standard` mode**: Once a field is confirmed unquoted,
92
+ `String#index` jumps directly to the next `col_sep`, bypassing per-character state checks.
93
+ - **Compiler flag `-fno-semantic-interposition`**: Added to `extconf.rb` for GCC/Clang
94
+ (excluded from MSVC). Enables more aggressive LTO inlining and bypasses the PLT for
95
+ intra-library calls on Linux.
96
+ - **`cold`/`hot` function attributes + compiler hints**: Applied to rarely-executed paths and
97
+ hot inner loops respectively to guide branch predictor and instruction cache layout.
98
+
99
+ ### Ruby Path
100
+
101
+ - **Unquoted fast path — direct hash construction**: `parse_line_to_hash_ruby` builds the
102
+ result hash directly from `String#split` for unquoted lines. Eliminates the intermediate
103
+ `Array` from `parse_csv_line_ruby` and a second full-row iteration. Uses integer-index
104
+ `while` loops instead of Ruby enumerators.
105
+ - **`byteindex` skip-ahead**: Inside quoted fields, `String#byteindex` (Ruby 3.2+) or inline
106
+ `getbyte` scan jumps to next quote or col_sep at C speed. Falls back correctly on
107
+ JRuby/TruffleRuby.
108
+ - **Empty field skipping inline**: `remove_empty_values` now filters empty fields inline
109
+ during hash building rather than post-processing. Combined with `strip_whitespace: true`
110
+ (default), catches both empty and whitespace-only fields without regex.
111
+ - **Quoted field extraction**: Content extracted directly with `byteslice` excluding
112
+ surrounding quotes; avoids double allocation. In-place `.strip!` on fresh byteslice avoids
113
+ a second allocation.
114
+ - **Backslash detection fast-path**: In `:auto` quote_escaping mode, when the line contains no
115
+ backslash character, skips the backslash-try dance and calls RFC 4180 mode directly.
116
+ - **Hot-path option caching**: `@hot_path_options`, `@quote_escaping_backslash`,
117
+ `@quote_escaping_double`, `@delete_nil_keys`, `@delete_empty_keys`, `@quote_char`, and
118
+ `@field_size_limit` precomputed as ivars once after headers are loaded — all per-row
119
+ option-hash lookups replaced by cheap ivar reads.
120
+ - **Multiline gate optimization**: `detect_multiline_strict` used as a cheap gate in the
121
+ stitch loop; avoids N-2 full re-parses per multiline row in the Ruby path.
122
+
123
+ ---
124
+
125
+ ## New Features
126
+
127
+ ### Reader
128
+
129
+ **New top-level API:**
130
+
131
+ - **`SmarterCSV.parse(csv_string, options = {})`**: Parse a CSV string directly without
132
+ wrapping in `StringIO`. Drop-in equivalent of `CSV.parse(str, headers: true,
133
+ header_converters: :symbol)` with numeric conversion included. See
134
+ [Migrating from Ruby CSV](../../migrating_from_csv.md).
135
+ - **`SmarterCSV.each(input, options = {}, &block)`**: Row-by-row enumerator yielding each
136
+ row as a `Hash`. Returns an `Enumerator` when called without a block.
137
+ - **`SmarterCSV.each_chunk(input, options = {}, &block)`**: Chunked enumerator yielding
138
+ `(Array<Hash>, chunk_index)`. Requires `chunk_size` in options. Returns an `Enumerator`
139
+ without a block.
140
+
141
+ **New `Reader` instance methods:**
142
+
143
+ - **`Reader#each { |hash| }`**: Yields each row as a `Hash`. `Reader` now includes
144
+ `Enumerable` (enables `map`, `select`, `lazy`, etc.).
145
+ - **`Reader#each_chunk { |chunk, index| }`**: Yields each chunk plus 0-based chunk index.
146
+
147
+ **New options:**
148
+
149
+ - **`quote_boundary: :standard`** *(default — breaking change)*: Quotes are only recognized
150
+ as field delimiters at field boundaries; mid-field quotes are treated as literal characters.
151
+ Use `quote_boundary: :legacy` to restore previous behavior.
152
+ - **`quote_escaping: :auto`** *(default)*: Tries backslash interpretation first; automatically
153
+ downgrades to RFC 4180 when no backslash is present in the line. Also accepts `:backslash`
154
+ and `:double_quotes`.
155
+ - **`headers: { only: [...] }`**: Keep only the specified columns in each result hash.
156
+ Excluded columns are skipped in the C hot path — no string allocation, no conversion, no
157
+ hash insertion. See [Column Selection](../../column_selection.md).
158
+ - **`headers: { except: [...] }`**: Remove the specified columns from each result hash. Same
159
+ hot-path optimization. Cannot be combined with `headers: { only: }`.
160
+ - **`on_bad_row:`**: Controls behavior when a row raises a parse error. Values: `:raise`
161
+ (default), `:skip`, `:collect`, or a callable. With `:collect`, error records accumulate in
162
+ `reader.errors[:bad_rows]`. See [Bad Row Quarantine](../../bad_row_quarantine.md).
163
+ - **`bad_row_limit: N`**: Raises `SmarterCSV::TooManyBadRows` after N bad rows. Default: `nil`
164
+ (unlimited).
165
+ - **`collect_raw_lines: true`** *(default)*: Include the raw stitched line in bad-row error
166
+ records. Set to `false` for privacy or memory savings.
167
+ - **`field_size_limit: N`**: Maximum size of any extracted field in bytes. Raises
168
+ `SmarterCSV::FieldSizeLimitExceeded` if a field or accumulating multiline buffer exceeds
169
+ the limit. Prevents DoS from runaway quoted fields. See
170
+ [Bad Row Quarantine](../../bad_row_quarantine.md#limiting-field-size-field_size_limit).
171
+ - **`nil_values_matching: regex`**: Set matching values to `nil` via regular expression. With
172
+ `remove_empty_values: true` (default), nil-ified values are removed. With
173
+ `remove_empty_values: false`, the key is retained with a `nil` value. Replaces deprecated
174
+ `remove_values_matching:`.
175
+ - **`missing_headers: :auto`** *(default)*: Auto-generate names for extra columns using
176
+ `missing_header_prefix` (e.g. `column_7`, `column_8`). Use `:raise` to raise
177
+ `HeaderSizeMismatch` instead. Replaces deprecated `strict:`.
178
+ - **`verbose: :quiet / :normal / :debug`**: Symbol-based verbosity levels. `:quiet` suppresses
179
+ all output; `:normal` (default) shows behavioral warnings; `:debug` adds computed options and
180
+ per-row diagnostics to `$stderr`. Replaces deprecated `verbose: true/false`.
181
+ - New Instrumentation Hooks: See [Instrumentation Hooks](../../instrumentation.md).
182
+ - **`on_start: callable`**: Fires once before the first row with
183
+ `{ input:, file_size:, col_sep:, row_sep: }`.
184
+ - **`on_chunk: callable`**: Fires after each chunk (chunked mode only) with
185
+ `{ chunk_number:, rows_in_chunk:, total_rows_so_far: }`.
186
+ - **`on_complete: callable`**: Fires after the file is exhausted with
187
+ `{ total_rows:, total_chunks:, duration:, bad_rows: }`.
188
+
189
+
190
+ **New exceptions:**
191
+
192
+ - **`SmarterCSV::FieldSizeLimitExceeded`**: Raised when `field_size_limit` is exceeded.
193
+ - **`SmarterCSV::TooManyBadRows`**: Raised when `bad_row_limit` is exceeded.
194
+
195
+ **Deprecations:**
196
+
197
+ - `only_headers:` → use `headers: { only: }`
198
+ - `except_headers:` → use `headers: { except: }`
199
+ - `remove_values_matching:` → use `nil_values_matching:`
200
+ - `strict: true` → use `missing_headers: :raise`
201
+ - `strict: false` → use `missing_headers: :auto`
202
+ - `verbose: true` → use `verbose: :debug`
203
+ - `verbose: false` → use `verbose: :normal`
204
+
205
+ ### Writer
206
+
207
+ - **IO and StringIO support**: `SmarterCSV.generate` and `SmarterCSV::Writer.new` now accept
208
+ any `IO`-compatible object (responding to `#write`) in addition to a file path or
209
+ `Pathname`. The caller retains ownership of passed-in IO objects.
210
+ - **`SmarterCSV.generate` returns a String when called without a destination**: Omit the file
211
+ argument and the CSV is written to an internal buffer and returned as a `String`. Options
212
+ hash can be passed as the sole argument.
213
+ - **Streaming mode for known headers**: When `headers:` or `map_headers:` is provided at
214
+ construction time, the Writer skips the internal temp file entirely — the header line is
215
+ written immediately and each `<<` streams directly to the output file. No API change;
216
+ existing code benefits automatically. See [The Basic Write API](../../basic_write_api.md).
217
+ - **`encoding:` option**: Specifies the file encoding (e.g. `'UTF-8'`, `'ISO-8859-1'`).
218
+ Supports Ruby's `'external:internal'` transcoding notation. Only applies when writing to a
219
+ file path; ignored for IO objects.
220
+ - **`write_nil_value:` option** *(default: `''`)*: String written in place of `nil` field
221
+ values.
222
+ - **`write_empty_value:` option** *(default: `''`)*: String written in place of empty-string
223
+ field values, including missing keys.
224
+ - **`write_bom:` option** *(default: `false`)*: Prepends a UTF-8 BOM (`\xEF\xBB\xBF`) to the
225
+ output. Useful for Excel compatibility with non-ASCII content.
226
+
227
+ ---
228
+
229
+ ## Bug Fixes
230
+
231
+ ### Reader
232
+
233
+ - **Mid-field quotes no longer corrupt unquoted fields**: `quote_boundary: :standard` (now the
234
+ default) prevents a quote character mid-field (e.g. `b"bb`) from toggling quoted state. This
235
+ silently corrupted rows in 1.15.2 when data contained apostrophes or inch marks.
236
+ - **Unclosed-quote fallback in `:auto` mode**: When backslash mode encounters an unclosed quote
237
+ at EOL, the parser now tries RFC 4180 mode as a fallback before treating the row as multiline.
238
+ - **Empty headers bug fixed** ([#324](https://github.com/tilo/smarter_csv/issues/324),
239
+ [#312](https://github.com/tilo/smarter_csv/issues/312)): CSV files with empty or
240
+ whitespace-only header fields (e.g. `name,,`) now auto-generate column names using
241
+ `missing_header_prefix` (default: `column_1`, `column_2`, …).
242
+ - **All library output now goes to `$stderr`**: Behavioral warnings use `warn` (suppressible
243
+ via `-W0` or `verbose: :quiet`); debug diagnostics use `$stderr.puts`. Nothing is written to
244
+ `$stdout`.
245
+ - **`SmarterCSV.generate` raises `ArgumentError`** (not a blank `RuntimeError`) when called
246
+ without a block.
247
+
248
+ ### Writer
249
+
250
+ - **Temp file no longer hardcoded to `/tmp`**: Fixes `Errno::ENOENT` on Windows.
251
+ - **Temp file properly cleaned up**: `Tempfile#close!` now used instead of `Tempfile#delete`,
252
+ ensuring the file is both closed and unlinked.
253
+ - **`StringIO` handling**: Writing to a `StringIO` no longer attempts to close it on
254
+ `finalize`.
255
+
256
+ ---
257
+
258
+ ## Misc
259
+
260
+ - **`@mapped_keys` changed from `Array` to `Set`**: O(1) lookup per field instead of O(n)
261
+ scan on the `value_converters` key check.
262
+ - **`escape_csv_field` micro-optimizations**: `@escaped_quote_char` precomputed once in
263
+ `initialize`; redundant `.to_s` call removed; row separator appended with `<<` (mutating)
264
+ instead of `+` to save one string allocation per row.
265
+ - **`Reader` includes `Enumerable`**: Enables `map`, `select`, `reject`, `lazy`, and other
266
+ Enumerable methods on `Reader#each` results.
267
+ - **`DEFAULT_CHUNK_SIZE = 100`**: Constant added; warning emitted when `each_chunk` is called
268
+ without explicit `chunk_size`.
269
+
270
+ ---
271
+
272
+ PREVIOUS: [SmarterCSV over the Years](../../history.md) | UP: [README](../../../README.md)
@@ -0,0 +1,114 @@
1
+ # SmarterCSV 1.16.0 — Performance Notes
2
+
3
+ Measured on Apple M1 Pro, Ruby 3.4.7, best of two benchmark sessions (30 runs each).
4
+ See [benchmarks.md](benchmarks.md) for full tables.
5
+
6
+ ---
7
+
8
+ ## vs Ruby CSV
9
+
10
+ ### vs CSV.read (raw tokenization only — no hashes, no post-processing)
11
+
12
+ `CSV.read` is the *fastest* Ruby CSV mode. It returns plain string arrays with no header
13
+ handling, no symbol keys, no numeric conversion. SmarterCSV/C delivers fully processed
14
+ hashes — and still beats it on every single file:
15
+
16
+ | Range | Files |
17
+ |--------------|--------------------------------------------------------------------|
18
+ | **8–9×** | PEOPLE_IMPORT_C (8.1×), uszips (8.6×) |
19
+ | **6–7×** | uscities (6.4×), worldcities (6.3×), embedded_sep (6.0×) |
20
+ | **4–5×** | PEOPLE_IMPORT_NC (4.8×), long_fields (5.5×), many_empty (5.2×), sample_10M (4.3×), utf8 (4.3×) |
21
+ | **3×** | heavy_quoting (3.1×), tab_sep (3.3×), whitespace (3.1×), embedded_newlines (2.8×) |
22
+ | **2–3×** | PEOPLE_IMPORT_B (2.9×), PEOPLE_IMPORT_NB (2.7×), sensor_data (2.2×), multi_char (2.4×) |
23
+ | **~1.7×** | wide_500_cols (1.7×) — most column-heavy file, hash overhead visible |
24
+
25
+ **Summary: 1.7×–8.6× faster than CSV.read, while returning fully processed hashes.**
26
+
27
+ ### vs CSV.table (symbol keys + numeric conversion — nearest equivalent output)
28
+
29
+ `CSV.table` is the fairest apples-to-apples comparison: it also produces symbol-keyed
30
+ rows with type conversion applied. SmarterCSV/C is dramatically faster:
31
+
32
+ | Range | Files |
33
+ |----------------|-----------------------------------------------------------------|
34
+ | **100×+** | PEOPLE_IMPORT_C (129×) |
35
+ | **40–50×** | PEOPLE_IMPORT_NC (48×), many_empty (46×), wide_500_cols (41×) |
36
+ | **20–30×** | PEOPLE_IMPORT_B (24×), PEOPLE_IMPORT_NB (26×), uszips (28×), tab_sep (27×), whitespace (24×), sensor_data (24×), utf8 (23×), multi_char (20×), worldcities (20×), sample_10M (20×) |
37
+ | **15–20×** | uscities (21×), long_fields (16×), heavy_quoting (19×), embedded_sep (20×) |
38
+ | **7×** | embedded_newlines (7×) — multiline rows, overhead unavoidable |
39
+
40
+ **Summary: 7×–129× faster than CSV.table.**
41
+
42
+ ---
43
+
44
+ ## vs SmarterCSV 1.15.2
45
+
46
+ ### C path
47
+
48
+ | Gain | Files |
49
+ |--------------|---------------------------------------------------------------------|
50
+ | **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields |
51
+ | **1.5×** | heavy_quoting — same skip-ahead benefit |
52
+ | **1.4×** | tab_separated |
53
+ | **1.2–1.3×** | embedded_sep, utf8, PEOPLE_IMPORT_C/NC, worldcities, whitespace, multi_char |
54
+ | **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols |
55
+ | **~1.0×** | sensor_data, embedded_newlines (within noise) |
56
+
57
+ 15 of 19 files are measurably faster; 2 within noise; 2 files show a small regression
58
+ (PEOPLE_IMPORT_NB −7%, wide_500_cols −5%) attributable to the new `quote_boundary: :standard`
59
+ default adding one extra state check on the unquoted fast path.
60
+
61
+ ### Ruby path
62
+
63
+ | Gain | Files |
64
+ |--------------|---------------------------------------------------------------------|
65
+ | **1.9×** | PEOPLE_IMPORT_C (117 cols) — direct hash construction bypasses intermediate Array |
66
+ | **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep |
67
+ | **1.0–1.1×** | most other files |
68
+
69
+ The Ruby path gains are concentrated on wide/complex files where the direct-hash
70
+ construction optimization (Opt #11) has the most impact.
71
+
72
+ ---
73
+
74
+ ## vs SmarterCSV 1.14.4
75
+
76
+ C path is **9×–65× faster** across all 19 benchmark files:
77
+
78
+ - Long fields: **65×** (v1.15.0 introduced `memchr` skip-ahead)
79
+ - PEOPLE_IMPORT_C: **48×** (117 cols × 50k rows)
80
+ - PEOPLE_IMPORT_NC, multi_char_sep: **~21–24×**
81
+ - Typical real-world file: **10–20×**
82
+ - Minimum: **9.8×** (uscities, embedded_newlines)
83
+
84
+ ---
85
+
86
+ ## vs ZSV (C library, GC disabled)
87
+
88
+ ZSV is a dedicated C CSV library with GC disabled during measurement (working around a
89
+ bug in zsv-ruby 1.3.1 on Ruby 3.4.x). Despite this advantage:
90
+
91
+ **SmarterCSV/C beats ZSV+wrapper** (the fair comparison — both return processed hashes)
92
+ on 18 of 19 files, by **2–7×**. ZSV+wrapper is faster only on `embedded_newlines`
93
+ (1.5×), where ZSV's chunked I/O is particularly efficient.
94
+
95
+ **SmarterCSV/C vs ZSV.read** (raw arrays, GC disabled): ZSV.read is faster on most files
96
+ (2–12×), which is expected — it does far less work and has GC disabled. SmarterCSV/C
97
+ matches or beats ZSV.read on PEOPLE_IMPORT_C (the 117-column file) and PEOPLE_IMPORT_NC,
98
+ where our C hash-building overhead is proportionally small.
99
+
100
+ ---
101
+
102
+ ## column_selection speedup (`headers: { only: }`)
103
+
104
+ When using `headers: { only: [...] }` to select a subset of columns, excluded columns
105
+ are skipped entirely in the C hot path — no string allocation, no conversion, no hash
106
+ insertion. Benchmark on `wide_500_cols_20k.csv` (500 columns):
107
+
108
+ | Columns kept | Speedup vs no selection |
109
+ |---|---|
110
+ | 2 of 500 | ~16× faster |
111
+ | 10 of 500 | ~8× faster |
112
+ | 50 of 500 | ~3× faster |
113
+
114
+ This is additive on top of the baseline gains above.
data/docs/row_col_sep.md CHANGED
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [Parsing Strategy](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,10 +11,17 @@
10
11
  * [**Row and Column Separators**](./row_col_sep.md)
11
12
  * [Header Transformations](./header_transformations.md)
12
13
  * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [Data Transformations](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ --------------
17
25
 
18
26
  # Row and Column Separators
19
27
 
@@ -52,7 +60,7 @@ This data format uses CTRL-A as the column separator, and CTRL-B as the record s
52
60
  ```ruby
53
61
  filename = '/tmp/itunes_db_dump'
54
62
  options = {
55
- :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
63
+ :col_sep => "\cA", :row_sep => "\cB", :comment_regexp => /^#/,
56
64
  :chunk_size => 100 , :key_mapping => {export_date: nil, name: :genre},
57
65
  }
58
66
  n = SmarterCSV.process(filename, options) do |chunk|
@@ -93,7 +101,7 @@ In this example, we use `comment_regexp` to filter out and ignore any lines star
93
101
  # Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
94
102
  filename = '/tmp/strange_db_dump'
95
103
  options = {
96
- :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
104
+ :col_sep => "\cA", :row_sep => "\cB", :comment_regexp => /^#/,
97
105
  :chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
98
106
  }
99
107
  n = SmarterCSV.process(filename, options) do |chunk|
@@ -103,4 +111,5 @@ In this example, we use `comment_regexp` to filter out and ignore any lines star
103
111
  ```
104
112
 
105
113
  ----------------
106
- PREVIOUS: [Configuration Options](./options.md) | NEXT: [Header Transformations](./header_transformations.md)
114
+
115
+ PREVIOUS: [Configuration Options](./options.md) | NEXT: [Header Transformations](./header_transformations.md) | UP: [README](../README.md)