smarter_csv 1.15.2 → 1.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +9 -0
- data/CHANGELOG.md +68 -1
- data/CONTRIBUTORS.md +3 -1
- data/Gemfile +1 -0
- data/README.md +123 -27
- data/docs/_introduction.md +40 -24
- data/docs/bad_row_quarantine.md +285 -0
- data/docs/basic_read_api.md +151 -9
- data/docs/basic_write_api.md +474 -59
- data/docs/batch_processing.md +161 -4
- data/docs/column_selection.md +183 -0
- data/docs/data_transformations.md +162 -29
- data/docs/examples.md +339 -46
- data/docs/header_transformations.md +93 -12
- data/docs/header_validations.md +56 -18
- data/docs/history.md +117 -0
- data/docs/instrumentation.md +165 -0
- data/docs/migrating_from_csv.md +290 -0
- data/docs/options.md +150 -87
- data/docs/parsing_strategy.md +63 -1
- data/docs/real_world_csv.md +262 -0
- data/docs/releases/1.16.0/benchmarks.md +223 -0
- data/docs/releases/1.16.0/changes.md +272 -0
- data/docs/releases/1.16.0/performance_notes.md +114 -0
- data/docs/row_col_sep.md +14 -5
- data/docs/value_converters.md +193 -57
- data/ext/smarter_csv/extconf.rb +3 -0
- data/ext/smarter_csv/smarter_csv.c +1007 -71
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
- data/lib/smarter_csv/errors.rb +8 -0
- data/lib/smarter_csv/file_io.rb +1 -1
- data/lib/smarter_csv/hash_transformations.rb +14 -13
- data/lib/smarter_csv/header_transformations.rb +21 -2
- data/lib/smarter_csv/headers.rb +2 -1
- data/lib/smarter_csv/options.rb +124 -7
- data/lib/smarter_csv/parser.rb +362 -75
- data/lib/smarter_csv/reader.rb +494 -46
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +71 -19
- data/lib/smarter_csv.rb +95 -12
- data/smarter_csv.gemspec +20 -10
- metadata +37 -80
|
@@ -0,0 +1,272 @@
|
|
|
1
|
+
|
|
2
|
+
### Contents
|
|
3
|
+
|
|
4
|
+
* [Introduction](../../_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](../../migrating_from_csv.md)
|
|
6
|
+
* [Parsing Strategy](../../parsing_strategy.md)
|
|
7
|
+
* [The Basic Read API](../../basic_read_api.md)
|
|
8
|
+
* [The Basic Write API](../../basic_write_api.md)
|
|
9
|
+
* [Batch Processing](../../batch_processing.md)
|
|
10
|
+
* [Configuration Options](../../options.md)
|
|
11
|
+
* [Row and Column Separators](../../row_col_sep.md)
|
|
12
|
+
* [Header Transformations](../../header_transformations.md)
|
|
13
|
+
* [Header Validations](../../header_validations.md)
|
|
14
|
+
* [Column Selection](../../column_selection.md)
|
|
15
|
+
* [Data Transformations](../../data_transformations.md)
|
|
16
|
+
* [Value Converters](../../value_converters.md)
|
|
17
|
+
* [Bad Row Quarantine](../../bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](../../instrumentation.md)
|
|
19
|
+
* [Examples](../../examples.md)
|
|
20
|
+
* [Real-World CSV Files](../../real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](../../history.md)
|
|
22
|
+
* [**Release Notes**](./changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
25
|
+
|
|
26
|
+
# SmarterCSV 1.16.0 — Changes
|
|
27
|
+
|
|
28
|
+
RSpec tests: **714 → 1,247** (+533 tests)
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## Minor Breaking Change
|
|
33
|
+
|
|
34
|
+
New option **`quote_boundary:`**
|
|
35
|
+
* defaults to `:standard`**: quotes are now only recognized as field delimiters at field boundaries;
|
|
36
|
+
mid-field quotes are treated as literal characters.
|
|
37
|
+
|
|
38
|
+
This aligns SmarterCSV with RFC 4180 and other CSV libraries. In practice, mid-field quotes
|
|
39
|
+
were already producing silently corrupt output in previous versions — so most users will see
|
|
40
|
+
correct behavior improve, not regress.
|
|
41
|
+
|
|
42
|
+
* Use `quote_boundary: :legacy` only in exceptional cases to restore previous behavior. See [Parsing Strategy](../../parsing_strategy.md).
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Performance Improvements
|
|
47
|
+
|
|
48
|
+
### Net Benchmark Result (C-accelerated, Apple M1, Ruby 3.4.7)
|
|
49
|
+
|
|
50
|
+
| Comparison | Range |
|
|
51
|
+
|---|---|
|
|
52
|
+
| vs Ruby `CSV.read` † | **2×–8× faster** |
|
|
53
|
+
| vs Ruby `CSV.table` ‡ | **7×–129× faster** |
|
|
54
|
+
| vs SmarterCSV 1.14.4 (C-path) | **9×–65× faster** |
|
|
55
|
+
| vs SmarterCSV 1.15.2 (C-path) | **up to 2.4× faster** |
|
|
56
|
+
| vs SmarterCSV 1.15.2 (Ruby-path) | **up to 2× faster** |
|
|
57
|
+
|
|
58
|
+
† `CSV.read` returns raw arrays of arrays — hash construction, key normalization, and type conversion still need to happen, understating the real cost difference.
|
|
59
|
+
|
|
60
|
+
‡ `CSV.table` is the closest Ruby equivalent to SmarterCSV — both return symbol-keyed hashes.
|
|
61
|
+
|
|
62
|
+

|
|
63
|
+
|
|
64
|
+

|
|
65
|
+
|
|
66
|
+
See [performance_notes.md](performance_notes.md) and [benchmarks.md](benchmarks.md).
|
|
67
|
+
|
|
68
|
+
### C Extension
|
|
69
|
+
|
|
70
|
+
- **ParseContext architecture**: All per-file parse options are now wrapped in a GC-managed
|
|
71
|
+
`TypedData` object (`parse_context_t`) built once after headers are loaded. Eliminates
|
|
72
|
+
~10 `rb_hash_aref` calls per row that previously read directly from the options hash on
|
|
73
|
+
every row.
|
|
74
|
+
- **Column-filter bitmap**: `_keep_bitmap` precomputed as a packed binary `String` — one
|
|
75
|
+
`memcpy`-style check per row replaces N `rb_ary_entry` calls. Loop invariants
|
|
76
|
+
`_keep_extra_cols` and `_early_exit_after` precomputed once; `_keep_cols=false` sentinel
|
|
77
|
+
skips bitmap logic entirely on files without column selection (one `!= Qfalse` test per row).
|
|
78
|
+
- **Section 4 fast-path split**: The C unquoted inner loop is split into two sub-paths —
|
|
79
|
+
plain unquoted vs. boundary-aware `:standard` mode — so the common case avoids all
|
|
80
|
+
quote-boundary state tracking. `__builtin_expect` hints applied to both guards.
|
|
81
|
+
- **Section 2 lazy lookups**: `quote_escaping` / `quote_boundary` reads moved from
|
|
82
|
+
unconditional Section 2 (every row) to Section 5 (quoted-field path only).
|
|
83
|
+
`only_headers` / `except_headers` / `strict` lookups guarded by `_keep_cols` nil-check.
|
|
84
|
+
Duplicate `row_sep` lookup removed.
|
|
85
|
+
- **Byte-level indexing**: All `line[i]` character lookups inside inner loops replaced with
|
|
86
|
+
`line.getbyte(i)` (returns Integer Fixnum directly, ~5–10 ns, zero allocation vs. ~30–50 ns
|
|
87
|
+
one-char String per call). Field extraction switched to `line.byteslice(start, len)`.
|
|
88
|
+
`col_sep_byte` and `quote_byte` precomputed as integers.
|
|
89
|
+
- **Skip-ahead in quoted fields**: `memchr` jump to next quote character instead of advancing
|
|
90
|
+
one byte at a time inside quoted fields.
|
|
91
|
+
- **Skip-ahead for unquoted fields in `:standard` mode**: Once a field is confirmed unquoted,
|
|
92
|
+
`String#index` jumps directly to the next `col_sep`, bypassing per-character state checks.
|
|
93
|
+
- **Compiler flag `-fno-semantic-interposition`**: Added to `extconf.rb` for GCC/Clang
|
|
94
|
+
(excluded from MSVC). Enables more aggressive LTO inlining and bypasses the PLT for
|
|
95
|
+
intra-library calls on Linux.
|
|
96
|
+
- **`cold`/`hot` function attributes + compiler hints**: Applied to rarely-executed paths and
|
|
97
|
+
hot inner loops respectively to guide branch predictor and instruction cache layout.
|
|
98
|
+
|
|
99
|
+
### Ruby Path
|
|
100
|
+
|
|
101
|
+
- **Unquoted fast path — direct hash construction**: `parse_line_to_hash_ruby` builds the
|
|
102
|
+
result hash directly from `String#split` for unquoted lines. Eliminates the intermediate
|
|
103
|
+
`Array` from `parse_csv_line_ruby` and a second full-row iteration. Uses integer-index
|
|
104
|
+
`while` loops instead of Ruby enumerators.
|
|
105
|
+
- **`byteindex` skip-ahead**: Inside quoted fields, `String#byteindex` (Ruby 3.2+) or inline
|
|
106
|
+
`getbyte` scan jumps to next quote or col_sep at C speed. Falls back correctly on
|
|
107
|
+
JRuby/TruffleRuby.
|
|
108
|
+
- **Empty field skipping inline**: `remove_empty_values` now filters empty fields inline
|
|
109
|
+
during hash building rather than post-processing. Combined with `strip_whitespace: true`
|
|
110
|
+
(default), catches both empty and whitespace-only fields without regex.
|
|
111
|
+
- **Quoted field extraction**: Content extracted directly with `byteslice` excluding
|
|
112
|
+
surrounding quotes; avoids double allocation. In-place `.strip!` on fresh byteslice avoids
|
|
113
|
+
a second allocation.
|
|
114
|
+
- **Backslash detection fast-path**: In `:auto` quote_escaping mode, when the line contains no
|
|
115
|
+
backslash character, skips the backslash-try dance and calls RFC 4180 mode directly.
|
|
116
|
+
- **Hot-path option caching**: `@hot_path_options`, `@quote_escaping_backslash`,
|
|
117
|
+
`@quote_escaping_double`, `@delete_nil_keys`, `@delete_empty_keys`, `@quote_char`, and
|
|
118
|
+
`@field_size_limit` precomputed as ivars once after headers are loaded — all per-row
|
|
119
|
+
option-hash lookups replaced by cheap ivar reads.
|
|
120
|
+
- **Multiline gate optimization**: `detect_multiline_strict` used as a cheap gate in the
|
|
121
|
+
stitch loop; avoids N-2 full re-parses per multiline row in the Ruby path.
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
## New Features
|
|
126
|
+
|
|
127
|
+
### Reader
|
|
128
|
+
|
|
129
|
+
**New top-level API:**
|
|
130
|
+
|
|
131
|
+
- **`SmarterCSV.parse(csv_string, options = {})`**: Parse a CSV string directly without
|
|
132
|
+
wrapping in `StringIO`. Drop-in equivalent of `CSV.parse(str, headers: true,
|
|
133
|
+
header_converters: :symbol)` with numeric conversion included. See
|
|
134
|
+
[Migrating from Ruby CSV](../../migrating_from_csv.md).
|
|
135
|
+
- **`SmarterCSV.each(input, options = {}, &block)`**: Row-by-row enumerator yielding each
|
|
136
|
+
row as a `Hash`. Returns an `Enumerator` when called without a block.
|
|
137
|
+
- **`SmarterCSV.each_chunk(input, options = {}, &block)`**: Chunked enumerator yielding
|
|
138
|
+
`(Array<Hash>, chunk_index)`. Requires `chunk_size` in options. Returns an `Enumerator`
|
|
139
|
+
without a block.
|
|
140
|
+
|
|
141
|
+
**New `Reader` instance methods:**
|
|
142
|
+
|
|
143
|
+
- **`Reader#each { |hash| }`**: Yields each row as a `Hash`. `Reader` now includes
|
|
144
|
+
`Enumerable` (enables `map`, `select`, `lazy`, etc.).
|
|
145
|
+
- **`Reader#each_chunk { |chunk, index| }`**: Yields each chunk plus 0-based chunk index.
|
|
146
|
+
|
|
147
|
+
**New options:**
|
|
148
|
+
|
|
149
|
+
- **`quote_boundary: :standard`** *(default — breaking change)*: Quotes are only recognized
|
|
150
|
+
as field delimiters at field boundaries; mid-field quotes are treated as literal characters.
|
|
151
|
+
Use `quote_boundary: :legacy` to restore previous behavior.
|
|
152
|
+
- **`quote_escaping: :auto`** *(default)*: Tries backslash interpretation first; automatically
|
|
153
|
+
downgrades to RFC 4180 when no backslash is present in the line. Also accepts `:backslash`
|
|
154
|
+
and `:double_quotes`.
|
|
155
|
+
- **`headers: { only: [...] }`**: Keep only the specified columns in each result hash.
|
|
156
|
+
Excluded columns are skipped in the C hot path — no string allocation, no conversion, no
|
|
157
|
+
hash insertion. See [Column Selection](../../column_selection.md).
|
|
158
|
+
- **`headers: { except: [...] }`**: Remove the specified columns from each result hash. Same
|
|
159
|
+
hot-path optimization. Cannot be combined with `headers: { only: }`.
|
|
160
|
+
- **`on_bad_row:`**: Controls behavior when a row raises a parse error. Values: `:raise`
|
|
161
|
+
(default), `:skip`, `:collect`, or a callable. With `:collect`, error records accumulate in
|
|
162
|
+
`reader.errors[:bad_rows]`. See [Bad Row Quarantine](../../bad_row_quarantine.md).
|
|
163
|
+
- **`bad_row_limit: N`**: Raises `SmarterCSV::TooManyBadRows` after N bad rows. Default: `nil`
|
|
164
|
+
(unlimited).
|
|
165
|
+
- **`collect_raw_lines: true`** *(default)*: Include the raw stitched line in bad-row error
|
|
166
|
+
records. Set to `false` for privacy or memory savings.
|
|
167
|
+
- **`field_size_limit: N`**: Maximum size of any extracted field in bytes. Raises
|
|
168
|
+
`SmarterCSV::FieldSizeLimitExceeded` if a field or accumulating multiline buffer exceeds
|
|
169
|
+
the limit. Prevents DoS from runaway quoted fields. See
|
|
170
|
+
[Bad Row Quarantine](../../bad_row_quarantine.md#limiting-field-size-field_size_limit).
|
|
171
|
+
- **`nil_values_matching: regex`**: Set matching values to `nil` via regular expression. With
|
|
172
|
+
`remove_empty_values: true` (default), nil-ified values are removed. With
|
|
173
|
+
`remove_empty_values: false`, the key is retained with a `nil` value. Replaces deprecated
|
|
174
|
+
`remove_values_matching:`.
|
|
175
|
+
- **`missing_headers: :auto`** *(default)*: Auto-generate names for extra columns using
|
|
176
|
+
`missing_header_prefix` (e.g. `column_7`, `column_8`). Use `:raise` to raise
|
|
177
|
+
`HeaderSizeMismatch` instead. Replaces deprecated `strict:`.
|
|
178
|
+
- **`verbose: :quiet / :normal / :debug`**: Symbol-based verbosity levels. `:quiet` suppresses
|
|
179
|
+
all output; `:normal` (default) shows behavioral warnings; `:debug` adds computed options and
|
|
180
|
+
per-row diagnostics to `$stderr`. Replaces deprecated `verbose: true/false`.
|
|
181
|
+
- New Instrumentation Hooks: See [Instrumentation Hooks](../../instrumentation.md).
|
|
182
|
+
- **`on_start: callable`**: Fires once before the first row with
|
|
183
|
+
`{ input:, file_size:, col_sep:, row_sep: }`.
|
|
184
|
+
- **`on_chunk: callable`**: Fires after each chunk (chunked mode only) with
|
|
185
|
+
`{ chunk_number:, rows_in_chunk:, total_rows_so_far: }`.
|
|
186
|
+
- **`on_complete: callable`**: Fires after the file is exhausted with
|
|
187
|
+
`{ total_rows:, total_chunks:, duration:, bad_rows: }`.
|
|
188
|
+
|
|
189
|
+
|
|
190
|
+
**New exceptions:**
|
|
191
|
+
|
|
192
|
+
- **`SmarterCSV::FieldSizeLimitExceeded`**: Raised when `field_size_limit` is exceeded.
|
|
193
|
+
- **`SmarterCSV::TooManyBadRows`**: Raised when `bad_row_limit` is exceeded.
|
|
194
|
+
|
|
195
|
+
**Deprecations:**
|
|
196
|
+
|
|
197
|
+
- `only_headers:` → use `headers: { only: }`
|
|
198
|
+
- `except_headers:` → use `headers: { except: }`
|
|
199
|
+
- `remove_values_matching:` → use `nil_values_matching:`
|
|
200
|
+
- `strict: true` → use `missing_headers: :raise`
|
|
201
|
+
- `strict: false` → use `missing_headers: :auto`
|
|
202
|
+
- `verbose: true` → use `verbose: :debug`
|
|
203
|
+
- `verbose: false` → use `verbose: :normal`
|
|
204
|
+
|
|
205
|
+
### Writer
|
|
206
|
+
|
|
207
|
+
- **IO and StringIO support**: `SmarterCSV.generate` and `SmarterCSV::Writer.new` now accept
|
|
208
|
+
any `IO`-compatible object (responding to `#write`) in addition to a file path or
|
|
209
|
+
`Pathname`. The caller retains ownership of passed-in IO objects.
|
|
210
|
+
- **`SmarterCSV.generate` returns a String when called without a destination**: Omit the file
|
|
211
|
+
argument and the CSV is written to an internal buffer and returned as a `String`. Options
|
|
212
|
+
hash can be passed as the sole argument.
|
|
213
|
+
- **Streaming mode for known headers**: When `headers:` or `map_headers:` is provided at
|
|
214
|
+
construction time, the Writer skips the internal temp file entirely — the header line is
|
|
215
|
+
written immediately and each `<<` streams directly to the output file. No API change;
|
|
216
|
+
existing code benefits automatically. See [The Basic Write API](../../basic_write_api.md).
|
|
217
|
+
- **`encoding:` option**: Specifies the file encoding (e.g. `'UTF-8'`, `'ISO-8859-1'`).
|
|
218
|
+
Supports Ruby's `'external:internal'` transcoding notation. Only applies when writing to a
|
|
219
|
+
file path; ignored for IO objects.
|
|
220
|
+
- **`write_nil_value:` option** *(default: `''`)*: String written in place of `nil` field
|
|
221
|
+
values.
|
|
222
|
+
- **`write_empty_value:` option** *(default: `''`)*: String written in place of empty-string
|
|
223
|
+
field values, including missing keys.
|
|
224
|
+
- **`write_bom:` option** *(default: `false`)*: Prepends a UTF-8 BOM (`\xEF\xBB\xBF`) to the
|
|
225
|
+
output. Useful for Excel compatibility with non-ASCII content.
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
## Bug Fixes
|
|
230
|
+
|
|
231
|
+
### Reader
|
|
232
|
+
|
|
233
|
+
- **Mid-field quotes no longer corrupt unquoted fields**: `quote_boundary: :standard` (now the
|
|
234
|
+
default) prevents a quote character mid-field (e.g. `b"bb`) from toggling quoted state. This
|
|
235
|
+
silently corrupted rows in 1.15.2 when data contained apostrophes or inch marks.
|
|
236
|
+
- **Unclosed-quote fallback in `:auto` mode**: When backslash mode encounters an unclosed quote
|
|
237
|
+
at EOL, the parser now tries RFC 4180 mode as a fallback before treating the row as multiline.
|
|
238
|
+
- **Empty headers bug fixed** ([#324](https://github.com/tilo/smarter_csv/issues/324),
|
|
239
|
+
[#312](https://github.com/tilo/smarter_csv/issues/312)): CSV files with empty or
|
|
240
|
+
whitespace-only header fields (e.g. `name,,`) now auto-generate column names using
|
|
241
|
+
`missing_header_prefix` (default: `column_1`, `column_2`, …).
|
|
242
|
+
- **All library output now goes to `$stderr`**: Behavioral warnings use `warn` (suppressible
|
|
243
|
+
via `-W0` or `verbose: :quiet`); debug diagnostics use `$stderr.puts`. Nothing is written to
|
|
244
|
+
`$stdout`.
|
|
245
|
+
- **`SmarterCSV.generate` raises `ArgumentError`** (not a blank `RuntimeError`) when called
|
|
246
|
+
without a block.
|
|
247
|
+
|
|
248
|
+
### Writer
|
|
249
|
+
|
|
250
|
+
- **Temp file no longer hardcoded to `/tmp`**: Fixes `Errno::ENOENT` on Windows.
|
|
251
|
+
- **Temp file properly cleaned up**: `Tempfile#close!` now used instead of `Tempfile#delete`,
|
|
252
|
+
ensuring the file is both closed and unlinked.
|
|
253
|
+
- **`StringIO` handling**: Writing to a `StringIO` no longer attempts to close it on
|
|
254
|
+
`finalize`.
|
|
255
|
+
|
|
256
|
+
---
|
|
257
|
+
|
|
258
|
+
## Misc
|
|
259
|
+
|
|
260
|
+
- **`@mapped_keys` changed from `Array` to `Set`**: O(1) lookup per field instead of O(n)
|
|
261
|
+
scan on the `value_converters` key check.
|
|
262
|
+
- **`escape_csv_field` micro-optimizations**: `@escaped_quote_char` precomputed once in
|
|
263
|
+
`initialize`; redundant `.to_s` call removed; row separator appended with `<<` (mutating)
|
|
264
|
+
instead of `+` to save one string allocation per row.
|
|
265
|
+
- **`Reader` includes `Enumerable`**: Enables `map`, `select`, `reject`, `lazy`, and other
|
|
266
|
+
Enumerable methods on `Reader#each` results.
|
|
267
|
+
- **`DEFAULT_CHUNK_SIZE = 100`**: Constant added; warning emitted when `each_chunk` is called
|
|
268
|
+
without explicit `chunk_size`.
|
|
269
|
+
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
PREVIOUS: [SmarterCSV over the Years](../../history.md) | UP: [README](../../../README.md)
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
# SmarterCSV 1.16.0 — Performance Notes
|
|
2
|
+
|
|
3
|
+
Measured on Apple M1 Pro, Ruby 3.4.7, best of two benchmark sessions (30 runs each).
|
|
4
|
+
See [benchmarks.md](benchmarks.md) for full tables.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## vs Ruby CSV
|
|
9
|
+
|
|
10
|
+
### vs CSV.read (raw tokenization only — no hashes, no post-processing)
|
|
11
|
+
|
|
12
|
+
`CSV.read` is the *fastest* Ruby CSV mode. It returns plain string arrays with no header
|
|
13
|
+
handling, no symbol keys, no numeric conversion. SmarterCSV/C delivers fully processed
|
|
14
|
+
hashes — and still beats it on every single file:
|
|
15
|
+
|
|
16
|
+
| Range | Files |
|
|
17
|
+
|--------------|--------------------------------------------------------------------|
|
|
18
|
+
| **8–9×** | PEOPLE_IMPORT_C (8.1×), uszips (8.6×) |
|
|
19
|
+
| **6–7×** | uscities (6.4×), worldcities (6.3×), embedded_sep (6.0×) |
|
|
20
|
+
| **4–5×** | PEOPLE_IMPORT_NC (4.8×), long_fields (5.5×), many_empty (5.2×), sample_10M (4.3×), utf8 (4.3×) |
|
|
21
|
+
| **3×** | heavy_quoting (3.1×), tab_sep (3.3×), whitespace (3.1×), embedded_newlines (2.8×) |
|
|
22
|
+
| **2–3×** | PEOPLE_IMPORT_B (2.9×), PEOPLE_IMPORT_NB (2.7×), sensor_data (2.2×), multi_char (2.4×) |
|
|
23
|
+
| **~1.7×** | wide_500_cols (1.7×) — most column-heavy file, hash overhead visible |
|
|
24
|
+
|
|
25
|
+
**Summary: 1.7×–8.6× faster than CSV.read, while returning fully processed hashes.**
|
|
26
|
+
|
|
27
|
+
### vs CSV.table (symbol keys + numeric conversion — nearest equivalent output)
|
|
28
|
+
|
|
29
|
+
`CSV.table` is the fairest apples-to-apples comparison: it also produces symbol-keyed
|
|
30
|
+
rows with type conversion applied. SmarterCSV/C is dramatically faster:
|
|
31
|
+
|
|
32
|
+
| Range | Files |
|
|
33
|
+
|----------------|-----------------------------------------------------------------|
|
|
34
|
+
| **100×+** | PEOPLE_IMPORT_C (129×) |
|
|
35
|
+
| **40–50×** | PEOPLE_IMPORT_NC (48×), many_empty (46×), wide_500_cols (41×) |
|
|
36
|
+
| **20–30×** | PEOPLE_IMPORT_B (24×), PEOPLE_IMPORT_NB (26×), uszips (28×), tab_sep (27×), whitespace (24×), sensor_data (24×), utf8 (23×), multi_char (20×), worldcities (20×), sample_10M (20×) |
|
|
37
|
+
| **15–20×** | uscities (21×), long_fields (16×), heavy_quoting (19×), embedded_sep (20×) |
|
|
38
|
+
| **7×** | embedded_newlines (7×) — multiline rows, overhead unavoidable |
|
|
39
|
+
|
|
40
|
+
**Summary: 7×–129× faster than CSV.table.**
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## vs SmarterCSV 1.15.2
|
|
45
|
+
|
|
46
|
+
### C path
|
|
47
|
+
|
|
48
|
+
| Gain | Files |
|
|
49
|
+
|--------------|---------------------------------------------------------------------|
|
|
50
|
+
| **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields |
|
|
51
|
+
| **1.5×** | heavy_quoting — same skip-ahead benefit |
|
|
52
|
+
| **1.4×** | tab_separated |
|
|
53
|
+
| **1.2–1.3×** | embedded_sep, utf8, PEOPLE_IMPORT_C/NC, worldcities, whitespace, multi_char |
|
|
54
|
+
| **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols |
|
|
55
|
+
| **~1.0×** | sensor_data, embedded_newlines (within noise) |
|
|
56
|
+
|
|
57
|
+
15 of 19 files are measurably faster; 2 within noise; 2 files show a small regression
|
|
58
|
+
(PEOPLE_IMPORT_NB −7%, wide_500_cols −5%) attributable to the new `quote_boundary: :standard`
|
|
59
|
+
default adding one extra state check on the unquoted fast path.
|
|
60
|
+
|
|
61
|
+
### Ruby path
|
|
62
|
+
|
|
63
|
+
| Gain | Files |
|
|
64
|
+
|--------------|---------------------------------------------------------------------|
|
|
65
|
+
| **1.9×** | PEOPLE_IMPORT_C (117 cols) — direct hash construction bypasses intermediate Array |
|
|
66
|
+
| **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep |
|
|
67
|
+
| **1.0–1.1×** | most other files |
|
|
68
|
+
|
|
69
|
+
The Ruby path gains are concentrated on wide/complex files where the direct-hash
|
|
70
|
+
construction optimization (Opt #11) has the most impact.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## vs SmarterCSV 1.14.4
|
|
75
|
+
|
|
76
|
+
C path is **9×–65× faster** across all 19 benchmark files:
|
|
77
|
+
|
|
78
|
+
- Long fields: **65×** (v1.15.0 introduced `memchr` skip-ahead)
|
|
79
|
+
- PEOPLE_IMPORT_C: **48×** (117 cols × 50k rows)
|
|
80
|
+
- PEOPLE_IMPORT_NC, multi_char_sep: **~21–24×**
|
|
81
|
+
- Typical real-world file: **10–20×**
|
|
82
|
+
- Minimum: **9.8×** (uscities, embedded_newlines)
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## vs ZSV (C library, GC disabled)
|
|
87
|
+
|
|
88
|
+
ZSV is a dedicated C CSV library with GC disabled during measurement (working around a
|
|
89
|
+
bug in zsv-ruby 1.3.1 on Ruby 3.4.x). Despite this advantage:
|
|
90
|
+
|
|
91
|
+
**SmarterCSV/C beats ZSV+wrapper** (the fair comparison — both return processed hashes)
|
|
92
|
+
on 18 of 19 files, by **2–7×**. ZSV+wrapper is faster only on `embedded_newlines`
|
|
93
|
+
(1.5×), where ZSV's chunked I/O is particularly efficient.
|
|
94
|
+
|
|
95
|
+
**SmarterCSV/C vs ZSV.read** (raw arrays, GC disabled): ZSV.read is faster on most files
|
|
96
|
+
(2–12×), which is expected — it does far less work and has GC disabled. SmarterCSV/C
|
|
97
|
+
matches or beats ZSV.read on PEOPLE_IMPORT_C (the 117-column file) and PEOPLE_IMPORT_NC,
|
|
98
|
+
where our C hash-building overhead is proportionally small.
|
|
99
|
+
|
|
100
|
+
---
|
|
101
|
+
|
|
102
|
+
## column_selection speedup (`headers: { only: }`)
|
|
103
|
+
|
|
104
|
+
When using `headers: { only: [...] }` to select a subset of columns, excluded columns
|
|
105
|
+
are skipped entirely in the C hot path — no string allocation, no conversion, no hash
|
|
106
|
+
insertion. Benchmark on `wide_500_cols_20k.csv` (500 columns):
|
|
107
|
+
|
|
108
|
+
| Columns kept | Speedup vs no selection |
|
|
109
|
+
|---|---|
|
|
110
|
+
| 2 of 500 | ~16× faster |
|
|
111
|
+
| 10 of 500 | ~8× faster |
|
|
112
|
+
| 50 of 500 | ~3× faster |
|
|
113
|
+
|
|
114
|
+
This is additive on top of the baseline gains above.
|
data/docs/row_col_sep.md
CHANGED
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
### Contents
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
5
6
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
6
7
|
* [The Basic Read API](./basic_read_api.md)
|
|
7
8
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -10,10 +11,17 @@
|
|
|
10
11
|
* [**Row and Column Separators**](./row_col_sep.md)
|
|
11
12
|
* [Header Transformations](./header_transformations.md)
|
|
12
13
|
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
13
15
|
* [Data Transformations](./data_transformations.md)
|
|
14
16
|
* [Value Converters](./value_converters.md)
|
|
15
|
-
|
|
16
|
-
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
17
25
|
|
|
18
26
|
# Row and Column Separators
|
|
19
27
|
|
|
@@ -52,7 +60,7 @@ This data format uses CTRL-A as the column separator, and CTRL-B as the record s
|
|
|
52
60
|
```ruby
|
|
53
61
|
filename = '/tmp/itunes_db_dump'
|
|
54
62
|
options = {
|
|
55
|
-
:col_sep => "\cA", :row_sep => "\cB
|
|
63
|
+
:col_sep => "\cA", :row_sep => "\cB", :comment_regexp => /^#/,
|
|
56
64
|
:chunk_size => 100 , :key_mapping => {export_date: nil, name: :genre},
|
|
57
65
|
}
|
|
58
66
|
n = SmarterCSV.process(filename, options) do |chunk|
|
|
@@ -93,7 +101,7 @@ In this example, we use `comment_regexp` to filter out and ignore any lines star
|
|
|
93
101
|
# Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
|
|
94
102
|
filename = '/tmp/strange_db_dump'
|
|
95
103
|
options = {
|
|
96
|
-
:col_sep => "\cA", :row_sep => "\cB
|
|
104
|
+
:col_sep => "\cA", :row_sep => "\cB", :comment_regexp => /^#/,
|
|
97
105
|
:chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
|
|
98
106
|
}
|
|
99
107
|
n = SmarterCSV.process(filename, options) do |chunk|
|
|
@@ -103,4 +111,5 @@ In this example, we use `comment_regexp` to filter out and ignore any lines star
|
|
|
103
111
|
```
|
|
104
112
|
|
|
105
113
|
----------------
|
|
106
|
-
|
|
114
|
+
|
|
115
|
+
PREVIOUS: [Configuration Options](./options.md) | NEXT: [Header Transformations](./header_transformations.md) | UP: [README](../README.md)
|