smarter_csv 1.15.2 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. checksums.yaml +4 -4
  2. data/.rspec +2 -0
  3. data/.rubocop.yml +9 -0
  4. data/CHANGELOG.md +112 -1
  5. data/CONTRIBUTORS.md +4 -1
  6. data/Gemfile +1 -0
  7. data/README.md +129 -27
  8. data/docs/_introduction.md +45 -24
  9. data/docs/bad_row_quarantine.md +342 -0
  10. data/docs/basic_read_api.md +152 -9
  11. data/docs/basic_write_api.md +475 -59
  12. data/docs/batch_processing.md +162 -4
  13. data/docs/column_selection.md +184 -0
  14. data/docs/data_transformations.md +163 -29
  15. data/docs/examples.md +340 -46
  16. data/docs/header_transformations.md +94 -12
  17. data/docs/header_validations.md +57 -18
  18. data/docs/history.md +119 -0
  19. data/docs/instrumentation.md +166 -0
  20. data/docs/migrating_from_csv.md +565 -0
  21. data/docs/options.md +151 -87
  22. data/docs/parsing_strategy.md +64 -1
  23. data/docs/real_world_csv.md +263 -0
  24. data/docs/releases/1.16.0/benchmarks.md +223 -0
  25. data/docs/releases/1.16.0/changes.md +273 -0
  26. data/docs/releases/1.16.0/performance_notes.md +114 -0
  27. data/docs/row_col_sep.md +15 -5
  28. data/docs/ruby_csv_pitfalls.md +514 -0
  29. data/docs/value_converters.md +194 -57
  30. data/ext/smarter_csv/extconf.rb +3 -0
  31. data/ext/smarter_csv/smarter_csv.c +1017 -82
  32. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  36. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  37. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  38. data/lib/smarter_csv/errors.rb +8 -0
  39. data/lib/smarter_csv/file_io.rb +1 -1
  40. data/lib/smarter_csv/hash_transformations.rb +14 -13
  41. data/lib/smarter_csv/header_transformations.rb +21 -2
  42. data/lib/smarter_csv/headers.rb +2 -1
  43. data/lib/smarter_csv/options.rb +124 -7
  44. data/lib/smarter_csv/parser.rb +358 -74
  45. data/lib/smarter_csv/reader.rb +494 -46
  46. data/lib/smarter_csv/version.rb +1 -1
  47. data/lib/smarter_csv/writer.rb +71 -19
  48. data/lib/smarter_csv.rb +134 -13
  49. data/smarter_csv.gemspec +20 -10
  50. metadata +38 -80
@@ -2,6 +2,8 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
5
7
  * [Parsing Strategy](./parsing_strategy.md)
6
8
  * [The Basic Read API](./basic_read_api.md)
7
9
  * [The Basic Write API](./basic_write_api.md)
@@ -10,10 +12,17 @@
10
12
  * [Row and Column Separators](./row_col_sep.md)
11
13
  * [Header Transformations](./header_transformations.md)
12
14
  * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
13
16
  * [Data Transformations](./data_transformations.md)
14
17
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
17
26
 
18
27
  # Batch Processing
19
28
 
@@ -64,7 +73,7 @@ The `process` method returns the number of chunks when called with a block.
64
73
  => 2
65
74
  ```
66
75
 
67
- ## Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
76
+ ## Example 3: ActiveRecord Bulk Insert in Chunks of 100 records with SmarterCSV:
68
77
  ```ruby
69
78
  # using chunks:
70
79
  filename = '/tmp/some.csv'
@@ -78,5 +87,154 @@ The `process` method returns the number of chunks when called with a block.
78
87
  => returns number of chunks we processed
79
88
  ```
80
89
 
90
+ ---
91
+
92
+ # Modern Batch API — `each_chunk`
93
+
94
+ `Reader#each_chunk` is the modern API for chunked batch processing. It yields `(Array<Hash>, chunk_index)` — the same shape as the `process` block — but returns an `Enumerator` when called without a block, enabling more flexible composition.
95
+
96
+ ## Configuration
97
+
98
+ Set `chunk_size` in options when constructing the Reader. `each_chunk` reads this value automatically:
99
+
100
+ ```ruby
101
+ reader = SmarterCSV::Reader.new('big.csv', chunk_size: 500)
102
+ reader.each_chunk do |chunk, index|
103
+ puts "Processing chunk #{index} (#{chunk.size} rows)"
104
+ MyModel.insert_all(chunk)
105
+ end
106
+ ```
107
+
108
+ If `chunk_size` is not set, `each_chunk` defaults to `SmarterCSV::Reader::DEFAULT_CHUNK_SIZE` (100) and emits a warning to STDERR:
109
+
110
+ ```
111
+ SmarterCSV: chunk_size not set, defaulting to 100. Set chunk_size explicitly to suppress this warning.
112
+ ```
113
+
114
+ Set `chunk_size` explicitly to suppress the warning and choose the right batch size for your use case.
115
+
116
+ ## Simplified form
117
+
118
+ ```ruby
119
+ SmarterCSV.each_chunk('big.csv', chunk_size: 500) do |chunk, index|
120
+ MyModel.insert_all(chunk)
121
+ end
122
+ ```
123
+
124
+ ## Returns an Enumerator when called without a block
125
+
126
+ ```ruby
127
+ reader = SmarterCSV::Reader.new('big.csv', chunk_size: 500)
128
+ reader.each_chunk.with_index do |chunk, index|
129
+ puts "Chunk #{index}: #{chunk.size} rows"
130
+ end
131
+ ```
132
+
133
+ ## Example: Sidekiq parallel import
134
+
135
+ ```ruby
136
+ reader = SmarterCSV::Reader.new('users.csv', chunk_size: 100)
137
+ reader.each_chunk do |chunk, index|
138
+ ImportWorker.perform_async(chunk)
139
+ end
140
+ ```
141
+
142
+ ## Example: Resque parallel import
143
+
144
+ ```ruby
145
+ reader = SmarterCSV::Reader.new('orders.csv', chunk_size: 200)
146
+ reader.each_chunk do |chunk, index|
147
+ Resque.enqueue(OrderImportJob, chunk)
148
+ end
149
+ ```
150
+
151
+ ## Example: ActiveRecord `insert_all` bulk insert
152
+
153
+ ```ruby
154
+ reader = SmarterCSV::Reader.new('products.csv', chunk_size: 500)
155
+ reader.each_chunk do |chunk, _index|
156
+ MyModel.insert_all(chunk)
157
+ end
158
+ ```
159
+
160
+ ## Example: Progress tracking
161
+
162
+ ```ruby
163
+ reader = SmarterCSV::Reader.new('big.csv', chunk_size: 1_000)
164
+ total = File.foreach('big.csv').count - 1 # subtract header row
165
+
166
+ reader.each_chunk do |chunk, index|
167
+ processed = [(index + 1) * 1_000, total].min
168
+ puts "#{processed}/#{total} rows processed"
169
+ MyModel.insert_all(chunk)
170
+ end
171
+ ```
172
+
173
+ ## Interaction with `on_bad_row`
174
+
175
+ `each_chunk` respects all `on_bad_row` options. Bad rows are excluded from chunks and counted or routed to your handler:
176
+
177
+ ```ruby
178
+ reader = SmarterCSV::Reader.new('data.csv',
179
+ chunk_size: 500,
180
+ on_bad_row: :collect,
181
+ )
182
+ reader.each_chunk do |chunk, index|
183
+ MyModel.insert_all(chunk)
184
+ end
185
+ puts "Bad rows: #{reader.errors[:bad_row_count]}"
186
+ reader.errors[:bad_rows].each { |rec| puts "Line #{rec[:csv_line_number]}: #{rec[:error_message]}" }
187
+ ```
188
+
189
+ See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
190
+
191
+ ## Example: DynamoDB batch write
192
+
193
+ DynamoDB's `batch_write_item` API accepts up to **25 items per request** — making
194
+ `chunk_size: 25` the natural fit. SmarterCSV symbol keys map directly to DynamoDB
195
+ attribute names after a simple `transform_keys(&:to_s)` call.
196
+
197
+ ```ruby
198
+ require 'aws-sdk-dynamodb'
199
+
200
+ client = Aws::DynamoDB::Client.new(region: 'us-east-1')
201
+
202
+ SmarterCSV::Reader.new('products.csv', chunk_size: 25).each_chunk do |chunk, _index|
203
+ client.batch_write_item(
204
+ request_items: {
205
+ 'ProductsTable' => chunk.map do |row|
206
+ { put_request: { item: row.transform_keys(&:to_s) } }
207
+ end
208
+ }
209
+ )
210
+ end
211
+ ```
212
+
213
+ ## Example: Reading a CSV from S3
214
+
215
+ SmarterCSV accepts any IO-like object, so you can stream directly from S3 without
216
+ writing a temp file:
217
+
218
+ ```ruby
219
+ require 'aws-sdk-s3'
220
+
221
+ s3 = Aws::S3::Client.new(region: 'us-east-1')
222
+ obj = s3.get_object(bucket: 'my-bucket', key: 'imports/products.csv')
223
+
224
+ data = SmarterCSV.process(obj.body)
225
+ MyModel.insert_all(data)
226
+ ```
227
+
228
+ For large files, combine with chunked processing:
229
+
230
+ ```ruby
231
+ obj = s3.get_object(bucket: 'my-bucket', key: 'imports/big.csv')
232
+
233
+ SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _index|
234
+ MyModel.insert_all(chunk)
235
+ end
236
+ ```
237
+
81
238
  ----------------
82
- PREVIOUS: [The Basic Write API](./basic_write_api.md) | NEXT: [Configuration Options](./options.md)
239
+
240
+ PREVIOUS: [The Basic Write API](./basic_write_api.md) | NEXT: [Configuration Options](./options.md) | UP: [README](../README.md)
@@ -0,0 +1,184 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
7
+ * [Parsing Strategy](./parsing_strategy.md)
8
+ * [The Basic Read API](./basic_read_api.md)
9
+ * [The Basic Write API](./basic_write_api.md)
10
+ * [Batch Processing](././batch_processing.md)
11
+ * [Configuration Options](./options.md)
12
+ * [Row and Column Separators](./row_col_sep.md)
13
+ * [Header Transformations](./header_transformations.md)
14
+ * [Header Validations](./header_validations.md)
15
+ * [**Column Selection**](./column_selection.md)
16
+ * [Data Transformations](./data_transformations.md)
17
+ * [Value Converters](./value_converters.md)
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
26
+
27
+ # Column Selection
28
+
29
+ Wide CSV files often contain dozens or hundreds of columns, but a given application typically
30
+ only needs a handful of them. The `headers: { only: }` and `headers: { except: }` options let
31
+ you declare upfront which columns you want, so SmarterCSV skips allocation and hash insertion
32
+ for everything else — both in the Ruby path and in the C-accelerated hot path.
33
+
34
+ ## Options
35
+
36
+ | Option | Default | Description |
37
+ |--------|---------|-------------|
38
+ | `headers: { only: }` | `nil` | Keep only the listed columns in each result hash |
39
+ | `headers: { except: }` | `nil` | Remove the listed columns from each result hash |
40
+
41
+ You cannot use both options at the same time — doing so raises `SmarterCSV::ValidationError`.
42
+
43
+ ## Basic usage
44
+
45
+ ```ruby
46
+ # Keep only two columns out of a wide file
47
+ data = SmarterCSV.process('big.csv', headers: { only: [:id, :email] })
48
+ # => [{id: 1, email: "alice@example.com"}, ...]
49
+
50
+ # Keep everything except one noisy column
51
+ data = SmarterCSV.process('big.csv', headers: { except: [:internal_notes] })
52
+ ```
53
+
54
+ ## Input flexibility
55
+
56
+ Both options accept an Array of symbols or strings, or a single symbol or string — anything
57
+ that makes sense as a column name. All values are normalized to symbols internally.
58
+
59
+ ```ruby
60
+ headers: { only: :id } # single symbol — same as [:id]
61
+ headers: { only: 'id' } # single string — normalized to :id
62
+ headers: { only: [:id, :email] } # array of symbols
63
+ headers: { only: ['id', 'email'] } # array of strings — normalized to symbols
64
+ ```
65
+
66
+ ## Names refer to post-mapping keys
67
+
68
+ `headers: { only: }` and `headers: { except: }` use the **post-mapping** column name — the
69
+ symbol that actually appears in the result hash after `key_mapping:` has been applied. You
70
+ never need to know the original CSV header spelling.
71
+
72
+ ```ruby
73
+ # CSV has header "First Name"; key_mapping renames it to :given_name
74
+ data = SmarterCSV.process('contacts.csv',
75
+ key_mapping: { first_name: :given_name },
76
+ headers: { only: [:given_name] }, # post-mapping name
77
+ )
78
+ # => [{given_name: "Alice"}, ...]
79
+ ```
80
+
81
+ ## Interaction with `with_line_numbers:`
82
+
83
+ `:csv_line_number` is added to each hash **after** column selection runs, so it is always
84
+ present when `with_line_numbers: true` — even if it is not listed in `headers: { only: }`.
85
+
86
+ ```ruby
87
+ data = SmarterCSV.process('data.csv',
88
+ headers: { only: [:name] },
89
+ with_line_numbers: true,
90
+ )
91
+ data.each { |row| puts "#{row[:csv_line_number]}: #{row[:name]}" }
92
+ ```
93
+
94
+ ## Interaction with `strict:`
95
+
96
+ `strict: true` raises `SmarterCSV::HeaderSizeMismatch` when a data row contains more fields
97
+ than the header row. This check runs **before** column selection, so schema validation still
98
+ catches malformed rows even when `headers: { only: }` is active.
99
+
100
+ ```ruby
101
+ # Raises HeaderSizeMismatch on the row with extra fields, regardless of headers: { only: }
102
+ SmarterCSV.process('data.csv', headers: { only: [:name] }, strict: true)
103
+ ```
104
+
105
+ ## Extra columns without `strict:`
106
+
107
+ When `strict:` is false (the default) and a data row has more fields than the header,
108
+ the extra columns are silently dropped — they cannot be in the `headers: { only: }` set, so
109
+ the filter discards them naturally.
110
+
111
+ > **Important:** `missing_headers: :auto` (auto-generating names like `column_7`,
112
+ > `column_8` for extra data columns) does **not** work in combination with `headers: { only: }`.
113
+ > `headers: { only: }` is a **performance improvement** that causes the parser to stop scanning
114
+ > a row once all requested columns have been found — any extra columns beyond the header
115
+ > count are never visited, so no auto-names are generated for them. If you need to capture
116
+ > auto-named overflow columns, do not use `headers: { only: }` at the same time.
117
+
118
+ ## Unknown column names are silently ignored
119
+
120
+ Listing a column name that doesn't exist in the file is not an error. The column simply
121
+ never appears in any row hash.
122
+
123
+ ```ruby
124
+ # :nonexistent_column is not in the file — no error, just absent from results
125
+ data = SmarterCSV.process('data.csv', headers: { only: [:id, :nonexistent_column] })
126
+ ```
127
+
128
+ ## Performance
129
+
130
+ Both options are implemented in the C extension (when acceleration is enabled). Excluded
131
+ columns are skipped entirely inside the C parsing loop — no Ruby string is allocated, no
132
+ numeric conversion runs, and no `rb_hash_aset` call is made for fields the caller doesn't
133
+ need. This makes column selection a genuine performance option for wide CSV files, not just
134
+ a post-processing filter.
135
+
136
+ The Ruby fallback path applies the same filter via `hash.select!` / `hash.reject!` after
137
+ parsing, giving correct results on all platforms.
138
+
139
+ ### `headers: { only: }` vs `headers: { except: }` — performance asymmetry
140
+
141
+ **`headers: { only: }` enables early exit.** Once every requested column has been parsed,
142
+ the parser stops scanning the current row entirely — the remaining fields are never visited.
143
+ For a 500-column file where you only need 5 columns near the start, this can be
144
+ **10–14× faster** than parsing all columns.
145
+
146
+ **`headers: { except: }` cannot have early exit.** To know which columns to *keep*, the
147
+ parser must scan every field in the row to the end. Skipping just a few columns out of many
148
+ saves very little work, so benchmark results for `headers: { except: }` are typically flat
149
+ (0.7×–1.0× vs full parse).
150
+
151
+ **Rule of thumb:**
152
+ - Use `headers: { only: }` when you want a small subset of a wide file — this is the fast path.
153
+ - Use `headers: { except: }` only when you want *almost everything* and excluding a known
154
+ noisy column is more convenient than listing all the ones you want.
155
+ - Avoid `headers: { except: }` as a performance tool on wide files — it provides no speed benefit.
156
+
157
+ ### `headers: { only: }` vs `remove_unmapped_keys:`
158
+
159
+ If you are already using `key_mapping:` to rename headers, the `remove_unmapped_keys: true`
160
+ option lets you implicitly drop everything not in the map — without listing each unwanted
161
+ column explicitly. This is a convenient alternative to `headers: { only: }` when renaming
162
+ and selecting go hand in hand:
163
+
164
+ ```ruby
165
+ # With key_mapping + remove_unmapped_keys: convenient when renaming
166
+ SmarterCSV.process('data.csv',
167
+ key_mapping: { col_a: :name, col_b: :email },
168
+ remove_unmapped_keys: true,
169
+ )
170
+
171
+ # With headers: { only: }: better for pure selection — C-path early exit applies
172
+ SmarterCSV.process('data.csv',
173
+ headers: { only: [:col_a, :col_b] },
174
+ )
175
+ ```
176
+
177
+ `headers: { only: }` is the faster choice for wide files since unneeded fields are skipped
178
+ inside the C parser before any Ruby objects are created. `remove_unmapped_keys:` is a
179
+ post-parse filter — all fields are parsed first, then the unwanted keys are deleted.
180
+ See [Header Transformations](./header_transformations.md#key-mapping) for more details.
181
+
182
+ ---
183
+
184
+ PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Data Transformations](./data_transformations.md) | UP: [README](../README.md)
@@ -2,6 +2,8 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
5
7
  * [Parsing Strategy](./parsing_strategy.md)
6
8
  * [The Basic Read API](./basic_read_api.md)
7
9
  * [The Basic Write API](./basic_write_api.md)
@@ -10,52 +12,184 @@
10
12
  * [Row and Column Separators](./row_col_sep.md)
11
13
  * [Header Transformations](./header_transformations.md)
12
14
  * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
13
16
  * [**Data Transformations**](./data_transformations.md)
14
17
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
17
26
 
18
27
  # Data Transformations
19
28
 
20
- SmarterCSV automatically transforms the values in each colum in order to normalize the data.
21
- This behavior can be customized or disabled.
29
+ SmarterCSV automatically normalizes the values in each row. All transformations are configurable — most are enabled by default because they're the right behavior for the vast majority of CSV files.
30
+
31
+ ## Transformation Pipeline
32
+
33
+ Transformations run in this order for every row:
34
+
35
+ | Step | Option | Default | What it does |
36
+ |------|--------|---------|--------------|
37
+ | 1 | `strip_whitespace` | `true` | Strips leading/trailing whitespace from all values (and headers) at parse time |
38
+ | 2 | `nil_values_matching` | `nil` | Sets values matching the regexp to `nil` |
39
+ | 3 | `remove_empty_values` | `true` | Removes keys whose value is `nil` or blank |
40
+ | 4 | `remove_zero_values` | `false` | Removes keys whose value is numeric zero |
41
+ | 5 | `convert_values_to_numeric` | `true` | Converts numeric-looking strings to `Integer` or `Float` |
42
+ | 6 | `value_converters` | `nil` | Applies per-key custom converter lambdas or classes |
43
+ | 7 | `remove_empty_hashes` | `true` | Drops rows that are entirely empty after all transformations |
44
+
45
+ > Steps 2–6 run per field in order. `value_converters` receive the value **after** numeric conversion — guard against receiving `Integer`/`Float` if your converter expects a string.
46
+
47
+ ---
22
48
 
23
- ## Remove Empty Values
24
- `remove_empty_values` is enabled by default
25
- It removes any values which are `nil` or would be empty strings.
49
+ ## `strip_whitespace`
26
50
 
27
- ## Convert Values to Numeric
28
- `convert_values_to_numeric` is enabled by default.
29
- SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
51
+ **Default: `true`**
30
52
 
31
- Here is an example of using `convert_values_to_numeric` for numbers with leading zeros, e.g. ZIP codes:
53
+ Strips leading and trailing whitespace from all header names and all field values at parse time, before any other transformation runs.
32
54
 
55
+ ```ruby
56
+ # CSV with padded values:
57
+ # name, score
58
+ # Alice , 42
59
+ # Bob , 0
60
+
61
+ data = SmarterCSV.process(file)
62
+ # => [{name: "Alice", score: 42}, {name: "Bob", score: 0}]
63
+ # ↑ "Alice " stripped to "Alice", " 42" stripped to "42" then converted
64
+
65
+ data = SmarterCSV.process(file, strip_whitespace: false)
66
+ # => [{"name"=>"Alice ", " score"=>" 42"}, ...]
67
+ # ↑ whitespace preserved in both headers and values
33
68
  ```
34
- data = SmarterCSV.process('/tmp/zip.csv', convert_values_to_numeric: { except: [:zip] })
35
- => [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
36
- ```
37
69
 
38
- This will return the column `:zip` as a string with all digits intact.
70
+ ---
71
+
72
+ ## `nil_values_matching`
73
+
74
+ **Default: `nil` (disabled)**
75
+
76
+ Set values matching the given regular expression to `nil`. Combined with the default `remove_empty_values: true`, matching values are removed from the result hash. With `remove_empty_values: false`, the key is retained with a `nil` value — useful when you need to distinguish "field was absent" from "field had a sentinel value".
77
+
78
+ ```ruby
79
+ # Treat common null sentinels as nil and remove them
80
+ data = SmarterCSV.process(file, nil_values_matching: /\A(NULL|N\/A|NA|#N\/A|\(null\))\z/i)
39
81
 
40
- ## Remove Zero Values
41
- `remove_zero_values` is disabled by default.
42
- When enabled, it removes key/value pairs which have a numeric value equal to zero.
82
+ # Nil-ify but retain the key (don't remove)
83
+ data = SmarterCSV.process(file,
84
+ nil_values_matching: /\A(NULL|N\/A)\z/i,
85
+ remove_empty_values: false)
86
+ # => [{name: "Alice", score: nil}] ← key retained with nil value
87
+
88
+ # Remove Excel error values
89
+ data = SmarterCSV.process(file, nil_values_matching: /\A(#VALUE!|#REF!|#DIV\/0!|NaN)\z/)
90
+ ```
91
+
92
+ > **Deprecated:** `remove_values_matching:` still works but emits a deprecation warning.
93
+ > Use `nil_values_matching:` instead.
94
+
95
+ ---
96
+
97
+ ## `remove_empty_values`
98
+
99
+ **Default: `true`**
100
+
101
+ Removes key/value pairs where the value is `nil` or an empty string after `strip_whitespace` and `nil_values_matching` have run. This is why SmarterCSV result hashes only contain keys with actual values — sparse CSV rows don't produce hashes cluttered with `nil` entries.
102
+
103
+ ```ruby
104
+ # CSV: name,score,notes
105
+ # Alice,42,
106
+ # Bob,,great player
107
+
108
+ data = SmarterCSV.process(file)
109
+ # => [{name: "Alice", score: 42}, {name: "Bob", notes: "great player"}]
110
+ # ↑ empty :notes and :score keys are dropped automatically
111
+
112
+ data = SmarterCSV.process(file, remove_empty_values: false)
113
+ # => [{name: "Alice", score: 42, notes: nil}, {name: nil, score: nil, notes: "great player"}]
114
+ ```
43
115
 
44
- ## Remove Values Matching
45
- `remove_values_matching` is disabled by default.
46
- When enabled, this can help removing key/value pairs from result hashes which would cause problems.
116
+ ---
47
117
 
48
- e.g.
49
- * `remove_values_matching: /^\$0\.0+$/` would remove $0.00
50
- * `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets
118
+ ## `remove_zero_values`
51
119
 
52
- ## Empty Hashes
120
+ **Default: `false`**
53
121
 
54
- It can happen that after all transformations, a row of the CSV file would produce a completely empty hash.
122
+ When enabled, removes key/value pairs where the value is numeric zero (`0`, `0.0`, `"0"`, `"0.0"`). Useful when zero and absent mean the same thing in your domain.
55
123
 
56
- By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
124
+ ```ruby
125
+ # CSV: product,quantity,discount
126
+ # Widget,10,0
127
+ # Gadget,0,5
57
128
 
58
- This can be set to `false`, to keep these empty hashes in the results.
129
+ data = SmarterCSV.process(file, remove_zero_values: true)
130
+ # => [{product: "Widget", quantity: 10}, {product: "Gadget", discount: 5}]
131
+ # ↑ :discount=>0 and :quantity=>0 removed
132
+ ```
133
+
134
+ ---
135
+
136
+ ## `convert_values_to_numeric`
137
+
138
+ **Default: `true`**
139
+
140
+ Converts string values that look like integers or floats to the appropriate numeric type. This is one of the most common sources of silent data loss if not configured carefully — fields like ZIP codes, phone numbers, and account numbers with leading zeros will be silently corrupted if not excluded.
141
+
142
+ ```ruby
143
+ data = SmarterCSV.process(file)
144
+ # "42" => 42 (Integer)
145
+ # "3.14" => 3.14 (Float)
146
+ # "01234" => 1234 ← leading zero lost! exclude this column
147
+
148
+ # Exclude specific columns from numeric conversion
149
+ data = SmarterCSV.process(file,
150
+ convert_values_to_numeric: { except: [:zip, :phone, :account_number] })
151
+ # => [{zip: "01234", phone: "800-555-0100", amount: 99.99}]
152
+
153
+ # Only convert specific columns (all others stay as strings)
154
+ data = SmarterCSV.process(file,
155
+ convert_values_to_numeric: { only: [:quantity, :price] })
156
+ ```
157
+
158
+ ---
159
+
160
+ ## `remove_empty_hashes`
161
+
162
+ **Default: `true`**
163
+
164
+ After all per-field transformations, removes rows that have no remaining key/value pairs. This handles blank lines and rows where every field was empty or matched `nil_values_matching`.
165
+
166
+ ```ruby
167
+ # CSV with a blank line between records:
168
+ # name,score
169
+ # Alice,42
170
+ #
171
+ # Bob,99
172
+
173
+ data = SmarterCSV.process(file)
174
+ # => [{name: "Alice", score: 42}, {name: "Bob", score: 99}]
175
+ # ↑ blank line silently dropped
176
+
177
+ data = SmarterCSV.process(file, remove_empty_hashes: false)
178
+ # => [{name: "Alice", score: 42}, {}, {name: "Bob", score: 99}]
179
+ ```
180
+
181
+ ---
182
+
183
+ ## Custom Transformations — `value_converters`
184
+
185
+ For type conversions beyond numeric (dates, booleans, currency, etc.), use `value_converters`. They run last in the pipeline, after numeric conversion. See [Value Converters](./value_converters.md) for full documentation.
186
+
187
+ ```ruby
188
+ data = SmarterCSV.process(file, value_converters: {
189
+ date: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil },
190
+ active: ->(v) { v&.match?(/\Atrue\z/i) },
191
+ })
192
+ ```
59
193
 
60
194
  -------------------
61
- PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Value Converters](./value_converters.md)
195
+ PREVIOUS: [Column Selection](./column_selection.md) | NEXT: [Value Converters](./value_converters.md) | UP: [README](../README.md)