smarter_csv 1.15.2 → 1.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +9 -0
  3. data/CHANGELOG.md +68 -1
  4. data/CONTRIBUTORS.md +3 -1
  5. data/Gemfile +1 -0
  6. data/README.md +123 -27
  7. data/docs/_introduction.md +40 -24
  8. data/docs/bad_row_quarantine.md +285 -0
  9. data/docs/basic_read_api.md +151 -9
  10. data/docs/basic_write_api.md +474 -59
  11. data/docs/batch_processing.md +161 -4
  12. data/docs/column_selection.md +183 -0
  13. data/docs/data_transformations.md +162 -29
  14. data/docs/examples.md +339 -46
  15. data/docs/header_transformations.md +93 -12
  16. data/docs/header_validations.md +56 -18
  17. data/docs/history.md +117 -0
  18. data/docs/instrumentation.md +165 -0
  19. data/docs/migrating_from_csv.md +290 -0
  20. data/docs/options.md +150 -87
  21. data/docs/parsing_strategy.md +63 -1
  22. data/docs/real_world_csv.md +262 -0
  23. data/docs/releases/1.16.0/benchmarks.md +223 -0
  24. data/docs/releases/1.16.0/changes.md +272 -0
  25. data/docs/releases/1.16.0/performance_notes.md +114 -0
  26. data/docs/row_col_sep.md +14 -5
  27. data/docs/value_converters.md +193 -57
  28. data/ext/smarter_csv/extconf.rb +3 -0
  29. data/ext/smarter_csv/smarter_csv.c +1007 -71
  30. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  31. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  32. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  36. data/lib/smarter_csv/errors.rb +8 -0
  37. data/lib/smarter_csv/file_io.rb +1 -1
  38. data/lib/smarter_csv/hash_transformations.rb +14 -13
  39. data/lib/smarter_csv/header_transformations.rb +21 -2
  40. data/lib/smarter_csv/headers.rb +2 -1
  41. data/lib/smarter_csv/options.rb +124 -7
  42. data/lib/smarter_csv/parser.rb +362 -75
  43. data/lib/smarter_csv/reader.rb +494 -46
  44. data/lib/smarter_csv/version.rb +1 -1
  45. data/lib/smarter_csv/writer.rb +71 -19
  46. data/lib/smarter_csv.rb +95 -12
  47. data/smarter_csv.gemspec +20 -10
  48. metadata +37 -80
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [Parsing Strategy](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,10 +11,17 @@
10
11
  * [Row and Column Separators](./row_col_sep.md)
11
12
  * [Header Transformations](./header_transformations.md)
12
13
  * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [Data Transformations](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ --------------
17
25
 
18
26
  # Batch Processing
19
27
 
@@ -64,7 +72,7 @@ The `process` method returns the number of chunks when called with a block.
64
72
  => 2
65
73
  ```
66
74
 
67
- ## Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
75
+ ## Example 3: ActiveRecord Bulk Insert in Chunks of 100 records with SmarterCSV:
68
76
  ```ruby
69
77
  # using chunks:
70
78
  filename = '/tmp/some.csv'
@@ -78,5 +86,154 @@ The `process` method returns the number of chunks when called with a block.
78
86
  => returns number of chunks we processed
79
87
  ```
80
88
 
89
+ ---
90
+
91
+ # Modern Batch API — `each_chunk`
92
+
93
+ `Reader#each_chunk` is the modern API for chunked batch processing. It yields `(Array<Hash>, chunk_index)` — the same shape as the `process` block — but returns an `Enumerator` when called without a block, enabling more flexible composition.
94
+
95
+ ## Configuration
96
+
97
+ Set `chunk_size` in options when constructing the Reader. `each_chunk` reads this value automatically:
98
+
99
+ ```ruby
100
+ reader = SmarterCSV::Reader.new('big.csv', chunk_size: 500)
101
+ reader.each_chunk do |chunk, index|
102
+ puts "Processing chunk #{index} (#{chunk.size} rows)"
103
+ MyModel.insert_all(chunk)
104
+ end
105
+ ```
106
+
107
+ If `chunk_size` is not set, `each_chunk` defaults to `SmarterCSV::Reader::DEFAULT_CHUNK_SIZE` (100) and emits a warning to STDERR:
108
+
109
+ ```
110
+ SmarterCSV: chunk_size not set, defaulting to 100. Set chunk_size explicitly to suppress this warning.
111
+ ```
112
+
113
+ Set `chunk_size` explicitly to suppress the warning and choose the right batch size for your use case.
114
+
115
+ ## Simplified form
116
+
117
+ ```ruby
118
+ SmarterCSV.each_chunk('big.csv', chunk_size: 500) do |chunk, index|
119
+ MyModel.insert_all(chunk)
120
+ end
121
+ ```
122
+
123
+ ## Returns an Enumerator when called without a block
124
+
125
+ ```ruby
126
+ reader = SmarterCSV::Reader.new('big.csv', chunk_size: 500)
127
+ reader.each_chunk.with_index do |chunk, index|
128
+ puts "Chunk #{index}: #{chunk.size} rows"
129
+ end
130
+ ```
131
+
132
+ ## Example: Sidekiq parallel import
133
+
134
+ ```ruby
135
+ reader = SmarterCSV::Reader.new('users.csv', chunk_size: 100)
136
+ reader.each_chunk do |chunk, index|
137
+ ImportWorker.perform_async(chunk)
138
+ end
139
+ ```
140
+
141
+ ## Example: Resque parallel import
142
+
143
+ ```ruby
144
+ reader = SmarterCSV::Reader.new('orders.csv', chunk_size: 200)
145
+ reader.each_chunk do |chunk, index|
146
+ Resque.enqueue(OrderImportJob, chunk)
147
+ end
148
+ ```
149
+
150
+ ## Example: ActiveRecord `insert_all` bulk insert
151
+
152
+ ```ruby
153
+ reader = SmarterCSV::Reader.new('products.csv', chunk_size: 500)
154
+ reader.each_chunk do |chunk, _index|
155
+ MyModel.insert_all(chunk)
156
+ end
157
+ ```
158
+
159
+ ## Example: Progress tracking
160
+
161
+ ```ruby
162
+ reader = SmarterCSV::Reader.new('big.csv', chunk_size: 1_000)
163
+ total = File.foreach('big.csv').count - 1 # subtract header row
164
+
165
+ reader.each_chunk do |chunk, index|
166
+ processed = [(index + 1) * 1_000, total].min
167
+ puts "#{processed}/#{total} rows processed"
168
+ MyModel.insert_all(chunk)
169
+ end
170
+ ```
171
+
172
+ ## Interaction with `on_bad_row`
173
+
174
+ `each_chunk` respects all `on_bad_row` options. Bad rows are excluded from chunks and counted or routed to your handler:
175
+
176
+ ```ruby
177
+ reader = SmarterCSV::Reader.new('data.csv',
178
+ chunk_size: 500,
179
+ on_bad_row: :collect,
180
+ )
181
+ reader.each_chunk do |chunk, index|
182
+ MyModel.insert_all(chunk)
183
+ end
184
+ puts "Bad rows: #{reader.errors[:bad_row_count]}"
185
+ reader.errors[:bad_rows].each { |rec| puts "Line #{rec[:csv_line_number]}: #{rec[:error_message]}" }
186
+ ```
187
+
188
+ See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
189
+
190
+ ## Example: DynamoDB batch write
191
+
192
+ DynamoDB's `batch_write_item` API accepts up to **25 items per request** — making
193
+ `chunk_size: 25` the natural fit. SmarterCSV symbol keys map directly to DynamoDB
194
+ attribute names after a simple `transform_keys(&:to_s)` call.
195
+
196
+ ```ruby
197
+ require 'aws-sdk-dynamodb'
198
+
199
+ client = Aws::DynamoDB::Client.new(region: 'us-east-1')
200
+
201
+ SmarterCSV::Reader.new('products.csv', chunk_size: 25).each_chunk do |chunk, _index|
202
+ client.batch_write_item(
203
+ request_items: {
204
+ 'ProductsTable' => chunk.map do |row|
205
+ { put_request: { item: row.transform_keys(&:to_s) } }
206
+ end
207
+ }
208
+ )
209
+ end
210
+ ```
211
+
212
+ ## Example: Reading a CSV from S3
213
+
214
+ SmarterCSV accepts any IO-like object, so you can stream directly from S3 without
215
+ writing a temp file:
216
+
217
+ ```ruby
218
+ require 'aws-sdk-s3'
219
+
220
+ s3 = Aws::S3::Client.new(region: 'us-east-1')
221
+ obj = s3.get_object(bucket: 'my-bucket', key: 'imports/products.csv')
222
+
223
+ data = SmarterCSV.process(obj.body)
224
+ MyModel.insert_all(data)
225
+ ```
226
+
227
+ For large files, combine with chunked processing:
228
+
229
+ ```ruby
230
+ obj = s3.get_object(bucket: 'my-bucket', key: 'imports/big.csv')
231
+
232
+ SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _index|
233
+ MyModel.insert_all(chunk)
234
+ end
235
+ ```
236
+
81
237
  ----------------
82
- PREVIOUS: [The Basic Write API](./basic_write_api.md) | NEXT: [Configuration Options](./options.md)
238
+
239
+ PREVIOUS: [The Basic Write API](./basic_write_api.md) | NEXT: [Configuration Options](./options.md) | UP: [README](../README.md)
@@ -0,0 +1,183 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Parsing Strategy](./parsing_strategy.md)
7
+ * [The Basic Read API](./basic_read_api.md)
8
+ * [The Basic Write API](./basic_write_api.md)
9
+ * [Batch Processing](././batch_processing.md)
10
+ * [Configuration Options](./options.md)
11
+ * [Row and Column Separators](./row_col_sep.md)
12
+ * [Header Transformations](./header_transformations.md)
13
+ * [Header Validations](./header_validations.md)
14
+ * [**Column Selection**](./column_selection.md)
15
+ * [Data Transformations](./data_transformations.md)
16
+ * [Value Converters](./value_converters.md)
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ --------------
25
+
26
+ # Column Selection
27
+
28
+ Wide CSV files often contain dozens or hundreds of columns, but a given application typically
29
+ only needs a handful of them. The `headers: { only: }` and `headers: { except: }` options let
30
+ you declare upfront which columns you want, so SmarterCSV skips allocation and hash insertion
31
+ for everything else — both in the Ruby path and in the C-accelerated hot path.
32
+
33
+ ## Options
34
+
35
+ | Option | Default | Description |
36
+ |--------|---------|-------------|
37
+ | `headers: { only: }` | `nil` | Keep only the listed columns in each result hash |
38
+ | `headers: { except: }` | `nil` | Remove the listed columns from each result hash |
39
+
40
+ You cannot use both options at the same time — doing so raises `SmarterCSV::ValidationError`.
41
+
42
+ ## Basic usage
43
+
44
+ ```ruby
45
+ # Keep only two columns out of a wide file
46
+ data = SmarterCSV.process('big.csv', headers: { only: [:id, :email] })
47
+ # => [{id: 1, email: "alice@example.com"}, ...]
48
+
49
+ # Keep everything except one noisy column
50
+ data = SmarterCSV.process('big.csv', headers: { except: [:internal_notes] })
51
+ ```
52
+
53
+ ## Input flexibility
54
+
55
+ Both options accept an Array of symbols or strings, or a single symbol or string — anything
56
+ that makes sense as a column name. All values are normalized to symbols internally.
57
+
58
+ ```ruby
59
+ headers: { only: :id } # single symbol — same as [:id]
60
+ headers: { only: 'id' } # single string — normalized to :id
61
+ headers: { only: [:id, :email] } # array of symbols
62
+ headers: { only: ['id', 'email'] } # array of strings — normalized to symbols
63
+ ```
64
+
65
+ ## Names refer to post-mapping keys
66
+
67
+ `headers: { only: }` and `headers: { except: }` use the **post-mapping** column name — the
68
+ symbol that actually appears in the result hash after `key_mapping:` has been applied. You
69
+ never need to know the original CSV header spelling.
70
+
71
+ ```ruby
72
+ # CSV has header "First Name"; key_mapping renames it to :given_name
73
+ data = SmarterCSV.process('contacts.csv',
74
+ key_mapping: { first_name: :given_name },
75
+ headers: { only: [:given_name] }, # post-mapping name
76
+ )
77
+ # => [{given_name: "Alice"}, ...]
78
+ ```
79
+
80
+ ## Interaction with `with_line_numbers:`
81
+
82
+ `:csv_line_number` is added to each hash **after** column selection runs, so it is always
83
+ present when `with_line_numbers: true` — even if it is not listed in `headers: { only: }`.
84
+
85
+ ```ruby
86
+ data = SmarterCSV.process('data.csv',
87
+ headers: { only: [:name] },
88
+ with_line_numbers: true,
89
+ )
90
+ data.each { |row| puts "#{row[:csv_line_number]}: #{row[:name]}" }
91
+ ```
92
+
93
+ ## Interaction with `strict:`
94
+
95
+ `strict: true` raises `SmarterCSV::HeaderSizeMismatch` when a data row contains more fields
96
+ than the header row. This check runs **before** column selection, so schema validation still
97
+ catches malformed rows even when `headers: { only: }` is active.
98
+
99
+ ```ruby
100
+ # Raises HeaderSizeMismatch on the row with extra fields, regardless of headers: { only: }
101
+ SmarterCSV.process('data.csv', headers: { only: [:name] }, strict: true)
102
+ ```
103
+
104
+ ## Extra columns without `strict:`
105
+
106
+ When `strict:` is false (the default) and a data row has more fields than the header,
107
+ the extra columns are silently dropped — they cannot be in the `headers: { only: }` set, so
108
+ the filter discards them naturally.
109
+
110
+ > **Important:** `missing_headers: :auto` (auto-generating names like `column_7`,
111
+ > `column_8` for extra data columns) does **not** work in combination with `headers: { only: }`.
112
+ > `headers: { only: }` is a **performance improvement** that causes the parser to stop scanning
113
+ > a row once all requested columns have been found — any extra columns beyond the header
114
+ > count are never visited, so no auto-names are generated for them. If you need to capture
115
+ > auto-named overflow columns, do not use `headers: { only: }` at the same time.
116
+
117
+ ## Unknown column names are silently ignored
118
+
119
+ Listing a column name that doesn't exist in the file is not an error. The column simply
120
+ never appears in any row hash.
121
+
122
+ ```ruby
123
+ # :nonexistent_column is not in the file — no error, just absent from results
124
+ data = SmarterCSV.process('data.csv', headers: { only: [:id, :nonexistent_column] })
125
+ ```
126
+
127
+ ## Performance
128
+
129
+ Both options are implemented in the C extension (when acceleration is enabled). Excluded
130
+ columns are skipped entirely inside the C parsing loop — no Ruby string is allocated, no
131
+ numeric conversion runs, and no `rb_hash_aset` call is made for fields the caller doesn't
132
+ need. This makes column selection a genuine performance option for wide CSV files, not just
133
+ a post-processing filter.
134
+
135
+ The Ruby fallback path applies the same filter via `hash.select!` / `hash.reject!` after
136
+ parsing, giving correct results on all platforms.
137
+
138
+ ### `headers: { only: }` vs `headers: { except: }` — performance asymmetry
139
+
140
+ **`headers: { only: }` enables early exit.** Once every requested column has been parsed,
141
+ the parser stops scanning the current row entirely — the remaining fields are never visited.
142
+ For a 500-column file where you only need 5 columns near the start, this can be
143
+ **10–14× faster** than parsing all columns.
144
+
145
+ **`headers: { except: }` cannot have early exit.** To know which columns to *keep*, the
146
+ parser must scan every field in the row to the end. Skipping just a few columns out of many
147
+ saves very little work, so benchmark results for `headers: { except: }` are typically flat
148
+ (0.7×–1.0× vs full parse).
149
+
150
+ **Rule of thumb:**
151
+ - Use `headers: { only: }` when you want a small subset of a wide file — this is the fast path.
152
+ - Use `headers: { except: }` only when you want *almost everything* and excluding a known
153
+ noisy column is more convenient than listing all the ones you want.
154
+ - Avoid `headers: { except: }` as a performance tool on wide files — it provides no speed benefit.
155
+
156
+ ### `headers: { only: }` vs `remove_unmapped_keys:`
157
+
158
+ If you are already using `key_mapping:` to rename headers, the `remove_unmapped_keys: true`
159
+ option lets you implicitly drop everything not in the map — without listing each unwanted
160
+ column explicitly. This is a convenient alternative to `headers: { only: }` when renaming
161
+ and selecting go hand in hand:
162
+
163
+ ```ruby
164
+ # With key_mapping + remove_unmapped_keys: convenient when renaming
165
+ SmarterCSV.process('data.csv',
166
+ key_mapping: { col_a: :name, col_b: :email },
167
+ remove_unmapped_keys: true,
168
+ )
169
+
170
+ # With headers: { only: }: better for pure selection — C-path early exit applies
171
+ SmarterCSV.process('data.csv',
172
+ headers: { only: [:col_a, :col_b] },
173
+ )
174
+ ```
175
+
176
+ `headers: { only: }` is the faster choice for wide files since unneeded fields are skipped
177
+ inside the C parser before any Ruby objects are created. `remove_unmapped_keys:` is a
178
+ post-parse filter — all fields are parsed first, then the unwanted keys are deleted.
179
+ See [Header Transformations](./header_transformations.md#key-mapping) for more details.
180
+
181
+ ---
182
+
183
+ PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Data Transformations](./data_transformations.md) | UP: [README](../README.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [Parsing Strategy](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,52 +11,184 @@
10
11
  * [Row and Column Separators](./row_col_sep.md)
11
12
  * [Header Transformations](./header_transformations.md)
12
13
  * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [**Data Transformations**](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ --------------
17
25
 
18
26
  # Data Transformations
19
27
 
20
- SmarterCSV automatically transforms the values in each colum in order to normalize the data.
21
- This behavior can be customized or disabled.
28
+ SmarterCSV automatically normalizes the values in each row. All transformations are configurable — most are enabled by default because they're the right behavior for the vast majority of CSV files.
29
+
30
+ ## Transformation Pipeline
31
+
32
+ Transformations run in this order for every row:
33
+
34
+ | Step | Option | Default | What it does |
35
+ |------|--------|---------|--------------|
36
+ | 1 | `strip_whitespace` | `true` | Strips leading/trailing whitespace from all values (and headers) at parse time |
37
+ | 2 | `nil_values_matching` | `nil` | Sets values matching the regexp to `nil` |
38
+ | 3 | `remove_empty_values` | `true` | Removes keys whose value is `nil` or blank |
39
+ | 4 | `remove_zero_values` | `false` | Removes keys whose value is numeric zero |
40
+ | 5 | `convert_values_to_numeric` | `true` | Converts numeric-looking strings to `Integer` or `Float` |
41
+ | 6 | `value_converters` | `nil` | Applies per-key custom converter lambdas or classes |
42
+ | 7 | `remove_empty_hashes` | `true` | Drops rows that are entirely empty after all transformations |
43
+
44
+ > Steps 2–6 run per field in order. `value_converters` receive the value **after** numeric conversion — guard against receiving `Integer`/`Float` if your converter expects a string.
45
+
46
+ ---
22
47
 
23
- ## Remove Empty Values
24
- `remove_empty_values` is enabled by default
25
- It removes any values which are `nil` or would be empty strings.
48
+ ## `strip_whitespace`
26
49
 
27
- ## Convert Values to Numeric
28
- `convert_values_to_numeric` is enabled by default.
29
- SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
50
+ **Default: `true`**
30
51
 
31
- Here is an example of using `convert_values_to_numeric` for numbers with leading zeros, e.g. ZIP codes:
52
+ Strips leading and trailing whitespace from all header names and all field values at parse time, before any other transformation runs.
32
53
 
54
+ ```ruby
55
+ # CSV with padded values:
56
+ # name, score
57
+ # Alice , 42
58
+ # Bob , 0
59
+
60
+ data = SmarterCSV.process(file)
61
+ # => [{name: "Alice", score: 42}, {name: "Bob", score: 0}]
62
+ # ↑ "Alice " stripped to "Alice", " 42" stripped to "42" then converted
63
+
64
+ data = SmarterCSV.process(file, strip_whitespace: false)
65
+ # => [{"name"=>"Alice ", " score"=>" 42"}, ...]
66
+ # ↑ whitespace preserved in both headers and values
33
67
  ```
34
- data = SmarterCSV.process('/tmp/zip.csv', convert_values_to_numeric: { except: [:zip] })
35
- => [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
36
- ```
37
68
 
38
- This will return the column `:zip` as a string with all digits intact.
69
+ ---
70
+
71
+ ## `nil_values_matching`
72
+
73
+ **Default: `nil` (disabled)**
74
+
75
+ Set values matching the given regular expression to `nil`. Combined with the default `remove_empty_values: true`, matching values are removed from the result hash. With `remove_empty_values: false`, the key is retained with a `nil` value — useful when you need to distinguish "field was absent" from "field had a sentinel value".
76
+
77
+ ```ruby
78
+ # Treat common null sentinels as nil and remove them
79
+ data = SmarterCSV.process(file, nil_values_matching: /\A(NULL|N\/A|NA|#N\/A|\(null\))\z/i)
39
80
 
40
- ## Remove Zero Values
41
- `remove_zero_values` is disabled by default.
42
- When enabled, it removes key/value pairs which have a numeric value equal to zero.
81
+ # Nil-ify but retain the key (don't remove)
82
+ data = SmarterCSV.process(file,
83
+ nil_values_matching: /\A(NULL|N\/A)\z/i,
84
+ remove_empty_values: false)
85
+ # => [{name: "Alice", score: nil}] ← key retained with nil value
86
+
87
+ # Remove Excel error values
88
+ data = SmarterCSV.process(file, nil_values_matching: /\A(#VALUE!|#REF!|#DIV\/0!|NaN)\z/)
89
+ ```
90
+
91
+ > **Deprecated:** `remove_values_matching:` still works but emits a deprecation warning.
92
+ > Use `nil_values_matching:` instead.
93
+
94
+ ---
95
+
96
+ ## `remove_empty_values`
97
+
98
+ **Default: `true`**
99
+
100
+ Removes key/value pairs where the value is `nil` or an empty string after `strip_whitespace` and `nil_values_matching` have run. This is why SmarterCSV result hashes only contain keys with actual values — sparse CSV rows don't produce hashes cluttered with `nil` entries.
101
+
102
+ ```ruby
103
+ # CSV: name,score,notes
104
+ # Alice,42,
105
+ # Bob,,great player
106
+
107
+ data = SmarterCSV.process(file)
108
+ # => [{name: "Alice", score: 42}, {name: "Bob", notes: "great player"}]
109
+ # ↑ empty :notes and :score keys are dropped automatically
110
+
111
+ data = SmarterCSV.process(file, remove_empty_values: false)
112
+ # => [{name: "Alice", score: 42, notes: nil}, {name: nil, score: nil, notes: "great player"}]
113
+ ```
43
114
 
44
- ## Remove Values Matching
45
- `remove_values_matching` is disabled by default.
46
- When enabled, this can help removing key/value pairs from result hashes which would cause problems.
115
+ ---
47
116
 
48
- e.g.
49
- * `remove_values_matching: /^\$0\.0+$/` would remove $0.00
50
- * `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets
117
+ ## `remove_zero_values`
51
118
 
52
- ## Empty Hashes
119
+ **Default: `false`**
53
120
 
54
- It can happen that after all transformations, a row of the CSV file would produce a completely empty hash.
121
+ When enabled, removes key/value pairs where the value is numeric zero (`0`, `0.0`, `"0"`, `"0.0"`). Useful when zero and absent mean the same thing in your domain.
55
122
 
56
- By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
123
+ ```ruby
124
+ # CSV: product,quantity,discount
125
+ # Widget,10,0
126
+ # Gadget,0,5
57
127
 
58
- This can be set to `false`, to keep these empty hashes in the results.
128
+ data = SmarterCSV.process(file, remove_zero_values: true)
129
+ # => [{product: "Widget", quantity: 10}, {product: "Gadget", discount: 5}]
130
+ # ↑ :discount=>0 and :quantity=>0 removed
131
+ ```
132
+
133
+ ---
134
+
135
+ ## `convert_values_to_numeric`
136
+
137
+ **Default: `true`**
138
+
139
+ Converts string values that look like integers or floats to the appropriate numeric type. This is one of the most common sources of silent data loss if not configured carefully — fields like ZIP codes, phone numbers, and account numbers with leading zeros will be silently corrupted if not excluded.
140
+
141
+ ```ruby
142
+ data = SmarterCSV.process(file)
143
+ # "42" => 42 (Integer)
144
+ # "3.14" => 3.14 (Float)
145
+ # "01234" => 1234 ← leading zero lost! exclude this column
146
+
147
+ # Exclude specific columns from numeric conversion
148
+ data = SmarterCSV.process(file,
149
+ convert_values_to_numeric: { except: [:zip, :phone, :account_number] })
150
+ # => [{zip: "01234", phone: "800-555-0100", amount: 99.99}]
151
+
152
+ # Only convert specific columns (all others stay as strings)
153
+ data = SmarterCSV.process(file,
154
+ convert_values_to_numeric: { only: [:quantity, :price] })
155
+ ```
156
+
157
+ ---
158
+
159
+ ## `remove_empty_hashes`
160
+
161
+ **Default: `true`**
162
+
163
+ After all per-field transformations, removes rows that have no remaining key/value pairs. This handles blank lines and rows where every field was empty or matched `nil_values_matching`.
164
+
165
+ ```ruby
166
+ # CSV with a blank line between records:
167
+ # name,score
168
+ # Alice,42
169
+ #
170
+ # Bob,99
171
+
172
+ data = SmarterCSV.process(file)
173
+ # => [{name: "Alice", score: 42}, {name: "Bob", score: 99}]
174
+ # ↑ blank line silently dropped
175
+
176
+ data = SmarterCSV.process(file, remove_empty_hashes: false)
177
+ # => [{name: "Alice", score: 42}, {}, {name: "Bob", score: 99}]
178
+ ```
179
+
180
+ ---
181
+
182
+ ## Custom Transformations — `value_converters`
183
+
184
+ For type conversions beyond numeric (dates, booleans, currency, etc.), use `value_converters`. They run last in the pipeline, after numeric conversion. See [Value Converters](./value_converters.md) for full documentation.
185
+
186
+ ```ruby
187
+ data = SmarterCSV.process(file, value_converters: {
188
+ date: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil },
189
+ active: ->(v) { v&.match?(/\Atrue\z/i) },
190
+ })
191
+ ```
59
192
 
60
193
  -------------------
61
- PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Value Converters](./value_converters.md)
194
+ PREVIOUS: [Column Selection](./column_selection.md) | NEXT: [Value Converters](./value_converters.md) | UP: [README](../README.md)