smarter_csv 1.15.2 → 1.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +9 -0
- data/CHANGELOG.md +68 -1
- data/CONTRIBUTORS.md +3 -1
- data/Gemfile +1 -0
- data/README.md +123 -27
- data/docs/_introduction.md +40 -24
- data/docs/bad_row_quarantine.md +285 -0
- data/docs/basic_read_api.md +151 -9
- data/docs/basic_write_api.md +474 -59
- data/docs/batch_processing.md +161 -4
- data/docs/column_selection.md +183 -0
- data/docs/data_transformations.md +162 -29
- data/docs/examples.md +339 -46
- data/docs/header_transformations.md +93 -12
- data/docs/header_validations.md +56 -18
- data/docs/history.md +117 -0
- data/docs/instrumentation.md +165 -0
- data/docs/migrating_from_csv.md +290 -0
- data/docs/options.md +150 -87
- data/docs/parsing_strategy.md +63 -1
- data/docs/real_world_csv.md +262 -0
- data/docs/releases/1.16.0/benchmarks.md +223 -0
- data/docs/releases/1.16.0/changes.md +272 -0
- data/docs/releases/1.16.0/performance_notes.md +114 -0
- data/docs/row_col_sep.md +14 -5
- data/docs/value_converters.md +193 -57
- data/ext/smarter_csv/extconf.rb +3 -0
- data/ext/smarter_csv/smarter_csv.c +1007 -71
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
- data/lib/smarter_csv/errors.rb +8 -0
- data/lib/smarter_csv/file_io.rb +1 -1
- data/lib/smarter_csv/hash_transformations.rb +14 -13
- data/lib/smarter_csv/header_transformations.rb +21 -2
- data/lib/smarter_csv/headers.rb +2 -1
- data/lib/smarter_csv/options.rb +124 -7
- data/lib/smarter_csv/parser.rb +362 -75
- data/lib/smarter_csv/reader.rb +494 -46
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +71 -19
- data/lib/smarter_csv.rb +95 -12
- data/smarter_csv.gemspec +20 -10
- metadata +37 -80
data/docs/batch_processing.md
CHANGED
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
### Contents
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
5
6
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
6
7
|
* [The Basic Read API](./basic_read_api.md)
|
|
7
8
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -10,10 +11,17 @@
|
|
|
10
11
|
* [Row and Column Separators](./row_col_sep.md)
|
|
11
12
|
* [Header Transformations](./header_transformations.md)
|
|
12
13
|
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
13
15
|
* [Data Transformations](./data_transformations.md)
|
|
14
16
|
* [Value Converters](./value_converters.md)
|
|
15
|
-
|
|
16
|
-
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
17
25
|
|
|
18
26
|
# Batch Processing
|
|
19
27
|
|
|
@@ -64,7 +72,7 @@ The `process` method returns the number of chunks when called with a block.
|
|
|
64
72
|
=> 2
|
|
65
73
|
```
|
|
66
74
|
|
|
67
|
-
## Example 3:
|
|
75
|
+
## Example 3: ActiveRecord Bulk Insert in Chunks of 100 records with SmarterCSV:
|
|
68
76
|
```ruby
|
|
69
77
|
# using chunks:
|
|
70
78
|
filename = '/tmp/some.csv'
|
|
@@ -78,5 +86,154 @@ The `process` method returns the number of chunks when called with a block.
|
|
|
78
86
|
=> returns number of chunks we processed
|
|
79
87
|
```
|
|
80
88
|
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
# Modern Batch API — `each_chunk`
|
|
92
|
+
|
|
93
|
+
`Reader#each_chunk` is the modern API for chunked batch processing. It yields `(Array<Hash>, chunk_index)` — the same shape as the `process` block — but returns an `Enumerator` when called without a block, enabling more flexible composition.
|
|
94
|
+
|
|
95
|
+
## Configuration
|
|
96
|
+
|
|
97
|
+
Set `chunk_size` in options when constructing the Reader. `each_chunk` reads this value automatically:
|
|
98
|
+
|
|
99
|
+
```ruby
|
|
100
|
+
reader = SmarterCSV::Reader.new('big.csv', chunk_size: 500)
|
|
101
|
+
reader.each_chunk do |chunk, index|
|
|
102
|
+
puts "Processing chunk #{index} (#{chunk.size} rows)"
|
|
103
|
+
MyModel.insert_all(chunk)
|
|
104
|
+
end
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
If `chunk_size` is not set, `each_chunk` defaults to `SmarterCSV::Reader::DEFAULT_CHUNK_SIZE` (100) and emits a warning to STDERR:
|
|
108
|
+
|
|
109
|
+
```
|
|
110
|
+
SmarterCSV: chunk_size not set, defaulting to 100. Set chunk_size explicitly to suppress this warning.
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Set `chunk_size` explicitly to suppress the warning and choose the right batch size for your use case.
|
|
114
|
+
|
|
115
|
+
## Simplified form
|
|
116
|
+
|
|
117
|
+
```ruby
|
|
118
|
+
SmarterCSV.each_chunk('big.csv', chunk_size: 500) do |chunk, index|
|
|
119
|
+
MyModel.insert_all(chunk)
|
|
120
|
+
end
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
## Returns an Enumerator when called without a block
|
|
124
|
+
|
|
125
|
+
```ruby
|
|
126
|
+
reader = SmarterCSV::Reader.new('big.csv', chunk_size: 500)
|
|
127
|
+
reader.each_chunk.with_index do |chunk, index|
|
|
128
|
+
puts "Chunk #{index}: #{chunk.size} rows"
|
|
129
|
+
end
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
## Example: Sidekiq parallel import
|
|
133
|
+
|
|
134
|
+
```ruby
|
|
135
|
+
reader = SmarterCSV::Reader.new('users.csv', chunk_size: 100)
|
|
136
|
+
reader.each_chunk do |chunk, index|
|
|
137
|
+
ImportWorker.perform_async(chunk)
|
|
138
|
+
end
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
## Example: Resque parallel import
|
|
142
|
+
|
|
143
|
+
```ruby
|
|
144
|
+
reader = SmarterCSV::Reader.new('orders.csv', chunk_size: 200)
|
|
145
|
+
reader.each_chunk do |chunk, index|
|
|
146
|
+
Resque.enqueue(OrderImportJob, chunk)
|
|
147
|
+
end
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## Example: ActiveRecord `insert_all` bulk insert
|
|
151
|
+
|
|
152
|
+
```ruby
|
|
153
|
+
reader = SmarterCSV::Reader.new('products.csv', chunk_size: 500)
|
|
154
|
+
reader.each_chunk do |chunk, _index|
|
|
155
|
+
MyModel.insert_all(chunk)
|
|
156
|
+
end
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
## Example: Progress tracking
|
|
160
|
+
|
|
161
|
+
```ruby
|
|
162
|
+
reader = SmarterCSV::Reader.new('big.csv', chunk_size: 1_000)
|
|
163
|
+
total = File.foreach('big.csv').count - 1 # subtract header row
|
|
164
|
+
|
|
165
|
+
reader.each_chunk do |chunk, index|
|
|
166
|
+
processed = [(index + 1) * 1_000, total].min
|
|
167
|
+
puts "#{processed}/#{total} rows processed"
|
|
168
|
+
MyModel.insert_all(chunk)
|
|
169
|
+
end
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
## Interaction with `on_bad_row`
|
|
173
|
+
|
|
174
|
+
`each_chunk` respects all `on_bad_row` options. Bad rows are excluded from chunks and counted or routed to your handler:
|
|
175
|
+
|
|
176
|
+
```ruby
|
|
177
|
+
reader = SmarterCSV::Reader.new('data.csv',
|
|
178
|
+
chunk_size: 500,
|
|
179
|
+
on_bad_row: :collect,
|
|
180
|
+
)
|
|
181
|
+
reader.each_chunk do |chunk, index|
|
|
182
|
+
MyModel.insert_all(chunk)
|
|
183
|
+
end
|
|
184
|
+
puts "Bad rows: #{reader.errors[:bad_row_count]}"
|
|
185
|
+
reader.errors[:bad_rows].each { |rec| puts "Line #{rec[:csv_line_number]}: #{rec[:error_message]}" }
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
|
|
189
|
+
|
|
190
|
+
## Example: DynamoDB batch write
|
|
191
|
+
|
|
192
|
+
DynamoDB's `batch_write_item` API accepts up to **25 items per request** — making
|
|
193
|
+
`chunk_size: 25` the natural fit. SmarterCSV symbol keys map directly to DynamoDB
|
|
194
|
+
attribute names after a simple `transform_keys(&:to_s)` call.
|
|
195
|
+
|
|
196
|
+
```ruby
|
|
197
|
+
require 'aws-sdk-dynamodb'
|
|
198
|
+
|
|
199
|
+
client = Aws::DynamoDB::Client.new(region: 'us-east-1')
|
|
200
|
+
|
|
201
|
+
SmarterCSV::Reader.new('products.csv', chunk_size: 25).each_chunk do |chunk, _index|
|
|
202
|
+
client.batch_write_item(
|
|
203
|
+
request_items: {
|
|
204
|
+
'ProductsTable' => chunk.map do |row|
|
|
205
|
+
{ put_request: { item: row.transform_keys(&:to_s) } }
|
|
206
|
+
end
|
|
207
|
+
}
|
|
208
|
+
)
|
|
209
|
+
end
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
## Example: Reading a CSV from S3
|
|
213
|
+
|
|
214
|
+
SmarterCSV accepts any IO-like object, so you can stream directly from S3 without
|
|
215
|
+
writing a temp file:
|
|
216
|
+
|
|
217
|
+
```ruby
|
|
218
|
+
require 'aws-sdk-s3'
|
|
219
|
+
|
|
220
|
+
s3 = Aws::S3::Client.new(region: 'us-east-1')
|
|
221
|
+
obj = s3.get_object(bucket: 'my-bucket', key: 'imports/products.csv')
|
|
222
|
+
|
|
223
|
+
data = SmarterCSV.process(obj.body)
|
|
224
|
+
MyModel.insert_all(data)
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
For large files, combine with chunked processing:
|
|
228
|
+
|
|
229
|
+
```ruby
|
|
230
|
+
obj = s3.get_object(bucket: 'my-bucket', key: 'imports/big.csv')
|
|
231
|
+
|
|
232
|
+
SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _index|
|
|
233
|
+
MyModel.insert_all(chunk)
|
|
234
|
+
end
|
|
235
|
+
```
|
|
236
|
+
|
|
81
237
|
----------------
|
|
82
|
-
|
|
238
|
+
|
|
239
|
+
PREVIOUS: [The Basic Write API](./basic_write_api.md) | NEXT: [Configuration Options](./options.md) | UP: [README](../README.md)
|
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
|
|
2
|
+
### Contents
|
|
3
|
+
|
|
4
|
+
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Parsing Strategy](./parsing_strategy.md)
|
|
7
|
+
* [The Basic Read API](./basic_read_api.md)
|
|
8
|
+
* [The Basic Write API](./basic_write_api.md)
|
|
9
|
+
* [Batch Processing](././batch_processing.md)
|
|
10
|
+
* [Configuration Options](./options.md)
|
|
11
|
+
* [Row and Column Separators](./row_col_sep.md)
|
|
12
|
+
* [Header Transformations](./header_transformations.md)
|
|
13
|
+
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [**Column Selection**](./column_selection.md)
|
|
15
|
+
* [Data Transformations](./data_transformations.md)
|
|
16
|
+
* [Value Converters](./value_converters.md)
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
25
|
+
|
|
26
|
+
# Column Selection
|
|
27
|
+
|
|
28
|
+
Wide CSV files often contain dozens or hundreds of columns, but a given application typically
|
|
29
|
+
only needs a handful of them. The `headers: { only: }` and `headers: { except: }` options let
|
|
30
|
+
you declare upfront which columns you want, so SmarterCSV skips allocation and hash insertion
|
|
31
|
+
for everything else — both in the Ruby path and in the C-accelerated hot path.
|
|
32
|
+
|
|
33
|
+
## Options
|
|
34
|
+
|
|
35
|
+
| Option | Default | Description |
|
|
36
|
+
|--------|---------|-------------|
|
|
37
|
+
| `headers: { only: }` | `nil` | Keep only the listed columns in each result hash |
|
|
38
|
+
| `headers: { except: }` | `nil` | Remove the listed columns from each result hash |
|
|
39
|
+
|
|
40
|
+
You cannot use both options at the same time — doing so raises `SmarterCSV::ValidationError`.
|
|
41
|
+
|
|
42
|
+
## Basic usage
|
|
43
|
+
|
|
44
|
+
```ruby
|
|
45
|
+
# Keep only two columns out of a wide file
|
|
46
|
+
data = SmarterCSV.process('big.csv', headers: { only: [:id, :email] })
|
|
47
|
+
# => [{id: 1, email: "alice@example.com"}, ...]
|
|
48
|
+
|
|
49
|
+
# Keep everything except one noisy column
|
|
50
|
+
data = SmarterCSV.process('big.csv', headers: { except: [:internal_notes] })
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## Input flexibility
|
|
54
|
+
|
|
55
|
+
Both options accept an Array of symbols or strings, or a single symbol or string — anything
|
|
56
|
+
that makes sense as a column name. All values are normalized to symbols internally.
|
|
57
|
+
|
|
58
|
+
```ruby
|
|
59
|
+
headers: { only: :id } # single symbol — same as [:id]
|
|
60
|
+
headers: { only: 'id' } # single string — normalized to :id
|
|
61
|
+
headers: { only: [:id, :email] } # array of symbols
|
|
62
|
+
headers: { only: ['id', 'email'] } # array of strings — normalized to symbols
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Names refer to post-mapping keys
|
|
66
|
+
|
|
67
|
+
`headers: { only: }` and `headers: { except: }` use the **post-mapping** column name — the
|
|
68
|
+
symbol that actually appears in the result hash after `key_mapping:` has been applied. You
|
|
69
|
+
never need to know the original CSV header spelling.
|
|
70
|
+
|
|
71
|
+
```ruby
|
|
72
|
+
# CSV has header "First Name"; key_mapping renames it to :given_name
|
|
73
|
+
data = SmarterCSV.process('contacts.csv',
|
|
74
|
+
key_mapping: { first_name: :given_name },
|
|
75
|
+
headers: { only: [:given_name] }, # post-mapping name
|
|
76
|
+
)
|
|
77
|
+
# => [{given_name: "Alice"}, ...]
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Interaction with `with_line_numbers:`
|
|
81
|
+
|
|
82
|
+
`:csv_line_number` is added to each hash **after** column selection runs, so it is always
|
|
83
|
+
present when `with_line_numbers: true` — even if it is not listed in `headers: { only: }`.
|
|
84
|
+
|
|
85
|
+
```ruby
|
|
86
|
+
data = SmarterCSV.process('data.csv',
|
|
87
|
+
headers: { only: [:name] },
|
|
88
|
+
with_line_numbers: true,
|
|
89
|
+
)
|
|
90
|
+
data.each { |row| puts "#{row[:csv_line_number]}: #{row[:name]}" }
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## Interaction with `strict:`
|
|
94
|
+
|
|
95
|
+
`strict: true` raises `SmarterCSV::HeaderSizeMismatch` when a data row contains more fields
|
|
96
|
+
than the header row. This check runs **before** column selection, so schema validation still
|
|
97
|
+
catches malformed rows even when `headers: { only: }` is active.
|
|
98
|
+
|
|
99
|
+
```ruby
|
|
100
|
+
# Raises HeaderSizeMismatch on the row with extra fields, regardless of headers: { only: }
|
|
101
|
+
SmarterCSV.process('data.csv', headers: { only: [:name] }, strict: true)
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
## Extra columns without `strict:`
|
|
105
|
+
|
|
106
|
+
When `strict:` is false (the default) and a data row has more fields than the header,
|
|
107
|
+
the extra columns are silently dropped — they cannot be in the `headers: { only: }` set, so
|
|
108
|
+
the filter discards them naturally.
|
|
109
|
+
|
|
110
|
+
> **Important:** `missing_headers: :auto` (auto-generating names like `column_7`,
|
|
111
|
+
> `column_8` for extra data columns) does **not** work in combination with `headers: { only: }`.
|
|
112
|
+
> `headers: { only: }` is a **performance improvement** that causes the parser to stop scanning
|
|
113
|
+
> a row once all requested columns have been found — any extra columns beyond the header
|
|
114
|
+
> count are never visited, so no auto-names are generated for them. If you need to capture
|
|
115
|
+
> auto-named overflow columns, do not use `headers: { only: }` at the same time.
|
|
116
|
+
|
|
117
|
+
## Unknown column names are silently ignored
|
|
118
|
+
|
|
119
|
+
Listing a column name that doesn't exist in the file is not an error. The column simply
|
|
120
|
+
never appears in any row hash.
|
|
121
|
+
|
|
122
|
+
```ruby
|
|
123
|
+
# :nonexistent_column is not in the file — no error, just absent from results
|
|
124
|
+
data = SmarterCSV.process('data.csv', headers: { only: [:id, :nonexistent_column] })
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## Performance
|
|
128
|
+
|
|
129
|
+
Both options are implemented in the C extension (when acceleration is enabled). Excluded
|
|
130
|
+
columns are skipped entirely inside the C parsing loop — no Ruby string is allocated, no
|
|
131
|
+
numeric conversion runs, and no `rb_hash_aset` call is made for fields the caller doesn't
|
|
132
|
+
need. This makes column selection a genuine performance option for wide CSV files, not just
|
|
133
|
+
a post-processing filter.
|
|
134
|
+
|
|
135
|
+
The Ruby fallback path applies the same filter via `hash.select!` / `hash.reject!` after
|
|
136
|
+
parsing, giving correct results on all platforms.
|
|
137
|
+
|
|
138
|
+
### `headers: { only: }` vs `headers: { except: }` — performance asymmetry
|
|
139
|
+
|
|
140
|
+
**`headers: { only: }` enables early exit.** Once every requested column has been parsed,
|
|
141
|
+
the parser stops scanning the current row entirely — the remaining fields are never visited.
|
|
142
|
+
For a 500-column file where you only need 5 columns near the start, this can be
|
|
143
|
+
**10–14× faster** than parsing all columns.
|
|
144
|
+
|
|
145
|
+
**`headers: { except: }` cannot have early exit.** To know which columns to *keep*, the
|
|
146
|
+
parser must scan every field in the row to the end. Skipping just a few columns out of many
|
|
147
|
+
saves very little work, so benchmark results for `headers: { except: }` are typically flat
|
|
148
|
+
(0.7×–1.0× vs full parse).
|
|
149
|
+
|
|
150
|
+
**Rule of thumb:**
|
|
151
|
+
- Use `headers: { only: }` when you want a small subset of a wide file — this is the fast path.
|
|
152
|
+
- Use `headers: { except: }` only when you want *almost everything* and excluding a known
|
|
153
|
+
noisy column is more convenient than listing all the ones you want.
|
|
154
|
+
- Avoid `headers: { except: }` as a performance tool on wide files — it provides no speed benefit.
|
|
155
|
+
|
|
156
|
+
### `headers: { only: }` vs `remove_unmapped_keys:`
|
|
157
|
+
|
|
158
|
+
If you are already using `key_mapping:` to rename headers, the `remove_unmapped_keys: true`
|
|
159
|
+
option lets you implicitly drop everything not in the map — without listing each unwanted
|
|
160
|
+
column explicitly. This is a convenient alternative to `headers: { only: }` when renaming
|
|
161
|
+
and selecting go hand in hand:
|
|
162
|
+
|
|
163
|
+
```ruby
|
|
164
|
+
# With key_mapping + remove_unmapped_keys: convenient when renaming
|
|
165
|
+
SmarterCSV.process('data.csv',
|
|
166
|
+
key_mapping: { col_a: :name, col_b: :email },
|
|
167
|
+
remove_unmapped_keys: true,
|
|
168
|
+
)
|
|
169
|
+
|
|
170
|
+
# With headers: { only: }: better for pure selection — C-path early exit applies
|
|
171
|
+
SmarterCSV.process('data.csv',
|
|
172
|
+
headers: { only: [:col_a, :col_b] },
|
|
173
|
+
)
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
`headers: { only: }` is the faster choice for wide files since unneeded fields are skipped
|
|
177
|
+
inside the C parser before any Ruby objects are created. `remove_unmapped_keys:` is a
|
|
178
|
+
post-parse filter — all fields are parsed first, then the unwanted keys are deleted.
|
|
179
|
+
See [Header Transformations](./header_transformations.md#key-mapping) for more details.
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Data Transformations](./data_transformations.md) | UP: [README](../README.md)
|
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
### Contents
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
5
6
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
6
7
|
* [The Basic Read API](./basic_read_api.md)
|
|
7
8
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -10,52 +11,184 @@
|
|
|
10
11
|
* [Row and Column Separators](./row_col_sep.md)
|
|
11
12
|
* [Header Transformations](./header_transformations.md)
|
|
12
13
|
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
13
15
|
* [**Data Transformations**](./data_transformations.md)
|
|
14
16
|
* [Value Converters](./value_converters.md)
|
|
15
|
-
|
|
16
|
-
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
17
25
|
|
|
18
26
|
# Data Transformations
|
|
19
27
|
|
|
20
|
-
SmarterCSV automatically
|
|
21
|
-
|
|
28
|
+
SmarterCSV automatically normalizes the values in each row. All transformations are configurable — most are enabled by default because they're the right behavior for the vast majority of CSV files.
|
|
29
|
+
|
|
30
|
+
## Transformation Pipeline
|
|
31
|
+
|
|
32
|
+
Transformations run in this order for every row:
|
|
33
|
+
|
|
34
|
+
| Step | Option | Default | What it does |
|
|
35
|
+
|------|--------|---------|--------------|
|
|
36
|
+
| 1 | `strip_whitespace` | `true` | Strips leading/trailing whitespace from all values (and headers) at parse time |
|
|
37
|
+
| 2 | `nil_values_matching` | `nil` | Sets values matching the regexp to `nil` |
|
|
38
|
+
| 3 | `remove_empty_values` | `true` | Removes keys whose value is `nil` or blank |
|
|
39
|
+
| 4 | `remove_zero_values` | `false` | Removes keys whose value is numeric zero |
|
|
40
|
+
| 5 | `convert_values_to_numeric` | `true` | Converts numeric-looking strings to `Integer` or `Float` |
|
|
41
|
+
| 6 | `value_converters` | `nil` | Applies per-key custom converter lambdas or classes |
|
|
42
|
+
| 7 | `remove_empty_hashes` | `true` | Drops rows that are entirely empty after all transformations |
|
|
43
|
+
|
|
44
|
+
> Steps 2–6 run per field in order. `value_converters` receive the value **after** numeric conversion — guard against receiving `Integer`/`Float` if your converter expects a string.
|
|
45
|
+
|
|
46
|
+
---
|
|
22
47
|
|
|
23
|
-
##
|
|
24
|
-
`remove_empty_values` is enabled by default
|
|
25
|
-
It removes any values which are `nil` or would be empty strings.
|
|
48
|
+
## `strip_whitespace`
|
|
26
49
|
|
|
27
|
-
|
|
28
|
-
`convert_values_to_numeric` is enabled by default.
|
|
29
|
-
SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
|
|
50
|
+
**Default: `true`**
|
|
30
51
|
|
|
31
|
-
|
|
52
|
+
Strips leading and trailing whitespace from all header names and all field values at parse time, before any other transformation runs.
|
|
32
53
|
|
|
54
|
+
```ruby
|
|
55
|
+
# CSV with padded values:
|
|
56
|
+
# name, score
|
|
57
|
+
# Alice , 42
|
|
58
|
+
# Bob , 0
|
|
59
|
+
|
|
60
|
+
data = SmarterCSV.process(file)
|
|
61
|
+
# => [{name: "Alice", score: 42}, {name: "Bob", score: 0}]
|
|
62
|
+
# ↑ "Alice " stripped to "Alice", " 42" stripped to "42" then converted
|
|
63
|
+
|
|
64
|
+
data = SmarterCSV.process(file, strip_whitespace: false)
|
|
65
|
+
# => [{"name"=>"Alice ", " score"=>" 42"}, ...]
|
|
66
|
+
# ↑ whitespace preserved in both headers and values
|
|
33
67
|
```
|
|
34
|
-
data = SmarterCSV.process('/tmp/zip.csv', convert_values_to_numeric: { except: [:zip] })
|
|
35
|
-
=> [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
|
|
36
|
-
```
|
|
37
68
|
|
|
38
|
-
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## `nil_values_matching`
|
|
72
|
+
|
|
73
|
+
**Default: `nil` (disabled)**
|
|
74
|
+
|
|
75
|
+
Set values matching the given regular expression to `nil`. Combined with the default `remove_empty_values: true`, matching values are removed from the result hash. With `remove_empty_values: false`, the key is retained with a `nil` value — useful when you need to distinguish "field was absent" from "field had a sentinel value".
|
|
76
|
+
|
|
77
|
+
```ruby
|
|
78
|
+
# Treat common null sentinels as nil and remove them
|
|
79
|
+
data = SmarterCSV.process(file, nil_values_matching: /\A(NULL|N\/A|NA|#N\/A|\(null\))\z/i)
|
|
39
80
|
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
81
|
+
# Nil-ify but retain the key (don't remove)
|
|
82
|
+
data = SmarterCSV.process(file,
|
|
83
|
+
nil_values_matching: /\A(NULL|N\/A)\z/i,
|
|
84
|
+
remove_empty_values: false)
|
|
85
|
+
# => [{name: "Alice", score: nil}] ← key retained with nil value
|
|
86
|
+
|
|
87
|
+
# Remove Excel error values
|
|
88
|
+
data = SmarterCSV.process(file, nil_values_matching: /\A(#VALUE!|#REF!|#DIV\/0!|NaN)\z/)
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
> **Deprecated:** `remove_values_matching:` still works but emits a deprecation warning.
|
|
92
|
+
> Use `nil_values_matching:` instead.
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
## `remove_empty_values`
|
|
97
|
+
|
|
98
|
+
**Default: `true`**
|
|
99
|
+
|
|
100
|
+
Removes key/value pairs where the value is `nil` or an empty string after `strip_whitespace` and `nil_values_matching` have run. This is why SmarterCSV result hashes only contain keys with actual values — sparse CSV rows don't produce hashes cluttered with `nil` entries.
|
|
101
|
+
|
|
102
|
+
```ruby
|
|
103
|
+
# CSV: name,score,notes
|
|
104
|
+
# Alice,42,
|
|
105
|
+
# Bob,,great player
|
|
106
|
+
|
|
107
|
+
data = SmarterCSV.process(file)
|
|
108
|
+
# => [{name: "Alice", score: 42}, {name: "Bob", notes: "great player"}]
|
|
109
|
+
# ↑ empty :notes and :score keys are dropped automatically
|
|
110
|
+
|
|
111
|
+
data = SmarterCSV.process(file, remove_empty_values: false)
|
|
112
|
+
# => [{name: "Alice", score: 42, notes: nil}, {name: nil, score: nil, notes: "great player"}]
|
|
113
|
+
```
|
|
43
114
|
|
|
44
|
-
|
|
45
|
-
`remove_values_matching` is disabled by default.
|
|
46
|
-
When enabled, this can help removing key/value pairs from result hashes which would cause problems.
|
|
115
|
+
---
|
|
47
116
|
|
|
48
|
-
|
|
49
|
-
* `remove_values_matching: /^\$0\.0+$/` would remove $0.00
|
|
50
|
-
* `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets
|
|
117
|
+
## `remove_zero_values`
|
|
51
118
|
|
|
52
|
-
|
|
119
|
+
**Default: `false`**
|
|
53
120
|
|
|
54
|
-
|
|
121
|
+
When enabled, removes key/value pairs where the value is numeric zero (`0`, `0.0`, `"0"`, `"0.0"`). Useful when zero and absent mean the same thing in your domain.
|
|
55
122
|
|
|
56
|
-
|
|
123
|
+
```ruby
|
|
124
|
+
# CSV: product,quantity,discount
|
|
125
|
+
# Widget,10,0
|
|
126
|
+
# Gadget,0,5
|
|
57
127
|
|
|
58
|
-
|
|
128
|
+
data = SmarterCSV.process(file, remove_zero_values: true)
|
|
129
|
+
# => [{product: "Widget", quantity: 10}, {product: "Gadget", discount: 5}]
|
|
130
|
+
# ↑ :discount=>0 and :quantity=>0 removed
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## `convert_values_to_numeric`
|
|
136
|
+
|
|
137
|
+
**Default: `true`**
|
|
138
|
+
|
|
139
|
+
Converts string values that look like integers or floats to the appropriate numeric type. This is one of the most common sources of silent data loss if not configured carefully — fields like ZIP codes, phone numbers, and account numbers with leading zeros will be silently corrupted if not excluded.
|
|
140
|
+
|
|
141
|
+
```ruby
|
|
142
|
+
data = SmarterCSV.process(file)
|
|
143
|
+
# "42" => 42 (Integer)
|
|
144
|
+
# "3.14" => 3.14 (Float)
|
|
145
|
+
# "01234" => 1234 ← leading zero lost! exclude this column
|
|
146
|
+
|
|
147
|
+
# Exclude specific columns from numeric conversion
|
|
148
|
+
data = SmarterCSV.process(file,
|
|
149
|
+
convert_values_to_numeric: { except: [:zip, :phone, :account_number] })
|
|
150
|
+
# => [{zip: "01234", phone: "800-555-0100", amount: 99.99}]
|
|
151
|
+
|
|
152
|
+
# Only convert specific columns (all others stay as strings)
|
|
153
|
+
data = SmarterCSV.process(file,
|
|
154
|
+
convert_values_to_numeric: { only: [:quantity, :price] })
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
## `remove_empty_hashes`
|
|
160
|
+
|
|
161
|
+
**Default: `true`**
|
|
162
|
+
|
|
163
|
+
After all per-field transformations, removes rows that have no remaining key/value pairs. This handles blank lines and rows where every field was empty or matched `nil_values_matching`.
|
|
164
|
+
|
|
165
|
+
```ruby
|
|
166
|
+
# CSV with a blank line between records:
|
|
167
|
+
# name,score
|
|
168
|
+
# Alice,42
|
|
169
|
+
#
|
|
170
|
+
# Bob,99
|
|
171
|
+
|
|
172
|
+
data = SmarterCSV.process(file)
|
|
173
|
+
# => [{name: "Alice", score: 42}, {name: "Bob", score: 99}]
|
|
174
|
+
# ↑ blank line silently dropped
|
|
175
|
+
|
|
176
|
+
data = SmarterCSV.process(file, remove_empty_hashes: false)
|
|
177
|
+
# => [{name: "Alice", score: 42}, {}, {name: "Bob", score: 99}]
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
---
|
|
181
|
+
|
|
182
|
+
## Custom Transformations — `value_converters`
|
|
183
|
+
|
|
184
|
+
For type conversions beyond numeric (dates, booleans, currency, etc.), use `value_converters`. They run last in the pipeline, after numeric conversion. See [Value Converters](./value_converters.md) for full documentation.
|
|
185
|
+
|
|
186
|
+
```ruby
|
|
187
|
+
data = SmarterCSV.process(file, value_converters: {
|
|
188
|
+
date: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil },
|
|
189
|
+
active: ->(v) { v&.match?(/\Atrue\z/i) },
|
|
190
|
+
})
|
|
191
|
+
```
|
|
59
192
|
|
|
60
193
|
-------------------
|
|
61
|
-
PREVIOUS: [
|
|
194
|
+
PREVIOUS: [Column Selection](./column_selection.md) | NEXT: [Value Converters](./value_converters.md) | UP: [README](../README.md)
|