smarter_csv 1.16.4 → 1.17.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +10 -1
- data/CHANGELOG.md +54 -0
- data/Gemfile +10 -5
- data/README.md +98 -14
- data/TO_DO.md +109 -0
- data/docs/_introduction.md +1 -0
- data/docs/bad_row_quarantine.md +2 -1
- data/docs/basic_read_api.md +6 -1
- data/docs/basic_write_api.md +30 -0
- data/docs/batch_processing.md +25 -0
- data/docs/column_selection.md +1 -0
- data/docs/data_transformations.md +1 -0
- data/docs/examples.md +126 -0
- data/docs/header_transformations.md +23 -0
- data/docs/header_validations.md +1 -0
- data/docs/history.md +1 -0
- data/docs/instrumentation.md +2 -1
- data/docs/migrating_from_csv.md +1 -0
- data/docs/options.md +20 -18
- data/docs/parsing_strategy.md +1 -0
- data/docs/real_world_csv.md +51 -1
- data/docs/releases/1.16.0/performance_notes.md +15 -15
- data/docs/releases/1.17.0/benchmarks.md +121 -0
- data/docs/releases/1.17.0/changes.md +161 -0
- data/docs/releases/1.17.0/performance_notes.md +126 -0
- data/docs/row_col_sep.md +21 -1
- data/docs/ruby_csv_pitfalls.md +1 -0
- data/docs/value_converters.md +24 -0
- data/docs/warnings.md +141 -0
- data/ext/smarter_csv/smarter_csv.c +98 -32
- data/images/SmarterCSV_1.17.0_vs_RubyCSV_3.3.5_speedup.svg +106 -0
- data/images/SmarterCSV_1.17.0_vs_previous_C-speedup.svg +181 -0
- data/images/SmarterCSV_1.17.0_vs_previous_Rb-speedup.svg +179 -0
- data/lib/smarter_csv/auto_detection.rb +215 -30
- data/lib/smarter_csv/file_io.rb +2 -2
- data/lib/smarter_csv/hash_transformations.rb +29 -13
- data/lib/smarter_csv/parser.rb +42 -33
- data/lib/smarter_csv/peekable_io.rb +453 -0
- data/lib/smarter_csv/reader.rb +119 -23
- data/lib/smarter_csv/reader_options.rb +61 -1
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +40 -12
- metadata +12 -5
- data/TO_DO_v2.md +0 -14
- data/ext/smarter_csv/Makefile +0 -270
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 702bd7049e83c0beb85f0ca11a122e6f1659eddef6afec66eaf1c37c5b30f43f
|
|
4
|
+
data.tar.gz: dd1915694d041c9b631324de7408f46fc8f426f9e1c60136c35a8f1e754d4590
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: fa00d07c21cffa711a43ecb4622ad3a09b667f1c1965ad26bee864ada6a3c168076ec04550781c75c4f0acbb28fcab60001278f459cf3065cceaef6820764e30
|
|
7
|
+
data.tar.gz: eab835a356e5343e20a5cc0784ffd9aafa8ab631256d412ec570a0192060b8f3e9c6f619db36e59494364f6c51cda16a9c434c2565ef5a5b6e23f4813a7eaaef
|
data/.rubocop.yml
CHANGED
|
@@ -13,6 +13,12 @@ Layout/SpaceInsideHashLiteralBraces:
|
|
|
13
13
|
Layout/SpaceAroundOperators:
|
|
14
14
|
Enabled: false
|
|
15
15
|
|
|
16
|
+
Lint/ConstantDefinitionInBlock:
|
|
17
|
+
Enabled: false
|
|
18
|
+
|
|
19
|
+
Lint/UnderscorePrefixedVariableName:
|
|
20
|
+
Enabled: false
|
|
21
|
+
|
|
16
22
|
Metrics/AbcSize:
|
|
17
23
|
Enabled: false
|
|
18
24
|
|
|
@@ -37,6 +43,9 @@ Metrics/ModuleLength:
|
|
|
37
43
|
Metrics/PerceivedComplexity:
|
|
38
44
|
Enabled: false
|
|
39
45
|
|
|
46
|
+
Naming/MethodParameterName:
|
|
47
|
+
Enabled: false
|
|
48
|
+
|
|
40
49
|
Naming/PredicateName:
|
|
41
50
|
Enabled: false
|
|
42
51
|
|
|
@@ -156,7 +165,7 @@ Style/SymbolArray:
|
|
|
156
165
|
Style/SymbolProc: # old Ruby versions can't do this
|
|
157
166
|
Enabled: false
|
|
158
167
|
|
|
159
|
-
Style/TernaryParentheses:
|
|
168
|
+
Style/TernaryParentheses: # parentheses are good!
|
|
160
169
|
Enabled: false
|
|
161
170
|
|
|
162
171
|
Style/TrailingCommaInArrayLiteral:
|
data/CHANGELOG.md
CHANGED
|
@@ -1,6 +1,60 @@
|
|
|
1
1
|
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
|
3
3
|
|
|
4
|
+
## 1.17.0 (NOT RELEASED)
|
|
5
|
+
|
|
6
|
+
RSpec tests: **1,434 → 2,210** (+776 tests)
|
|
7
|
+
|
|
8
|
+
### New Features
|
|
9
|
+
|
|
10
|
+
* **Streaming IO support** — SmarterCSV now works with non-seekable IO sources such as pipes, STDIN, and Zlib streams.
|
|
11
|
+
A rewindable peek buffer transparently captures the first bytes of the stream so that `row_sep` and `col_sep` auto-detection can replay them without requiring the underlying source to support `rewind` or `seek`.
|
|
12
|
+
|
|
13
|
+
* **Structured warnings** — auto-detection and configuration warnings are now collected on the Reader as a deduped histogram:
|
|
14
|
+
|
|
15
|
+
```ruby
|
|
16
|
+
reader = SmarterCSV::Reader.new('data.csv')
|
|
17
|
+
reader.process
|
|
18
|
+
reader.warnings # => [{ type:, code:, severity:, message:, count: }, ...]
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
Repeated warnings of the same `(type, code)` are deduped — `count` tracks occurrences. Available codes today: `:chunk_size_default`, `:header_a_method`, `:utf8_missing_binary_mode`, `:no_clear_row_sep`, `:no_row_sep_found`.
|
|
22
|
+
|
|
23
|
+
* **Class-level `SmarterCSV.warnings`** accessor — mirrors `SmarterCSV.errors`. Per-thread, cleared at the start of each `.process` / `.parse` / `.each` / `.each_chunk` call. Safe under Puma/Sidekiq.
|
|
24
|
+
|
|
25
|
+
* **Rails.logger routing** — when `Rails.logger` is present, warnings are routed through it at the severity declared at the call site (`:debug` / `:info` / `:warn` / `:error` / `:fatal`); otherwise `Kernel#warn` is used as a fallback. Detection is cached at construct time, no per-call overhead.
|
|
26
|
+
|
|
27
|
+
### Improvements
|
|
28
|
+
|
|
29
|
+
* Improved auto-detection of `row_sep` and `col_sep` — giving more accurate results on files with comment headers.
|
|
30
|
+
|
|
31
|
+
* Larger scan window for accurate row separator detection on files with wide headers or long first lines.
|
|
32
|
+
|
|
33
|
+
* `guess_line_ending` now scans the input in chunks up to a 64KB hard cap, returning as soon as one separator has a clear majority. Near-tie chunk-boundary artifacts no longer cause spurious warnings; only true ties at the hard cap fall back to `"\n"` and emit a `:no_clear_row_sep` warning at `:error` severity (silent miss-parse risk).
|
|
34
|
+
|
|
35
|
+
### New / Changed Options
|
|
36
|
+
|
|
37
|
+
* **`buffer_size` is now a public option** — peek buffer chunk size for non-seekable inputs (pipes, gzip readers, HTTP/S3 bodies). Default `16_384`. Out-of-range values warn and clamp to the supported range rather than raising.
|
|
38
|
+
|
|
39
|
+
* **`auto_row_sep_chars` default changed to `4096`** (was `500` in 1.16.x). Sized to cover wide-header CSVs in a single read. Bump it higher if your files have very wide headers or long comment preambles.
|
|
40
|
+
|
|
41
|
+
### Bug Fixes
|
|
42
|
+
|
|
43
|
+
* **Files ending in a lone `\r`** are now correctly detected as `\r`-terminated instead of falling through to a "no clear row separator" warning.
|
|
44
|
+
|
|
45
|
+
* **`remove_empty_values` now treats Unicode whitespace as empty** — a field containing only whitespace, including characters like non-breaking space (U+00A0) or ideographic space (U+3000), is now dropped, the same way Ruby's `String#blank?` behaves. Previously only ASCII whitespace counted (and only Rails apps got the Unicode behavior, via `blank?` — an inconsistency that's now gone). Behavior is identical with or without the C extension.
|
|
46
|
+
|
|
47
|
+
* **`remove_zero_values` now also removes signed zeros** — `+0`, `-0`, `-0.0`, `+0.00`, etc. are recognized as zero and dropped, just like `0` and `0.0`. (Only applies when `remove_zero_values: true`, which is off by default.)
|
|
48
|
+
|
|
49
|
+
### Performance
|
|
50
|
+
|
|
51
|
+
Measured against 1.16.4 (Apple M4, Ruby 3.4.7):
|
|
52
|
+
|
|
53
|
+
* **C-accelerated path (the default):** quote-heavy, large-field, and wide CSVs parse meaningfully faster — roughly **7–22% faster** (city/address-style files ~10–12%; long-field and wide files the most). CSVs with very short lines and many tiny fields are up to ~3% slower — a side effect of the larger default auto-detection scan window (see `auto_row_sep_chars`); set it back to a smaller value if that matters for your workload. Net: solid wins where there's real per-row work, a small cost on the trivially-cheap cases.
|
|
54
|
+
* **Ruby fallback path (`acceleration: false`):** faster on nearly every file — typically **3–20% faster** than 1.16.4, with the biggest gains on wide and many-small-field CSVs.
|
|
55
|
+
|
|
56
|
+
Per-file breakdown: [`docs/releases/1.17.0/performance_notes.md`](docs/releases/1.17.0/performance_notes.md).
|
|
57
|
+
|
|
4
58
|
## 1.16.4 (2026-04-21) — Bug Fixes
|
|
5
59
|
|
|
6
60
|
RSpec tests: **1,434 → 1,467** (+33 tests)
|
data/Gemfile
CHANGED
|
@@ -5,12 +5,17 @@ source 'https://rubygems.org'
|
|
|
5
5
|
# Specify your gem's dependencies in smarter_csv.gemspec
|
|
6
6
|
gemspec
|
|
7
7
|
|
|
8
|
-
|
|
9
|
-
gem "rake
|
|
8
|
+
group :development do
|
|
9
|
+
gem "rake"
|
|
10
|
+
gem "rake-compiler"
|
|
11
|
+
gem "ostruct" # silences rake's stdlib-deprecation warning during dev
|
|
12
|
+
gem "rubocop"
|
|
13
|
+
end
|
|
10
14
|
|
|
11
|
-
|
|
12
|
-
gem
|
|
13
|
-
gem "
|
|
15
|
+
group :development, :test do
|
|
16
|
+
gem "awesome_print"
|
|
17
|
+
gem "pry" # required in spec_helper.rb; also useful in dev console
|
|
18
|
+
end
|
|
14
19
|
|
|
15
20
|
group :test do
|
|
16
21
|
gem "rspec"
|
data/README.md
CHANGED
|
@@ -14,9 +14,13 @@
|
|
|
14
14
|
|
|
15
15
|
Beyond raw speed, SmarterCSV is designed to provide a significantly more convenient and developer-friendly interface than traditional CSV libraries. Instead of returning raw arrays that require substantial post-processing, SmarterCSV produces Rails-ready hashes for each row, making the data immediately usable with ActiveRecord, Sidekiq pipelines, parallel processing, and JSON-based workflows such as S3.
|
|
16
16
|
|
|
17
|
+
In a Rails app, warnings auto-route through `Rails.logger` and instrumentation hooks compose with `ActiveSupport::Notifications` — no setup required. Outside Rails, warnings fall back to `$stderr` and the same APIs work without any framework dependency.
|
|
18
|
+
|
|
17
19
|
The library includes intelligent defaults, automatic detection of column and row separators, and flexible header/value transformations. These features eliminate much of the boilerplate typically required when working with CSV data and help keep ingestion code concise and maintainable.
|
|
18
20
|
|
|
19
|
-
For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines.
|
|
21
|
+
For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines.
|
|
22
|
+
As of 1.17.0, SmarterCSV also accepts **non-seekable streaming inputs** — pipes, `STDIN`, `Zlib::GzipReader`, and HTTP responses — with no need to materialize the file on disk first.
|
|
23
|
+
The C acceleration further optimizes the full ingestion path — including parsing, hash construction, and conversions — so performance gains reflect real-world workloads, not just tokenizer benchmarks.
|
|
20
24
|
|
|
21
25
|
The interface is intentionally designed to robustly handle messy real-world CSV while keeping application code clean. Developers can easily map headers, skip unwanted rows, quarantine problematic data, and transform values on the fly without building custom post-processing pipelines. See [Real-World CSV Files](docs/real_world_csv.md) for a comprehensive guide to production CSV patterns.
|
|
22
26
|
|
|
@@ -33,22 +37,33 @@ SmarterCSV is designed for **real-world CSV processing**, returning fully usable
|
|
|
33
37
|
|
|
34
38
|
For a fair comparison, `CSV.table` is the closest Ruby CSV equivalent to SmarterCSV.
|
|
35
39
|
|
|
36
|
-
| Comparison (SmarterCSV 1.
|
|
40
|
+
| Comparison (SmarterCSV 1.17.0, C-accelerated) | Range |
|
|
37
41
|
|-------------------------------------------------|-------------------------|
|
|
38
|
-
| vs SmarterCSV 1.15.2 (with C acceleration) | up to 2.
|
|
39
|
-
| vs SmarterCSV 1.14.4 (with C acceleration) | 9×–
|
|
40
|
-
| vs SmarterCSV 1.14.4 (Ruby path) |
|
|
41
|
-
| vs CSV.read (arrays of arrays) | 1.7
|
|
42
|
-
| vs CSV.table (arrays of hashes) |
|
|
43
|
-
| vs ZSV (arrays of hashes, equiv. output)
|
|
42
|
+
| vs SmarterCSV 1.15.2 (with C acceleration) | up to 2.8× faster |
|
|
43
|
+
| vs SmarterCSV 1.14.4 (with C acceleration) | 9×–82× faster |
|
|
44
|
+
| vs SmarterCSV 1.14.4 (Ruby path) | 2.4×–19.8× faster |
|
|
45
|
+
| vs CSV.read (arrays of arrays) | 1.3×–7.9× faster |
|
|
46
|
+
| vs CSV.table (arrays of hashes) | 4.9×–132× faster |
|
|
47
|
+
| vs ZSV 1.3.0 (arrays of hashes, equiv. output) | 1.1×–6.6× faster † |
|
|
48
|
+
|
|
49
|
+
† SmarterCSV faster on 15 of 16 files. ZSV raw arrays (no hashes, no conversions) are 2×–14× faster — but that omits the post-processing work needed to produce usable output. ZSV row carried over from the 1.16.0 benchmark; not re-measured for 1.17.0.
|
|
50
|
+
|
|
51
|
+
_Benchmarks: 19 CSV files (20k–240k rows), Ruby 3.4.7, Apple M4._
|
|
44
52
|
|
|
45
|
-
|
|
53
|
+
> ⁉️ **Why these numbers look a touch lower than 1.16.0 charts?**
|
|
54
|
+
> TL;DR: because we use different statistic methods.
|
|
55
|
+
>
|
|
56
|
+
> Earlier versions of these benchmarks reported the best-of-N sample (the absolute `min` / fastest run) for each measurement. A single lucky run — empty caches lining up, no scheduler interrupts — could shave up to ~10% off and become the headline number. I think that would be misleading.
|
|
57
|
+
> Because of that, we've switched to the 10th-percentile (`p10`) of multiple runs of 40 samples, which discards roughly the four luckiest runs and reports a time much closer to what you'll actually observe in production. On noisier fixtures `p10` is ~5–10% above `min`; on quiet ones it's within 1%. The relative ordering between versions and adapters is unchanged; the absolute speedup figures are simply more honest.
|
|
46
58
|
|
|
47
|
-
|
|
59
|
+
### SmarterCSV vs Ruby CSV
|
|
60
|
+

|
|
48
61
|
|
|
49
|
-
|
|
62
|
+
### SmarterCSV C Path
|
|
63
|
+

|
|
50
64
|
|
|
51
|
-
|
|
65
|
+
### SmarterCSV Ruby Path
|
|
66
|
+

|
|
52
67
|
|
|
53
68
|
See [SmarterCSV 1.15.2: Faster Than Raw CSV Arrays](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e) and [PR #319](https://github.com/tilo/smarter_csv/pull/319) for more details.
|
|
54
69
|
|
|
@@ -61,7 +76,7 @@ It's a one-line change:
|
|
|
61
76
|
# Before
|
|
62
77
|
rows = CSV.table('data.csv').map(&:to_h)
|
|
63
78
|
|
|
64
|
-
# After — up to
|
|
79
|
+
# After — up to 132× faster, same symbol keys
|
|
65
80
|
rows = SmarterCSV.process('data.csv')
|
|
66
81
|
```
|
|
67
82
|
|
|
@@ -124,6 +139,23 @@ strip_whitespace → nil_values_matching → remove_empty_values → remove_zero
|
|
|
124
139
|
|
|
125
140
|
Each step is individually configurable. See [Data Transformations](docs/data_transformations.md) and [Value Converters](docs/value_converters.md) for details.
|
|
126
141
|
|
|
142
|
+
### Value Converters
|
|
143
|
+
|
|
144
|
+
Per-column lambdas convert raw strings into typed values — dates, currency, booleans:
|
|
145
|
+
|
|
146
|
+
```ruby
|
|
147
|
+
require 'date'
|
|
148
|
+
|
|
149
|
+
data = SmarterCSV.process('orders.csv',
|
|
150
|
+
value_converters: {
|
|
151
|
+
dob: ->(v) { v && Date.strptime(v, '%m/%d/%Y') },
|
|
152
|
+
price: ->(v) { v&.delete('$,')&.to_f },
|
|
153
|
+
active: ->(v) { v&.match?(/\Atrue\z/i) },
|
|
154
|
+
})
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
See [Value Converters](docs/value_converters.md).
|
|
158
|
+
|
|
127
159
|
### Batch Processing:
|
|
128
160
|
|
|
129
161
|
Processing large CSV files in chunks minimizes memory usage and enables powerful workflows:
|
|
@@ -147,6 +179,8 @@ SmarterCSV.process(filename, chunk_size: 100) do |chunk|
|
|
|
147
179
|
end
|
|
148
180
|
```
|
|
149
181
|
|
|
182
|
+
See [Batch Processing](docs/batch_processing.md) for chunk sizing, `each_chunk`, and parallel-worker patterns.
|
|
183
|
+
|
|
150
184
|
### Modern Enumerator API:
|
|
151
185
|
|
|
152
186
|
`Reader#each` is the modern, idiomatic way to process rows — `Reader` includes `Enumerable`, so all standard Ruby methods work:
|
|
@@ -166,6 +200,29 @@ first_ten = reader.lazy.select { |h| h[:active] }.first(10)
|
|
|
166
200
|
reader.each_slice(500) { |batch| MyModel.insert_all(batch) }
|
|
167
201
|
```
|
|
168
202
|
|
|
203
|
+
See [The Basic Read API](docs/basic_read_api.md) for the full `Reader` interface.
|
|
204
|
+
|
|
205
|
+
### Streaming / Non-Seekable Inputs (1.17.0+):
|
|
206
|
+
|
|
207
|
+
SmarterCSV reads directly from any IO — no need to materialize the file on disk first. Auto-detection works on streaming inputs without rewinding; the first chunk is buffered transparently.
|
|
208
|
+
|
|
209
|
+
```ruby
|
|
210
|
+
# Gzipped CSV — stream-decompressed, never written to disk
|
|
211
|
+
require 'zlib'
|
|
212
|
+
Zlib::GzipReader.open('huge.csv.gz') do |io|
|
|
213
|
+
SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
|
|
214
|
+
end
|
|
215
|
+
|
|
216
|
+
# STDIN / pipes
|
|
217
|
+
SmarterCSV.process($stdin) { |row, _| ... }
|
|
218
|
+
|
|
219
|
+
# HTTP response body
|
|
220
|
+
require 'open-uri'
|
|
221
|
+
URI.open('https://example.com/data.csv') { |io| SmarterCSV.process(io) }
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
See [Row and Column Separators](docs/row_col_sep.md) for how `:auto` detection works on non-seekable streams, and [Configuration Options](docs/options.md) for `buffer_size` (the peek-buffer chunk size).
|
|
225
|
+
|
|
169
226
|
### Bad Row Handling:
|
|
170
227
|
|
|
171
228
|
SmarterCSV can quarantine malformed rows instead of crashing the entire import:
|
|
@@ -182,7 +239,33 @@ end
|
|
|
182
239
|
|
|
183
240
|
See [Bad Row Quarantine](docs/bad_row_quarantine.md) for full details including `bad_row_limit` and `field_size_limit`.
|
|
184
241
|
|
|
185
|
-
|
|
242
|
+
### Header Validation:
|
|
243
|
+
|
|
244
|
+
Raise early if the file is missing required columns, before any data row is processed:
|
|
245
|
+
|
|
246
|
+
```ruby
|
|
247
|
+
begin
|
|
248
|
+
SmarterCSV.process('transactions.csv',
|
|
249
|
+
required_keys: [:account_id, :amount, :currency])
|
|
250
|
+
rescue SmarterCSV::MissingKeys => e
|
|
251
|
+
abort "CSV missing columns: #{e.keys.join(', ')}"
|
|
252
|
+
end
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
See [Header Validations](docs/header_validations.md).
|
|
256
|
+
|
|
257
|
+
### Writing CSV:
|
|
258
|
+
|
|
259
|
+
```ruby
|
|
260
|
+
SmarterCSV.generate('output.csv') do |csv|
|
|
261
|
+
csv << { name: 'Alice', age: 30, city: 'New York' }
|
|
262
|
+
csv << { name: 'Bob', age: 25, city: 'Chicago' }
|
|
263
|
+
end
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
Hashes (not arrays) make column-shift bugs impossible — adding a column never silently misaligns existing rows. See [The Basic Write API](docs/basic_write_api.md) for header renaming, value converters, and ordered output.
|
|
267
|
+
|
|
268
|
+
See [18 Examples](docs/examples.md) for more, including encoding and preamble handling, key mapping, instrumentation hooks, and resumable Rails ActiveJob imports.
|
|
186
269
|
|
|
187
270
|
## Requirements
|
|
188
271
|
|
|
@@ -223,6 +306,7 @@ Or install it yourself as:
|
|
|
223
306
|
* [Data Transformations](docs/data_transformations.md)
|
|
224
307
|
* [Value Converters](docs/value_converters.md)
|
|
225
308
|
* [Bad Row Quarantine](docs/bad_row_quarantine.md)
|
|
309
|
+
* [Warnings](docs/warnings.md)
|
|
226
310
|
* [Instrumentation Hooks](docs/instrumentation.md)
|
|
227
311
|
* [Examples](docs/examples.md)
|
|
228
312
|
* [Real-World CSV Files](docs/real_world_csv.md)
|
data/TO_DO.md
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
1
|
+
# SmarterCSV v2.0 TO DO List
|
|
2
|
+
|
|
3
|
+
DONE:
|
|
4
|
+
[X] Don't call rewind on filehandle
|
|
5
|
+
[X] use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
|
6
|
+
[X] skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120). Or stream large file from S3 (linked in the issue)
|
|
7
|
+
[X] [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
|
|
8
|
+
[X] add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
|
|
9
|
+
[X] Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
|
|
10
|
+
[X] Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
|
|
11
|
+
|
|
12
|
+
|
|
13
|
+
Partially Done:
|
|
14
|
+
[ ] make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
|
15
|
+
|
|
16
|
+
StilL TO DO:
|
|
17
|
+
[ ] Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
|
|
18
|
+
|
|
19
|
+
Arguably by design (e.g. exclude these columns from conversion and have them returned as a string)
|
|
20
|
+
[ ] [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
|
|
21
|
+
|
|
22
|
+
|
|
23
|
+
## Numeric conversion: align the Ruby fallback path with the C path (permissive)
|
|
24
|
+
|
|
25
|
+
Context: `convert_values_to_numeric` runs in two places that currently DISAGREE on edge cases:
|
|
26
|
+
- C path (`acceleration: true`, the default): `ext/smarter_csv/smarter_csv.c#try_numeric_conversion`
|
|
27
|
+
uses `strtol`/`strtod` (base 10; float branch only entered when the field contains a `.`).
|
|
28
|
+
- Ruby fallback (`acceleration: false`): `lib/smarter_csv/hash_transformations.rb` uses the
|
|
29
|
+
strict regex `NUMERIC_REGEX = /\A[+-]?\d+(?:\.\d+)?\z/` plus `to_i` / `to_f`.
|
|
30
|
+
|
|
31
|
+
Divergence (verified empirically):
|
|
32
|
+
| value | C path | Ruby fallback |
|
|
33
|
+
|-----------|------------------|-------------------|
|
|
34
|
+
| ".5" | 0.5 (Float) | ".5" (String) |
|
|
35
|
+
| "3." | 3.0 (Float) | "3." (String) |
|
|
36
|
+
| "1.5e3" | 1500.0 (Float) | "1.5e3" (String) |
|
|
37
|
+
| "1.0e10" | 10000000000.0 | "1.0e10" (String) |
|
|
38
|
+
|
|
39
|
+
Decision: the C path's permissive behavior (corner cases + scientific notation) is the intended
|
|
40
|
+
contract. Fix = make the Ruby fallback match the C path. Do NOT tighten the C path.
|
|
41
|
+
|
|
42
|
+
Ruby-side changes (in `hash_transformations.rb`):
|
|
43
|
+
1. Swap NUMERIC_REGEX for a permissive one:
|
|
44
|
+
/\A[+-]?(?:\d+\.?\d*|\.\d+)(?:[eE][+-]?\d+)?\z/
|
|
45
|
+
matches 1, 1., 1.5, .5, 1e3, 1.5e3, -3.14e-2, etc.; still rejects ".", "e3", "1.2.3",
|
|
46
|
+
"1_000", "0x1F".
|
|
47
|
+
2. Add `DOT_BYTE = '.'.ord` (46) and include it in the first-byte fast-reject's allowed set
|
|
48
|
+
(the C pre-check already allows a leading `.`; without this, ".5" gets rejected on byte 0).
|
|
49
|
+
3. Int-vs-float decision: `(v.include?('.') || v.include?('e') || v.include?('E')) ? v.to_f : v.to_i`
|
|
50
|
+
(currently only checks for `.`).
|
|
51
|
+
|
|
52
|
+
Stays a string on BOTH paths (no change needed, but worth characterization tests — there are
|
|
53
|
+
currently NONE):
|
|
54
|
+
- "010" => 10 (NOT octal 8 — both paths use base-10 conversion: String#to_i / strtol(.,10).
|
|
55
|
+
A switch to Kernel#Integer() would break this. Lock it down with a test.)
|
|
56
|
+
- "0x1F", "0b101", "0o17" => string (radix prefixes not honored by base-10 conversion)
|
|
57
|
+
- "1_000" => string (underscores)
|
|
58
|
+
- "1,200.00", "1.300,00" => string (thousands sep / decimal comma — strtod stops at the
|
|
59
|
+
separator → not fully consumed; regex rejects. This is the only safe behavior; "1,200" is
|
|
60
|
+
genuinely ambiguous. Locale-specific number formats are the caller's job via value_converters.)
|
|
61
|
+
|
|
62
|
+
NOT doing: locale sniffing (read LC_NUMERIC at init and adjust the regexes). Rejected because
|
|
63
|
+
the machine locale tells you nothing about the file's number format, it breaks reproducibility
|
|
64
|
+
(same code + same file → different results on a US vs EU box), and `,` can't be both col_sep and
|
|
65
|
+
decimal separator anyway. Note `strtod` IS locale-sensitive (LC_NUMERIC) but it's dormant — Ruby
|
|
66
|
+
runs in the C/POSIX locale; don't deliberately activate it.
|
|
67
|
+
|
|
68
|
+
When done: parity tests (`[true, false].each`) for the now-consistent set (.5, 3., 1.5e3, 1e3)
|
|
69
|
+
plus characterization tests for the stays-a-string set above; CHANGELOG line noting the Ruby
|
|
70
|
+
fallback's numeric conversion now accepts scientific notation and bare-dot forms, matching the
|
|
71
|
+
accelerated path. Behavior change affects `acceleration: false` users only — and aligns them with
|
|
72
|
+
the default.
|
|
73
|
+
|
|
74
|
+
|
|
75
|
+
## Warn once when the C extension didn't load on a platform that supports it
|
|
76
|
+
|
|
77
|
+
Context: `acceleration: true` is the default. When the C extension fails to build / isn't loaded,
|
|
78
|
+
SmarterCSV silently falls back to the Ruby parser — graceful degradation by design (so the gem
|
|
79
|
+
keeps working for users with broken toolchains, JRuby, TruffleRuby, etc.). Today there is no
|
|
80
|
+
signal to the user that they're not getting the C path; their CSV parsing is just slower than
|
|
81
|
+
they might have expected.
|
|
82
|
+
|
|
83
|
+
Idea: emit a one-time warning when:
|
|
84
|
+
* the C extension is NOT loaded — `!SmarterCSV::Parser.respond_to?(:parse_csv_line_c)`, AND
|
|
85
|
+
* the platform is one where it *should* be available — `RUBY_ENGINE == 'ruby'` (MRI / CRuby).
|
|
86
|
+
JRuby and TruffleRuby don't load CRuby C extensions natively; nothing for the user to do.
|
|
87
|
+
|
|
88
|
+
Where to fire:
|
|
89
|
+
* NOT at `require 'smarter_csv'` time — Rails.logger typically isn't set up yet, so any
|
|
90
|
+
"route through the warnings system" code would just fall through to `Kernel#warn` anyway,
|
|
91
|
+
and the warning would land in stderr instead of the Rails log where ops would see it.
|
|
92
|
+
* At first `Reader.new` / `SmarterCSV.process` call — Rails has booted, the existing
|
|
93
|
+
routing-through-Rails.logger-or-Kernel#warn infra works, and the existing deduped warnings
|
|
94
|
+
histogram means it fires once per process regardless of how many parse calls.
|
|
95
|
+
|
|
96
|
+
Implementation sketch:
|
|
97
|
+
* Add a new warning code (e.g. `:c_extension_unavailable`) alongside the existing ones
|
|
98
|
+
(`:chunk_size_default`, `:header_a_method`, `:utf8_missing_binary_mode`, ...).
|
|
99
|
+
* Severity `:warn`. Suppressible via the existing `verbose: :quiet`.
|
|
100
|
+
* Message points at the fix — e.g. "C acceleration extension not loaded on this Ruby; using
|
|
101
|
+
Ruby parser. To enable acceleration, reinstall with `gem pristine smarter_csv` and check
|
|
102
|
+
the build log." Plus a link/pointer to a troubleshooting section in the docs.
|
|
103
|
+
|
|
104
|
+
Bonus: add a public predicate `SmarterCSV.acceleration_available?` returning
|
|
105
|
+
`Parser.respond_to?(:parse_csv_line_c)`. Zero noise, useful for scripts / CI / future spec
|
|
106
|
+
files that want to branch on the environment fact rather than guess.
|
|
107
|
+
|
|
108
|
+
NOT doing: a banner at `require` time (every Rails app would print it at boot, too noisy);
|
|
109
|
+
warning when `acceleration: false` was explicitly chosen (the user knows what they're doing).
|
data/docs/_introduction.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/bad_row_quarantine.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [**Bad Row Quarantine**](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -339,4 +340,4 @@ Normal rows (where the entire line fits within the limit) bypass per-field check
|
|
|
339
340
|
|
|
340
341
|
--------------------
|
|
341
342
|
|
|
342
|
-
PREVIOUS: [Value Converters](./value_converters.md) | NEXT: [
|
|
343
|
+
PREVIOUS: [Value Converters](./value_converters.md) | NEXT: [Warnings](./warnings.md) | UP: [README](../README.md)
|
data/docs/basic_read_api.md
CHANGED
|
@@ -123,8 +123,9 @@ reader.each do |hash|
|
|
|
123
123
|
MyModel.upsert(hash)
|
|
124
124
|
end
|
|
125
125
|
|
|
126
|
-
puts reader.headers
|
|
126
|
+
puts reader.headers # accessible after processing
|
|
127
127
|
puts reader.errors.inspect
|
|
128
|
+
puts reader.warnings # see [Warnings](./warnings.md)
|
|
128
129
|
```
|
|
129
130
|
|
|
130
131
|
### Returns an Enumerator when called without a block
|
|
@@ -185,6 +186,10 @@ reader.each { |hash| MyModel.upsert(hash) }
|
|
|
185
186
|
reader.errors[:bad_rows].each { |rec| puts "Bad row: #{rec[:error_message]}" }
|
|
186
187
|
```
|
|
187
188
|
|
|
189
|
+
### Read-Transform-Write Pipelines
|
|
190
|
+
|
|
191
|
+
Composing `SmarterCSV.each` with `SmarterCSV.generate` is the idiomatic replacement for Ruby's `CSV.filter` — read CSV, mutate each row, write the result. See [Examples → Filtering and Transforming a CSV File](./examples.md#example-19-filtering-and-transforming-a-csv-file) for the full set of patterns (file → file, STDIN → STDOUT, gzip → gzip, header renaming).
|
|
192
|
+
|
|
188
193
|
---
|
|
189
194
|
|
|
190
195
|
## Value Transformation Pipeline
|
data/docs/basic_write_api.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -188,6 +189,31 @@ File.open('output.csv', 'w') do |f|
|
|
|
188
189
|
end
|
|
189
190
|
```
|
|
190
191
|
|
|
192
|
+
**Write to STDOUT (e.g. piping to another process):**
|
|
193
|
+
|
|
194
|
+
```ruby
|
|
195
|
+
SmarterCSV.generate($stdout) do |csv|
|
|
196
|
+
records.each { |r| csv << r }
|
|
197
|
+
end
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
Useful in CLI scripts: `ruby export.rb | gzip > out.csv.gz`.
|
|
201
|
+
|
|
202
|
+
**Stream a CSV upload to S3 — never written to disk:**
|
|
203
|
+
|
|
204
|
+
```ruby
|
|
205
|
+
require 'aws-sdk-s3'
|
|
206
|
+
|
|
207
|
+
obj = Aws::S3::Object.new(bucket_name: 'exports', key: 'reports/daily.csv')
|
|
208
|
+
obj.upload_stream do |stream|
|
|
209
|
+
SmarterCSV.generate(stream) do |csv|
|
|
210
|
+
Order.find_each { |o| csv << o.attributes }
|
|
211
|
+
end
|
|
212
|
+
end
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
`upload_stream` performs a multipart upload, so the CSV is sent to S3 incrementally as it's generated — memory usage stays flat regardless of result size.
|
|
216
|
+
|
|
191
217
|
### Full Interface
|
|
192
218
|
|
|
193
219
|
The full interface gives you direct access to the `Writer` instance, which is useful when you
|
|
@@ -616,6 +642,10 @@ end
|
|
|
616
642
|
> **Note:** `write_headers: false` only suppresses the header line. All other
|
|
617
643
|
> options (`col_sep:`, `row_sep:`, `value_converters:`, etc.) apply as normal.
|
|
618
644
|
|
|
645
|
+
## Read-Transform-Write Pipelines
|
|
646
|
+
|
|
647
|
+
Pairing `SmarterCSV.generate` with `SmarterCSV.each` on the read side is the idiomatic replacement for Ruby's `CSV.filter`. See [Examples → Filtering and Transforming a CSV File](./examples.md#example-19-filtering-and-transforming-a-csv-file) for the full set of patterns, including streaming gzip → gzip pipelines.
|
|
648
|
+
|
|
619
649
|
## More Examples
|
|
620
650
|
|
|
621
651
|
Check out the [RSpec tests](../spec/smarter_csv/writer_spec.rb) for more examples.
|
data/docs/batch_processing.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -210,6 +211,30 @@ SmarterCSV::Reader.new('products.csv', chunk_size: 25).each_chunk do |chunk, _in
|
|
|
210
211
|
end
|
|
211
212
|
```
|
|
212
213
|
|
|
214
|
+
## Example: Resumable Import (Plain Ruby)
|
|
215
|
+
|
|
216
|
+
Track the chunk cursor in a JSON state file so an interrupted import can resume where it left off — no Rails / ActiveJob required:
|
|
217
|
+
|
|
218
|
+
```ruby
|
|
219
|
+
require 'json'
|
|
220
|
+
|
|
221
|
+
STATE_FILE = '/var/run/import.state.json'
|
|
222
|
+
|
|
223
|
+
state = File.exist?(STATE_FILE) ? JSON.parse(File.read(STATE_FILE)) : { 'cursor' => 0 }
|
|
224
|
+
|
|
225
|
+
SmarterCSV.process('import.csv', chunk_size: 500) do |chunk, chunk_index|
|
|
226
|
+
next if chunk_index < state['cursor'] # skip already-processed chunks on resume
|
|
227
|
+
|
|
228
|
+
MyModel.import!(chunk)
|
|
229
|
+
state['cursor'] = chunk_index + 1
|
|
230
|
+
File.write(STATE_FILE, JSON.dump(state))
|
|
231
|
+
end
|
|
232
|
+
|
|
233
|
+
File.delete(STATE_FILE) # done — clear the cursor
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
If the process is killed at chunk 7, the next run skips chunks 0–6 quickly via `next` and resumes at chunk 7. For Rails 8.1+ projects, see [Examples → Resumable CSV Import with Rails ActiveJob](./examples.md#example-12-resumable-csv-import-with-rails-activejob-rails-81) for the framework-native version.
|
|
237
|
+
|
|
213
238
|
## Example: Reading a CSV from S3
|
|
214
239
|
|
|
215
240
|
SmarterCSV accepts any IO-like object, so you can stream directly from S3 without
|
data/docs/column_selection.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [**Data Transformations**](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|