smarter_csv 1.16.4 → 1.17.0.pre5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +7 -1
- data/CHANGELOG.md +33 -0
- data/README.md +4 -1
- data/TO_DO_v2.md +17 -11
- data/docs/_introduction.md +1 -0
- data/docs/bad_row_quarantine.md +2 -1
- data/docs/basic_read_api.md +2 -1
- data/docs/basic_write_api.md +1 -0
- data/docs/batch_processing.md +1 -0
- data/docs/column_selection.md +1 -0
- data/docs/data_transformations.md +1 -0
- data/docs/examples.md +1 -0
- data/docs/header_transformations.md +1 -0
- data/docs/header_validations.md +1 -0
- data/docs/history.md +1 -0
- data/docs/instrumentation.md +2 -1
- data/docs/migrating_from_csv.md +1 -0
- data/docs/options.md +4 -3
- data/docs/parsing_strategy.md +1 -0
- data/docs/real_world_csv.md +6 -1
- data/docs/row_col_sep.md +2 -1
- data/docs/ruby_csv_pitfalls.md +1 -0
- data/docs/value_converters.md +24 -0
- data/docs/warnings.md +119 -0
- data/lib/smarter_csv/auto_detection.rb +73 -32
- data/lib/smarter_csv/file_io.rb +2 -2
- data/lib/smarter_csv/peekable_io.rb +432 -0
- data/lib/smarter_csv/reader.rb +121 -19
- data/lib/smarter_csv/reader_options.rb +14 -1
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +39 -11
- metadata +4 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 5d2154634f98b9df235995b9c6368e6208027d31da8d6d80ad09a526dd51fbf0
|
|
4
|
+
data.tar.gz: 62dd06196aef83b0e2c7dd6391ce9004fe95ba17310eec5f8ae9f2a74c2008a8
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: fdac413102754c859247b876f8a2130ff7dcf700c3531e4c58e72c36e0e53081c0ad7a93654d343428ed7c1c91dd0a35aa90e94877df72dab2b0aa5d9ae0bf65
|
|
7
|
+
data.tar.gz: c800d6e4c807ff2c502ac9c7f819de5b6eef582c587b505e1efe2b5ee284a941f809bdb7166d638ac3ee5595763116c72ccfbe76d30b83f83793a53b1f728a6c
|
data/.rubocop.yml
CHANGED
|
@@ -13,6 +13,9 @@ Layout/SpaceInsideHashLiteralBraces:
|
|
|
13
13
|
Layout/SpaceAroundOperators:
|
|
14
14
|
Enabled: false
|
|
15
15
|
|
|
16
|
+
Lint/UnderscorePrefixedVariableName:
|
|
17
|
+
Enabled: false
|
|
18
|
+
|
|
16
19
|
Metrics/AbcSize:
|
|
17
20
|
Enabled: false
|
|
18
21
|
|
|
@@ -37,6 +40,9 @@ Metrics/ModuleLength:
|
|
|
37
40
|
Metrics/PerceivedComplexity:
|
|
38
41
|
Enabled: false
|
|
39
42
|
|
|
43
|
+
Naming/MethodParameterName:
|
|
44
|
+
Enabled: false
|
|
45
|
+
|
|
40
46
|
Naming/PredicateName:
|
|
41
47
|
Enabled: false
|
|
42
48
|
|
|
@@ -156,7 +162,7 @@ Style/SymbolArray:
|
|
|
156
162
|
Style/SymbolProc: # old Ruby versions can't do this
|
|
157
163
|
Enabled: false
|
|
158
164
|
|
|
159
|
-
Style/TernaryParentheses:
|
|
165
|
+
Style/TernaryParentheses: # parentheses are good!
|
|
160
166
|
Enabled: false
|
|
161
167
|
|
|
162
168
|
Style/TrailingCommaInArrayLiteral:
|
data/CHANGELOG.md
CHANGED
|
@@ -1,6 +1,39 @@
|
|
|
1
1
|
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
|
3
3
|
|
|
4
|
+
## 1.17.0.pre5 (2026-04-28)
|
|
5
|
+
|
|
6
|
+
RSpec tests: **1,434 → 1,905** (+471 tests)
|
|
7
|
+
|
|
8
|
+
### New Features
|
|
9
|
+
|
|
10
|
+
* **Streaming IO support** — SmarterCSV now works with non-seekable IO sources such as pipes, STDIN, and Zlib streams.
|
|
11
|
+
A rewindable peek buffer transparently captures the first bytes of the stream so that `row_sep` and `col_sep` auto-detection can replay them without requiring the underlying source to support `rewind` or `seek`.
|
|
12
|
+
|
|
13
|
+
* **Structured warnings** — auto-detection and configuration warnings are now collected on the Reader as a deduped histogram:
|
|
14
|
+
|
|
15
|
+
```ruby
|
|
16
|
+
reader = SmarterCSV::Reader.new('data.csv')
|
|
17
|
+
reader.process
|
|
18
|
+
reader.warnings # => [{ type:, code:, severity:, message:, count: }, ...]
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
Repeated warnings of the same `(type, code)` are deduped — `count` tracks occurrences. Available codes today: `:chunk_size_default`, `:header_a_method`, `:utf8_missing_binary_mode`, `:no_clear_row_sep`, `:no_row_sep_found`.
|
|
22
|
+
|
|
23
|
+
* **Class-level `SmarterCSV.warnings`** accessor — mirrors `SmarterCSV.errors`. Per-thread, cleared at the start of each `.process` / `.parse` / `.each` / `.each_chunk` call. Safe under Puma/Sidekiq.
|
|
24
|
+
|
|
25
|
+
* **Rails.logger routing** — when `Rails.logger` is present, warnings are routed through it at the severity declared at the call site (`:debug` / `:info` / `:warn` / `:error` / `:fatal`); otherwise `Kernel#warn` is used as a fallback. Detection is cached at construct time, no per-call overhead.
|
|
26
|
+
|
|
27
|
+
### Improvements
|
|
28
|
+
|
|
29
|
+
* Improved auto-detection of `row_sep` and `col_sep` — giving more accurate results on files with comment headers.
|
|
30
|
+
|
|
31
|
+
* Default value for `auto_row_sep_chars` changed from `500` to `8192`, providing a larger scan window for accurate row separator detection on files with wide headers or long first lines.
|
|
32
|
+
Values below `8192` (and `nil` / `0`) are now rejected and fall back to the default `8192` with a warning message.
|
|
33
|
+
This is a change from the previous `nil` / `0` were documented as "scan whole file".
|
|
34
|
+
|
|
35
|
+
* `guess_line_ending` now scans the input in chunks up to a 64KB hard cap, returning as soon as one separator has a clear majority. Near-tie chunk-boundary artifacts no longer cause spurious warnings; only true ties at the hard cap fall back to `"\n"` and emit a `:no_clear_row_sep` warning at `:error` severity (silent miss-parse risk).
|
|
36
|
+
|
|
4
37
|
## 1.16.4 (2026-04-21) — Bug Fixes
|
|
5
38
|
|
|
6
39
|
RSpec tests: **1,434 → 1,467** (+33 tests)
|
data/README.md
CHANGED
|
@@ -16,7 +16,9 @@
|
|
|
16
16
|
|
|
17
17
|
The library includes intelligent defaults, automatic detection of column and row separators, and flexible header/value transformations. These features eliminate much of the boilerplate typically required when working with CSV data and help keep ingestion code concise and maintainable.
|
|
18
18
|
|
|
19
|
-
For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines.
|
|
19
|
+
For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines.
|
|
20
|
+
As of 1.17.0, SmarterCSV also accepts **non-seekable streaming inputs** — pipes, `STDIN`, `Zlib::GzipReader`, and HTTP responses — with no need to materialize the file on disk first.
|
|
21
|
+
The C acceleration further optimizes the full ingestion path — including parsing, hash construction, and conversions — so performance gains reflect real-world workloads, not just tokenizer benchmarks.
|
|
20
22
|
|
|
21
23
|
The interface is intentionally designed to robustly handle messy real-world CSV while keeping application code clean. Developers can easily map headers, skip unwanted rows, quarantine problematic data, and transform values on the fly without building custom post-processing pipelines. See [Real-World CSV Files](docs/real_world_csv.md) for a comprehensive guide to production CSV patterns.
|
|
22
24
|
|
|
@@ -223,6 +225,7 @@ Or install it yourself as:
|
|
|
223
225
|
* [Data Transformations](docs/data_transformations.md)
|
|
224
226
|
* [Value Converters](docs/value_converters.md)
|
|
225
227
|
* [Bad Row Quarantine](docs/bad_row_quarantine.md)
|
|
228
|
+
* [Warnings](docs/warnings.md)
|
|
226
229
|
* [Instrumentation Hooks](docs/instrumentation.md)
|
|
227
230
|
* [Examples](docs/examples.md)
|
|
228
231
|
* [Real-World CSV Files](docs/real_world_csv.md)
|
data/TO_DO_v2.md
CHANGED
|
@@ -1,14 +1,20 @@
|
|
|
1
1
|
# SmarterCSV v2.0 TO DO List
|
|
2
2
|
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
* [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
|
|
12
|
-
* Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
|
|
13
|
-
* Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
|
|
3
|
+
DONE:
|
|
4
|
+
[X] Don't call rewind on filehandle
|
|
5
|
+
[X] use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
|
6
|
+
[X] skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120). Or stream large file from S3 (linked in the issue)
|
|
7
|
+
[X] [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
|
|
8
|
+
[X] add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
|
|
9
|
+
[X] Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
|
|
10
|
+
[X] Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
|
|
14
11
|
|
|
12
|
+
|
|
13
|
+
Partially Done:
|
|
14
|
+
[ ] make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
|
15
|
+
|
|
16
|
+
StilL TO DO:
|
|
17
|
+
[ ] Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
|
|
18
|
+
|
|
19
|
+
Arguably by design (e.g. exclude these columns from conversion and have them returned as a string)
|
|
20
|
+
[ ] [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
|
data/docs/_introduction.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/bad_row_quarantine.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [**Bad Row Quarantine**](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -339,4 +340,4 @@ Normal rows (where the entire line fits within the limit) bypass per-field check
|
|
|
339
340
|
|
|
340
341
|
--------------------
|
|
341
342
|
|
|
342
|
-
PREVIOUS: [Value Converters](./value_converters.md) | NEXT: [
|
|
343
|
+
PREVIOUS: [Value Converters](./value_converters.md) | NEXT: [Warnings](./warnings.md) | UP: [README](../README.md)
|
data/docs/basic_read_api.md
CHANGED
|
@@ -123,8 +123,9 @@ reader.each do |hash|
|
|
|
123
123
|
MyModel.upsert(hash)
|
|
124
124
|
end
|
|
125
125
|
|
|
126
|
-
puts reader.headers
|
|
126
|
+
puts reader.headers # accessible after processing
|
|
127
127
|
puts reader.errors.inspect
|
|
128
|
+
puts reader.warnings # see [Warnings](./warnings.md)
|
|
128
129
|
```
|
|
129
130
|
|
|
130
131
|
### Returns an Enumerator when called without a block
|
data/docs/basic_write_api.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/batch_processing.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/column_selection.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [**Data Transformations**](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/examples.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [**Examples**](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/header_validations.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/history.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/instrumentation.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [**Instrumentation Hooks**](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -163,4 +164,4 @@ SmarterCSV.process(file, on_start: ON_START, on_complete: ON_COMPLETE)
|
|
|
163
164
|
```
|
|
164
165
|
|
|
165
166
|
--------------------
|
|
166
|
-
PREVIOUS: [
|
|
167
|
+
PREVIOUS: [Warnings](./warnings.md) | NEXT: [Examples](./examples.md) | UP: [README](../README.md)
|
data/docs/migrating_from_csv.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/options.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -71,8 +72,8 @@
|
|
|
71
72
|
| Option | Default | Explanation |
|
|
72
73
|
|--------|---------|-------------|
|
|
73
74
|
| `:col_sep` | `:auto` | Column separator. `:auto` detects from file content (previous default was `','`). |
|
|
74
|
-
| `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content
|
|
75
|
-
| `:auto_row_sep_chars` | `
|
|
75
|
+
| `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content by scanning in chunks of `auto_row_sep_chars` bytes, up to a 64KB hard cap. |
|
|
76
|
+
| `:auto_row_sep_chars` | `8192` | Chunk size used while scanning for `:row_sep => :auto`. Detection stops as soon as one separator has a clear majority, with a 64KB hard cap. Must be an Integer ≥ 8192; smaller values, `nil`, or `0` are rejected and fall back to the default with a warning. |
|
|
76
77
|
|
|
77
78
|
### Quoting
|
|
78
79
|
|
|
@@ -142,7 +143,7 @@ See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
|
|
|
142
143
|
| Option | Default | Explanation |
|
|
143
144
|
|--------|---------|-------------|
|
|
144
145
|
| `:with_line_numbers` | `false` | Add `:csv_line_number` to each result hash. |
|
|
145
|
-
| `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. |
|
|
146
|
+
| `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. See [Warnings](./warnings.md) for the structured warning collection. |
|
|
146
147
|
|
|
147
148
|
### Instrumentation Hooks
|
|
148
149
|
|
data/docs/parsing_strategy.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/real_world_csv.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [**Real-World CSV Files**](./real_world_csv.md)
|
|
@@ -186,10 +187,14 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
|
|
|
186
187
|
|
|
187
188
|
### I/O Patterns
|
|
188
189
|
|
|
190
|
+
SmarterCSV accepts any IO-compatible source — file paths, open `File` handles, `StringIO`, and **non-seekable streams** like pipes, `STDIN`, and `Zlib::GzipReader`. Auto-detection of `row_sep` / `col_sep` works on streaming sources too: SmarterCSV captures the first bytes in an internal peek buffer and replays them, so the underlying source never needs to support `rewind` or `seek`. (Streaming IO support landed in 1.17.0.)
|
|
191
|
+
|
|
189
192
|
| Source | Issue | Status | Notes |
|
|
190
193
|
|--------|-------|--------|-------|
|
|
191
|
-
| Gzipped CSV (`.csv.gz`) | Compressed
|
|
194
|
+
| Gzipped CSV (`.csv.gz`) | Compressed, non-seekable stream | 🔘 | `SmarterCSV.process(Zlib::GzipReader.open(path))` — no need to decompress to disk first. |
|
|
192
195
|
| HTTP streaming | Parsing from a live HTTP response | 🔘 | Pass any IO-compatible object that responds to `#gets`. |
|
|
196
|
+
| `STDIN` / shell pipes | Non-seekable input | 🔘 | `cat data.csv \| ruby -rsmarter_csv -e 'SmarterCSV.process(STDIN) { \|h\| ... }'` |
|
|
197
|
+
| `IO.popen` output | Non-seekable subprocess stream | 🔘 | `IO.popen('zcat data.csv.gz') { \|io\| SmarterCSV.process(io) }` |
|
|
193
198
|
|
|
194
199
|
†: Legacy Apple DB Dump and older UNIX data dumps use ASCII control characters as delimiters:
|
|
195
200
|
|
data/docs/row_col_sep.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -30,7 +31,7 @@
|
|
|
30
31
|
|
|
31
32
|
Convenient defaults allow automatic detection of the column and row separators: `row_sep: :auto`, `col_sep: :auto`. This makes it easier to process any CSV files without having to examine the line endings or column separators, e.g. when users upload CSV files to your service and you have no control over the incoming files.
|
|
32
33
|
|
|
33
|
-
|
|
34
|
+
The setting `:auto_row_sep_chars` controls the chunk size used while scanning for the row separator (default is 8192). Detection reads in chunks of this size and stops as soon as one separator has a clear majority, with a 64KB hard cap. Values below 8192 (and `nil` / `0`) are rejected and fall back to the default with a warning. Of course you can also set the `:row_sep` manually.
|
|
34
35
|
|
|
35
36
|
|
|
36
37
|
## Column Separator `col_sep`
|
data/docs/ruby_csv_pitfalls.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/value_converters.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [**Value Converters**](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -113,6 +114,29 @@ def self.convert(value)
|
|
|
113
114
|
end
|
|
114
115
|
```
|
|
115
116
|
|
|
117
|
+
## Handling Numeric Inputs
|
|
118
|
+
|
|
119
|
+
Converters run **after** `convert_values_to_numeric`, so a field that looks like a
|
|
120
|
+
number (e.g. `"42"`, `"3.14"`) will already be an `Integer` or `Float` by the time
|
|
121
|
+
your converter sees it. If your converter expects a string, guard against this:
|
|
122
|
+
|
|
123
|
+
```ruby
|
|
124
|
+
# Safe: passes already-numeric values through unchanged
|
|
125
|
+
dollar = ->(v) { v.is_a?(String) ? v.sub('$', '').to_f : v }
|
|
126
|
+
|
|
127
|
+
# Unsafe: raises NoMethodError on Integer/Float (no #sub)
|
|
128
|
+
dollar = ->(v) { v.sub('$', '').to_f }
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Alternatively, exclude the column from numeric conversion so the converter always
|
|
132
|
+
receives a string:
|
|
133
|
+
|
|
134
|
+
```ruby
|
|
135
|
+
SmarterCSV.process(file,
|
|
136
|
+
convert_values_to_numeric: { except: [:price] },
|
|
137
|
+
value_converters: { price: ->(v) { v&.sub('$', '')&.to_f } })
|
|
138
|
+
```
|
|
139
|
+
|
|
116
140
|
## Class-Based Converters
|
|
117
141
|
|
|
118
142
|
For converters you want to reuse across the codebase or test independently, define a class
|
data/docs/warnings.md
ADDED
|
@@ -0,0 +1,119 @@
|
|
|
1
|
+
|
|
2
|
+
### Contents
|
|
3
|
+
|
|
4
|
+
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
|
|
7
|
+
* [Parsing Strategy](./parsing_strategy.md)
|
|
8
|
+
* [The Basic Read API](./basic_read_api.md)
|
|
9
|
+
* [The Basic Write API](./basic_write_api.md)
|
|
10
|
+
* [Batch Processing](././batch_processing.md)
|
|
11
|
+
* [Configuration Options](./options.md)
|
|
12
|
+
* [Row and Column Separators](./row_col_sep.md)
|
|
13
|
+
* [Header Transformations](./header_transformations.md)
|
|
14
|
+
* [Header Validations](./header_validations.md)
|
|
15
|
+
* [Column Selection](./column_selection.md)
|
|
16
|
+
* [Data Transformations](./data_transformations.md)
|
|
17
|
+
* [Value Converters](./value_converters.md)
|
|
18
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [**Warnings**](./warnings.md)
|
|
20
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
21
|
+
* [Examples](./examples.md)
|
|
22
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
23
|
+
* [SmarterCSV over the Years](./history.md)
|
|
24
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
25
|
+
|
|
26
|
+
--------------
|
|
27
|
+
|
|
28
|
+
# Warnings
|
|
29
|
+
|
|
30
|
+
SmarterCSV records auto-detection and configuration warnings into a structured
|
|
31
|
+
collection on the Reader, in addition to emitting them to a log sink. This lets
|
|
32
|
+
you inspect warnings programmatically (e.g. surface them in dashboards, fail
|
|
33
|
+
deploys on unexpected codes) without parsing stderr text.
|
|
34
|
+
|
|
35
|
+
## Accessing warnings
|
|
36
|
+
|
|
37
|
+
### Via the Reader API
|
|
38
|
+
|
|
39
|
+
```ruby
|
|
40
|
+
reader = SmarterCSV::Reader.new('data.csv')
|
|
41
|
+
reader.process
|
|
42
|
+
|
|
43
|
+
reader.warnings
|
|
44
|
+
# => [
|
|
45
|
+
# { type: :config, code: :chunk_size_default, severity: :warn,
|
|
46
|
+
# message: "chunk_size not set, defaulting to 100. ...", count: 1 },
|
|
47
|
+
# ...
|
|
48
|
+
# ]
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
### Via the class-level API (`SmarterCSV.warnings`)
|
|
52
|
+
|
|
53
|
+
Mirrors `SmarterCSV.errors`. Returns the warnings from the most recent call to
|
|
54
|
+
`process`, `parse`, `each`, or `each_chunk` on the current thread. Cleared at
|
|
55
|
+
the start of each new call.
|
|
56
|
+
|
|
57
|
+
```ruby
|
|
58
|
+
SmarterCSV.process('data.csv')
|
|
59
|
+
SmarterCSV.warnings.each do |w|
|
|
60
|
+
logger.warn("[#{w[:type]}/#{w[:code]}] #{w[:message]} (×#{w[:count]})")
|
|
61
|
+
end
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
> **Note:** `SmarterCSV.warnings` is per-thread (uses `Thread.current`). It is
|
|
65
|
+
> safe in multi-threaded environments (Puma, Sidekiq), but **not fiber-safe**.
|
|
66
|
+
> If you process CSV files concurrently in fibers (e.g. with `Async`, `Falcon`,
|
|
67
|
+
> or manual `Fiber` scheduling), use `SmarterCSV::Reader` directly so warnings
|
|
68
|
+
> are scoped to the reader instance.
|
|
69
|
+
|
|
70
|
+
## Warning record shape
|
|
71
|
+
|
|
72
|
+
| Field | Description |
|
|
73
|
+
|---|---|
|
|
74
|
+
| `type` | Coarse semantic grouping. Currently: `:config`, `:deprecation`, `:encoding`, `:row_sep`. |
|
|
75
|
+
| `code` | Unique identifier for the specific warning. |
|
|
76
|
+
| `severity` | Log level: `:debug` / `:info` / `:warn` / `:error` / `:fatal`. |
|
|
77
|
+
| `message` | Human-readable description. |
|
|
78
|
+
| `count` | Number of times this `(type, code)` was triggered during the run. |
|
|
79
|
+
|
|
80
|
+
Repeated warnings of the same `(type, code)` are deduped — `count` tracks
|
|
81
|
+
occurrences. The `message` is the first one emitted.
|
|
82
|
+
|
|
83
|
+
## Available codes
|
|
84
|
+
|
|
85
|
+
| Code | Type | Severity | Triggered when |
|
|
86
|
+
|---|---|---|---|
|
|
87
|
+
| `:chunk_size_default` | `:config` | `:warn` | `each_chunk` is called without `chunk_size:` and the default of `100` is used. |
|
|
88
|
+
| `:header_a_method` | `:deprecation` | `:warn` | The deprecated `Reader#headerA` accessor is called. |
|
|
89
|
+
| `:utf8_missing_binary_mode` | `:encoding` | `:warn` | UTF-8 input is being processed but the IO was not opened with `"b:utf-8"`. |
|
|
90
|
+
| `:no_clear_row_sep` | `:row_sep` | `:error` | Auto-detection found a true tie between separators after scanning 64KB. Falls back to `"\n"` — silent miss-parse risk. |
|
|
91
|
+
| `:no_row_sep_found` | `:row_sep` | `:error` | No known row separator was found in the first 64KB. Falls back to `"\n"`. Likely an exotic separator like `\u2028`. |
|
|
92
|
+
|
|
93
|
+
## Log sink routing
|
|
94
|
+
|
|
95
|
+
When the warning is emitted, the sink is selected at Reader construction time:
|
|
96
|
+
|
|
97
|
+
* **Rails.logger present** — the warning is routed through `Rails.logger` at
|
|
98
|
+
the declared `severity`. `Rails.logger.warn(...)`, `Rails.logger.error(...)`,
|
|
99
|
+
etc.
|
|
100
|
+
* **No Rails.logger** — falls back to `Kernel#warn` (writes to `$stderr`).
|
|
101
|
+
|
|
102
|
+
Detection is one-shot at construct time, so there is no per-call overhead.
|
|
103
|
+
|
|
104
|
+
## Suppressing warnings
|
|
105
|
+
|
|
106
|
+
Pass `verbose: :quiet` to suppress both the recording and the log emission of
|
|
107
|
+
all warnings. Currently this affects every code listed above.
|
|
108
|
+
|
|
109
|
+
```ruby
|
|
110
|
+
SmarterCSV.process('data.csv', verbose: :quiet)
|
|
111
|
+
SmarterCSV.warnings # => []
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
> ⚠️ Suppressing `:row_sep` warnings hides genuine silent miss-parse risk on
|
|
115
|
+
> ambiguous files. Prefer passing `row_sep:` explicitly over silencing.
|
|
116
|
+
|
|
117
|
+
----------------
|
|
118
|
+
|
|
119
|
+
PREVIOUS: [Bad Row Quarantine](./bad_row_quarantine.md) | NEXT: [Instrumentation Hooks](./instrumentation.md) | UP: [README](../README.md)
|