smarter_csv 1.16.3 → 1.17.0.pre5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b40fb76fef88599d7449691806af6f76a131ffcd41e2ee145d5c87f3554a2006
4
- data.tar.gz: edb27057973c0a88524579f450dc8ed3ffadbd983de7155a84ff159495c31233
3
+ metadata.gz: 5d2154634f98b9df235995b9c6368e6208027d31da8d6d80ad09a526dd51fbf0
4
+ data.tar.gz: 62dd06196aef83b0e2c7dd6391ce9004fe95ba17310eec5f8ae9f2a74c2008a8
5
5
  SHA512:
6
- metadata.gz: 5b9e2a17ae14a5d7b3dfddd854148f38c8892d67544b51fecba888aa11c540096057460880564b89251194fea513c3e2807c500d93f896826db885daf64271fe
7
- data.tar.gz: 90b5aa10c6bc36cbc97662de879deaf61c2417a61e4cb5171924aac991a37c762588ff6413a924bfd083ca10c1fa1739049561f3c6e9de6ff196bdba85c7878f
6
+ metadata.gz: fdac413102754c859247b876f8a2130ff7dcf700c3531e4c58e72c36e0e53081c0ad7a93654d343428ed7c1c91dd0a35aa90e94877df72dab2b0aa5d9ae0bf65
7
+ data.tar.gz: c800d6e4c807ff2c502ac9c7f819de5b6eef582c587b505e1efe2b5ee284a941f809bdb7166d638ac3ee5595763116c72ccfbe76d30b83f83793a53b1f728a6c
data/.rubocop.yml CHANGED
@@ -13,6 +13,9 @@ Layout/SpaceInsideHashLiteralBraces:
13
13
  Layout/SpaceAroundOperators:
14
14
  Enabled: false
15
15
 
16
+ Lint/UnderscorePrefixedVariableName:
17
+ Enabled: false
18
+
16
19
  Metrics/AbcSize:
17
20
  Enabled: false
18
21
 
@@ -37,6 +40,9 @@ Metrics/ModuleLength:
37
40
  Metrics/PerceivedComplexity:
38
41
  Enabled: false
39
42
 
43
+ Naming/MethodParameterName:
44
+ Enabled: false
45
+
40
46
  Naming/PredicateName:
41
47
  Enabled: false
42
48
 
@@ -121,6 +127,9 @@ Style/PercentLiteralDelimiters:
121
127
  Style/RegexpLiteral:
122
128
  Enabled: false
123
129
 
130
+ Style/RescueModifier:
131
+ Enabled: false
132
+
124
133
  Style/SafeNavigation:
125
134
  Enabled: false
126
135
 
@@ -153,6 +162,9 @@ Style/SymbolArray:
153
162
  Style/SymbolProc: # old Ruby versions can't do this
154
163
  Enabled: false
155
164
 
165
+ Style/TernaryParentheses: # parentheses are good!
166
+ Enabled: false
167
+
156
168
  Style/TrailingCommaInArrayLiteral:
157
169
  Enabled: false
158
170
  EnforcedStyleForMultiline: consistent_comma
data/CHANGELOG.md CHANGED
@@ -1,6 +1,51 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.17.0.pre5 (2026-04-28)
5
+
6
+ RSpec tests: **1,434 → 1,905** (+471 tests)
7
+
8
+ ### New Features
9
+
10
+ * **Streaming IO support** — SmarterCSV now works with non-seekable IO sources such as pipes, STDIN, and Zlib streams.
11
+ A rewindable peek buffer transparently captures the first bytes of the stream so that `row_sep` and `col_sep` auto-detection can replay them without requiring the underlying source to support `rewind` or `seek`.
12
+
13
+ * **Structured warnings** — auto-detection and configuration warnings are now collected on the Reader as a deduped histogram:
14
+
15
+ ```ruby
16
+ reader = SmarterCSV::Reader.new('data.csv')
17
+ reader.process
18
+ reader.warnings # => [{ type:, code:, severity:, message:, count: }, ...]
19
+ ```
20
+
21
+ Repeated warnings of the same `(type, code)` are deduped — `count` tracks occurrences. Available codes today: `:chunk_size_default`, `:header_a_method`, `:utf8_missing_binary_mode`, `:no_clear_row_sep`, `:no_row_sep_found`.
22
+
23
+ * **Class-level `SmarterCSV.warnings`** accessor — mirrors `SmarterCSV.errors`. Per-thread, cleared at the start of each `.process` / `.parse` / `.each` / `.each_chunk` call. Safe under Puma/Sidekiq.
24
+
25
+ * **Rails.logger routing** — when `Rails.logger` is present, warnings are routed through it at the severity declared at the call site (`:debug` / `:info` / `:warn` / `:error` / `:fatal`); otherwise `Kernel#warn` is used as a fallback. Detection is cached at construct time, no per-call overhead.
26
+
27
+ ### Improvements
28
+
29
+ * Improved auto-detection of `row_sep` and `col_sep` — giving more accurate results on files with comment headers.
30
+
31
+ * Default value for `auto_row_sep_chars` changed from `500` to `8192`, providing a larger scan window for accurate row separator detection on files with wide headers or long first lines.
32
+ Values below `8192` (and `nil` / `0`) are now rejected and fall back to the default `8192` with a warning message.
33
+ This is a change from the previous `nil` / `0` were documented as "scan whole file".
34
+
35
+ * `guess_line_ending` now scans the input in chunks up to a 64KB hard cap, returning as soon as one separator has a clear majority. Near-tie chunk-boundary artifacts no longer cause spurious warnings; only true ties at the hard cap fall back to `"\n"` and emit a `:no_clear_row_sep` warning at `:error` severity (silent miss-parse risk).
36
+
37
+ ## 1.16.4 (2026-04-21) — Bug Fixes
38
+
39
+ RSpec tests: **1,434 → 1,467** (+33 tests)
40
+
41
+ ### Bug Fixes
42
+
43
+ * Fixed bug in `SmarterCSV.errors` that could lose collected records when processing raises mid-stream,
44
+ e.g. when `bad_row_limit:` was exceeded (`TooManyBadRows`), or when a user's block raised through `.process` / `.each` / `.each_chunk`.
45
+
46
+ * Fixed `enforce_utf8_encoding` incorrectly replacing all non-ASCII bytes when the input string was tagged as `ASCII-8BIT` (binary).
47
+ The encoding is now relabeled to UTF-8 before transcoding, so only genuinely invalid byte sequences are replaced.
48
+
4
49
  ## 1.16.3 (2026-04-14) — New Feature
5
50
 
6
51
  RSpec tests: **1,425 → 1,434** (+9 tests)
data/README.md CHANGED
@@ -16,7 +16,9 @@
16
16
 
17
17
  The library includes intelligent defaults, automatic detection of column and row separators, and flexible header/value transformations. These features eliminate much of the boilerplate typically required when working with CSV data and help keep ingestion code concise and maintainable.
18
18
 
19
- For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines. The C acceleration further optimizes the full ingestion path — including parsing, hash construction, and conversions — so performance gains reflect real-world workloads, not just tokenizer benchmarks.
19
+ For large files, SmarterCSV supports both chunked processing (arrays of hashes) and streaming via Enumerable APIs, enabling efficient batch jobs and low-memory pipelines.
20
+ As of 1.17.0, SmarterCSV also accepts **non-seekable streaming inputs** — pipes, `STDIN`, `Zlib::GzipReader`, and HTTP responses — with no need to materialize the file on disk first.
21
+ The C acceleration further optimizes the full ingestion path — including parsing, hash construction, and conversions — so performance gains reflect real-world workloads, not just tokenizer benchmarks.
20
22
 
21
23
  The interface is intentionally designed to robustly handle messy real-world CSV while keeping application code clean. Developers can easily map headers, skip unwanted rows, quarantine problematic data, and transform values on the fly without building custom post-processing pipelines. See [Real-World CSV Files](docs/real_world_csv.md) for a comprehensive guide to production CSV patterns.
22
24
 
@@ -223,6 +225,7 @@ Or install it yourself as:
223
225
  * [Data Transformations](docs/data_transformations.md)
224
226
  * [Value Converters](docs/value_converters.md)
225
227
  * [Bad Row Quarantine](docs/bad_row_quarantine.md)
228
+ * [Warnings](docs/warnings.md)
226
229
  * [Instrumentation Hooks](docs/instrumentation.md)
227
230
  * [Examples](docs/examples.md)
228
231
  * [Real-World CSV Files](docs/real_world_csv.md)
data/TO_DO_v2.md CHANGED
@@ -1,14 +1,20 @@
1
1
  # SmarterCSV v2.0 TO DO List
2
2
 
3
- * add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
4
- * use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
5
- * make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
6
- * skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120).
7
- Or stream large file from S3 (linked in the issue)
8
- * Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
9
- * Don't call rewind on filehandle
10
- * [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
11
- * [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
12
- * Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
13
- * Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
3
+ DONE:
4
+ [X] Don't call rewind on filehandle
5
+ [X] use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
6
+ [X] skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120). Or stream large file from S3 (linked in the issue)
7
+ [X] [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
8
+ [X] add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
9
+ [X] Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
10
+ [X] Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
14
11
 
12
+
13
+ Partially Done:
14
+ [ ] make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
15
+
16
+ StilL TO DO:
17
+ [ ] Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
18
+
19
+ Arguably by design (e.g. exclude these columns from conversion and have them returned as a string)
20
+ [ ] [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [**Bad Row Quarantine**](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -339,4 +340,4 @@ Normal rows (where the entire line fits within the limit) bypass per-field check
339
340
 
340
341
  --------------------
341
342
 
342
- PREVIOUS: [Value Converters](./value_converters.md) | NEXT: [Instrumentation Hooks](./instrumentation.md) | UP: [README](../README.md)
343
+ PREVIOUS: [Value Converters](./value_converters.md) | NEXT: [Warnings](./warnings.md) | UP: [README](../README.md)
@@ -123,8 +123,9 @@ reader.each do |hash|
123
123
  MyModel.upsert(hash)
124
124
  end
125
125
 
126
- puts reader.headers # accessible after processing
126
+ puts reader.headers # accessible after processing
127
127
  puts reader.errors.inspect
128
+ puts reader.warnings # see [Warnings](./warnings.md)
128
129
  ```
129
130
 
130
131
  ### Returns an Enumerator when called without a block
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [**Data Transformations**](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
data/docs/examples.md CHANGED
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [**Examples**](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
data/docs/history.md CHANGED
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [**Instrumentation Hooks**](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -163,4 +164,4 @@ SmarterCSV.process(file, on_start: ON_START, on_complete: ON_COMPLETE)
163
164
  ```
164
165
 
165
166
  --------------------
166
- PREVIOUS: [Bad Row Quarantine](./bad_row_quarantine.md) | NEXT: [Examples](./examples.md) | UP: [README](../README.md)
167
+ PREVIOUS: [Warnings](./warnings.md) | NEXT: [Examples](./examples.md) | UP: [README](../README.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
data/docs/options.md CHANGED
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -71,8 +72,8 @@
71
72
  | Option | Default | Explanation |
72
73
  |--------|---------|-------------|
73
74
  | `:col_sep` | `:auto` | Column separator. `:auto` detects from file content (previous default was `','`). |
74
- | `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content. Manual detection reads the whole file first (slow on large files). |
75
- | `:auto_row_sep_chars` | `500` | How many characters to analyze when using `:row_sep => :auto`. `nil` or `0` means whole file. |
75
+ | `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content by scanning in chunks of `auto_row_sep_chars` bytes, up to a 64KB hard cap. |
76
+ | `:auto_row_sep_chars` | `8192` | Chunk size used while scanning for `:row_sep => :auto`. Detection stops as soon as one separator has a clear majority, with a 64KB hard cap. Must be an Integer ≥ 8192; smaller values, `nil`, or `0` are rejected and fall back to the default with a warning. |
76
77
 
77
78
  ### Quoting
78
79
 
@@ -142,7 +143,7 @@ See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
142
143
  | Option | Default | Explanation |
143
144
  |--------|---------|-------------|
144
145
  | `:with_line_numbers` | `false` | Add `:csv_line_number` to each result hash. |
145
- | `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. |
146
+ | `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. See [Warnings](./warnings.md) for the structured warning collection. |
146
147
 
147
148
  ### Instrumentation Hooks
148
149
 
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [**Real-World CSV Files**](./real_world_csv.md)
@@ -186,10 +187,14 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
186
187
 
187
188
  ### I/O Patterns
188
189
 
190
+ SmarterCSV accepts any IO-compatible source — file paths, open `File` handles, `StringIO`, and **non-seekable streams** like pipes, `STDIN`, and `Zlib::GzipReader`. Auto-detection of `row_sep` / `col_sep` works on streaming sources too: SmarterCSV captures the first bytes in an internal peek buffer and replays them, so the underlying source never needs to support `rewind` or `seek`. (Streaming IO support landed in 1.17.0.)
191
+
189
192
  | Source | Issue | Status | Notes |
190
193
  |--------|-------|--------|-------|
191
- | Gzipped CSV (`.csv.gz`) | Compressed file | 🔘 | Decompress and pass the resulting IO object: `SmarterCSV.process(Zlib::GzipReader.open(path))`. |
194
+ | Gzipped CSV (`.csv.gz`) | Compressed, non-seekable stream | 🔘 | `SmarterCSV.process(Zlib::GzipReader.open(path))` — no need to decompress to disk first. |
192
195
  | HTTP streaming | Parsing from a live HTTP response | 🔘 | Pass any IO-compatible object that responds to `#gets`. |
196
+ | `STDIN` / shell pipes | Non-seekable input | 🔘 | `cat data.csv \| ruby -rsmarter_csv -e 'SmarterCSV.process(STDIN) { \|h\| ... }'` |
197
+ | `IO.popen` output | Non-seekable subprocess stream | 🔘 | `IO.popen('zcat data.csv.gz') { \|io\| SmarterCSV.process(io) }` |
193
198
 
194
199
  †: Legacy Apple DB Dump and older UNIX data dumps use ASCII control characters as delimiters:
195
200
 
data/docs/row_col_sep.md CHANGED
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -30,7 +31,7 @@
30
31
 
31
32
  Convenient defaults allow automatic detection of the column and row separators: `row_sep: :auto`, `col_sep: :auto`. This makes it easier to process any CSV files without having to examine the line endings or column separators, e.g. when users upload CSV files to your service and you have no control over the incoming files.
32
33
 
33
- You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); `nil` or `0` will check the whole file). Of course you can also set the `:row_sep` manually.
34
+ The setting `:auto_row_sep_chars` controls the chunk size used while scanning for the row separator (default is 8192). Detection reads in chunks of this size and stops as soon as one separator has a clear majority, with a 64KB hard cap. Values below 8192 (and `nil` / `0`) are rejected and fall back to the default with a warning. Of course you can also set the `:row_sep` manually.
34
35
 
35
36
 
36
37
  ## Column Separator `col_sep`
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [**Value Converters**](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -113,6 +114,29 @@ def self.convert(value)
113
114
  end
114
115
  ```
115
116
 
117
+ ## Handling Numeric Inputs
118
+
119
+ Converters run **after** `convert_values_to_numeric`, so a field that looks like a
120
+ number (e.g. `"42"`, `"3.14"`) will already be an `Integer` or `Float` by the time
121
+ your converter sees it. If your converter expects a string, guard against this:
122
+
123
+ ```ruby
124
+ # Safe: passes already-numeric values through unchanged
125
+ dollar = ->(v) { v.is_a?(String) ? v.sub('$', '').to_f : v }
126
+
127
+ # Unsafe: raises NoMethodError on Integer/Float (no #sub)
128
+ dollar = ->(v) { v.sub('$', '').to_f }
129
+ ```
130
+
131
+ Alternatively, exclude the column from numeric conversion so the converter always
132
+ receives a string:
133
+
134
+ ```ruby
135
+ SmarterCSV.process(file,
136
+ convert_values_to_numeric: { except: [:price] },
137
+ value_converters: { price: ->(v) { v&.sub('$', '')&.to_f } })
138
+ ```
139
+
116
140
  ## Class-Based Converters
117
141
 
118
142
  For converters you want to reuse across the codebase or test independently, define a class
data/docs/warnings.md ADDED
@@ -0,0 +1,119 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
7
+ * [Parsing Strategy](./parsing_strategy.md)
8
+ * [The Basic Read API](./basic_read_api.md)
9
+ * [The Basic Write API](./basic_write_api.md)
10
+ * [Batch Processing](././batch_processing.md)
11
+ * [Configuration Options](./options.md)
12
+ * [Row and Column Separators](./row_col_sep.md)
13
+ * [Header Transformations](./header_transformations.md)
14
+ * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
16
+ * [Data Transformations](./data_transformations.md)
17
+ * [Value Converters](./value_converters.md)
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [**Warnings**](./warnings.md)
20
+ * [Instrumentation Hooks](./instrumentation.md)
21
+ * [Examples](./examples.md)
22
+ * [Real-World CSV Files](./real_world_csv.md)
23
+ * [SmarterCSV over the Years](./history.md)
24
+ * [Release Notes](./releases/1.16.0/changes.md)
25
+
26
+ --------------
27
+
28
+ # Warnings
29
+
30
+ SmarterCSV records auto-detection and configuration warnings into a structured
31
+ collection on the Reader, in addition to emitting them to a log sink. This lets
32
+ you inspect warnings programmatically (e.g. surface them in dashboards, fail
33
+ deploys on unexpected codes) without parsing stderr text.
34
+
35
+ ## Accessing warnings
36
+
37
+ ### Via the Reader API
38
+
39
+ ```ruby
40
+ reader = SmarterCSV::Reader.new('data.csv')
41
+ reader.process
42
+
43
+ reader.warnings
44
+ # => [
45
+ # { type: :config, code: :chunk_size_default, severity: :warn,
46
+ # message: "chunk_size not set, defaulting to 100. ...", count: 1 },
47
+ # ...
48
+ # ]
49
+ ```
50
+
51
+ ### Via the class-level API (`SmarterCSV.warnings`)
52
+
53
+ Mirrors `SmarterCSV.errors`. Returns the warnings from the most recent call to
54
+ `process`, `parse`, `each`, or `each_chunk` on the current thread. Cleared at
55
+ the start of each new call.
56
+
57
+ ```ruby
58
+ SmarterCSV.process('data.csv')
59
+ SmarterCSV.warnings.each do |w|
60
+ logger.warn("[#{w[:type]}/#{w[:code]}] #{w[:message]} (×#{w[:count]})")
61
+ end
62
+ ```
63
+
64
+ > **Note:** `SmarterCSV.warnings` is per-thread (uses `Thread.current`). It is
65
+ > safe in multi-threaded environments (Puma, Sidekiq), but **not fiber-safe**.
66
+ > If you process CSV files concurrently in fibers (e.g. with `Async`, `Falcon`,
67
+ > or manual `Fiber` scheduling), use `SmarterCSV::Reader` directly so warnings
68
+ > are scoped to the reader instance.
69
+
70
+ ## Warning record shape
71
+
72
+ | Field | Description |
73
+ |---|---|
74
+ | `type` | Coarse semantic grouping. Currently: `:config`, `:deprecation`, `:encoding`, `:row_sep`. |
75
+ | `code` | Unique identifier for the specific warning. |
76
+ | `severity` | Log level: `:debug` / `:info` / `:warn` / `:error` / `:fatal`. |
77
+ | `message` | Human-readable description. |
78
+ | `count` | Number of times this `(type, code)` was triggered during the run. |
79
+
80
+ Repeated warnings of the same `(type, code)` are deduped — `count` tracks
81
+ occurrences. The `message` is the first one emitted.
82
+
83
+ ## Available codes
84
+
85
+ | Code | Type | Severity | Triggered when |
86
+ |---|---|---|---|
87
+ | `:chunk_size_default` | `:config` | `:warn` | `each_chunk` is called without `chunk_size:` and the default of `100` is used. |
88
+ | `:header_a_method` | `:deprecation` | `:warn` | The deprecated `Reader#headerA` accessor is called. |
89
+ | `:utf8_missing_binary_mode` | `:encoding` | `:warn` | UTF-8 input is being processed but the IO was not opened with `"b:utf-8"`. |
90
+ | `:no_clear_row_sep` | `:row_sep` | `:error` | Auto-detection found a true tie between separators after scanning 64KB. Falls back to `"\n"` — silent miss-parse risk. |
91
+ | `:no_row_sep_found` | `:row_sep` | `:error` | No known row separator was found in the first 64KB. Falls back to `"\n"`. Likely an exotic separator like `\u2028`. |
92
+
93
+ ## Log sink routing
94
+
95
+ When the warning is emitted, the sink is selected at Reader construction time:
96
+
97
+ * **Rails.logger present** — the warning is routed through `Rails.logger` at
98
+ the declared `severity`. `Rails.logger.warn(...)`, `Rails.logger.error(...)`,
99
+ etc.
100
+ * **No Rails.logger** — falls back to `Kernel#warn` (writes to `$stderr`).
101
+
102
+ Detection is one-shot at construct time, so there is no per-call overhead.
103
+
104
+ ## Suppressing warnings
105
+
106
+ Pass `verbose: :quiet` to suppress both the recording and the log emission of
107
+ all warnings. Currently this affects every code listed above.
108
+
109
+ ```ruby
110
+ SmarterCSV.process('data.csv', verbose: :quiet)
111
+ SmarterCSV.warnings # => []
112
+ ```
113
+
114
+ > ⚠️ Suppressing `:row_sep` warnings hides genuine silent miss-parse risk on
115
+ > ambiguous files. Prefer passing `row_sep:` explicitly over silencing.
116
+
117
+ ----------------
118
+
119
+ PREVIOUS: [Bad Row Quarantine](./bad_row_quarantine.md) | NEXT: [Instrumentation Hooks](./instrumentation.md) | UP: [README](../README.md)