RubyGems - smarter_csv - Versions diffs - 1.15.2 → 1.16.1 - Mend

smarter_csv 1.15.2 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (50) hide show

checksums.yaml +4 -4
data/.rspec +2 -0
data/.rubocop.yml +9 -0
data/CHANGELOG.md +112 -1
data/CONTRIBUTORS.md +4 -1
data/Gemfile +1 -0
data/README.md +129 -27
data/docs/_introduction.md +45 -24
data/docs/bad_row_quarantine.md +342 -0
data/docs/basic_read_api.md +152 -9
data/docs/basic_write_api.md +475 -59
data/docs/batch_processing.md +162 -4
data/docs/column_selection.md +184 -0
data/docs/data_transformations.md +163 -29
data/docs/examples.md +340 -46
data/docs/header_transformations.md +94 -12
data/docs/header_validations.md +57 -18
data/docs/history.md +119 -0
data/docs/instrumentation.md +166 -0
data/docs/migrating_from_csv.md +565 -0
data/docs/options.md +151 -87
data/docs/parsing_strategy.md +64 -1
data/docs/real_world_csv.md +263 -0
data/docs/releases/1.16.0/benchmarks.md +223 -0
data/docs/releases/1.16.0/changes.md +273 -0
data/docs/releases/1.16.0/performance_notes.md +114 -0
data/docs/row_col_sep.md +15 -5
data/docs/ruby_csv_pitfalls.md +514 -0
data/docs/value_converters.md +194 -57
data/ext/smarter_csv/extconf.rb +3 -0
data/ext/smarter_csv/smarter_csv.c +1017 -82
data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
data/lib/smarter_csv/errors.rb +8 -0
data/lib/smarter_csv/file_io.rb +1 -1
data/lib/smarter_csv/hash_transformations.rb +14 -13
data/lib/smarter_csv/header_transformations.rb +21 -2
data/lib/smarter_csv/headers.rb +2 -1
data/lib/smarter_csv/options.rb +124 -7
data/lib/smarter_csv/parser.rb +358 -74
data/lib/smarter_csv/reader.rb +494 -46
data/lib/smarter_csv/version.rb +1 -1
data/lib/smarter_csv/writer.rb +71 -19
data/lib/smarter_csv.rb +134 -13
data/smarter_csv.gemspec +20 -10
metadata +38 -80

data/docs/bad_row_quarantine.md ADDED Viewed

@@ -0,0 +1,342 @@
+### Contents
+  * [Introduction](./_introduction.md)
+  * [Migrating from Ruby CSV](./migrating_from_csv.md)
+  * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
+  * [Parsing Strategy](./parsing_strategy.md)
+  * [The Basic Read API](./basic_read_api.md)
+  * [The Basic Write API](./basic_write_api.md)
+  * [Batch Processing](././batch_processing.md)
+  * [Configuration Options](./options.md)
+  * [Row and Column Separators](./row_col_sep.md)
+  * [Header Transformations](./header_transformations.md)
+  * [Header Validations](./header_validations.md)
+  * [Column Selection](./column_selection.md)
+  * [Data Transformations](./data_transformations.md)
+  * [Value Converters](./value_converters.md)
+  * [**Bad Row Quarantine**](./bad_row_quarantine.md)
+  * [Instrumentation Hooks](./instrumentation.md)
+  * [Examples](./examples.md)
+  * [Real-World CSV Files](./real_world_csv.md)
+  * [SmarterCSV over the Years](./history.md)
+  * [Release Notes](./releases/1.16.0/changes.md)
+--------------
+# Bad Row Quarantine
+Real-world CSV files are often malformed. By default, SmarterCSV raises an exception on the
+first bad row it encounters. The `on_bad_row` option lets you keep processing and handle bad
+rows in whatever way suits your application.
+## What counts as a bad row
+- Malformed CSV (unclosed quoted fields, unterminated multiline rows)
+- A field that exceeds `field_size_limit` (see [Limiting field size](#limiting-field-size-field_size_limit))
+- Extra columns when running in `strict: true` mode
+- Any `SmarterCSV::Error` or `EOFError` raised during row parsing
+## Options
+| Option | Default | Description |
+|--------|---------|-------------|
+| `on_bad_row` | `:raise` | How to handle a bad row: `:raise`, `:skip`, `:collect`, or a callable |
+| `collect_raw_lines` | `true` | Include `raw_logical_line` in the error record |
+| `bad_row_limit` | `nil` | Raise `SmarterCSV::TooManyBadRows` after this many bad rows |
+## Modes
+### `:raise` (default)
+Current behavior — the exception propagates and processing stops:
+```ruby
+SmarterCSV.process('data.csv')
+# => raises SmarterCSV::MalformedCSV on the first bad row
+```
+The `on_bad_row` option controls what happens when a bad row is encountered:
+* `on_bad_row: :raise` (default) fails fast.
+* `on_bad_row: :collect` quarantines them — error records available via `SmarterCSV.errors` or `reader.errors`.
+* `on_bad_row: ->(rec) { ... }` calls your lambda per bad row — works with both `SmarterCSV.process` and `SmarterCSV::Reader`.
+* `on_bad_row: :skip` discards bad rows silently — count available via `SmarterCSV.errors` or `reader.errors`.
+### `:collect`
+Continue processing and store a structured error record for each bad row.
+Error records are available via `SmarterCSV.errors[:bad_rows]` (class-level API)
+or `reader.errors[:bad_rows]` (Reader API).
+```ruby
+# Class-level API — use SmarterCSV.errors after the call
+good_rows = SmarterCSV.process('data.csv', on_bad_row: :collect)
+good_rows.each { |row| MyModel.create!(row) }
+SmarterCSV.errors[:bad_rows].each do |rec|
+  Rails.logger.warn "Bad row at line #{rec[:csv_line_number]}: #{rec[:error_message]}"
+  Rails.logger.warn "Raw content: #{rec[:raw_logical_line]}"
+end
+```
+```ruby
+# Reader API — use when you also need access to headers or other reader state
+reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
+result = reader.process
+result.each { |row| MyModel.create!(row) }
+reader.errors[:bad_rows].each do |rec|
+  Rails.logger.warn "Bad row at line #{rec[:csv_line_number]}: #{rec[:error_message]}"
+  Rails.logger.warn "Raw content: #{rec[:raw_logical_line]}"
+end
+```
+### Callable (lambda / proc)
+Pass any object that responds to `#call`. It is invoked once per bad row with the
+error record hash, then processing continues. Because the lambda receives errors
+inline, **this works with both `SmarterCSV.process` and `SmarterCSV::Reader`** —
+you do not need a `Reader` instance to handle bad rows.
+```ruby
+# Works with SmarterCSV.process — no Reader instance needed
+bad_rows = []
+good_rows = SmarterCSV.process('data.csv',
+  on_bad_row: ->(rec) { bad_rows << rec })
+```
+```ruby
+# Log to a dead-letter file
+quarantine = File.open('quarantine.csv', 'w')
+SmarterCSV.process('data.csv',
+  on_bad_row: ->(rec) { quarantine.puts(rec[:raw_logical_line]) })
+quarantine.close
+```
+```ruby
+# Send to a monitoring system
+SmarterCSV.process('data.csv',
+  on_bad_row: ->(rec) { Metrics.increment('csv.bad_rows', tags: { error: rec[:error_class].name }) })
+```
+### `:skip`
+Silently skip bad rows and continue. The count of skipped rows is available via
+`SmarterCSV.errors[:bad_row_count]` (class-level API) or `reader.errors[:bad_row_count]`
+(Reader API). No error records are stored.
+```ruby
+# Class-level API — use SmarterCSV.errors after the call
+SmarterCSV.process('data.csv', on_bad_row: :skip)
+puts "Skipped: #{SmarterCSV.errors[:bad_row_count] || 0} bad rows"
+```
+```ruby
+# Reader API — access reader.errors directly
+reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :skip)
+result = reader.process
+puts "Processed: #{result.size} good rows"
+puts "Skipped:   #{reader.errors[:bad_row_count] || 0} bad rows"
+```
+## Error record structure
+Each error record is a Hash:
+```ruby
+{
+  csv_line_number:     3,                               # logical row (counting header as row 1)
+  file_line_number:    3,                               # physical file line where the row started
+  file_lines_consumed: 1,                               # physical lines spanned (>1 for multiline)
+  error_class:         SmarterCSV::HeaderSizeMismatch,  # exception class object
+  error_message:       "extra columns detected ...",    # exception message string
+  raw_logical_line:    "Jane,25,Boston,EXTRA_DATA\n",   # present when collect_raw_lines: true (default)
+}
+```
+### `collect_raw_lines`
+`collect_raw_lines: true` (default) — `raw_logical_line` is always included in the error
+record. Set to `false` if you want to reduce memory usage and don't need the raw content:
+```ruby
+reader = SmarterCSV::Reader.new('data.csv',
+  on_bad_row: :collect,
+  collect_raw_lines: false,
+)
+```
+For multiline rows (quoted fields spanning several physical lines), `raw_logical_line` contains
+the fully stitched content — it may include embedded newline characters. The
+`file_lines_consumed` field tells you how many physical lines were read.
+## Limiting bad rows with `bad_row_limit`
+To abort processing after too many failures, set `bad_row_limit`. This works with `:skip`,
+`:collect`, and callable modes:
+```ruby
+reader = SmarterCSV::Reader.new('data.csv',
+  on_bad_row: :collect,
+  bad_row_limit: 10,
+)
+begin
+  result = reader.process
+rescue SmarterCSV::TooManyBadRows => e
+  puts "Aborting: #{e.message}"
+  puts "Collected so far: #{reader.errors[:bad_rows].size} bad rows"
+end
+```
+## Accessing errors
+There are two ways to access bad row data after processing:
+### Via `SmarterCSV.errors` (class-level API)
+`SmarterCSV.errors` returns the errors from the most recent call to `process`, `parse`,
+`each`, or `each_chunk` on the current thread. It is cleared at the start of each new call.
+```ruby
+SmarterCSV.process('data.csv', on_bad_row: :skip)
+puts SmarterCSV.errors[:bad_row_count]   # => 3
+SmarterCSV.process('data.csv', on_bad_row: :collect)
+puts SmarterCSV.errors[:bad_row_count]   # => 3
+puts SmarterCSV.errors[:bad_rows].size   # => 3
+```
+> **Note:** `SmarterCSV.errors` only surfaces errors from the **most recent run on the
+> current thread**. In a multi-threaded environment (Puma, Sidekiq), each thread maintains
+> its own error state independently. If you call `SmarterCSV.process` twice in the same
+> thread, the second call's errors replace the first's. For long-running or complex
+> pipelines where you need to aggregate errors across multiple files, use the Reader API.
+>
+> ⚠️ **Fibers:** `SmarterCSV.errors` uses `Thread.current` for storage, which is **shared
+> across all fibers running in the same thread**. If you process CSV files concurrently
+> in fibers (e.g. with `Async`, `Falcon`, or manual `Fiber` scheduling), `SmarterCSV.errors`
+> may return stale or wrong results. **Use `SmarterCSV::Reader` directly** — errors are
+> scoped to the reader instance and are always correct regardless of fiber context.
+### Via `reader.errors` (Reader API)
+For full control — including access to headers, raw headers, and errors from a specific
+call — use `SmarterCSV::Reader` directly:
+| Attribute | Description |
+|-----------|-------------|
+| `reader.errors[:bad_row_count]` | Total bad rows encountered (all modes) |
+| `reader.errors[:bad_rows]` | Array of error records (`:collect` mode only) |
+```ruby
+reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
+reader.process
+puts reader.errors[:bad_row_count]
+puts reader.headers.inspect
+```
+## Chunked processing
+Bad row quarantine works seamlessly with `chunk_size`. Skipped rows are simply not added to the
+current chunk — chunk sizes remain consistent:
+```ruby
+reader = SmarterCSV::Reader.new('large_file.csv',
+  chunk_size: 500,
+  on_bad_row: :collect,
+)
+reader.process do |chunk, index|
+  MyModel.import(chunk)
+end
+puts "Bad rows: #{reader.errors[:bad_row_count]}"
+```
+## Limiting field size: `field_size_limit`
+Real-world CSV files sometimes contain unexpectedly large fields — either intentionally
+(a DoS attempt) or accidentally (a forgotten closing quote, a JSON blob in a cell, a notes
+field that ran away). Without a limit, SmarterCSV will happily stitch together physical lines
+until it either finds the closing quote or reaches end-of-file, potentially consuming hundreds
+of megabytes.
+`field_size_limit` sets a hard cap (in bytes) on the size of any individual extracted field.
+The default is `nil` (no limit). When a field exceeds the limit a
+`SmarterCSV::FieldSizeLimitExceeded` exception is raised — and because it inherits from
+`SmarterCSV::Error`, the `on_bad_row` option handles it exactly like any other parse error.
+### The three cases it prevents
+**1. Huge inline field** — a single-line field containing a large payload (e.g. a JSON blob,
+a base64-encoded file, or a runaway notes column):
+```csv
+id,payload
+1,"{... 500 KB of JSON ...}"
+```
+**2. Quoted field spanning many embedded newlines** — a legitimate multiline field in a
+poorly exported file that happens to be enormous:
+```csv
+ticket_id,notes
+42,"Customer wrote:
+... (thousands of lines of chat history) ...
+"
+```
+**3. Never-closing quoted field** — a missing closing quote causes the parser to stitch every
+subsequent physical line into one logical row until EOF:
+```csv
+id,comment
+1,"this quote never closes
+2,this entire row is now inside the field
+3,and this one too ...
+```
+Without `field_size_limit`, case 3 reads the entire rest of the file into memory. With the
+limit set, the stitch loop raises `FieldSizeLimitExceeded` as soon as the accumulating buffer
+crosses the threshold.
+### Usage
+```ruby
+# Raise immediately on any oversized field (default on_bad_row: :raise)
+SmarterCSV.process('data.csv', field_size_limit: 1_000_000)  # 1 MB per field
+# Skip oversized rows and continue
+SmarterCSV.process('data.csv', field_size_limit: 1_000_000, on_bad_row: :skip)
+# Collect oversized rows for inspection
+reader = SmarterCSV::Reader.new('data.csv',
+  field_size_limit: 1_000_000,
+  on_bad_row: :collect,
+)
+result = reader.process
+reader.errors[:bad_rows].each do |rec|
+  Rails.logger.warn "Oversized field on row #{rec[:csv_line_number]}: #{rec[:error_message]}"
+end
+```
+### What "bytes" means here
+The limit is checked against `String#bytesize` (raw byte count), not character count.
+For ASCII content they are identical. For multi-byte UTF-8 content (e.g. CJK characters)
+bytesize is larger than the character count — so the limit is a memory cap, not a
+character cap, which is what matters for DoS protection.
+### Performance
+`field_size_limit` is zero-overhead when not set (the default `nil` short-circuits all
+checks). When set, a single integer comparison is performed per logical row; the per-field
+scan only runs when the raw line is large enough to potentially contain an oversized field.
+Normal rows (where the entire line fits within the limit) bypass per-field checking entirely.
+--------------------
+PREVIOUS: [Value Converters](./value_converters.md) | NEXT: [Instrumentation Hooks](./instrumentation.md) | UP: [README](../README.md)

data/docs/basic_read_api.md CHANGED Viewed

@@ -2,6 +2,8 @@
 ### Contents
   * [Introduction](./_introduction.md)
+  * [Migrating from Ruby CSV](./migrating_from_csv.md)
+  * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
   * [Parsing Strategy](./parsing_strategy.md)
   * [**The Basic Read API**](./basic_read_api.md)
   * [The Basic Write API](./basic_write_api.md)
@@ -10,10 +12,17 @@
   * [Row and Column Separators](./row_col_sep.md)
   * [Header Transformations](./header_transformations.md)
   * [Header Validations](./header_validations.md)
+  * [Column Selection](./column_selection.md)
   * [Data Transformations](./data_transformations.md)
   * [Value Converters](./value_converters.md)
---------------
+  * [Bad Row Quarantine](./bad_row_quarantine.md)
+  * [Instrumentation Hooks](./instrumentation.md)
+  * [Examples](./examples.md)
+  * [Real-World CSV Files](./real_world_csv.md)
+  * [SmarterCSV over the Years](./history.md)
+  * [Release Notes](./releases/1.16.0/changes.md)
+--------------
 # SmarterCSV Basic API
@@ -22,7 +31,7 @@ Let's explore the basic APIs for reading and writing CSV files. There is a simpl
 ## Reading CSV
 SmarterCSV has convenient defaults for automatically detecting row and column separators based on the given data. This provides more robust parsing of input files when you have no control over the data, e.g. when users upload CSV files.
-Learn more about this [in this section](docs/examples/row_col_sep.md).
+Learn more about this [in this section](./row_col_sep.md).
 ### Simplified Interface
@@ -32,11 +41,23 @@ The simplified call to read CSV files is:
          array_of_hashes = SmarterCSV.process(file_or_input, options)
       ```
+To parse a CSV **string** directly (no file needed), use `SmarterCSV.parse`:
+      ```
+         array_of_hashes = SmarterCSV.parse(csv_string, options)
+      ```
+This is equivalent to `SmarterCSV.process(StringIO.new(csv_string), options)` and is the
+idiomatic replacement for `CSV.parse(str, headers: true, header_converters: :symbol)`.
+See [Migrating from Ruby CSV](./migrating_from_csv.md) for a full comparison.
 It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
       ```
          SmarterCSV.process(file_or_input, options) do |array_of_hashes|
-           # without chunk_size, each yield conatins a one-element array (one row)
+           # without chunk_size, each yield contains a one-element array (one row)
          end
       ```
@@ -81,11 +102,133 @@ It can also be used with a block. The block always receives an array of hashes a
 This allows you access to the internal state of the `reader` instance after processing.
+## Modern Enumerator API — `each`
+`Reader#each` is the modern, idiomatic way to read CSV rows one at a time. It always yields a single `Hash` per row and includes `Enumerable`, so every standard Ruby enumerable method works out of the box.
+### Simplified form
+```ruby
+SmarterCSV.each('data.csv', options) do |hash|
+  MyModel.upsert(hash)
+end
+```
+### Full form (recommended — retains reader state after processing)
+```ruby
+reader = SmarterCSV::Reader.new('data.csv', options)
+reader.each do |hash|
+  MyModel.upsert(hash)
+end
+puts reader.headers       # accessible after processing
+puts reader.errors.inspect
+```
+### Returns an Enumerator when called without a block
+```ruby
+enum = SmarterCSV.each('data.csv', options)
+enum.to_a   # => [{ name: "Alice", ... }, { name: "Bob", ... }, ...]
+```
+### Enumerable methods work directly
+Because `Reader` includes `Enumerable`, all standard Ruby enumerable methods work:
+```ruby
+reader = SmarterCSV::Reader.new('data.csv', options)
+# Filter rows
+us_users = reader.select { |h| h[:country] == 'US' }
+# Transform
+names = reader.map { |h| h[:name] }
+# Count good rows
+reader.count
+# Row index (0-based count of successfully parsed rows, excluding bad rows)
+reader.each_with_index do |hash, i|
+  puts "Row #{i}: #{hash[:name]}"
+end
+# Free chunking via Enumerable — no chunk_size needed
+reader.each_slice(100) do |batch|
+  MyModel.insert_all(batch)
+end
+```
+### Lazy evaluation
+`lazy` lets you stop early without reading the entire file:
+```ruby
+# Read only the first 10 rows matching a condition
+reader = SmarterCSV::Reader.new('big.csv', options)
+result = reader.lazy.select { |h| h[:status] == 'active' }.first(10)
+```
+### `each` ignores `chunk_size`
+If `chunk_size` is set in options, `each` ignores it and always yields individual `Hash` objects. Use [`each_chunk`](./batch_processing.md) for chunked batch processing.
+### Interaction with `on_bad_row`
+`each` respects all `on_bad_row` options. Bad rows are skipped (or routed to your handler) and never yielded:
+```ruby
+reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
+reader.each { |hash| MyModel.upsert(hash) }
+reader.errors[:bad_rows].each { |rec| puts "Bad row: #{rec[:error_message]}" }
+```
+---
+## Value Transformation Pipeline
+After each row is parsed, SmarterCSV applies transformations to field values in this order:
+| Step | Option | Default | Description |
+|------|--------|---------|-------------|
+| 1 | `strip_whitespace` | `true` | Strips leading/trailing whitespace from all values (and headers) at parse time |
+| 2 | `nil_values_matching` | `nil` | Sets values matching the regexp to `nil` |
+| 3 | `remove_empty_values` | `true` | Removes keys whose value is `nil` or blank |
+| 4 | `remove_zero_values` | `false` | Removes keys whose value is numeric zero |
+| 5 | `convert_values_to_numeric` | `true` | Converts numeric-looking strings to `Integer` or `Float` |
+| 6 | `value_converters` | `nil` | Applies per-key custom converter lambdas or classes |
+| 7 | `remove_empty_hashes` | `true` | Drops rows that are entirely empty after all transformations |
+> Steps 2–6 run per field, in that order, for every key/value pair in the row.
+> `value_converters` receive the value **after** numeric conversion — guard against `Integer`/`Float` input if needed.
+See [Data Transformations](./data_transformations.md) and [Value Converters](./value_converters.md) for details.
+---
+## Header Transformation Pipeline
+Before any data rows are processed, the header line passes through these steps:
+```
+comment_regexp → strip_chars_from_headers → split on col_sep → strip quote_char
+    → strip_whitespace → [gsub spaces/dashes→_ → downcase_header]
+    → disambiguate_headers → symbolize → key_mapping
+```
+`user_provided_headers` bypasses the file header and all transformation steps — your array is used as-is.
+See [Header Transformations](./header_transformations.md) for the full step-by-step table and options.
+---
 ## Rescue from Exceptions
 While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore please rescue from `SmarterCSV::Error`, and handle outliers according to your requirements.
-If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
+If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accommodate for unusual formats.
 ## Troubleshooting
@@ -102,9 +245,8 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
 ## Assumptions / Limitations
-* the escape character is `\`, as on UNIX and Windows systems.
-* quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"`
-  e.g. an escaped `quote_char` does not denote the end of a field.
+* By default, quote escaping uses `:auto` mode — SmarterCSV tries backslash-escape (`\"`) first and falls back to RFC 4180 doubled-quotes (`""`). Use `quote_escaping: :double_quotes` or `:backslash` to fix the mode explicitly. See [Parsing Strategy](./parsing_strategy.md).
+* Quote characters around fields are expected to be balanced, e.g. valid: `"field"`, invalid: `"field\"`  — an escaped `quote_char` does not denote the end of a field.
 ## NOTES about File Encodings:
@@ -125,4 +267,5 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
 ```
 ----------------
-PREVIOUS: [Parsing Strategy](./parsing_strategy.md) | NEXT: [The Basic Write API](./basic_write_api.md)
+PREVIOUS: [Parsing Strategy](./parsing_strategy.md) | NEXT: [The Basic Write API](./basic_write_api.md) | UP: [README](../README.md)