smarter_csv 1.15.2 → 1.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +9 -0
  3. data/CHANGELOG.md +68 -1
  4. data/CONTRIBUTORS.md +3 -1
  5. data/Gemfile +1 -0
  6. data/README.md +123 -27
  7. data/docs/_introduction.md +40 -24
  8. data/docs/bad_row_quarantine.md +285 -0
  9. data/docs/basic_read_api.md +151 -9
  10. data/docs/basic_write_api.md +474 -59
  11. data/docs/batch_processing.md +161 -4
  12. data/docs/column_selection.md +183 -0
  13. data/docs/data_transformations.md +162 -29
  14. data/docs/examples.md +339 -46
  15. data/docs/header_transformations.md +93 -12
  16. data/docs/header_validations.md +56 -18
  17. data/docs/history.md +117 -0
  18. data/docs/instrumentation.md +165 -0
  19. data/docs/migrating_from_csv.md +290 -0
  20. data/docs/options.md +150 -87
  21. data/docs/parsing_strategy.md +63 -1
  22. data/docs/real_world_csv.md +262 -0
  23. data/docs/releases/1.16.0/benchmarks.md +223 -0
  24. data/docs/releases/1.16.0/changes.md +272 -0
  25. data/docs/releases/1.16.0/performance_notes.md +114 -0
  26. data/docs/row_col_sep.md +14 -5
  27. data/docs/value_converters.md +193 -57
  28. data/ext/smarter_csv/extconf.rb +3 -0
  29. data/ext/smarter_csv/smarter_csv.c +1007 -71
  30. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  31. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  32. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  36. data/lib/smarter_csv/errors.rb +8 -0
  37. data/lib/smarter_csv/file_io.rb +1 -1
  38. data/lib/smarter_csv/hash_transformations.rb +14 -13
  39. data/lib/smarter_csv/header_transformations.rb +21 -2
  40. data/lib/smarter_csv/headers.rb +2 -1
  41. data/lib/smarter_csv/options.rb +124 -7
  42. data/lib/smarter_csv/parser.rb +362 -75
  43. data/lib/smarter_csv/reader.rb +494 -46
  44. data/lib/smarter_csv/version.rb +1 -1
  45. data/lib/smarter_csv/writer.rb +71 -19
  46. data/lib/smarter_csv.rb +95 -12
  47. data/smarter_csv.gemspec +20 -10
  48. metadata +37 -80
data/docs/options.md CHANGED
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [Parsing Strategy](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,8 +11,15 @@
10
11
  * [Row and Column Separators](./row_col_sep.md)
11
12
  * [Header Transformations](./header_transformations.md)
12
13
  * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [Data Transformations](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
15
23
 
16
24
  --------------
17
25
 
@@ -19,96 +27,151 @@
19
27
 
20
28
  ## CSV Writing
21
29
 
22
- | Option | Default | Explanation |
23
- ---------------------------------------------------------------------------------------------------------------------------------
24
- | :row_sep | $/ | Separates rows; Defaults to your OS row separator. `/n` on UNIX, `/r/n` oon Windows |
25
- | :col_sep | "," | Separates each value in a row |
26
- | :quote_char | '"' | To quote CSV fields. |
27
- | :force_quotes | false | Forces each individual value to be quoted |
28
- | :headers | [] | You can provide the specific list of keys from the input you'd like to be used as headers in the CSV file |
29
- | | | ⚠️ This disables automatic header detection! |
30
- | :map_headers | {} | Similar to `headers`, but also maps each desired key to a user-specified value that is uesd as the header. |
31
- | | | ⚠️ This disables automatic header detection! |
32
- | :value_converters | nil | allows to define lambdas to programmatically modify values |
33
- | | | * either for specific `key` names |
34
- | | | * or using `_all` for all fields |
35
- | :header_converter | nil | allows to define one lambda to programmatically modify the headers |
36
- | :discover_headers | true | Automatically detects all keys in the input before writing the header |
37
- | | | Do not manually set this to `false`. ⚠️ |
38
- | | | But you can set this to `true` when using `map_headers` option. |
39
- | :disable_auto_quoting | false | To manually disable auto-quoting of special characters. ⚠️ Be careful with this! |
40
- | :quote_headers | false | To force quoting all headers (only needed in rare cases) |
30
+ | Option | Default | Explanation |
31
+ |--------|---------|-------------|
32
+ | `:row_sep` | `$/` | Separates rows. Defaults to your OS row separator: `\n` on UNIX, `\r\n` on Windows. |
33
+ | `:col_sep` | `","` | Separates each value in a row. |
34
+ | `:quote_char` | `'"'` | Character used to quote CSV fields. |
35
+ | `:force_quotes` | `false` | Forces each individual value to be quoted. |
36
+ | `:headers` | `[]` | List of keys from the input to use as headers in the CSV file. ⚠️ Disables automatic header detection! |
37
+ | `:map_headers` | `{}` | Like `:headers`, but also maps each key to a user-specified header value. ⚠️ Disables automatic header detection! |
38
+ | `:value_converters` | `nil` | Lambdas to programmatically modify values either for specific key names, or using `_all` for all fields. |
39
+ | `:header_converter` | `nil` | One lambda to programmatically modify the headers. |
40
+ | `:discover_headers` | `true` | Automatically detects all keys in the input before writing the header. Do not set to `false` manually. ⚠️ |
41
+ | `:disable_auto_quoting` | `false` | Manually disables auto-quoting of special characters. ⚠️ Use with care! |
42
+ | `:quote_headers` | `false` | Force quoting all headers (only needed in rare cases). |
43
+ | `:encoding` | `nil` | File encoding passed to `File.open` when writing to a path (e.g. `'UTF-8'`, `'ISO-8859-1'`). Supports Ruby's `'external:internal'` transcoding notation (e.g. `'ISO-8859-1:UTF-8'`) to automatically transcode UTF-8 strings into the target encoding. `nil` uses the system default. Ignored when an IO object is passed directly. |
44
+ | `:write_nil_value` | `''` | String written in place of `nil` field values. E.g. `write_nil_value: 'N/A'`. |
45
+ | `:write_empty_value` | `''` | String written in place of empty-string field values, including missing keys. E.g. `write_empty_value: 'EMPTY'`. |
46
+ | `:write_bom` | `false` | Prepends a UTF-8 BOM (`\xEF\xBB\xBF`) to the output. Use with `encoding: 'UTF-8'` for Excel compatibility. |
41
47
 
42
48
 
43
49
  ## CSV Reading
44
50
 
45
- | Option | Default | Explanation |
46
- ---------------------------------------------------------------------------------------------------------------------------------
47
- | :chunk_size | nil | if set, determines the desired chunk-size (defaults to nil, no chunk processing) |
48
- | | | |
49
- | :file_encoding | utf-8 | Set the file encoding eg.: 'windows-1252' or 'iso-8859-1' |
50
- | :invalid_byte_sequence | '' | what to replace invalid byte sequences with |
51
- | :force_utf8 | false | force UTF-8 encoding of all lines (including headers) in the CSV file |
52
- | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
53
- | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
54
- ---------------------------------------------------------------------------------------------------------------------------------
55
- | :col_sep | :auto | column separator (default was ',') |
56
- | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
57
- | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
58
- | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
59
- | :quote_char | '"' | quotation character |
60
- | :quote_escaping | :auto | How quotes are escaped inside quoted fields. See [Parsing Strategy](./parsing_strategy.md). |
61
- | | | `:auto` (default): tries backslash-escape first, falls back to RFC 4180. |
62
- | | | `:double_quotes` (RFC 4180): only `""` escapes a quote. Backslash is literal. |
63
- | | | `:backslash` (MySQL/Unix): `\"` also escapes a quote. |
64
- ---------------------------------------------------------------------------------------------------------------------------------
65
- | :headers_in_file | true(1) | Whether or not the file contains headers as the first line. |
66
- | | | (1): if `user_provided_headers` is given, the default is `false`, |
67
- | | | unless you specify it to be explicitly `true`. |
68
- | | | This prevents losing the first line of data, which is otherwise assumed to be a header. |
69
- | :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
70
- | | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
71
- | :user_provided_headers | nil | *careful with that axe!* |
72
- | | | user provided Array of header strings or symbols, to define |
73
- | | | what headers should be used, overriding any in-file headers. |
74
- | | | You can not combine the :user_provided_headers and :key_mapping options |
75
- | :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs or all empty values |
76
- | :verbose | false | print out line number while processing (to track down problems in input files) |
77
- | :with_line_numbers | false | add :csv_line_number to each data hash |
78
- | :missing_header_prefix | column_ | can be set to a string of your liking |
79
- | :strict | false | When set to `true`, extra columns will raise MalformedCSV exception |
80
- ---------------------------------------------------------------------------------------------------------------------------------
81
-
82
- Additional 1.x Options which may be replaced in 2.0
83
-
84
- There have been a lot of 1-offs and feature creep around these options, and going forward we'll strive to have a simpler, but more flexible way to address these features.
85
-
86
-
87
- | Option | Default | Explanation |
88
- ---------------------------------------------------------------------------------------------------------------------------------
89
- | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
90
- | :silence_missing_keys | false | ignore missing keys in `key_mapping` |
91
- | | | if set to true: makes all mapped keys optional |
92
- | | | if given an array, makes only the keys listed in it optional |
93
- | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
94
- | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
95
- | | | or an exception is raised No validation if nil is given. |
96
- | :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
97
- | :downcase_header | true | downcase all column headers |
98
- | :strings_as_keys | false | use strings instead of symbols as the keys in the result hashes |
99
- | :strip_whitespace | true | remove whitespace before/after values and headers |
100
- | :keep_original_headers | false | keep the original headers from the CSV-file as-is. |
101
- | | | Disables other flags manipulating the header fields. |
102
- | :strip_chars_from_headers | nil | RegExp to remove extraneous characters from the header line (e.g. if headers are quoted) |
103
- ---------------------------------------------------------------------------------------------------------------------------------
104
- | :value_converters | nil | supply a hash of :header => KlassName; the class needs to implement self.convert(val)|
105
- | :remove_empty_values | true | remove values which have nil or empty strings as values |
106
- | :remove_zero_values | false | remove values which have a numeric value equal to zero / 0 |
107
- | :remove_values_matching | nil | removes key/value pairs if value matches given regular expressions. e.g.: |
108
- | | | /^\$0\.0+$/ to match $0.00 , or /^#VALUE!$/ to match errors in Excel spreadsheets |
109
- | :convert_values_to_numeric | true | converts strings containing Integers or Floats to the appropriate class |
110
- | | | also accepts either {:except => [:key1,:key2]} or {:only => :key3} |
111
- ---------------------------------------------------------------------------------------------------------------------------------
51
+ ### File Input & Encoding
52
+
53
+ | Option | Default | Explanation |
54
+ |--------|---------|-------------|
55
+ | `:file_encoding` | `utf-8` | Set the file encoding, e.g. `'windows-1252'` or `'iso-8859-1'`. |
56
+ | `:invalid_byte_sequence` | `''` | What to replace invalid byte sequences with. |
57
+ | `:force_utf8` | `false` | Force UTF-8 encoding of all lines (including headers) in the CSV file. |
58
+
59
+ ### File Layout
60
+
61
+ | Option | Default | Explanation |
62
+ |--------|---------|-------------|
63
+ | `:skip_lines` | `nil` | How many lines to skip before the first line or header line is processed. |
64
+ | `:comment_regexp` | `nil` | Regular expression to ignore comment lines (e.g. `/\A#/`). See NOTE on CSV header. |
65
+ | `:chunk_size` | `nil` | If set, data is yielded in chunks of this many rows instead of all at once. Use with `SmarterCSV.each_chunk` for memory-efficient batch processing. |
66
+
67
+ ### Separators
68
+
69
+ | Option | Default | Explanation |
70
+ |--------|---------|-------------|
71
+ | `:col_sep` | `:auto` | Column separator. `:auto` detects from file content (previous default was `','`). |
72
+ | `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content. Manual detection reads the whole file first (slow on large files). |
73
+ | `:auto_row_sep_chars` | `500` | How many characters to analyze when using `:row_sep => :auto`. `nil` or `0` means whole file. |
74
+
75
+ ### Quoting
76
+
77
+ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling.
78
+
79
+ | Option | Default | Explanation |
80
+ |--------|---------|-------------|
81
+ | `:quote_char` | `'"'` | Quotation character. Must be a single byte. |
82
+ | `:quote_escaping` | `:auto` | How quotes are escaped inside quoted fields. `:auto` (default): tries backslash-escape first, falls back to RFC 4180. `:double_quotes` (RFC 4180): only `""` escapes a quote; backslash is literal. `:backslash` (MySQL/Unix): `\"` also escapes a quote. |
83
+ | `:quote_boundary` | `:standard` | Where quote characters are recognized as field delimiters. `:standard` (default): a quote only opens a field at a field boundary (first character of the field); mid-field quotes are literal. `:legacy`: any quote toggles quoted state regardless of position (old behavior). |
84
+
85
+ ### Headers
86
+
87
+ | Option | Default | Explanation |
88
+ |--------|---------|-------------|
89
+ | `:headers_in_file` | `true` ¹ | Whether the file contains headers as the first line. ¹ If `user_provided_headers` is given, default becomes `false` unless explicitly set to `true`. |
90
+ | `:user_provided_headers` | `nil` | *Careful!* User-provided Array of header strings or symbols, overriding any in-file headers. Cannot be combined with `:key_mapping`. |
91
+ | `:duplicate_header_suffix` | `''` | Appends a number to duplicated headers, separated by this suffix. Set to `nil` to raise `DuplicateHeaders` error instead (previous behavior). |
92
+ | `:downcase_header` | `true` | Downcase all column headers. |
93
+ | `:strings_as_keys` | `false` | Use strings instead of symbols as keys in the result hashes. |
94
+ | `:keep_original_headers` | `false` | Keep the original headers from the CSV file as-is. Disables other flags that manipulate header fields. |
95
+ | `:strip_chars_from_headers` | `nil` | RegExp to remove extraneous characters from the header line (e.g. if headers are quoted). |
96
+ | `:missing_header_prefix` | `column_` | Prefix for auto-generated column names when extra columns are found. |
97
+ | `:missing_headers` | `:auto` | Behavior when a data row has more columns than the header row. `:auto` (default): auto-name extra columns using `missing_header_prefix`. `:raise`: raise `HeaderSizeMismatch` on the first row with extra columns. |
98
+
99
+ ### Header Mapping & Validation
100
+
101
+ | Option | Default | Explanation |
102
+ |--------|---------|-------------|
103
+ | `:key_mapping` | `nil` | A hash mapping CSV headers to keys in the result hash. |
104
+ | `:silence_missing_keys` | `false` | Ignore missing keys in `key_mapping`. `true` makes all mapped keys optional; an Array makes only the listed keys optional. |
105
+ | `:remove_unmapped_keys` | `false` | When using `key_mapping`, remove columns that have no mapping. |
106
+ | `:required_keys` | `nil` | Array of key names (after header transformation) that must be present. Raises an exception if any required key is missing. No validation if `nil`. |
107
+
108
+ ### Column Selection
109
+
110
+ | Option | Default | Explanation |
111
+ |--------|---------|-------------|
112
+ | `headers: { only: }` | `nil` | Keep only the listed columns in each result hash. See [Column Selection](./column_selection.md). Accepts a symbol, string, or array of either (normalized to symbols). Uses post-mapping names (after `key_mapping:` is applied). Cannot be combined with `headers: { except: }`. |
113
+ | `headers: { except: }` | `nil` | Remove the listed columns from each result hash. See [Column Selection](./column_selection.md). Accepts a symbol, string, or array of either (normalized to symbols). Uses post-mapping names (after `key_mapping:` is applied). Cannot be combined with `headers: { only: }`. |
114
+
115
+ ### Value Transformations
116
+
117
+ | Option | Default | Explanation |
118
+ |--------|---------|-------------|
119
+ | `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
120
+ | `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
121
+ | `:value_converters` | `nil` | Hash of `:header => ClassName`; each class must implement `self.convert(value)`. See [Value Converters](./value_converters.md). |
122
+ | `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil` or an empty string. |
123
+ | `:remove_zero_values` | `false` | Remove key/value pairs where the numeric value equals zero. |
124
+ | `:nil_values_matching` | `nil` | Set matching values to `nil`. Accepts a regular expression matched against the string representation of each value (e.g. `/\ANAN\z/` for NaN, `/\A#VALUE!\z/` for Excel errors). With `remove_empty_values: true` (default), nil-ified values are then removed. With `remove_empty_values: false`, the key is retained with a `nil` value. |
125
+ | `:remove_empty_hashes` | `true` | Remove result hashes that have no key/value pairs or all-empty values. |
126
+
127
+ ### Error Handling
128
+
129
+ See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
130
+
131
+ | Option | Default | Explanation |
132
+ |--------|---------|-------------|
133
+ | `:on_bad_row` | `:raise` | Behavior when a row raises a parse error. `:raise` (default): re-raise, stopping processing. `:skip`: skip the bad row and continue. `:collect`: skip and append an error record to `reader.errors[:bad_rows]`. callable: called with the error record per bad row; processing continues. |
134
+ | `:collect_raw_lines` | `true` | When collecting bad rows, include the raw stitched line in the error record. |
135
+ | `:bad_row_limit` | `nil` | If set, raises `SmarterCSV::TooManyBadRows` after this many bad rows. |
136
+ | `:field_size_limit` | `nil` | Maximum size of any extracted field in bytes. `nil` means no limit. Raises `SmarterCSV::FieldSizeLimitExceeded` (handled by `on_bad_row`) if a field or accumulating multiline buffer exceeds this size. Prevents DoS from runaway quoted fields or huge inline payloads. See [Bad Row Quarantine](./bad_row_quarantine.md#limiting-field-size-field_size_limit). |
137
+
138
+ ### Output & Diagnostics
139
+
140
+ | Option | Default | Explanation |
141
+ |--------|---------|-------------|
142
+ | `:with_line_numbers` | `false` | Add `:csv_line_number` to each result hash. |
143
+ | `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. |
144
+
145
+ ### Instrumentation Hooks
146
+
147
+ See [Instrumentation Hooks](./instrumentation.md) for full details and payload reference.
148
+
149
+ | Option | Default | Explanation |
150
+ |--------|---------|-------------|
151
+ | `:on_start` | `nil` | Callable invoked once before the first row is parsed. Receives a payload hash with `:input`, `:file_size`, `:col_sep`, `:row_sep`. |
152
+ | `:on_chunk` | `nil` | Callable invoked after each chunk is parsed (only when `chunk_size` is set). Receives `:chunk_number`, `:rows_in_chunk`, `:total_rows_so_far`. |
153
+ | `:on_complete` | `nil` | Callable invoked once after the entire file is exhausted. Receives `:total_rows`, `:total_chunks`, `:duration`, `:bad_rows`. |
154
+
155
+ ### Performance
156
+
157
+ | Option | Default | Explanation |
158
+ |--------|---------|-------------|
159
+ | `:acceleration` | `true` | Use the C extension for parsing (MRI Ruby only). Set to `false` to force the pure-Ruby fallback (always used on JRuby/TruffleRuby). |
160
+
161
+ ---
162
+
163
+ ## Deprecated Options
164
+
165
+ These options are still accepted but emit a deprecation warning. They will be removed in a future version.
166
+
167
+ | Option | Default | Replacement |
168
+ |--------|---------|-------------|
169
+ | `:strict` | `false` | Use `missing_headers: :raise` instead of `strict: true`, or `missing_headers: :auto` instead of `strict: false`. |
170
+ | `:required_headers` | `nil` | Renamed to `:required_keys`. Use `required_keys:` instead. |
171
+ | `:remove_values_matching` | `nil` | Renamed to `:nil_values_matching`. Use `nil_values_matching:` instead. |
172
+ | `verbose: true` | — | Use `verbose: :debug` instead. |
173
+ | `verbose: false` | — | Use `verbose: :normal` (or omit — it is the default) instead. |
112
174
 
113
175
  -------------
114
- PREVIOUS: [Batch Processing](./batch_processing.md) | NEXT: [Row and Column Separators](./row_col_sep.md)
176
+
177
+ PREVIOUS: [Batch Processing](./batch_processing.md) | NEXT: [Row and Column Separators](./row_col_sep.md) | UP: [README](../README.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [**Parsing Strategy**](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,8 +11,15 @@
10
11
  * [Row and Column Separators](./row_col_sep.md)
11
12
  * [Header Transformations](./header_transformations.md)
12
13
  * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [Data Transformations](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
15
23
 
16
24
  --------------
17
25
 
@@ -95,5 +103,59 @@ SmarterCSV.process("file.csv", quote_escaping: :backslash)
95
103
 
96
104
  **Note:** In `:backslash` mode, a field like `"abc\"` will raise `MalformedCSV` because the closing quote is escaped, leaving the field unclosed.
97
105
 
106
+ ## Quote Boundary: The `quote_boundary` Option
107
+
108
+ Real-world CSV files sometimes contain quote characters in the middle of an unquoted field — for example, a measurement like `6'2"`, a product name like `Intel Core i5 "Raptor Lake"`, or a field with an apostrophe in a poorly-exported file. Under a naive quote parser, any `"` would toggle quoted state, causing the field to be misread and subsequent fields to be garbled.
109
+
110
+ The `quote_boundary` option controls where SmarterCSV recognizes a quote as a field delimiter.
111
+
112
+ ### `:standard` (default)
113
+
114
+ In `:standard` mode, two rules apply:
115
+
116
+ - **Rule 1 — Opening**: a quote only opens a quoted field when it appears at the very start of the field (immediately after the column separator, or at the start of a line). A quote encountered after any other content is treated as a literal character.
117
+ - **Rule 2 — Closing**: a quote only closes a quoted field when it is immediately followed by a column separator, a row separator, or end of input. A quote in any other position inside a quoted field is treated as content (enabling RFC 4180 `""` doubled-quote escaping).
118
+
119
+ ```ruby
120
+ # Mid-field quote is a literal character — no state change
121
+ csv = "product,size\nCore i5 \"Raptor Lake\",medium\n"
122
+ SmarterCSV.process(StringIO.new(csv))
123
+ # => [{product: 'Core i5 "Raptor Lake"', size: "medium"}]
124
+
125
+ # Quote at field start opens quoted mode normally
126
+ csv = "first,second\n\"hello, world\",other\n"
127
+ SmarterCSV.process(StringIO.new(csv))
128
+ # => [{first: "hello, world", second: "other"}]
129
+
130
+ # RFC 4180 doubled quotes work inside a properly opened quoted field
131
+ csv = "name\n\"She said \"\"hello\"\"\"\n"
132
+ SmarterCSV.process(StringIO.new(csv))
133
+ # => [{name: 'She said "hello"'}]
134
+ ```
135
+
136
+ `:standard` is the default because treating mid-field quotes as literals matches how most modern CSV parsers (including Ruby's built-in `CSV` library in strict mode) handle malformed-but-common real-world data.
137
+
138
+ ### `:legacy`
139
+
140
+ In `:legacy` mode, any quote character toggles quoted state regardless of its position in the field. This was the only behavior available before SmarterCSV 1.16.0.
141
+
142
+ Use `:legacy` only if you have files that were specifically produced to rely on mid-field quote toggling, and you cannot change the source. Note that a mid-field quote with an odd total count will result in an unclosed field and a `MalformedCSV` error under `:legacy` mode.
143
+
144
+ ```ruby
145
+ SmarterCSV.process("file.csv", quote_boundary: :legacy)
146
+ ```
147
+
148
+ ### Interaction with `quote_escaping`
149
+
150
+ Both options apply simultaneously. `quote_boundary` governs *where* a quote is recognized as a delimiter; `quote_escaping` governs *how* a literal quote is represented *inside* a quoted field. They are independent:
151
+
152
+ | `quote_boundary` | `quote_escaping` | Effect |
153
+ |---|---|---|
154
+ | `:standard` | `:auto` (default) | Standard field boundaries + auto-detect escaping style |
155
+ | `:standard` | `:double_quotes` | Standard field boundaries + RFC 4180 only |
156
+ | `:standard` | `:backslash` | Standard field boundaries + backslash escaping |
157
+ | `:legacy` | `:auto` | Old toggle behavior + auto-detect escaping style |
158
+
98
159
  --------------
99
- PREVIOUS: [Introduction](./_introduction.md) | NEXT: [The Basic Read API](./basic_read_api.md)
160
+
161
+ PREVIOUS: [Migrating from Ruby CSV](./migrating_from_csv.md) | NEXT: [The Basic Read API](./basic_read_api.md) | UP: [README](../README.md)