smart_csv_import 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +7 -0
  2. data/LICENSE.adoc +134 -0
  3. data/README.md +534 -0
  4. data/app/jobs/smart_csv_import/import_job.rb +22 -0
  5. data/app/models/smart_csv_import/import.rb +36 -0
  6. data/app/models/smart_csv_import/import_row_error.rb +17 -0
  7. data/lib/generators/smart_csv_import/import/import_generator.rb +49 -0
  8. data/lib/generators/smart_csv_import/import/templates/import_form.rb.tt +32 -0
  9. data/lib/generators/smart_csv_import/import/templates/import_form_spec.rb.tt +38 -0
  10. data/lib/generators/smart_csv_import/install/install_generator.rb +34 -0
  11. data/lib/generators/smart_csv_import/install/templates/create_smart_csv_import_import_row_errors.rb.tt +18 -0
  12. data/lib/generators/smart_csv_import/install/templates/create_smart_csv_import_imports.rb.tt +23 -0
  13. data/lib/generators/smart_csv_import/install/templates/initializer.rb.tt +51 -0
  14. data/lib/generators/smart_csv_import/scaffold/scaffold_generator.rb +56 -0
  15. data/lib/generators/smart_csv_import/scaffold/templates/controller.rb.tt +33 -0
  16. data/lib/generators/smart_csv_import/scaffold/templates/new.html.erb.tt +12 -0
  17. data/lib/generators/smart_csv_import/scaffold/templates/show.html.erb.tt +59 -0
  18. data/lib/smart_csv_import/configuration.rb +77 -0
  19. data/lib/smart_csv_import/cosine_similarity.rb +15 -0
  20. data/lib/smart_csv_import/engine.rb +12 -0
  21. data/lib/smart_csv_import/failed_row_exporter.rb +78 -0
  22. data/lib/smart_csv_import/file_storage.rb +34 -0
  23. data/lib/smart_csv_import/header_normalizer.rb +76 -0
  24. data/lib/smart_csv_import/logging.rb +37 -0
  25. data/lib/smart_csv_import/match_result.rb +36 -0
  26. data/lib/smart_csv_import/matchable.rb +76 -0
  27. data/lib/smart_csv_import/matcher.rb +198 -0
  28. data/lib/smart_csv_import/normalizers/boolean_converter.rb +26 -0
  29. data/lib/smart_csv_import/normalizers/date_converter.rb +28 -0
  30. data/lib/smart_csv_import/notifications.rb +16 -0
  31. data/lib/smart_csv_import/processor/csv_preflight_analyzer.rb +74 -0
  32. data/lib/smart_csv_import/processor/import_result_builder.rb +97 -0
  33. data/lib/smart_csv_import/processor/mapping_review_policy.rb +90 -0
  34. data/lib/smart_csv_import/processor/nil_cell_counter.rb +19 -0
  35. data/lib/smart_csv_import/processor/null_progress_callback.rb +11 -0
  36. data/lib/smart_csv_import/processor/row_processor.rb +70 -0
  37. data/lib/smart_csv_import/processor.rb +294 -0
  38. data/lib/smart_csv_import/result.rb +101 -0
  39. data/lib/smart_csv_import/stability_report.rb +104 -0
  40. data/lib/smart_csv_import/strategies/llm.rb +106 -0
  41. data/lib/smart_csv_import/strategies/lookup.rb +41 -0
  42. data/lib/smart_csv_import/strategies/vector.rb +155 -0
  43. data/lib/smart_csv_import/strategy.rb +9 -0
  44. data/lib/smart_csv_import/strategy_failure.rb +13 -0
  45. data/lib/smart_csv_import/version.rb +5 -0
  46. data/lib/smart_csv_import.rb +79 -0
  47. data/smart_csv_import.gemspec +35 -0
  48. metadata +216 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: f663c2506a13b2fd5f495969fe22337c6a5a6408ecf8dc5c9f12805e73c15c65
4
+ data.tar.gz: fe62bc86016e471136c30362659fc28aecdaa8d690a5a772cfcf0c880021efe0
5
+ SHA512:
6
+ metadata.gz: 013d1838ac0a41c353c4f63423a2e891e99509eb71197f956250dafff745ff0e91d6a00402b161acb547c1ed13c87e2e35e236d97fae0fd4b8a3cc30001ff722
7
+ data.tar.gz: 62f20e4e4fe40ebff5f43af35e528fda2e6d8197183521072c95aa3de491248f1d0fbbd5c44a3beb716776d8ec8bf2f95391c9ddba4e49e4d29a5a091713b181
data/LICENSE.adoc ADDED
@@ -0,0 +1,134 @@
1
+ = Hippocratic License
2
+
3
+ Version: 2.1.0.
4
+
5
+ Purpose. The purpose of this License is for the Licensor named above to
6
+ permit the Licensee (as defined below) broad permission, if consistent
7
+ with Human Rights Laws and Human Rights Principles (as each is defined
8
+ below), to use and work with the Software (as defined below) within the
9
+ full scope of Licensor’s copyright and patent rights, if any, in the
10
+ Software, while ensuring attribution and protecting the Licensor from
11
+ liability.
12
+
13
+ Permission and Conditions. The Licensor grants permission by this
14
+ license ("License"), free of charge, to the extent of Licensor’s
15
+ rights under applicable copyright and patent law, to any person or
16
+ entity (the "Licensee") obtaining a copy of this software and
17
+ associated documentation files (the "Software"), to do everything with
18
+ the Software that would otherwise infringe (i) the Licensor’s copyright
19
+ in the Software or (ii) any patent claims to the Software that the
20
+ Licensor can license or becomes able to license, subject to all of the
21
+ following terms and conditions:
22
+
23
+ * Acceptance. This License is automatically offered to every person and
24
+ entity subject to its terms and conditions. Licensee accepts this
25
+ License and agrees to its terms and conditions by taking any action with
26
+ the Software that, absent this License, would infringe any intellectual
27
+ property right held by Licensor.
28
+ * Notice. Licensee must ensure that everyone who gets a copy of any part
29
+ of this Software from Licensee, with or without changes, also receives
30
+ the License and the above copyright notice (and if included by the
31
+ Licensor, patent, trademark and attribution notice). Licensee must cause
32
+ any modified versions of the Software to carry prominent notices stating
33
+ that Licensee changed the Software. For clarity, although Licensee is
34
+ free to create modifications of the Software and distribute only the
35
+ modified portion created by Licensee with additional or different terms,
36
+ the portion of the Software not modified must be distributed pursuant to
37
+ this License. If anyone notifies Licensee in writing that Licensee has
38
+ not complied with this Notice section, Licensee can keep this License by
39
+ taking all practical steps to comply within 30 days after the notice. If
40
+ Licensee does not do so, Licensee’s License (and all rights licensed
41
+ hereunder) shall end immediately.
42
+ * Compliance with Human Rights Principles and Human Rights Laws.
43
+ [arabic]
44
+ . Human Rights Principles.
45
+ [loweralpha]
46
+ .. Licensee is advised to consult the articles of the United Nations
47
+ Universal Declaration of Human Rights and the United Nations Global
48
+ Compact that define recognized principles of international human rights
49
+ (the "Human Rights Principles"). Licensee shall use the Software in a
50
+ manner consistent with Human Rights Principles.
51
+ .. Unless the Licensor and Licensee agree otherwise, any dispute,
52
+ controversy, or claim arising out of or relating to (i) Section 1(a)
53
+ regarding Human Rights Principles, including the breach of Section 1(a),
54
+ termination of this License for breach of the Human Rights Principles,
55
+ or invalidity of Section 1(a) or (ii) a determination of whether any Law
56
+ is consistent or in conflict with Human Rights Principles pursuant to
57
+ Section 2, below, shall be settled by arbitration in accordance with the
58
+ Hague Rules on Business and Human Rights Arbitration (the "Rules");
59
+ provided, however, that Licensee may elect not to participate in such
60
+ arbitration, in which event this License (and all rights licensed
61
+ hereunder) shall end immediately. The number of arbitrators shall be one
62
+ unless the Rules require otherwise.
63
+ +
64
+ Unless both the Licensor and Licensee agree to the contrary: (1) All
65
+ documents and information concerning the arbitration shall be public and
66
+ may be disclosed by any party; (2) The repository referred to under
67
+ Article 43 of the Rules shall make available to the public in a timely
68
+ manner all documents concerning the arbitration which are communicated
69
+ to it, including all submissions of the parties, all evidence admitted
70
+ into the record of the proceedings, all transcripts or other recordings
71
+ of hearings and all orders, decisions and awards of the arbitral
72
+ tribunal, subject only to the arbitral tribunal’s powers to take such
73
+ measures as may be necessary to safeguard the integrity of the arbitral
74
+ process pursuant to Articles 18, 33, 41 and 42 of the Rules; and (3)
75
+ Article 26(6) of the Rules shall not apply.
76
+ . Human Rights Laws. The Software shall not be used by any person or
77
+ entity for any systems, activities, or other uses that violate any Human
78
+ Rights Laws. "Human Rights Laws" means any applicable laws,
79
+ regulations, or rules (collectively, "Laws") that protect human,
80
+ civil, labor, privacy, political, environmental, security, economic, due
81
+ process, or similar rights; provided, however, that such Laws are
82
+ consistent and not in conflict with Human Rights Principles (a dispute
83
+ over the consistency or a conflict between Laws and Human Rights
84
+ Principles shall be determined by arbitration as stated above). Where
85
+ the Human Rights Laws of more than one jurisdiction are applicable or in
86
+ conflict with respect to the use of the Software, the Human Rights Laws
87
+ that are most protective of the individuals or groups harmed shall
88
+ apply.
89
+ . Indemnity. Licensee shall hold harmless and indemnify Licensor (and
90
+ any other contributor) against all losses, damages, liabilities,
91
+ deficiencies, claims, actions, judgments, settlements, interest, awards,
92
+ penalties, fines, costs, or expenses of whatever kind, including
93
+ Licensor’s reasonable attorneys’ fees, arising out of or relating to
94
+ Licensee’s use of the Software in violation of Human Rights Laws or
95
+ Human Rights Principles.
96
+ * Failure to Comply. Any failure of Licensee to act according to the
97
+ terms and conditions of this License is both a breach of the License and
98
+ an infringement of the intellectual property rights of the Licensor
99
+ (subject to exceptions under Laws, e.g., fair use). In the event of a
100
+ breach or infringement, the terms and conditions of this License may be
101
+ enforced by Licensor under the Laws of any jurisdiction to which
102
+ Licensee is subject. Licensee also agrees that the Licensor may enforce
103
+ the terms and conditions of this License against Licensee through
104
+ specific performance (or similar remedy under Laws) to the extent
105
+ permitted by Laws. For clarity, except in the event of a breach of this
106
+ License, infringement, or as otherwise stated in this License, Licensor
107
+ may not terminate this License with Licensee.
108
+ * Enforceability and Interpretation. If any term or provision of this
109
+ License is determined to be invalid, illegal, or unenforceable by a
110
+ court of competent jurisdiction, then such invalidity, illegality, or
111
+ unenforceability shall not affect any other term or provision of this
112
+ License or invalidate or render unenforceable such term or provision in
113
+ any other jurisdiction; provided, however, subject to a court
114
+ modification pursuant to the immediately following sentence, if any term
115
+ or provision of this License pertaining to Human Rights Laws or Human
116
+ Rights Principles is deemed invalid, illegal, or unenforceable against
117
+ Licensee by a court of competent jurisdiction, all rights in the
118
+ Software granted to Licensee shall be deemed null and void as between
119
+ Licensor and Licensee. Upon a determination that any term or provision
120
+ is invalid, illegal, or unenforceable, to the extent permitted by Laws,
121
+ the court may modify this License to affect the original purpose that
122
+ the Software be used in compliance with Human Rights Principles and
123
+ Human Rights Laws as closely as possible. The language in this License
124
+ shall be interpreted as to its fair meaning and not strictly for or
125
+ against any party.
126
+ * Disclaimer. TO THE FULL EXTENT ALLOWED BY LAW, THIS SOFTWARE COMES
127
+ "AS IS," WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, AND LICENSOR AND
128
+ ANY OTHER CONTRIBUTOR SHALL NOT BE LIABLE TO ANYONE FOR ANY DAMAGES OR
129
+ OTHER LIABILITY ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE
130
+ OR THIS LICENSE, UNDER ANY KIND OF LEGAL CLAIM.
131
+
132
+ This Hippocratic License is an link:https://ethicalsource.dev[Ethical Source license] and is offered
133
+ for use by licensors and licensees at their own risk, on an "AS IS" basis, and with no warranties
134
+ express or implied, to the maximum extent permitted by Laws.
data/README.md ADDED
@@ -0,0 +1,534 @@
1
+ # SmartCsvImport
2
+
3
+ [![CI](https://github.com/Nroulston/smart_csv_import/actions/workflows/ci.yml/badge.svg)](https://github.com/Nroulston/smart_csv_import/actions/workflows/ci.yml)
4
+ [![Gem Version](https://badge.fury.io/rb/smart_csv_import.svg)](https://badge.fury.io/rb/smart_csv_import)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE.adoc)
6
+
7
+ A Rails Engine for CSV importing with AI-powered header matching. Drop in a form object, describe your fields in plain English, and SmartCsvImport automatically maps whatever column names show up in the CSV — messy, abbreviated, or domain-specific — to your fields. No brittle column-name checks to maintain.
8
+
9
+ Matching uses a two-tier strategy: **vector similarity** (fast, cached embeddings) handles the common case, and **LLM fallback** handles ambiguous headers that need cross-field reasoning. Row data is never sent to an AI service — only headers and field descriptions.
10
+
11
+ ---
12
+
13
+ ## Table of Contents
14
+
15
+ - [Quickstart](#quickstart)
16
+ - [Reviewing Failed Rows](#reviewing-failed-rows)
17
+ - [Form Object DSL](#form-object-dsl)
18
+ - [Configuration](#configuration)
19
+ - [Matching Strategies](#matching-strategies)
20
+ - [Processing Modes](#processing-modes)
21
+ - [Advanced](#advanced)
22
+ - [Contributing](#contributing)
23
+
24
+ ---
25
+
26
+ ## Quickstart
27
+
28
+ ### 1. Configure an AI provider
29
+
30
+ SmartCsvImport delegates all AI calls to [ruby_llm](https://github.com/crmne/ruby_llm). Configure it with your provider credentials before using vector or LLM matching strategies.
31
+
32
+ ```ruby
33
+ # config/initializers/ruby_llm.rb
34
+ RubyLLM.configure do |config|
35
+ # OpenAI
36
+ config.openai_api_key = ENV["OPENAI_API_KEY"]
37
+
38
+ # — or Anthropic —
39
+ config.anthropic_api_key = ENV["ANTHROPIC_API_KEY"]
40
+
41
+ # — or Google (Gemini embedding is free tier, good for getting started) —
42
+ config.gemini_api_key = ENV["GEMINI_API_KEY"]
43
+ end
44
+ ```
45
+
46
+ > **Free tier option:** The default models (`gemini-embedding-001` for embeddings, `claude-haiku-4-5-20251001` for LLM fallback) work well for development and small workloads. A Google AI Studio key gives you free Gemini embeddings. See [Model Selection](#model-selection) for upgrading.
47
+
48
+ ### 2. Install the gem
49
+
50
+ ```ruby
51
+ # Gemfile
52
+ gem "smart_csv_import"
53
+ ```
54
+
55
+ ```bash
56
+ bundle install
57
+ ```
58
+
59
+ ### 3. Run the install generator
60
+
61
+ ```bash
62
+ rails generate smart_csv_import:install
63
+ ```
64
+
65
+ This creates:
66
+
67
+ - `config/initializers/smart_csv_import.rb` — configuration with all options commented out at their defaults
68
+ - `db/migrate/YYYYMMDDHHMMSS_create_smart_csv_import_imports.rb` — import tracking table
69
+ - `tmp/smart_csv_import/` and `tmp/smart_csv_import/embeddings_cache/` — storage directories
70
+
71
+ ```bash
72
+ rails db:migrate
73
+ ```
74
+
75
+ ### 4. Generate an import form
76
+
77
+ ```bash
78
+ rails generate smart_csv_import:import Employee first_name last_name email
79
+ ```
80
+
81
+ This creates `app/forms/employee_import_row.rb`:
82
+
83
+ ```ruby
84
+ class EmployeeImportRow
85
+ include ActiveModel::Validations
86
+ include SmartCsvImport::Matchable
87
+
88
+ attr_accessor :first_name, :last_name, :email
89
+
90
+ # The description is what the AI matches against — be specific.
91
+ # "Employee first/given name" beats "first name" for ambiguous CSVs.
92
+ csv_field :first_name, description: "Employee first/given name", required: true
93
+ csv_field :last_name, description: "Employee last/family name", required: true
94
+ csv_field :email, description: "Employee email address"
95
+
96
+ validates :first_name, :last_name, presence: true
97
+
98
+ def save
99
+ Employee.create!(first_name: first_name, last_name: last_name, email: email)
100
+ true
101
+ rescue ActiveRecord::RecordInvalid
102
+ false
103
+ end
104
+ end
105
+ ```
106
+
107
+ Edit the `save` method to persist however your app needs. SmartCsvImport calls `save` once per row.
108
+
109
+ ### 5. Process a CSV
110
+
111
+ ```ruby
112
+ result = SmartCsvImport.process("path/to/employees.csv", form_class: EmployeeImportRow)
113
+
114
+ result.completed? # => true
115
+ result.imported # => 98
116
+ result.failed # => 2
117
+ result.total # => 100
118
+ result.errors # => [#<RowError row=42 column=:email messages=["is invalid"]>]
119
+ result.warnings # => [#<RowWarning message="Column 'Nickname' was not mapped">]
120
+ result.header_mappings # => {"First Name" => "first_name", "Surname" => "last_name", ...}
121
+ ```
122
+
123
+ That's it. SmartCsvImport reads the CSV headers, maps them to your fields, and calls `save` on a form object for each row.
124
+
125
+ ---
126
+
127
+ ## Reviewing Failed Rows
128
+
129
+ Every row that fails — whether due to CSV parsing (malformed quotes, encoding issues) or form validation — is captured as structured data so you can build whatever review experience you want: a paginated admin UI, a corrective re-export, a retry workflow.
130
+
131
+ ### In-memory: the `Result` object (sync callers)
132
+
133
+ ```ruby
134
+ result = SmartCsvImport.process("employees.csv", form_class: EmployeeImportRow)
135
+
136
+ result.errors # => [#<RowError row=42 column=:email messages=["is invalid"]>, ...]
137
+ result.parse_errors # => [#<ParseError line_number=87 raw_line='"Bob,Jones...' error_message="Unclosed quoted field">, ...]
138
+ ```
139
+
140
+ - `RowError(row, column, messages)` — one per field that failed validation. A single CSV row that fails on multiple columns produces multiple `RowError` records.
141
+ - `ParseError(line_number, raw_line, error_message)` — one per CSV row that couldn't be parsed at all.
142
+
143
+ ### Persistent: the `Import#row_errors` association (async, later, or both)
144
+
145
+ Every failure is persisted to the `smart_csv_import_import_row_errors` table, giving you an ActiveRecord association to query at any time — including after an async `ImportJob` has completed and the in-memory `Result` is gone.
146
+
147
+ ```ruby
148
+ import = SmartCsvImport::Import.find(import_id)
149
+
150
+ import.row_errors.count # => 7
151
+ import.row_errors.validation_errors.count # => 5
152
+ import.row_errors.parse_errors.count # => 2
153
+
154
+ # Pagination, filtering, grouping — all plain ActiveRecord:
155
+ import.row_errors.validation_errors.where(column_name: "email").limit(50).offset(100)
156
+ import.row_errors.group(:error_type, :column_name).count
157
+ ```
158
+
159
+ Each `SmartCsvImport::ImportRowError` record:
160
+
161
+ | Field | Type | Populated for | Description |
162
+ |---|---|---|---|
163
+ | `import_id` | integer | both | FK to `smart_csv_import_imports` (cascade delete) |
164
+ | `row_number` | integer | both | Physical CSV line number (1-indexed, headers on line 1) |
165
+ | `error_type` | string | both | `"validation"` or `"parse"` |
166
+ | `column_name` | string | validation | Form field that failed (e.g. `"email"`) |
167
+ | `messages` | json (array) | validation | Error messages from the form object (e.g. `["is invalid"]`) |
168
+ | `raw_line` | text | parse | Literal CSV row text that failed to parse |
169
+ | `error_message` | text | parse | Parser's description of the failure |
170
+
171
+ ### Downloadable "fix and re-upload" CSV
172
+
173
+ Re-export failed rows to a CSV with the original headers plus an `_error` column — drop it in front of a user, let them fix the rows, and re-upload.
174
+
175
+ ```ruby
176
+ # From a sync Result:
177
+ output_path = SmartCsvImport::FailedRowExporter.new(
178
+ result: result,
179
+ csv_path: "path/to/original.csv"
180
+ ).call
181
+
182
+ # Or from a persisted Import (e.g. after an async job):
183
+ output_path = SmartCsvImport::FailedRowExporter.new(
184
+ import: SmartCsvImport::Import.find(import_id),
185
+ csv_path: "path/to/original.csv"
186
+ ).call
187
+ # => "tmp/smart_csv_import/failed_rows/20260423142530_failed.csv"
188
+ ```
189
+
190
+ The exporter writes only validation failures — parse errors keep their `raw_line` intact on the `row_errors` record, which is usually more useful for inspecting malformed CSV than re-exporting.
191
+
192
+ ---
193
+
194
+ ## Form Object DSL
195
+
196
+ Include `SmartCsvImport::Matchable` in any class with `attr_accessor` declarations and use `csv_field` to register fields for matching.
197
+
198
+ ### `csv_field`
199
+
200
+ ```ruby
201
+ csv_field :field_name, description: "...", required: false
202
+ ```
203
+
204
+ | Parameter | Required | Description |
205
+ |---|---|---|
206
+ | `name` | yes | Must match an `attr_accessor` on the class |
207
+ | `description:` | yes | Plain-English description — this is what the AI matches CSV headers against |
208
+ | `required:` | no | When `true`, a failed match transitions the import to `mapping_review` instead of processing rows (see [Import Tracking](#import-tracking)) |
209
+
210
+ **Write good descriptions.** The description is the only signal the AI has. Vague descriptions produce weaker matches.
211
+
212
+ ```ruby
213
+ # Weaker — too generic
214
+ csv_field :amount, description: "amount"
215
+
216
+ # Stronger — specific and unambiguous
217
+ csv_field :amount, description: "Invoice total amount in dollars"
218
+ ```
219
+
220
+ ### `csv_source` and `csv_context`
221
+
222
+ Optional class-level hints that give the LLM domain knowledge for disambiguating headers it can't resolve from descriptions alone.
223
+
224
+ ```ruby
225
+ class EmployeeImportRow
226
+ include SmartCsvImport::Matchable
227
+
228
+ # Where the CSV comes from
229
+ csv_source "ADP Workforce payroll export"
230
+
231
+ # The business domain of your app
232
+ csv_context "HR platform for staffing agencies"
233
+
234
+ csv_field :mobile_phone, description: "Employee mobile phone number"
235
+ # ...
236
+ end
237
+ ```
238
+
239
+ Without context, the LLM sees "Cell" and has no way to know if it means a mobile number, a prison cell, or a biological cell. With `csv_source` and `csv_context`, it can reason correctly.
240
+
241
+ ---
242
+
243
+ ## Configuration
244
+
245
+ ```ruby
246
+ # config/initializers/smart_csv_import.rb
247
+ SmartCsvImport.configure do |config|
248
+ config.confidence_threshold = 0.80
249
+ config.batch_size = 500
250
+ config.storage_path = "tmp/smart_csv_import"
251
+ config.default_strategy = :vector
252
+ config.llm_model = "claude-haiku-4-5-20251001"
253
+ config.embedding_model = "gemini-embedding-001"
254
+ config.value_hint_rows = 5
255
+ end
256
+ ```
257
+
258
+ | Option | Default | Description |
259
+ |---|---|---|
260
+ | `confidence_threshold` | `0.80` | Minimum cosine similarity score to accept a vector or LLM match. Headers below this threshold fall through to the next strategy tier or become unmatched. |
261
+ | `batch_size` | `500` | How often (in rows) the import record is updated with progress counts during processing. |
262
+ | `storage_path` | `"tmp/smart_csv_import"` | Root directory for stored CSV files and the embedding cache. |
263
+ | `default_strategy` | `:vector` | Which strategy tier to start with when no custom strategy is set on the form class. |
264
+ | `llm_model` | `"claude-haiku-4-5-20251001"` | The LLM used for fallback matching. Any model supported by ruby_llm can be used. |
265
+ | `embedding_model` | `"gemini-embedding-001"` | The embedding model used by the vector strategy. |
266
+ | `value_hint_rows` | `5` | Number of sample rows inspected to apply value-based confidence adjustments (e.g. boosting confidence when cell values look like dates or emails). |
267
+
268
+ ### Model selection
269
+
270
+ Better models produce better matching accuracy on ambiguous or domain-specific headers. The defaults are suitable for development and light workloads. To upgrade:
271
+
272
+ ```ruby
273
+ config.embedding_model = "text-embedding-3-large" # OpenAI — higher dimensionality
274
+ config.llm_model = "claude-sonnet-4-6" # Anthropic — stronger reasoning
275
+ ```
276
+
277
+ Any model listed in the [ruby_llm documentation](https://github.com/crmne/ruby_llm) works — no other changes needed.
278
+
279
+ ---
280
+
281
+ ## Matching Strategies
282
+
283
+ ### How the fallback chain works
284
+
285
+ For each unmatched header, SmartCsvImport tries strategies in order and accepts the first result that meets the confidence threshold:
286
+
287
+ ```
288
+ CSV headers
289
+
290
+
291
+ Custom strategy (if set on form class)
292
+ │ unmatched or below threshold
293
+
294
+ Vector strategy (embedding cosine similarity)
295
+ │ unmatched or below threshold
296
+
297
+ LLM strategy (structured prompt)
298
+ │ unmatched
299
+
300
+ UnmatchedResult → warning on the result object
301
+ ```
302
+
303
+ Once a header is matched, it does not pass to the next tier. A header that clears all three tiers unmatched becomes an `UnmatchedResult` and generates a warning — it does not cause the import to fail.
304
+
305
+ ### Vector strategy
306
+
307
+ Computes embeddings for your field descriptions and the incoming CSV headers, then accepts the highest-scoring mutual match above the confidence threshold.
308
+
309
+ Field embeddings are cached to disk (keyed by your field definitions) so the API is only called once per unique set of fields — subsequent imports of the same type are fast.
310
+
311
+ Only headers and field descriptions are sent to the embedding API. Row data is never transmitted.
312
+
313
+ ### LLM strategy
314
+
315
+ Fires for headers that the vector strategy couldn't match confidently. Sends all remaining headers and all field definitions together in a single prompt, letting the LLM reason across the full set:
316
+
317
+ > "Cell" next to `first_name`, `last_name`, `email` → clearly a phone number.
318
+ > "Cell" in isolation → ambiguous.
319
+
320
+ Cross-field context is what makes this effective. Only headers and descriptions are sent — never row data.
321
+
322
+ ### Lookup strategy
323
+
324
+ Zero AI. For systems with fixed, known column names.
325
+
326
+ ```ruby
327
+ class HrSystemMapping < SmartCsvImport::Strategies::Lookup
328
+ mappings(
329
+ "EMP_ID" => :employee_id,
330
+ "FNAME" => :first_name,
331
+ "LNAME" => :last_name,
332
+ "DOB" => :date_of_birth
333
+ )
334
+ end
335
+
336
+ class EmployeeImportRow
337
+ include SmartCsvImport::Matchable
338
+
339
+ self.matching_strategy = HrSystemMapping.new
340
+ # ...
341
+ end
342
+ ```
343
+
344
+ Because a custom strategy runs first in the chain, matches from the Lookup table skip vector and LLM entirely.
345
+
346
+ ### Custom strategy
347
+
348
+ Subclass `SmartCsvImport::Strategy` and implement `match`:
349
+
350
+ ```ruby
351
+ class MyStrategy < SmartCsvImport::Strategy
352
+ def match(csv_headers:, form_class:, sample_rows: [])
353
+ csv_headers.each_with_object({}) do |header, results|
354
+ next unless header.downcase == "emp_id"
355
+
356
+ results[header] = SmartCsvImport::MatchResult.matched(
357
+ target_field: :employee_id,
358
+ confidence: 1.0,
359
+ strategy_name: "my_strategy"
360
+ )
361
+ end
362
+ end
363
+ end
364
+ ```
365
+
366
+ Return only the headers your strategy can confidently match. Headers you omit fall through to the next tier.
367
+
368
+ ---
369
+
370
+ ## Processing Modes
371
+
372
+ ### Synchronous (default)
373
+
374
+ Blocks until all rows are processed. Returns a result object.
375
+
376
+ ```ruby
377
+ result = SmartCsvImport.process("file.csv", form_class: MyImportRow)
378
+ result.completed? # => true
379
+ ```
380
+
381
+ Use for small files, scripts, or rake tasks.
382
+
383
+ ### Asynchronous
384
+
385
+ Enqueues a background job and returns immediately.
386
+
387
+ ```ruby
388
+ result = SmartCsvImport.process("file.csv", form_class: MyImportRow, mode: :async)
389
+ result.queued? # => true
390
+ result.import_id # => 42
391
+ ```
392
+
393
+ The job runs on the `:smart_csv_import` queue. Requires a queue backend — [Sidekiq](https://github.com/sidekiq/sidekiq), [GoodJob](https://github.com/bensheldon/good_job), or any Active Job adapter. Use for user-facing uploads where you don't want to block a web request.
394
+
395
+ ### Dry run
396
+
397
+ Validates every row without persisting anything.
398
+
399
+ ```ruby
400
+ result = SmartCsvImport.process("file.csv", form_class: MyImportRow, dry_run: true)
401
+ result.dry_run? # => true
402
+ result.imported # => 95 (would succeed)
403
+ result.failed # => 5 (would fail, with errors)
404
+ ```
405
+
406
+ Use to preview results before committing an import.
407
+
408
+ ---
409
+
410
+ ## Advanced
411
+
412
+ ### Header matching only
413
+
414
+ Inspect the raw mapping decisions without processing any rows:
415
+
416
+ ```ruby
417
+ mappings = SmartCsvImport.match_headers("file.csv", form_class: MyImportRow)
418
+ # => {
419
+ # "First Name" => #<MatchResult target_field=:first_name confidence=0.97 strategy="vector">,
420
+ # "Cell" => #<MatchResult target_field=:mobile_phone confidence=0.91 strategy="llm">,
421
+ # "Nickname" => #<UnmatchedResult csv_header="Nickname" attempted_strategies=["vector", "llm"]>
422
+ # }
423
+ ```
424
+
425
+ Useful for building a review UI before committing large imports.
426
+
427
+ ### Import tracking
428
+
429
+ Every `SmartCsvImport.process` call creates a `SmartCsvImport::Import` record:
430
+
431
+ | Status | Meaning |
432
+ |---|---|
433
+ | `pending` | Created, not yet started |
434
+ | `processing` | Actively running |
435
+ | `completed` | All rows processed successfully |
436
+ | `partial_failure` | Some rows failed validation |
437
+ | `failed` | Processing stopped due to a database error |
438
+ | `mapping_review` | A `required:` field could not be matched — no rows were processed |
439
+
440
+ The record also stores the header mappings used, row counts, and a SHA-256 hash of the file. Duplicate file detection compares this hash before processing — a warning is added to the result if a match is found.
441
+
442
+ ### Stability analysis
443
+
444
+ After running several imports of the same type, check which header mappings have solidified:
445
+
446
+ ```ruby
447
+ report = SmartCsvImport::StabilityReport.new(import_type: "EmployeeImportRow")
448
+ analysis = report.analyze
449
+
450
+ analysis.imports_analyzed # => 20
451
+ analysis.stable_fields # => fields consistent >= 90% of the time
452
+ analysis.unstable_fields # => fields with varying resolutions
453
+
454
+ puts report.summary
455
+ # Stability report for EmployeeImportRow (20 imports analyzed):
456
+ # Stable fields (3):
457
+ # - First Name -> first_name (100.0% consistent)
458
+ # - Last Name -> last_name (100.0% consistent)
459
+ # - Email -> email (95.0% consistent)
460
+ ```
461
+
462
+ Fields stable at >= 90% are good candidates for promotion to a [Lookup strategy](#lookup-strategy). Doing so eliminates AI calls for those fields entirely on future imports.
463
+
464
+ ### Normalizers
465
+
466
+ Built-in converters for common CSV data types. Use them in your `save` method:
467
+
468
+ ```ruby
469
+ SmartCsvImport::Normalizers::DateConverter.call("03/15/2024") # => #<Date: 2024-03-15>
470
+ SmartCsvImport::Normalizers::DateConverter.call("2024-03-15") # => #<Date: 2024-03-15>
471
+ SmartCsvImport::Normalizers::BooleanConverter.call("yes") # => true
472
+ SmartCsvImport::Normalizers::BooleanConverter.call("0") # => false
473
+
474
+ # In your form object:
475
+ def save
476
+ Employee.create!(
477
+ name: name,
478
+ hired_on: SmartCsvImport::Normalizers::DateConverter.call(hired_on),
479
+ active: SmartCsvImport::Normalizers::BooleanConverter.call(active)
480
+ )
481
+ true
482
+ rescue ActiveRecord::RecordInvalid
483
+ false
484
+ end
485
+ ```
486
+
487
+ ---
488
+
489
+ ## Contributing
490
+
491
+ ### Architecture overview
492
+
493
+ ```
494
+ SmartCsvImport.process(file, form_class:)
495
+ └── Processor
496
+ ├── FileStorage stores file, computes hash, checks duplicates
497
+ ├── Matcher runs strategy chain, returns header → MatchResult map
498
+ │ ├── Strategies::Vector cosine similarity on embeddings (cached)
499
+ │ └── Strategies::Llm structured LLM prompt
500
+ └── (per row) form_class.new(attrs).save
501
+ ```
502
+
503
+ Key files:
504
+
505
+ | File | Purpose |
506
+ |---|---|
507
+ | `lib/smart_csv_import.rb` | Public API: `.process`, `.match_headers`, `.configure` |
508
+ | `lib/smart_csv_import/processor.rb` | Orchestrates matching + row processing + result building |
509
+ | `lib/smart_csv_import/matcher.rb` | Runs the strategy chain, applies value hints |
510
+ | `lib/smart_csv_import/matchable.rb` | `csv_field` DSL mixed into form objects |
511
+ | `lib/smart_csv_import/strategies/` | Vector, LLM, Lookup, and base Strategy class |
512
+ | `lib/smart_csv_import/result.rb` | Result value objects returned from `.process` |
513
+
514
+ The `Configuration` object is a global singleton accessed via `SmartCsvImport.configuration`. The `Engine` class hooks it into Rails' initializer and migration loading.
515
+
516
+ ### Getting started
517
+
518
+ ```bash
519
+ git clone https://github.com/Nroulston/smart_csv_import
520
+ cd smart_csv_import
521
+ bin/setup
522
+ bin/rake # runs the full test suite
523
+ bin/console # IRB with all gem code loaded
524
+ ```
525
+
526
+ ### Design decisions
527
+
528
+ Before sending a PR that touches the matching strategies, read [`ROADMAP.md`](ROADMAP.md). It documents two approaches that were evaluated and explicitly rejected (HyDE, asymmetric embedding augmentation) with the reasoning — understanding why they were ruled out will save you from rediscovering the same dead ends.
529
+
530
+ ---
531
+
532
+ ## License
533
+
534
+ [MIT](LICENSE.adoc)