smart_csv_import 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/LICENSE.adoc +134 -0
- data/README.md +534 -0
- data/app/jobs/smart_csv_import/import_job.rb +22 -0
- data/app/models/smart_csv_import/import.rb +36 -0
- data/app/models/smart_csv_import/import_row_error.rb +17 -0
- data/lib/generators/smart_csv_import/import/import_generator.rb +49 -0
- data/lib/generators/smart_csv_import/import/templates/import_form.rb.tt +32 -0
- data/lib/generators/smart_csv_import/import/templates/import_form_spec.rb.tt +38 -0
- data/lib/generators/smart_csv_import/install/install_generator.rb +34 -0
- data/lib/generators/smart_csv_import/install/templates/create_smart_csv_import_import_row_errors.rb.tt +18 -0
- data/lib/generators/smart_csv_import/install/templates/create_smart_csv_import_imports.rb.tt +23 -0
- data/lib/generators/smart_csv_import/install/templates/initializer.rb.tt +51 -0
- data/lib/generators/smart_csv_import/scaffold/scaffold_generator.rb +56 -0
- data/lib/generators/smart_csv_import/scaffold/templates/controller.rb.tt +33 -0
- data/lib/generators/smart_csv_import/scaffold/templates/new.html.erb.tt +12 -0
- data/lib/generators/smart_csv_import/scaffold/templates/show.html.erb.tt +59 -0
- data/lib/smart_csv_import/configuration.rb +77 -0
- data/lib/smart_csv_import/cosine_similarity.rb +15 -0
- data/lib/smart_csv_import/engine.rb +12 -0
- data/lib/smart_csv_import/failed_row_exporter.rb +78 -0
- data/lib/smart_csv_import/file_storage.rb +34 -0
- data/lib/smart_csv_import/header_normalizer.rb +76 -0
- data/lib/smart_csv_import/logging.rb +37 -0
- data/lib/smart_csv_import/match_result.rb +36 -0
- data/lib/smart_csv_import/matchable.rb +76 -0
- data/lib/smart_csv_import/matcher.rb +198 -0
- data/lib/smart_csv_import/normalizers/boolean_converter.rb +26 -0
- data/lib/smart_csv_import/normalizers/date_converter.rb +28 -0
- data/lib/smart_csv_import/notifications.rb +16 -0
- data/lib/smart_csv_import/processor/csv_preflight_analyzer.rb +74 -0
- data/lib/smart_csv_import/processor/import_result_builder.rb +97 -0
- data/lib/smart_csv_import/processor/mapping_review_policy.rb +90 -0
- data/lib/smart_csv_import/processor/nil_cell_counter.rb +19 -0
- data/lib/smart_csv_import/processor/null_progress_callback.rb +11 -0
- data/lib/smart_csv_import/processor/row_processor.rb +70 -0
- data/lib/smart_csv_import/processor.rb +294 -0
- data/lib/smart_csv_import/result.rb +101 -0
- data/lib/smart_csv_import/stability_report.rb +104 -0
- data/lib/smart_csv_import/strategies/llm.rb +106 -0
- data/lib/smart_csv_import/strategies/lookup.rb +41 -0
- data/lib/smart_csv_import/strategies/vector.rb +155 -0
- data/lib/smart_csv_import/strategy.rb +9 -0
- data/lib/smart_csv_import/strategy_failure.rb +13 -0
- data/lib/smart_csv_import/version.rb +5 -0
- data/lib/smart_csv_import.rb +79 -0
- data/smart_csv_import.gemspec +35 -0
- metadata +216 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: f663c2506a13b2fd5f495969fe22337c6a5a6408ecf8dc5c9f12805e73c15c65
|
|
4
|
+
data.tar.gz: fe62bc86016e471136c30362659fc28aecdaa8d690a5a772cfcf0c880021efe0
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 013d1838ac0a41c353c4f63423a2e891e99509eb71197f956250dafff745ff0e91d6a00402b161acb547c1ed13c87e2e35e236d97fae0fd4b8a3cc30001ff722
|
|
7
|
+
data.tar.gz: 62f20e4e4fe40ebff5f43af35e528fda2e6d8197183521072c95aa3de491248f1d0fbbd5c44a3beb716776d8ec8bf2f95391c9ddba4e49e4d29a5a091713b181
|
data/LICENSE.adoc
ADDED
|
@@ -0,0 +1,134 @@
|
|
|
1
|
+
= Hippocratic License
|
|
2
|
+
|
|
3
|
+
Version: 2.1.0.
|
|
4
|
+
|
|
5
|
+
Purpose. The purpose of this License is for the Licensor named above to
|
|
6
|
+
permit the Licensee (as defined below) broad permission, if consistent
|
|
7
|
+
with Human Rights Laws and Human Rights Principles (as each is defined
|
|
8
|
+
below), to use and work with the Software (as defined below) within the
|
|
9
|
+
full scope of Licensor’s copyright and patent rights, if any, in the
|
|
10
|
+
Software, while ensuring attribution and protecting the Licensor from
|
|
11
|
+
liability.
|
|
12
|
+
|
|
13
|
+
Permission and Conditions. The Licensor grants permission by this
|
|
14
|
+
license ("License"), free of charge, to the extent of Licensor’s
|
|
15
|
+
rights under applicable copyright and patent law, to any person or
|
|
16
|
+
entity (the "Licensee") obtaining a copy of this software and
|
|
17
|
+
associated documentation files (the "Software"), to do everything with
|
|
18
|
+
the Software that would otherwise infringe (i) the Licensor’s copyright
|
|
19
|
+
in the Software or (ii) any patent claims to the Software that the
|
|
20
|
+
Licensor can license or becomes able to license, subject to all of the
|
|
21
|
+
following terms and conditions:
|
|
22
|
+
|
|
23
|
+
* Acceptance. This License is automatically offered to every person and
|
|
24
|
+
entity subject to its terms and conditions. Licensee accepts this
|
|
25
|
+
License and agrees to its terms and conditions by taking any action with
|
|
26
|
+
the Software that, absent this License, would infringe any intellectual
|
|
27
|
+
property right held by Licensor.
|
|
28
|
+
* Notice. Licensee must ensure that everyone who gets a copy of any part
|
|
29
|
+
of this Software from Licensee, with or without changes, also receives
|
|
30
|
+
the License and the above copyright notice (and if included by the
|
|
31
|
+
Licensor, patent, trademark and attribution notice). Licensee must cause
|
|
32
|
+
any modified versions of the Software to carry prominent notices stating
|
|
33
|
+
that Licensee changed the Software. For clarity, although Licensee is
|
|
34
|
+
free to create modifications of the Software and distribute only the
|
|
35
|
+
modified portion created by Licensee with additional or different terms,
|
|
36
|
+
the portion of the Software not modified must be distributed pursuant to
|
|
37
|
+
this License. If anyone notifies Licensee in writing that Licensee has
|
|
38
|
+
not complied with this Notice section, Licensee can keep this License by
|
|
39
|
+
taking all practical steps to comply within 30 days after the notice. If
|
|
40
|
+
Licensee does not do so, Licensee’s License (and all rights licensed
|
|
41
|
+
hereunder) shall end immediately.
|
|
42
|
+
* Compliance with Human Rights Principles and Human Rights Laws.
|
|
43
|
+
[arabic]
|
|
44
|
+
. Human Rights Principles.
|
|
45
|
+
[loweralpha]
|
|
46
|
+
.. Licensee is advised to consult the articles of the United Nations
|
|
47
|
+
Universal Declaration of Human Rights and the United Nations Global
|
|
48
|
+
Compact that define recognized principles of international human rights
|
|
49
|
+
(the "Human Rights Principles"). Licensee shall use the Software in a
|
|
50
|
+
manner consistent with Human Rights Principles.
|
|
51
|
+
.. Unless the Licensor and Licensee agree otherwise, any dispute,
|
|
52
|
+
controversy, or claim arising out of or relating to (i) Section 1(a)
|
|
53
|
+
regarding Human Rights Principles, including the breach of Section 1(a),
|
|
54
|
+
termination of this License for breach of the Human Rights Principles,
|
|
55
|
+
or invalidity of Section 1(a) or (ii) a determination of whether any Law
|
|
56
|
+
is consistent or in conflict with Human Rights Principles pursuant to
|
|
57
|
+
Section 2, below, shall be settled by arbitration in accordance with the
|
|
58
|
+
Hague Rules on Business and Human Rights Arbitration (the "Rules");
|
|
59
|
+
provided, however, that Licensee may elect not to participate in such
|
|
60
|
+
arbitration, in which event this License (and all rights licensed
|
|
61
|
+
hereunder) shall end immediately. The number of arbitrators shall be one
|
|
62
|
+
unless the Rules require otherwise.
|
|
63
|
+
+
|
|
64
|
+
Unless both the Licensor and Licensee agree to the contrary: (1) All
|
|
65
|
+
documents and information concerning the arbitration shall be public and
|
|
66
|
+
may be disclosed by any party; (2) The repository referred to under
|
|
67
|
+
Article 43 of the Rules shall make available to the public in a timely
|
|
68
|
+
manner all documents concerning the arbitration which are communicated
|
|
69
|
+
to it, including all submissions of the parties, all evidence admitted
|
|
70
|
+
into the record of the proceedings, all transcripts or other recordings
|
|
71
|
+
of hearings and all orders, decisions and awards of the arbitral
|
|
72
|
+
tribunal, subject only to the arbitral tribunal’s powers to take such
|
|
73
|
+
measures as may be necessary to safeguard the integrity of the arbitral
|
|
74
|
+
process pursuant to Articles 18, 33, 41 and 42 of the Rules; and (3)
|
|
75
|
+
Article 26(6) of the Rules shall not apply.
|
|
76
|
+
. Human Rights Laws. The Software shall not be used by any person or
|
|
77
|
+
entity for any systems, activities, or other uses that violate any Human
|
|
78
|
+
Rights Laws. "Human Rights Laws" means any applicable laws,
|
|
79
|
+
regulations, or rules (collectively, "Laws") that protect human,
|
|
80
|
+
civil, labor, privacy, political, environmental, security, economic, due
|
|
81
|
+
process, or similar rights; provided, however, that such Laws are
|
|
82
|
+
consistent and not in conflict with Human Rights Principles (a dispute
|
|
83
|
+
over the consistency or a conflict between Laws and Human Rights
|
|
84
|
+
Principles shall be determined by arbitration as stated above). Where
|
|
85
|
+
the Human Rights Laws of more than one jurisdiction are applicable or in
|
|
86
|
+
conflict with respect to the use of the Software, the Human Rights Laws
|
|
87
|
+
that are most protective of the individuals or groups harmed shall
|
|
88
|
+
apply.
|
|
89
|
+
. Indemnity. Licensee shall hold harmless and indemnify Licensor (and
|
|
90
|
+
any other contributor) against all losses, damages, liabilities,
|
|
91
|
+
deficiencies, claims, actions, judgments, settlements, interest, awards,
|
|
92
|
+
penalties, fines, costs, or expenses of whatever kind, including
|
|
93
|
+
Licensor’s reasonable attorneys’ fees, arising out of or relating to
|
|
94
|
+
Licensee’s use of the Software in violation of Human Rights Laws or
|
|
95
|
+
Human Rights Principles.
|
|
96
|
+
* Failure to Comply. Any failure of Licensee to act according to the
|
|
97
|
+
terms and conditions of this License is both a breach of the License and
|
|
98
|
+
an infringement of the intellectual property rights of the Licensor
|
|
99
|
+
(subject to exceptions under Laws, e.g., fair use). In the event of a
|
|
100
|
+
breach or infringement, the terms and conditions of this License may be
|
|
101
|
+
enforced by Licensor under the Laws of any jurisdiction to which
|
|
102
|
+
Licensee is subject. Licensee also agrees that the Licensor may enforce
|
|
103
|
+
the terms and conditions of this License against Licensee through
|
|
104
|
+
specific performance (or similar remedy under Laws) to the extent
|
|
105
|
+
permitted by Laws. For clarity, except in the event of a breach of this
|
|
106
|
+
License, infringement, or as otherwise stated in this License, Licensor
|
|
107
|
+
may not terminate this License with Licensee.
|
|
108
|
+
* Enforceability and Interpretation. If any term or provision of this
|
|
109
|
+
License is determined to be invalid, illegal, or unenforceable by a
|
|
110
|
+
court of competent jurisdiction, then such invalidity, illegality, or
|
|
111
|
+
unenforceability shall not affect any other term or provision of this
|
|
112
|
+
License or invalidate or render unenforceable such term or provision in
|
|
113
|
+
any other jurisdiction; provided, however, subject to a court
|
|
114
|
+
modification pursuant to the immediately following sentence, if any term
|
|
115
|
+
or provision of this License pertaining to Human Rights Laws or Human
|
|
116
|
+
Rights Principles is deemed invalid, illegal, or unenforceable against
|
|
117
|
+
Licensee by a court of competent jurisdiction, all rights in the
|
|
118
|
+
Software granted to Licensee shall be deemed null and void as between
|
|
119
|
+
Licensor and Licensee. Upon a determination that any term or provision
|
|
120
|
+
is invalid, illegal, or unenforceable, to the extent permitted by Laws,
|
|
121
|
+
the court may modify this License to affect the original purpose that
|
|
122
|
+
the Software be used in compliance with Human Rights Principles and
|
|
123
|
+
Human Rights Laws as closely as possible. The language in this License
|
|
124
|
+
shall be interpreted as to its fair meaning and not strictly for or
|
|
125
|
+
against any party.
|
|
126
|
+
* Disclaimer. TO THE FULL EXTENT ALLOWED BY LAW, THIS SOFTWARE COMES
|
|
127
|
+
"AS IS," WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, AND LICENSOR AND
|
|
128
|
+
ANY OTHER CONTRIBUTOR SHALL NOT BE LIABLE TO ANYONE FOR ANY DAMAGES OR
|
|
129
|
+
OTHER LIABILITY ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE
|
|
130
|
+
OR THIS LICENSE, UNDER ANY KIND OF LEGAL CLAIM.
|
|
131
|
+
|
|
132
|
+
This Hippocratic License is an link:https://ethicalsource.dev[Ethical Source license] and is offered
|
|
133
|
+
for use by licensors and licensees at their own risk, on an "AS IS" basis, and with no warranties
|
|
134
|
+
express or implied, to the maximum extent permitted by Laws.
|
data/README.md
ADDED
|
@@ -0,0 +1,534 @@
|
|
|
1
|
+
# SmartCsvImport
|
|
2
|
+
|
|
3
|
+
[](https://github.com/Nroulston/smart_csv_import/actions/workflows/ci.yml)
|
|
4
|
+
[](https://badge.fury.io/rb/smart_csv_import)
|
|
5
|
+
[](LICENSE.adoc)
|
|
6
|
+
|
|
7
|
+
A Rails Engine for CSV importing with AI-powered header matching. Drop in a form object, describe your fields in plain English, and SmartCsvImport automatically maps whatever column names show up in the CSV — messy, abbreviated, or domain-specific — to your fields. No brittle column-name checks to maintain.
|
|
8
|
+
|
|
9
|
+
Matching uses a two-tier strategy: **vector similarity** (fast, cached embeddings) handles the common case, and **LLM fallback** handles ambiguous headers that need cross-field reasoning. Row data is never sent to an AI service — only headers and field descriptions.
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Table of Contents
|
|
14
|
+
|
|
15
|
+
- [Quickstart](#quickstart)
|
|
16
|
+
- [Reviewing Failed Rows](#reviewing-failed-rows)
|
|
17
|
+
- [Form Object DSL](#form-object-dsl)
|
|
18
|
+
- [Configuration](#configuration)
|
|
19
|
+
- [Matching Strategies](#matching-strategies)
|
|
20
|
+
- [Processing Modes](#processing-modes)
|
|
21
|
+
- [Advanced](#advanced)
|
|
22
|
+
- [Contributing](#contributing)
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Quickstart
|
|
27
|
+
|
|
28
|
+
### 1. Configure an AI provider
|
|
29
|
+
|
|
30
|
+
SmartCsvImport delegates all AI calls to [ruby_llm](https://github.com/crmne/ruby_llm). Configure it with your provider credentials before using vector or LLM matching strategies.
|
|
31
|
+
|
|
32
|
+
```ruby
|
|
33
|
+
# config/initializers/ruby_llm.rb
|
|
34
|
+
RubyLLM.configure do |config|
|
|
35
|
+
# OpenAI
|
|
36
|
+
config.openai_api_key = ENV["OPENAI_API_KEY"]
|
|
37
|
+
|
|
38
|
+
# — or Anthropic —
|
|
39
|
+
config.anthropic_api_key = ENV["ANTHROPIC_API_KEY"]
|
|
40
|
+
|
|
41
|
+
# — or Google (Gemini embedding is free tier, good for getting started) —
|
|
42
|
+
config.gemini_api_key = ENV["GEMINI_API_KEY"]
|
|
43
|
+
end
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
> **Free tier option:** The default models (`gemini-embedding-001` for embeddings, `claude-haiku-4-5-20251001` for LLM fallback) work well for development and small workloads. A Google AI Studio key gives you free Gemini embeddings. See [Model Selection](#model-selection) for upgrading.
|
|
47
|
+
|
|
48
|
+
### 2. Install the gem
|
|
49
|
+
|
|
50
|
+
```ruby
|
|
51
|
+
# Gemfile
|
|
52
|
+
gem "smart_csv_import"
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
bundle install
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### 3. Run the install generator
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
rails generate smart_csv_import:install
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
This creates:
|
|
66
|
+
|
|
67
|
+
- `config/initializers/smart_csv_import.rb` — configuration with all options commented out at their defaults
|
|
68
|
+
- `db/migrate/YYYYMMDDHHMMSS_create_smart_csv_import_imports.rb` — import tracking table
|
|
69
|
+
- `tmp/smart_csv_import/` and `tmp/smart_csv_import/embeddings_cache/` — storage directories
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
rails db:migrate
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### 4. Generate an import form
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
rails generate smart_csv_import:import Employee first_name last_name email
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
This creates `app/forms/employee_import_row.rb`:
|
|
82
|
+
|
|
83
|
+
```ruby
|
|
84
|
+
class EmployeeImportRow
|
|
85
|
+
include ActiveModel::Validations
|
|
86
|
+
include SmartCsvImport::Matchable
|
|
87
|
+
|
|
88
|
+
attr_accessor :first_name, :last_name, :email
|
|
89
|
+
|
|
90
|
+
# The description is what the AI matches against — be specific.
|
|
91
|
+
# "Employee first/given name" beats "first name" for ambiguous CSVs.
|
|
92
|
+
csv_field :first_name, description: "Employee first/given name", required: true
|
|
93
|
+
csv_field :last_name, description: "Employee last/family name", required: true
|
|
94
|
+
csv_field :email, description: "Employee email address"
|
|
95
|
+
|
|
96
|
+
validates :first_name, :last_name, presence: true
|
|
97
|
+
|
|
98
|
+
def save
|
|
99
|
+
Employee.create!(first_name: first_name, last_name: last_name, email: email)
|
|
100
|
+
true
|
|
101
|
+
rescue ActiveRecord::RecordInvalid
|
|
102
|
+
false
|
|
103
|
+
end
|
|
104
|
+
end
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Edit the `save` method to persist however your app needs. SmartCsvImport calls `save` once per row.
|
|
108
|
+
|
|
109
|
+
### 5. Process a CSV
|
|
110
|
+
|
|
111
|
+
```ruby
|
|
112
|
+
result = SmartCsvImport.process("path/to/employees.csv", form_class: EmployeeImportRow)
|
|
113
|
+
|
|
114
|
+
result.completed? # => true
|
|
115
|
+
result.imported # => 98
|
|
116
|
+
result.failed # => 2
|
|
117
|
+
result.total # => 100
|
|
118
|
+
result.errors # => [#<RowError row=42 column=:email messages=["is invalid"]>]
|
|
119
|
+
result.warnings # => [#<RowWarning message="Column 'Nickname' was not mapped">]
|
|
120
|
+
result.header_mappings # => {"First Name" => "first_name", "Surname" => "last_name", ...}
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
That's it. SmartCsvImport reads the CSV headers, maps them to your fields, and calls `save` on a form object for each row.
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## Reviewing Failed Rows
|
|
128
|
+
|
|
129
|
+
Every row that fails — whether due to CSV parsing (malformed quotes, encoding issues) or form validation — is captured as structured data so you can build whatever review experience you want: a paginated admin UI, a corrective re-export, a retry workflow.
|
|
130
|
+
|
|
131
|
+
### In-memory: the `Result` object (sync callers)
|
|
132
|
+
|
|
133
|
+
```ruby
|
|
134
|
+
result = SmartCsvImport.process("employees.csv", form_class: EmployeeImportRow)
|
|
135
|
+
|
|
136
|
+
result.errors # => [#<RowError row=42 column=:email messages=["is invalid"]>, ...]
|
|
137
|
+
result.parse_errors # => [#<ParseError line_number=87 raw_line='"Bob,Jones...' error_message="Unclosed quoted field">, ...]
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
- `RowError(row, column, messages)` — one per field that failed validation. A single CSV row that fails on multiple columns produces multiple `RowError` records.
|
|
141
|
+
- `ParseError(line_number, raw_line, error_message)` — one per CSV row that couldn't be parsed at all.
|
|
142
|
+
|
|
143
|
+
### Persistent: the `Import#row_errors` association (async, later, or both)
|
|
144
|
+
|
|
145
|
+
Every failure is persisted to the `smart_csv_import_import_row_errors` table, giving you an ActiveRecord association to query at any time — including after an async `ImportJob` has completed and the in-memory `Result` is gone.
|
|
146
|
+
|
|
147
|
+
```ruby
|
|
148
|
+
import = SmartCsvImport::Import.find(import_id)
|
|
149
|
+
|
|
150
|
+
import.row_errors.count # => 7
|
|
151
|
+
import.row_errors.validation_errors.count # => 5
|
|
152
|
+
import.row_errors.parse_errors.count # => 2
|
|
153
|
+
|
|
154
|
+
# Pagination, filtering, grouping — all plain ActiveRecord:
|
|
155
|
+
import.row_errors.validation_errors.where(column_name: "email").limit(50).offset(100)
|
|
156
|
+
import.row_errors.group(:error_type, :column_name).count
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
Each `SmartCsvImport::ImportRowError` record:
|
|
160
|
+
|
|
161
|
+
| Field | Type | Populated for | Description |
|
|
162
|
+
|---|---|---|---|
|
|
163
|
+
| `import_id` | integer | both | FK to `smart_csv_import_imports` (cascade delete) |
|
|
164
|
+
| `row_number` | integer | both | Physical CSV line number (1-indexed, headers on line 1) |
|
|
165
|
+
| `error_type` | string | both | `"validation"` or `"parse"` |
|
|
166
|
+
| `column_name` | string | validation | Form field that failed (e.g. `"email"`) |
|
|
167
|
+
| `messages` | json (array) | validation | Error messages from the form object (e.g. `["is invalid"]`) |
|
|
168
|
+
| `raw_line` | text | parse | Literal CSV row text that failed to parse |
|
|
169
|
+
| `error_message` | text | parse | Parser's description of the failure |
|
|
170
|
+
|
|
171
|
+
### Downloadable "fix and re-upload" CSV
|
|
172
|
+
|
|
173
|
+
Re-export failed rows to a CSV with the original headers plus an `_error` column — drop it in front of a user, let them fix the rows, and re-upload.
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
# From a sync Result:
|
|
177
|
+
output_path = SmartCsvImport::FailedRowExporter.new(
|
|
178
|
+
result: result,
|
|
179
|
+
csv_path: "path/to/original.csv"
|
|
180
|
+
).call
|
|
181
|
+
|
|
182
|
+
# Or from a persisted Import (e.g. after an async job):
|
|
183
|
+
output_path = SmartCsvImport::FailedRowExporter.new(
|
|
184
|
+
import: SmartCsvImport::Import.find(import_id),
|
|
185
|
+
csv_path: "path/to/original.csv"
|
|
186
|
+
).call
|
|
187
|
+
# => "tmp/smart_csv_import/failed_rows/20260423142530_failed.csv"
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
The exporter writes only validation failures — parse errors keep their `raw_line` intact on the `row_errors` record, which is usually more useful for inspecting malformed CSV than re-exporting.
|
|
191
|
+
|
|
192
|
+
---
|
|
193
|
+
|
|
194
|
+
## Form Object DSL
|
|
195
|
+
|
|
196
|
+
Include `SmartCsvImport::Matchable` in any class with `attr_accessor` declarations and use `csv_field` to register fields for matching.
|
|
197
|
+
|
|
198
|
+
### `csv_field`
|
|
199
|
+
|
|
200
|
+
```ruby
|
|
201
|
+
csv_field :field_name, description: "...", required: false
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
| Parameter | Required | Description |
|
|
205
|
+
|---|---|---|
|
|
206
|
+
| `name` | yes | Must match an `attr_accessor` on the class |
|
|
207
|
+
| `description:` | yes | Plain-English description — this is what the AI matches CSV headers against |
|
|
208
|
+
| `required:` | no | When `true`, a failed match transitions the import to `mapping_review` instead of processing rows (see [Import Tracking](#import-tracking)) |
|
|
209
|
+
|
|
210
|
+
**Write good descriptions.** The description is the only signal the AI has. Vague descriptions produce weaker matches.
|
|
211
|
+
|
|
212
|
+
```ruby
|
|
213
|
+
# Weaker — too generic
|
|
214
|
+
csv_field :amount, description: "amount"
|
|
215
|
+
|
|
216
|
+
# Stronger — specific and unambiguous
|
|
217
|
+
csv_field :amount, description: "Invoice total amount in dollars"
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
### `csv_source` and `csv_context`
|
|
221
|
+
|
|
222
|
+
Optional class-level hints that give the LLM domain knowledge for disambiguating headers it can't resolve from descriptions alone.
|
|
223
|
+
|
|
224
|
+
```ruby
|
|
225
|
+
class EmployeeImportRow
|
|
226
|
+
include SmartCsvImport::Matchable
|
|
227
|
+
|
|
228
|
+
# Where the CSV comes from
|
|
229
|
+
csv_source "ADP Workforce payroll export"
|
|
230
|
+
|
|
231
|
+
# The business domain of your app
|
|
232
|
+
csv_context "HR platform for staffing agencies"
|
|
233
|
+
|
|
234
|
+
csv_field :mobile_phone, description: "Employee mobile phone number"
|
|
235
|
+
# ...
|
|
236
|
+
end
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Without context, the LLM sees "Cell" and has no way to know if it means a mobile number, a prison cell, or a biological cell. With `csv_source` and `csv_context`, it can reason correctly.
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
## Configuration
|
|
244
|
+
|
|
245
|
+
```ruby
|
|
246
|
+
# config/initializers/smart_csv_import.rb
|
|
247
|
+
SmartCsvImport.configure do |config|
|
|
248
|
+
config.confidence_threshold = 0.80
|
|
249
|
+
config.batch_size = 500
|
|
250
|
+
config.storage_path = "tmp/smart_csv_import"
|
|
251
|
+
config.default_strategy = :vector
|
|
252
|
+
config.llm_model = "claude-haiku-4-5-20251001"
|
|
253
|
+
config.embedding_model = "gemini-embedding-001"
|
|
254
|
+
config.value_hint_rows = 5
|
|
255
|
+
end
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
| Option | Default | Description |
|
|
259
|
+
|---|---|---|
|
|
260
|
+
| `confidence_threshold` | `0.80` | Minimum cosine similarity score to accept a vector or LLM match. Headers below this threshold fall through to the next strategy tier or become unmatched. |
|
|
261
|
+
| `batch_size` | `500` | How often (in rows) the import record is updated with progress counts during processing. |
|
|
262
|
+
| `storage_path` | `"tmp/smart_csv_import"` | Root directory for stored CSV files and the embedding cache. |
|
|
263
|
+
| `default_strategy` | `:vector` | Which strategy tier to start with when no custom strategy is set on the form class. |
|
|
264
|
+
| `llm_model` | `"claude-haiku-4-5-20251001"` | The LLM used for fallback matching. Any model supported by ruby_llm can be used. |
|
|
265
|
+
| `embedding_model` | `"gemini-embedding-001"` | The embedding model used by the vector strategy. |
|
|
266
|
+
| `value_hint_rows` | `5` | Number of sample rows inspected to apply value-based confidence adjustments (e.g. boosting confidence when cell values look like dates or emails). |
|
|
267
|
+
|
|
268
|
+
### Model selection
|
|
269
|
+
|
|
270
|
+
Better models produce better matching accuracy on ambiguous or domain-specific headers. The defaults are suitable for development and light workloads. To upgrade:
|
|
271
|
+
|
|
272
|
+
```ruby
|
|
273
|
+
config.embedding_model = "text-embedding-3-large" # OpenAI — higher dimensionality
|
|
274
|
+
config.llm_model = "claude-sonnet-4-6" # Anthropic — stronger reasoning
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
Any model listed in the [ruby_llm documentation](https://github.com/crmne/ruby_llm) works — no other changes needed.
|
|
278
|
+
|
|
279
|
+
---
|
|
280
|
+
|
|
281
|
+
## Matching Strategies
|
|
282
|
+
|
|
283
|
+
### How the fallback chain works
|
|
284
|
+
|
|
285
|
+
For each unmatched header, SmartCsvImport tries strategies in order and accepts the first result that meets the confidence threshold:
|
|
286
|
+
|
|
287
|
+
```
|
|
288
|
+
CSV headers
|
|
289
|
+
│
|
|
290
|
+
▼
|
|
291
|
+
Custom strategy (if set on form class)
|
|
292
|
+
│ unmatched or below threshold
|
|
293
|
+
▼
|
|
294
|
+
Vector strategy (embedding cosine similarity)
|
|
295
|
+
│ unmatched or below threshold
|
|
296
|
+
▼
|
|
297
|
+
LLM strategy (structured prompt)
|
|
298
|
+
│ unmatched
|
|
299
|
+
▼
|
|
300
|
+
UnmatchedResult → warning on the result object
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
Once a header is matched, it does not pass to the next tier. A header that clears all three tiers unmatched becomes an `UnmatchedResult` and generates a warning — it does not cause the import to fail.
|
|
304
|
+
|
|
305
|
+
### Vector strategy
|
|
306
|
+
|
|
307
|
+
Computes embeddings for your field descriptions and the incoming CSV headers, then accepts the highest-scoring mutual match above the confidence threshold.
|
|
308
|
+
|
|
309
|
+
Field embeddings are cached to disk (keyed by your field definitions) so the API is only called once per unique set of fields — subsequent imports of the same type are fast.
|
|
310
|
+
|
|
311
|
+
Only headers and field descriptions are sent to the embedding API. Row data is never transmitted.
|
|
312
|
+
|
|
313
|
+
### LLM strategy
|
|
314
|
+
|
|
315
|
+
Fires for headers that the vector strategy couldn't match confidently. Sends all remaining headers and all field definitions together in a single prompt, letting the LLM reason across the full set:
|
|
316
|
+
|
|
317
|
+
> "Cell" next to `first_name`, `last_name`, `email` → clearly a phone number.
|
|
318
|
+
> "Cell" in isolation → ambiguous.
|
|
319
|
+
|
|
320
|
+
Cross-field context is what makes this effective. Only headers and descriptions are sent — never row data.
|
|
321
|
+
|
|
322
|
+
### Lookup strategy
|
|
323
|
+
|
|
324
|
+
Zero AI. For systems with fixed, known column names.
|
|
325
|
+
|
|
326
|
+
```ruby
|
|
327
|
+
class HrSystemMapping < SmartCsvImport::Strategies::Lookup
|
|
328
|
+
mappings(
|
|
329
|
+
"EMP_ID" => :employee_id,
|
|
330
|
+
"FNAME" => :first_name,
|
|
331
|
+
"LNAME" => :last_name,
|
|
332
|
+
"DOB" => :date_of_birth
|
|
333
|
+
)
|
|
334
|
+
end
|
|
335
|
+
|
|
336
|
+
class EmployeeImportRow
|
|
337
|
+
include SmartCsvImport::Matchable
|
|
338
|
+
|
|
339
|
+
self.matching_strategy = HrSystemMapping.new
|
|
340
|
+
# ...
|
|
341
|
+
end
|
|
342
|
+
```
|
|
343
|
+
|
|
344
|
+
Because a custom strategy runs first in the chain, matches from the Lookup table skip vector and LLM entirely.
|
|
345
|
+
|
|
346
|
+
### Custom strategy
|
|
347
|
+
|
|
348
|
+
Subclass `SmartCsvImport::Strategy` and implement `match`:
|
|
349
|
+
|
|
350
|
+
```ruby
|
|
351
|
+
class MyStrategy < SmartCsvImport::Strategy
|
|
352
|
+
def match(csv_headers:, form_class:, sample_rows: [])
|
|
353
|
+
csv_headers.each_with_object({}) do |header, results|
|
|
354
|
+
next unless header.downcase == "emp_id"
|
|
355
|
+
|
|
356
|
+
results[header] = SmartCsvImport::MatchResult.matched(
|
|
357
|
+
target_field: :employee_id,
|
|
358
|
+
confidence: 1.0,
|
|
359
|
+
strategy_name: "my_strategy"
|
|
360
|
+
)
|
|
361
|
+
end
|
|
362
|
+
end
|
|
363
|
+
end
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
Return only the headers your strategy can confidently match. Headers you omit fall through to the next tier.
|
|
367
|
+
|
|
368
|
+
---
|
|
369
|
+
|
|
370
|
+
## Processing Modes
|
|
371
|
+
|
|
372
|
+
### Synchronous (default)
|
|
373
|
+
|
|
374
|
+
Blocks until all rows are processed. Returns a result object.
|
|
375
|
+
|
|
376
|
+
```ruby
|
|
377
|
+
result = SmartCsvImport.process("file.csv", form_class: MyImportRow)
|
|
378
|
+
result.completed? # => true
|
|
379
|
+
```
|
|
380
|
+
|
|
381
|
+
Use for small files, scripts, or rake tasks.
|
|
382
|
+
|
|
383
|
+
### Asynchronous
|
|
384
|
+
|
|
385
|
+
Enqueues a background job and returns immediately.
|
|
386
|
+
|
|
387
|
+
```ruby
|
|
388
|
+
result = SmartCsvImport.process("file.csv", form_class: MyImportRow, mode: :async)
|
|
389
|
+
result.queued? # => true
|
|
390
|
+
result.import_id # => 42
|
|
391
|
+
```
|
|
392
|
+
|
|
393
|
+
The job runs on the `:smart_csv_import` queue. Requires a queue backend — [Sidekiq](https://github.com/sidekiq/sidekiq), [GoodJob](https://github.com/bensheldon/good_job), or any Active Job adapter. Use for user-facing uploads where you don't want to block a web request.
|
|
394
|
+
|
|
395
|
+
### Dry run
|
|
396
|
+
|
|
397
|
+
Validates every row without persisting anything.
|
|
398
|
+
|
|
399
|
+
```ruby
|
|
400
|
+
result = SmartCsvImport.process("file.csv", form_class: MyImportRow, dry_run: true)
|
|
401
|
+
result.dry_run? # => true
|
|
402
|
+
result.imported # => 95 (would succeed)
|
|
403
|
+
result.failed # => 5 (would fail, with errors)
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
Use to preview results before committing an import.
|
|
407
|
+
|
|
408
|
+
---
|
|
409
|
+
|
|
410
|
+
## Advanced
|
|
411
|
+
|
|
412
|
+
### Header matching only
|
|
413
|
+
|
|
414
|
+
Inspect the raw mapping decisions without processing any rows:
|
|
415
|
+
|
|
416
|
+
```ruby
|
|
417
|
+
mappings = SmartCsvImport.match_headers("file.csv", form_class: MyImportRow)
|
|
418
|
+
# => {
|
|
419
|
+
# "First Name" => #<MatchResult target_field=:first_name confidence=0.97 strategy="vector">,
|
|
420
|
+
# "Cell" => #<MatchResult target_field=:mobile_phone confidence=0.91 strategy="llm">,
|
|
421
|
+
# "Nickname" => #<UnmatchedResult csv_header="Nickname" attempted_strategies=["vector", "llm"]>
|
|
422
|
+
# }
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
Useful for building a review UI before committing large imports.
|
|
426
|
+
|
|
427
|
+
### Import tracking
|
|
428
|
+
|
|
429
|
+
Every `SmartCsvImport.process` call creates a `SmartCsvImport::Import` record:
|
|
430
|
+
|
|
431
|
+
| Status | Meaning |
|
|
432
|
+
|---|---|
|
|
433
|
+
| `pending` | Created, not yet started |
|
|
434
|
+
| `processing` | Actively running |
|
|
435
|
+
| `completed` | All rows processed successfully |
|
|
436
|
+
| `partial_failure` | Some rows failed validation |
|
|
437
|
+
| `failed` | Processing stopped due to a database error |
|
|
438
|
+
| `mapping_review` | A `required:` field could not be matched — no rows were processed |
|
|
439
|
+
|
|
440
|
+
The record also stores the header mappings used, row counts, and a SHA-256 hash of the file. Duplicate file detection compares this hash before processing — a warning is added to the result if a match is found.
|
|
441
|
+
|
|
442
|
+
### Stability analysis
|
|
443
|
+
|
|
444
|
+
After running several imports of the same type, check which header mappings have solidified:
|
|
445
|
+
|
|
446
|
+
```ruby
|
|
447
|
+
report = SmartCsvImport::StabilityReport.new(import_type: "EmployeeImportRow")
|
|
448
|
+
analysis = report.analyze
|
|
449
|
+
|
|
450
|
+
analysis.imports_analyzed # => 20
|
|
451
|
+
analysis.stable_fields # => fields consistent >= 90% of the time
|
|
452
|
+
analysis.unstable_fields # => fields with varying resolutions
|
|
453
|
+
|
|
454
|
+
puts report.summary
|
|
455
|
+
# Stability report for EmployeeImportRow (20 imports analyzed):
|
|
456
|
+
# Stable fields (3):
|
|
457
|
+
# - First Name -> first_name (100.0% consistent)
|
|
458
|
+
# - Last Name -> last_name (100.0% consistent)
|
|
459
|
+
# - Email -> email (95.0% consistent)
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
Fields stable at >= 90% are good candidates for promotion to a [Lookup strategy](#lookup-strategy). Doing so eliminates AI calls for those fields entirely on future imports.
|
|
463
|
+
|
|
464
|
+
### Normalizers
|
|
465
|
+
|
|
466
|
+
Built-in converters for common CSV data types. Use them in your `save` method:
|
|
467
|
+
|
|
468
|
+
```ruby
|
|
469
|
+
SmartCsvImport::Normalizers::DateConverter.call("03/15/2024") # => #<Date: 2024-03-15>
|
|
470
|
+
SmartCsvImport::Normalizers::DateConverter.call("2024-03-15") # => #<Date: 2024-03-15>
|
|
471
|
+
SmartCsvImport::Normalizers::BooleanConverter.call("yes") # => true
|
|
472
|
+
SmartCsvImport::Normalizers::BooleanConverter.call("0") # => false
|
|
473
|
+
|
|
474
|
+
# In your form object:
|
|
475
|
+
def save
|
|
476
|
+
Employee.create!(
|
|
477
|
+
name: name,
|
|
478
|
+
hired_on: SmartCsvImport::Normalizers::DateConverter.call(hired_on),
|
|
479
|
+
active: SmartCsvImport::Normalizers::BooleanConverter.call(active)
|
|
480
|
+
)
|
|
481
|
+
true
|
|
482
|
+
rescue ActiveRecord::RecordInvalid
|
|
483
|
+
false
|
|
484
|
+
end
|
|
485
|
+
```
|
|
486
|
+
|
|
487
|
+
---
|
|
488
|
+
|
|
489
|
+
## Contributing
|
|
490
|
+
|
|
491
|
+
### Architecture overview
|
|
492
|
+
|
|
493
|
+
```
|
|
494
|
+
SmartCsvImport.process(file, form_class:)
|
|
495
|
+
└── Processor
|
|
496
|
+
├── FileStorage stores file, computes hash, checks duplicates
|
|
497
|
+
├── Matcher runs strategy chain, returns header → MatchResult map
|
|
498
|
+
│ ├── Strategies::Vector cosine similarity on embeddings (cached)
|
|
499
|
+
│ └── Strategies::Llm structured LLM prompt
|
|
500
|
+
└── (per row) form_class.new(attrs).save
|
|
501
|
+
```
|
|
502
|
+
|
|
503
|
+
Key files:
|
|
504
|
+
|
|
505
|
+
| File | Purpose |
|
|
506
|
+
|---|---|
|
|
507
|
+
| `lib/smart_csv_import.rb` | Public API: `.process`, `.match_headers`, `.configure` |
|
|
508
|
+
| `lib/smart_csv_import/processor.rb` | Orchestrates matching + row processing + result building |
|
|
509
|
+
| `lib/smart_csv_import/matcher.rb` | Runs the strategy chain, applies value hints |
|
|
510
|
+
| `lib/smart_csv_import/matchable.rb` | `csv_field` DSL mixed into form objects |
|
|
511
|
+
| `lib/smart_csv_import/strategies/` | Vector, LLM, Lookup, and base Strategy class |
|
|
512
|
+
| `lib/smart_csv_import/result.rb` | Result value objects returned from `.process` |
|
|
513
|
+
|
|
514
|
+
The `Configuration` object is a global singleton accessed via `SmartCsvImport.configuration`. The `Engine` class hooks it into Rails' initializer and migration loading.
|
|
515
|
+
|
|
516
|
+
### Getting started
|
|
517
|
+
|
|
518
|
+
```bash
|
|
519
|
+
git clone https://github.com/Nroulston/smart_csv_import
|
|
520
|
+
cd smart_csv_import
|
|
521
|
+
bin/setup
|
|
522
|
+
bin/rake # runs the full test suite
|
|
523
|
+
bin/console # IRB with all gem code loaded
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
### Design decisions
|
|
527
|
+
|
|
528
|
+
Before sending a PR that touches the matching strategies, read [`ROADMAP.md`](ROADMAP.md). It documents two approaches that were evaluated and explicitly rejected (HyDE, asymmetric embedding augmentation) with the reasoning — understanding why they were ruled out will save you from rediscovering the same dead ends.
|
|
529
|
+
|
|
530
|
+
---
|
|
531
|
+
|
|
532
|
+
## License
|
|
533
|
+
|
|
534
|
+
[MIT](LICENSE.adoc)
|