data_porter 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.claude/commands/blog-status.md +10 -0
- data/.claude/commands/blog.md +109 -0
- data/.claude/commands/task-done.md +27 -0
- data/.claude/commands/tm/add-dependency.md +58 -0
- data/.claude/commands/tm/add-subtask.md +79 -0
- data/.claude/commands/tm/add-task.md +81 -0
- data/.claude/commands/tm/analyze-complexity.md +124 -0
- data/.claude/commands/tm/analyze-project.md +100 -0
- data/.claude/commands/tm/auto-implement-tasks.md +100 -0
- data/.claude/commands/tm/command-pipeline.md +80 -0
- data/.claude/commands/tm/complexity-report.md +120 -0
- data/.claude/commands/tm/convert-task-to-subtask.md +74 -0
- data/.claude/commands/tm/expand-all-tasks.md +52 -0
- data/.claude/commands/tm/expand-task.md +52 -0
- data/.claude/commands/tm/fix-dependencies.md +82 -0
- data/.claude/commands/tm/help.md +101 -0
- data/.claude/commands/tm/init-project-quick.md +49 -0
- data/.claude/commands/tm/init-project.md +53 -0
- data/.claude/commands/tm/install-taskmaster.md +118 -0
- data/.claude/commands/tm/learn.md +106 -0
- data/.claude/commands/tm/list-tasks-by-status.md +42 -0
- data/.claude/commands/tm/list-tasks-with-subtasks.md +30 -0
- data/.claude/commands/tm/list-tasks.md +46 -0
- data/.claude/commands/tm/next-task.md +69 -0
- data/.claude/commands/tm/parse-prd-with-research.md +51 -0
- data/.claude/commands/tm/parse-prd.md +52 -0
- data/.claude/commands/tm/project-status.md +67 -0
- data/.claude/commands/tm/quick-install-taskmaster.md +23 -0
- data/.claude/commands/tm/remove-all-subtasks.md +94 -0
- data/.claude/commands/tm/remove-dependency.md +65 -0
- data/.claude/commands/tm/remove-subtask.md +87 -0
- data/.claude/commands/tm/remove-subtasks.md +89 -0
- data/.claude/commands/tm/remove-task.md +110 -0
- data/.claude/commands/tm/setup-models.md +52 -0
- data/.claude/commands/tm/show-task.md +85 -0
- data/.claude/commands/tm/smart-workflow.md +58 -0
- data/.claude/commands/tm/sync-readme.md +120 -0
- data/.claude/commands/tm/tm-main.md +147 -0
- data/.claude/commands/tm/to-cancelled.md +58 -0
- data/.claude/commands/tm/to-deferred.md +50 -0
- data/.claude/commands/tm/to-done.md +47 -0
- data/.claude/commands/tm/to-in-progress.md +39 -0
- data/.claude/commands/tm/to-pending.md +35 -0
- data/.claude/commands/tm/to-review.md +43 -0
- data/.claude/commands/tm/update-single-task.md +122 -0
- data/.claude/commands/tm/update-task.md +75 -0
- data/.claude/commands/tm/update-tasks-from-id.md +111 -0
- data/.claude/commands/tm/validate-dependencies.md +72 -0
- data/.claude/commands/tm/view-models.md +52 -0
- data/.env.example +12 -0
- data/.mcp.json +24 -0
- data/.taskmaster/CLAUDE.md +435 -0
- data/.taskmaster/config.json +44 -0
- data/.taskmaster/docs/prd.txt +2044 -0
- data/.taskmaster/state.json +6 -0
- data/.taskmaster/tasks/task_001.md +19 -0
- data/.taskmaster/tasks/task_002.md +19 -0
- data/.taskmaster/tasks/task_003.md +19 -0
- data/.taskmaster/tasks/task_004.md +19 -0
- data/.taskmaster/tasks/task_005.md +19 -0
- data/.taskmaster/tasks/task_006.md +19 -0
- data/.taskmaster/tasks/task_007.md +19 -0
- data/.taskmaster/tasks/task_008.md +19 -0
- data/.taskmaster/tasks/task_009.md +19 -0
- data/.taskmaster/tasks/task_010.md +19 -0
- data/.taskmaster/tasks/task_011.md +19 -0
- data/.taskmaster/tasks/task_012.md +19 -0
- data/.taskmaster/tasks/task_013.md +19 -0
- data/.taskmaster/tasks/task_014.md +19 -0
- data/.taskmaster/tasks/task_015.md +19 -0
- data/.taskmaster/tasks/task_016.md +19 -0
- data/.taskmaster/tasks/task_017.md +19 -0
- data/.taskmaster/tasks/task_018.md +19 -0
- data/.taskmaster/tasks/task_019.md +19 -0
- data/.taskmaster/tasks/task_020.md +19 -0
- data/.taskmaster/tasks/tasks.json +299 -0
- data/.taskmaster/templates/example_prd.txt +47 -0
- data/.taskmaster/templates/example_prd_rpg.txt +511 -0
- data/CHANGELOG.md +29 -0
- data/CLAUDE.md +65 -0
- data/CODE_OF_CONDUCT.md +10 -0
- data/CONTRIBUTING.md +49 -0
- data/LICENSE +21 -0
- data/README.md +463 -0
- data/Rakefile +12 -0
- data/app/assets/stylesheets/data_porter/application.css +646 -0
- data/app/channels/data_porter/import_channel.rb +10 -0
- data/app/controllers/data_porter/imports_controller.rb +68 -0
- data/app/javascript/data_porter/progress_controller.js +33 -0
- data/app/jobs/data_porter/dry_run_job.rb +12 -0
- data/app/jobs/data_porter/import_job.rb +12 -0
- data/app/jobs/data_porter/parse_job.rb +12 -0
- data/app/models/data_porter/data_import.rb +49 -0
- data/app/views/data_porter/imports/index.html.erb +142 -0
- data/app/views/data_porter/imports/new.html.erb +88 -0
- data/app/views/data_porter/imports/show.html.erb +49 -0
- data/config/database.yml +3 -0
- data/config/routes.rb +12 -0
- data/docs/SPEC.md +2012 -0
- data/docs/UI.md +32 -0
- data/docs/blog/001-why-build-a-data-import-engine.md +166 -0
- data/docs/blog/002-scaffolding-a-rails-engine.md +188 -0
- data/docs/blog/003-configuration-dsl.md +222 -0
- data/docs/blog/004-store-model-jsonb.md +237 -0
- data/docs/blog/005-target-dsl.md +284 -0
- data/docs/blog/006-parsing-csv-sources.md +300 -0
- data/docs/blog/007-orchestrator.md +247 -0
- data/docs/blog/008-actioncable-stimulus.md +376 -0
- data/docs/blog/009-phlex-ui-components.md +446 -0
- data/docs/blog/010-controllers-routing.md +374 -0
- data/docs/blog/011-generators.md +364 -0
- data/docs/blog/012-json-api-sources.md +323 -0
- data/docs/blog/013-testing-rails-engine.md +618 -0
- data/docs/blog/014-dry-run.md +307 -0
- data/docs/blog/015-publishing-retro.md +264 -0
- data/docs/blog/016-erb-view-templates.md +431 -0
- data/docs/blog/017-showcase-final-retro.md +220 -0
- data/docs/blog/BACKLOG.md +8 -0
- data/docs/blog/SERIES.md +154 -0
- data/docs/screenshots/index-with-previewing.jpg +0 -0
- data/docs/screenshots/index.jpg +0 -0
- data/docs/screenshots/modal-new-import.jpg +0 -0
- data/docs/screenshots/preview.jpg +0 -0
- data/lib/data_porter/broadcaster.rb +29 -0
- data/lib/data_porter/components/base.rb +10 -0
- data/lib/data_porter/components/failure_alert.rb +20 -0
- data/lib/data_porter/components/preview_table.rb +54 -0
- data/lib/data_porter/components/progress_bar.rb +33 -0
- data/lib/data_porter/components/results_summary.rb +19 -0
- data/lib/data_porter/components/status_badge.rb +16 -0
- data/lib/data_porter/components/summary_cards.rb +30 -0
- data/lib/data_porter/components.rb +14 -0
- data/lib/data_porter/configuration.rb +25 -0
- data/lib/data_porter/dsl/api_config.rb +25 -0
- data/lib/data_porter/dsl/column.rb +17 -0
- data/lib/data_porter/engine.rb +15 -0
- data/lib/data_porter/orchestrator.rb +141 -0
- data/lib/data_porter/record_validator.rb +32 -0
- data/lib/data_porter/registry.rb +33 -0
- data/lib/data_porter/sources/api.rb +49 -0
- data/lib/data_porter/sources/base.rb +35 -0
- data/lib/data_porter/sources/csv.rb +43 -0
- data/lib/data_porter/sources/json.rb +45 -0
- data/lib/data_porter/sources.rb +20 -0
- data/lib/data_porter/store_models/error.rb +13 -0
- data/lib/data_porter/store_models/import_record.rb +52 -0
- data/lib/data_porter/store_models/report.rb +21 -0
- data/lib/data_porter/target.rb +89 -0
- data/lib/data_porter/type_validator.rb +46 -0
- data/lib/data_porter/version.rb +5 -0
- data/lib/data_porter.rb +32 -0
- data/lib/generators/data_porter/install/install_generator.rb +33 -0
- data/lib/generators/data_porter/install/templates/create_data_porter_imports.rb.erb +21 -0
- data/lib/generators/data_porter/install/templates/initializer.rb +30 -0
- data/lib/generators/data_porter/target/target_generator.rb +44 -0
- data/lib/generators/data_porter/target/templates/target.rb.tt +20 -0
- data/sig/data_porter.rbs +4 -0
- metadata +274 -0
|
@@ -0,0 +1,300 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Building DataPorter #6 — Parsing CSV Data with Sources"
|
|
3
|
+
series: "Building DataPorter - A Data Import Engine for Rails"
|
|
4
|
+
part: 6
|
|
5
|
+
tags: [ruby, rails, rails-engine, gem-development, csv, active-storage, source-pattern]
|
|
6
|
+
published: false
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Parsing CSV Data with Sources
|
|
10
|
+
|
|
11
|
+
> How to model an import record in the database, parse CSV files through a pluggable Source layer, and map headers to target columns -- the first end-to-end flow.
|
|
12
|
+
|
|
13
|
+
## Context
|
|
14
|
+
|
|
15
|
+
This is part 6 of the series where we build **DataPorter**, a mountable Rails engine for data import workflows. In [part 5](#), we designed the Target DSL and Registry -- the layer that describes *what* an import looks like: its columns, mappings, and persistence logic.
|
|
16
|
+
|
|
17
|
+
Now we need the other half: the code that represents an import *in progress* and the code that reads raw data from a file. By the end of this article, we will have a `DataImport` ActiveRecord model to track state, a `Source` abstraction for parsing, and a concrete CSV source that maps headers to target columns. This is where data first flows through the engine.
|
|
18
|
+
|
|
19
|
+
## The problem
|
|
20
|
+
|
|
21
|
+
A target declaration tells the engine what columns to expect, but it says nothing about where data comes from or how to track the import lifecycle. We need a database-backed model that records who started the import, what state it is in, what records were parsed, and what errors occurred. We also need a layer that can read a CSV file (today) and a JSON payload or API response (later) through the same interface. Without this separation, the parsing logic would live in the controller or the target itself, coupling the format to the business rules.
|
|
22
|
+
|
|
23
|
+
## What we're building
|
|
24
|
+
|
|
25
|
+
Here is the end-to-end flow we are wiring together:
|
|
26
|
+
|
|
27
|
+
```ruby
|
|
28
|
+
# 1. Create an import record
|
|
29
|
+
import = DataPorter::DataImport.create!(
|
|
30
|
+
target_key: "guests",
|
|
31
|
+
source_type: "csv",
|
|
32
|
+
user: current_user
|
|
33
|
+
)
|
|
34
|
+
|
|
35
|
+
# 2. Resolve the source and parse the file
|
|
36
|
+
source_class = DataPorter::Sources.resolve(import.source_type)
|
|
37
|
+
source = source_class.new(import, content: csv_string)
|
|
38
|
+
rows = source.fetch
|
|
39
|
+
# => [{ first_name: "Alice", last_name: "Smith", email: "alice@example.com" }, ...]
|
|
40
|
+
|
|
41
|
+
# 3. The import knows its target
|
|
42
|
+
import.target_class
|
|
43
|
+
# => GuestTarget
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Three objects, three concerns: `DataImport` tracks state, `Sources.resolve` picks the parser, and the source turns raw bytes into mapped hashes. The Orchestrator (part 7) will coordinate these pieces, but each works independently.
|
|
47
|
+
|
|
48
|
+
## Implementation
|
|
49
|
+
|
|
50
|
+
### Step 1 -- The DataImport model and migration
|
|
51
|
+
|
|
52
|
+
The `DataImport` model is the central record for every import. It needs to track which target is being imported, what source format the data arrives in, what state the import has reached, and who initiated it.
|
|
53
|
+
|
|
54
|
+
The migration creates a single table with JSONB columns for records and report data (the StoreModel types from part 4):
|
|
55
|
+
|
|
56
|
+
```ruby
|
|
57
|
+
# lib/generators/data_porter/install/templates/create_data_porter_imports.rb.erb
|
|
58
|
+
create_table :data_porter_imports do |t|
|
|
59
|
+
t.string :target_key, null: false
|
|
60
|
+
t.string :source_type, null: false, default: "csv"
|
|
61
|
+
t.integer :status, null: false, default: 0
|
|
62
|
+
t.jsonb :records, null: false, default: []
|
|
63
|
+
t.jsonb :report, null: false, default: {}
|
|
64
|
+
t.jsonb :config, null: false, default: {}
|
|
65
|
+
|
|
66
|
+
t.references :user, polymorphic: true, null: false
|
|
67
|
+
|
|
68
|
+
t.timestamps
|
|
69
|
+
end
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
The `user` reference is polymorphic (`user_type` + `user_id`), so the engine works regardless of whether the host app calls its user model `User`, `AdminUser`, or `Account`. The `config` JSONB column stores source-specific options like CSV delimiters or API authentication parameters -- things that vary per import, not per target.
|
|
73
|
+
|
|
74
|
+
The model itself is compact:
|
|
75
|
+
|
|
76
|
+
```ruby
|
|
77
|
+
# app/models/data_porter/data_import.rb
|
|
78
|
+
class DataImport < ActiveRecord::Base
|
|
79
|
+
self.table_name = "data_porter_imports"
|
|
80
|
+
|
|
81
|
+
belongs_to :user, polymorphic: true
|
|
82
|
+
|
|
83
|
+
enum :status, {
|
|
84
|
+
pending: 0, parsing: 1, previewing: 2,
|
|
85
|
+
importing: 3, completed: 4, failed: 5
|
|
86
|
+
}
|
|
87
|
+
|
|
88
|
+
attribute :records, StoreModels::ImportRecord.to_array_type, default: -> { [] }
|
|
89
|
+
attribute :report, StoreModels::Report.to_type, default: -> { StoreModels::Report.new }
|
|
90
|
+
attribute :config, :json, default: -> { {} }
|
|
91
|
+
|
|
92
|
+
validates :target_key, presence: true
|
|
93
|
+
validates :source_type, presence: true, inclusion: { in: %w[csv json api] }
|
|
94
|
+
end
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
The `status` enum defines the import lifecycle as a linear state machine: `pending -> parsing -> previewing -> importing -> completed` (or `failed` at any point). Integer-backed enums keep the database column small and indexable. The `records` and `report` attributes use StoreModel types that we built in part 4 -- they serialize structured data into the JSONB columns while providing typed Ruby objects in memory.
|
|
98
|
+
|
|
99
|
+
Two convenience methods bridge the model to the rest of the engine:
|
|
100
|
+
|
|
101
|
+
```ruby
|
|
102
|
+
# app/models/data_porter/data_import.rb
|
|
103
|
+
def target_class
|
|
104
|
+
Registry.find(target_key)
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
def importable_records
|
|
108
|
+
records.select(&:importable?)
|
|
109
|
+
end
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
`target_class` delegates to the Registry so any code holding a `DataImport` can reach the target's column definitions, mappings, and hooks. `importable_records` filters parsed records down to those that passed validation -- the subset the Orchestrator will actually persist.
|
|
113
|
+
|
|
114
|
+
### Step 2 -- The Source base class
|
|
115
|
+
|
|
116
|
+
Sources are responsible for one thing: turning raw input into an array of hashes where keys are target column names. The base class defines the interface and the shared mapping logic:
|
|
117
|
+
|
|
118
|
+
```ruby
|
|
119
|
+
# lib/data_porter/sources/base.rb
|
|
120
|
+
module DataPorter
|
|
121
|
+
module Sources
|
|
122
|
+
class Base
|
|
123
|
+
def initialize(data_import, **)
|
|
124
|
+
@data_import = data_import
|
|
125
|
+
@target_class = data_import.target_class
|
|
126
|
+
end
|
|
127
|
+
|
|
128
|
+
def fetch
|
|
129
|
+
raise NotImplementedError
|
|
130
|
+
end
|
|
131
|
+
end
|
|
132
|
+
end
|
|
133
|
+
end
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
Every source receives the `DataImport` record at construction, which gives it access to the target class (for column mappings) and the config hash (for source-specific options). The `**` double splat lets subclasses accept extra keyword arguments without the base class needing to know about them.
|
|
137
|
+
|
|
138
|
+
The mapping logic lives in the base class because it is shared across all sources that deal with key-value rows:
|
|
139
|
+
|
|
140
|
+
```ruby
|
|
141
|
+
# lib/data_porter/sources/base.rb (private methods)
|
|
142
|
+
def apply_csv_mapping(row)
|
|
143
|
+
mappings = @target_class._csv_mappings
|
|
144
|
+
return auto_map(row) if mappings.nil? || mappings.empty?
|
|
145
|
+
|
|
146
|
+
explicit_map(row, mappings)
|
|
147
|
+
end
|
|
148
|
+
|
|
149
|
+
def auto_map(row)
|
|
150
|
+
row.to_h.transform_keys { |k| k.parameterize(separator: "_").to_sym }
|
|
151
|
+
end
|
|
152
|
+
|
|
153
|
+
def explicit_map(row, mappings)
|
|
154
|
+
mappings.each_with_object({}) do |(header, column), hash|
|
|
155
|
+
hash[column] = row[header]
|
|
156
|
+
end
|
|
157
|
+
end
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
There are two mapping strategies. When a target defines `csv_mapping`, explicit mapping applies: only the declared header-to-column pairs are extracted, and anything else in the row is silently dropped. When no mapping is defined, auto-mapping kicks in: every header is parameterized into a snake_case symbol (`"First Name"` becomes `:first_name`). This lets simple imports work with zero configuration while giving complex imports full control over which columns matter.
|
|
161
|
+
|
|
162
|
+
### Step 3 -- The CSV source and source resolution
|
|
163
|
+
|
|
164
|
+
The CSV source implements `fetch` by parsing content through Ruby's standard library `CSV` class:
|
|
165
|
+
|
|
166
|
+
```ruby
|
|
167
|
+
# lib/data_porter/sources/csv.rb
|
|
168
|
+
class Csv < Base
|
|
169
|
+
def initialize(data_import, content: nil)
|
|
170
|
+
super(data_import)
|
|
171
|
+
@content = content
|
|
172
|
+
end
|
|
173
|
+
|
|
174
|
+
def fetch
|
|
175
|
+
rows = []
|
|
176
|
+
::CSV.parse(csv_content, **csv_options) do |row|
|
|
177
|
+
rows << apply_csv_mapping(row)
|
|
178
|
+
end
|
|
179
|
+
rows
|
|
180
|
+
end
|
|
181
|
+
|
|
182
|
+
private
|
|
183
|
+
|
|
184
|
+
def csv_content
|
|
185
|
+
@content || download_file
|
|
186
|
+
end
|
|
187
|
+
|
|
188
|
+
def download_file
|
|
189
|
+
@data_import.file.download
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
def csv_options
|
|
193
|
+
{ headers: true }.merge(extra_options)
|
|
194
|
+
end
|
|
195
|
+
|
|
196
|
+
def extra_options
|
|
197
|
+
config = @data_import.config
|
|
198
|
+
return {} unless config.is_a?(Hash)
|
|
199
|
+
config.symbolize_keys.slice(:col_sep, :encoding)
|
|
200
|
+
end
|
|
201
|
+
end
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
The `content:` keyword argument enables two usage modes. In production, `content` is nil and the source downloads the file from ActiveStorage via `@data_import.file.download`. In tests, you pass a CSV string directly, avoiding the need for file attachments or storage mocks. The `csv_options` method merges `headers: true` (so `CSV.parse` yields `CSV::Row` objects with named access) with any per-import overrides from the `config` column -- currently `col_sep` for semicolon-delimited files and `encoding` for non-UTF-8 data.
|
|
205
|
+
|
|
206
|
+
Source resolution ties it together:
|
|
207
|
+
|
|
208
|
+
```ruby
|
|
209
|
+
# lib/data_porter/sources.rb
|
|
210
|
+
module DataPorter
|
|
211
|
+
module Sources
|
|
212
|
+
REGISTRY = {
|
|
213
|
+
csv: Csv
|
|
214
|
+
}.freeze
|
|
215
|
+
|
|
216
|
+
def self.resolve(type)
|
|
217
|
+
REGISTRY.fetch(type.to_sym) { raise Error, "Unknown source type: #{type}" }
|
|
218
|
+
end
|
|
219
|
+
end
|
|
220
|
+
end
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
This is intentionally simpler than the Target Registry. Sources are engine-internal (the gem ships them), so a frozen hash with a `resolve` method is sufficient. When we add JSON and API sources in part 12, they get one line each in the registry.
|
|
224
|
+
|
|
225
|
+
## Decisions & tradeoffs
|
|
226
|
+
|
|
227
|
+
| Decision | We chose | Over | Because |
|
|
228
|
+
|----------|----------|------|---------|
|
|
229
|
+
| User association | Polymorphic `belongs_to :user` | A configurable foreign key or no association | Polymorphic works with any user model name without configuration; the engine does not need to know the host app's auth setup |
|
|
230
|
+
| State tracking | Integer-backed `enum` | A state machine gem (AASM, Statesman) | Six linear states do not need transition guards or history tracking yet; a gem would add a dependency for no immediate benefit |
|
|
231
|
+
| Auto-mapping fallback | `parameterize` + `to_sym` on headers | Requiring explicit mapping for all imports | Auto-mapping lets simple CSVs work with zero target configuration; explicit mapping is there when headers don't match column names |
|
|
232
|
+
| CSV content injection | `content:` keyword on initialize | Always reading from ActiveStorage | Injecting content makes tests fast and storage-independent; production code passes `nil` and falls through to `download_file` |
|
|
233
|
+
| Source-specific config | JSONB `config` column on DataImport | Separate columns for each option | A single JSONB column absorbs any source's options (col_sep, encoding, API headers) without schema changes |
|
|
234
|
+
|
|
235
|
+
## Testing it
|
|
236
|
+
|
|
237
|
+
DataImport specs verify validations, the status enum, and StoreModel integration:
|
|
238
|
+
|
|
239
|
+
```ruby
|
|
240
|
+
# spec/data_porter/data_import_spec.rb
|
|
241
|
+
it "validates source_type inclusion" do
|
|
242
|
+
import = described_class.new(target_key: "guests", source_type: "xml")
|
|
243
|
+
|
|
244
|
+
expect(import).not_to be_valid
|
|
245
|
+
expect(import.errors[:source_type]).to include("is not included in the list")
|
|
246
|
+
end
|
|
247
|
+
|
|
248
|
+
it "saves and reloads with records" do
|
|
249
|
+
import = described_class.create!(
|
|
250
|
+
target_key: "guests", source_type: "csv",
|
|
251
|
+
user_type: "User", user_id: 1
|
|
252
|
+
)
|
|
253
|
+
record = DataPorter::StoreModels::ImportRecord.new(line_number: 1, data: { name: "Alice" })
|
|
254
|
+
import.update!(records: [record])
|
|
255
|
+
|
|
256
|
+
reloaded = described_class.find(import.id)
|
|
257
|
+
expect(reloaded.records.first.data).to eq({ "name" => "Alice" })
|
|
258
|
+
end
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
CSV source specs exercise both mapping modes and the config override:
|
|
262
|
+
|
|
263
|
+
```ruby
|
|
264
|
+
# spec/data_porter/sources/csv_spec.rb
|
|
265
|
+
it "parses CSV content and applies mapping" do
|
|
266
|
+
csv_content = "Prenom,Nom,Email\nAlice,Smith,alice@example.com\n"
|
|
267
|
+
source = described_class.new(data_import, content: csv_content)
|
|
268
|
+
|
|
269
|
+
rows = source.fetch
|
|
270
|
+
|
|
271
|
+
expect(rows.size).to eq(2)
|
|
272
|
+
expect(rows.first).to eq(
|
|
273
|
+
first_name: "Alice", last_name: "Smith", email: "alice@example.com"
|
|
274
|
+
)
|
|
275
|
+
end
|
|
276
|
+
|
|
277
|
+
it "auto-maps when no csv_mapping defined" do
|
|
278
|
+
csv_content = "First Name,Last Name\nAlice,Smith\n"
|
|
279
|
+
source = described_class.new(import, content: csv_content)
|
|
280
|
+
|
|
281
|
+
expect(source.fetch.first).to eq(first_name: "Alice", last_name: "Smith")
|
|
282
|
+
end
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
All source specs pass CSV strings directly through the `content:` parameter, so they run without ActiveStorage, without file fixtures, and without I/O.
|
|
286
|
+
|
|
287
|
+
## Recap
|
|
288
|
+
|
|
289
|
+
- The **DataImport model** is the database-backed record for every import, tracking target key, source type, status, parsed records (via StoreModel), and the initiating user (via polymorphic association).
|
|
290
|
+
- The **migration** uses JSONB columns for records, report, and config, keeping the schema stable as features evolve.
|
|
291
|
+
- The **Source base class** defines the `fetch` interface and shared column-mapping logic with two strategies: explicit mapping from the target's `csv_mapping` block, or automatic parameterize-based mapping when none is defined.
|
|
292
|
+
- The **CSV source** parses content via Ruby's `CSV` library, supports ActiveStorage file download in production and string injection in tests, and respects per-import config options like custom delimiters.
|
|
293
|
+
|
|
294
|
+
## Next up
|
|
295
|
+
|
|
296
|
+
We have targets that describe imports and sources that parse raw data into mapped hashes. But right now, nothing coordinates the flow: reading the file, building ImportRecord objects, running validations, transitioning the status, and persisting results. In part 7, we build the **Orchestrator** -- the class that ties `DataImport`, `Target`, and `Source` together into the complete parse-then-import workflow. That is where state transitions, per-record error handling, and ActiveJob integration come in.
|
|
297
|
+
|
|
298
|
+
---
|
|
299
|
+
|
|
300
|
+
*This is part 6 of the series "Building DataPorter - A Data Import Engine for Rails". [Previous: Designing a Target DSL](#) | [Next: The Orchestrator](#)*
|
|
@@ -0,0 +1,247 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Building DataPorter #7 -- The Orchestrator: Coordinating the Import Workflow"
|
|
3
|
+
series: "Building DataPorter - A Data Import Engine for Rails"
|
|
4
|
+
part: 7
|
|
5
|
+
tags: [ruby, rails, rails-engine, gem-development, orchestrator, activejob, error-handling]
|
|
6
|
+
published: false
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# The Orchestrator: Coordinating the Import Workflow
|
|
10
|
+
|
|
11
|
+
> How to coordinate parsing, validation, and persistence through a single class that manages state transitions, handles errors per-record, and integrates with ActiveJob.
|
|
12
|
+
|
|
13
|
+
## Context
|
|
14
|
+
|
|
15
|
+
This is part 7 of the series where we build **DataPorter**, a mountable Rails engine for data import workflows. In [part 6](#), we built the DataImport model to track import state and the Source layer to parse CSV files into mapped hashes.
|
|
16
|
+
|
|
17
|
+
We now have all the individual pieces: targets describe imports, sources parse files, the RecordValidator checks column constraints, and DataImport tracks state. But nothing ties them together. In this article, we build the Orchestrator -- the class that coordinates the full parse-then-import workflow.
|
|
18
|
+
|
|
19
|
+
## The problem
|
|
20
|
+
|
|
21
|
+
Right now, if you wanted to run an import, you would need to manually resolve the source, call `fetch`, iterate over the rows, build ImportRecord objects, run validations, update the status, and handle failures. That is a lot of procedural logic. If it lived in a controller action, it would be untestable, unreusable, and impossible to run in the background. If it lived in the model, DataImport would become a god object. We need a dedicated coordination layer that knows the *order* of operations but delegates the *details* to the objects that own them.
|
|
22
|
+
|
|
23
|
+
## What we're building
|
|
24
|
+
|
|
25
|
+
Here is the Orchestrator in action -- two method calls that drive the entire workflow:
|
|
26
|
+
|
|
27
|
+
```ruby
|
|
28
|
+
# In a controller or job
|
|
29
|
+
orchestrator = DataPorter::Orchestrator.new(data_import, content: csv_string)
|
|
30
|
+
|
|
31
|
+
# Step 1: Parse the file, validate records, generate a preview
|
|
32
|
+
orchestrator.parse!
|
|
33
|
+
data_import.status # => "previewing"
|
|
34
|
+
data_import.records # => [ImportRecord, ImportRecord, ...]
|
|
35
|
+
|
|
36
|
+
# Step 2: After user reviews the preview, persist the records
|
|
37
|
+
orchestrator.import!
|
|
38
|
+
data_import.status # => "completed"
|
|
39
|
+
data_import.report.imported_count # => 42
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
Two methods, two phases. `parse!` turns raw data into validated records and stops at `previewing` so the user can review. `import!` takes the importable records and persists them through the target's `persist` method. If anything goes catastrophically wrong, the import transitions to `failed` with an error report.
|
|
43
|
+
|
|
44
|
+
## Implementation
|
|
45
|
+
|
|
46
|
+
### Step 1 -- The Orchestrator skeleton and parse phase
|
|
47
|
+
|
|
48
|
+
The Orchestrator is a plain Ruby object. It receives a DataImport and optional content (for testing), then delegates to the pieces we already built:
|
|
49
|
+
|
|
50
|
+
```ruby
|
|
51
|
+
# lib/data_porter/orchestrator.rb
|
|
52
|
+
class Orchestrator
|
|
53
|
+
def initialize(data_import, content: nil)
|
|
54
|
+
@data_import = data_import
|
|
55
|
+
@target = data_import.target_class.new
|
|
56
|
+
@source_options = { content: content }.compact
|
|
57
|
+
end
|
|
58
|
+
|
|
59
|
+
def parse!
|
|
60
|
+
@data_import.parsing!
|
|
61
|
+
records = build_records
|
|
62
|
+
@data_import.update!(records: records, status: :previewing)
|
|
63
|
+
build_report
|
|
64
|
+
rescue StandardError => e
|
|
65
|
+
handle_failure(e)
|
|
66
|
+
end
|
|
67
|
+
end
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
The constructor instantiates the target (so we can call its hooks) and compacts the source options (so `content: nil` does not override ActiveStorage downloads). The `parse!` method follows a strict sequence: transition to `parsing`, build and validate records, save them with a `previewing` status, then generate a summary report. The `rescue` at the method level catches any failure -- a malformed CSV, a missing file, an unexpected source error -- and transitions the import to `failed` with the error message preserved in the report.
|
|
71
|
+
|
|
72
|
+
Notice that `parsing!` is called *before* the work starts. This is intentional: if the job crashes between the status transition and the `update!`, the import is left in `parsing` rather than `pending`, signaling to the user that something went wrong mid-process rather than silently sitting idle.
|
|
73
|
+
|
|
74
|
+
### Step 2 -- Building and validating records
|
|
75
|
+
|
|
76
|
+
The `build_records` method is where the Source, Target, and RecordValidator converge:
|
|
77
|
+
|
|
78
|
+
```ruby
|
|
79
|
+
# lib/data_porter/orchestrator.rb
|
|
80
|
+
def build_records
|
|
81
|
+
source = @data_import.source_class.new(@data_import, **@source_options)
|
|
82
|
+
raw_rows = source.fetch
|
|
83
|
+
columns = @target.class._columns || []
|
|
84
|
+
validator = RecordValidator.new(columns)
|
|
85
|
+
|
|
86
|
+
raw_rows.each_with_index.map do |row, index|
|
|
87
|
+
build_record(row, index, columns, validator)
|
|
88
|
+
end
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
def build_record(row, index, columns, validator)
|
|
92
|
+
record = StoreModels::ImportRecord.new(
|
|
93
|
+
line_number: index + 1,
|
|
94
|
+
data: extract_data(row, columns)
|
|
95
|
+
)
|
|
96
|
+
record = @target.transform(record)
|
|
97
|
+
@target.validate(record)
|
|
98
|
+
validator.validate(record)
|
|
99
|
+
record.determine_status!
|
|
100
|
+
record
|
|
101
|
+
end
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
Each row goes through a four-step pipeline: extract the data for declared columns, let the target transform it (e.g., normalizing phone numbers), let the target run custom validations, then run the generic column-level validations (required fields, type checks). Finally, `determine_status!` sets each record to `complete`, `partial`, or `missing` based on whether errors were added.
|
|
105
|
+
|
|
106
|
+
The RecordValidator handles the generic constraints we defined in the column DSL:
|
|
107
|
+
|
|
108
|
+
```ruby
|
|
109
|
+
# lib/data_porter/record_validator.rb
|
|
110
|
+
class RecordValidator
|
|
111
|
+
def initialize(columns)
|
|
112
|
+
@columns = columns
|
|
113
|
+
end
|
|
114
|
+
|
|
115
|
+
def validate(record)
|
|
116
|
+
@columns.each do |col|
|
|
117
|
+
value = record.data[col.name]
|
|
118
|
+
validate_required(record, col, value)
|
|
119
|
+
validate_type(record, col, value)
|
|
120
|
+
end
|
|
121
|
+
end
|
|
122
|
+
end
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
This separation matters: the target owns business-rule validations ("email must be unique in the system"), while the RecordValidator owns structural validations ("email must look like an email"). Neither knows about the other.
|
|
126
|
+
|
|
127
|
+
### Step 3 -- The import phase and per-record error handling
|
|
128
|
+
|
|
129
|
+
Once the user reviews the preview and confirms, `import!` persists the records:
|
|
130
|
+
|
|
131
|
+
```ruby
|
|
132
|
+
# lib/data_porter/orchestrator.rb
|
|
133
|
+
def import!
|
|
134
|
+
@data_import.importing!
|
|
135
|
+
results = import_records
|
|
136
|
+
update_import_report(results)
|
|
137
|
+
@target.after_import(results, context: build_context)
|
|
138
|
+
rescue StandardError => e
|
|
139
|
+
handle_failure(e)
|
|
140
|
+
end
|
|
141
|
+
|
|
142
|
+
def persist_record(record, context, results)
|
|
143
|
+
@target.persist(record, context: context)
|
|
144
|
+
results[:created] += 1
|
|
145
|
+
rescue StandardError => e
|
|
146
|
+
record.add_error(e.message)
|
|
147
|
+
@target.on_error(record, e, context: context)
|
|
148
|
+
results[:errored] += 1
|
|
149
|
+
end
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
The critical design decision here is the error boundary. Each record is persisted individually, and if `persist` raises -- a uniqueness violation, a foreign key constraint, a custom validation from the host app -- the error is captured *on that record* and the import continues. The import does not wrap everything in a single transaction. This means a 10,000-row file with 3 bad records will successfully import 9,997 records rather than rolling back the entire batch.
|
|
153
|
+
|
|
154
|
+
The `on_error` hook lets the target react to failures (logging, notifying, skipping related records), while `after_import` runs once after all records are processed, receiving the results hash for summary work like sending a confirmation email.
|
|
155
|
+
|
|
156
|
+
### Step 4 -- ActiveJob integration
|
|
157
|
+
|
|
158
|
+
The Orchestrator is designed to be called from anywhere, but its primary consumer is a pair of ActiveJob classes:
|
|
159
|
+
|
|
160
|
+
```ruby
|
|
161
|
+
# app/jobs/data_porter/parse_job.rb
|
|
162
|
+
class ParseJob < ActiveJob::Base
|
|
163
|
+
queue_as { DataPorter.configuration.queue_name }
|
|
164
|
+
|
|
165
|
+
def perform(import_id)
|
|
166
|
+
data_import = DataImport.find(import_id)
|
|
167
|
+
Orchestrator.new(data_import).parse!
|
|
168
|
+
end
|
|
169
|
+
end
|
|
170
|
+
|
|
171
|
+
# app/jobs/data_porter/import_job.rb
|
|
172
|
+
class ImportJob < ActiveJob::Base
|
|
173
|
+
queue_as { DataPorter.configuration.queue_name }
|
|
174
|
+
|
|
175
|
+
def perform(import_id)
|
|
176
|
+
data_import = DataImport.find(import_id)
|
|
177
|
+
Orchestrator.new(data_import).import!
|
|
178
|
+
end
|
|
179
|
+
end
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
Each job is a one-liner: find the import, delegate to the Orchestrator. The queue name comes from the engine's configuration, so the host app controls which queue processes imports. Because the Orchestrator already handles failures internally (transitioning to `failed` and recording the error), the jobs do not need their own error handling -- a crash at the ActiveJob level means something truly unexpected happened, and the adapter's retry mechanism takes over.
|
|
183
|
+
|
|
184
|
+
## Decisions & tradeoffs
|
|
185
|
+
|
|
186
|
+
| Decision | We chose | Over | Because |
|
|
187
|
+
|----------|----------|------|---------|
|
|
188
|
+
| Coordination layer | Dedicated Orchestrator class | Controller-level logic or model callbacks | Keeps controllers thin, models focused on data, and the workflow independently testable |
|
|
189
|
+
| Transaction boundaries | Per-record persist (no wrapping transaction) | Single transaction around all records | A failed record should not roll back thousands of successful ones; partial success is more useful than total failure |
|
|
190
|
+
| Error recovery | Capture error on the record, continue importing | Halt on first error | Users expect to see which rows failed and why, not just "import failed"; the report becomes actionable |
|
|
191
|
+
| Two-phase workflow | Separate `parse!` and `import!` methods | A single `run!` method | The preview step between parse and import lets users catch problems before data hits the database |
|
|
192
|
+
| Job design | Thin jobs delegating to Orchestrator | Logic inside the job classes | The Orchestrator is testable without ActiveJob; jobs are just the async trigger |
|
|
193
|
+
|
|
194
|
+
## Testing it
|
|
195
|
+
|
|
196
|
+
The Orchestrator specs exercise both phases end-to-end using an anonymous target class and injected CSV content:
|
|
197
|
+
|
|
198
|
+
```ruby
|
|
199
|
+
# spec/data_porter/orchestrator_spec.rb
|
|
200
|
+
describe "#parse!" do
|
|
201
|
+
it "transitions to previewing" do
|
|
202
|
+
orchestrator = described_class.new(data_import, content: csv_content)
|
|
203
|
+
|
|
204
|
+
orchestrator.parse!
|
|
205
|
+
|
|
206
|
+
expect(data_import.reload.status).to eq("previewing")
|
|
207
|
+
end
|
|
208
|
+
|
|
209
|
+
it "validates required fields" do
|
|
210
|
+
csv = "First Name,Last Name,Email\n,Smith,alice@example.com\n"
|
|
211
|
+
orchestrator = described_class.new(data_import, content: csv)
|
|
212
|
+
|
|
213
|
+
orchestrator.parse!
|
|
214
|
+
|
|
215
|
+
record = data_import.reload.records.first
|
|
216
|
+
expect(record.status).to eq("missing")
|
|
217
|
+
end
|
|
218
|
+
end
|
|
219
|
+
|
|
220
|
+
describe "#import!" do
|
|
221
|
+
it "handles per-record errors" do
|
|
222
|
+
# Target that always raises on persist
|
|
223
|
+
orchestrator = described_class.new(import)
|
|
224
|
+
orchestrator.import!
|
|
225
|
+
|
|
226
|
+
expect(import.reload.status).to eq("completed")
|
|
227
|
+
expect(import.report.errored_count).to eq(1)
|
|
228
|
+
end
|
|
229
|
+
end
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
The key pattern: even when every `persist` call raises, the import still reaches `completed` -- not `failed`. The `failed` status is reserved for catastrophic errors (the source cannot be read, the target cannot be resolved). Per-record errors are expected operational noise, tracked in the report.
|
|
233
|
+
|
|
234
|
+
## Recap
|
|
235
|
+
|
|
236
|
+
- The **Orchestrator** is a plain Ruby class that coordinates the parse-validate-persist workflow, keeping controllers thin and models focused.
|
|
237
|
+
- The **two-phase design** (`parse!` then `import!`) creates a natural preview checkpoint where users can review data before it touches the database.
|
|
238
|
+
- **Per-record error handling** means a single bad row never takes down the entire import; errors are captured on individual records and surfaced in the report.
|
|
239
|
+
- **ActiveJob integration** is a thin wrapper: two one-liner jobs that delegate to the Orchestrator, using the engine's configured queue name.
|
|
240
|
+
|
|
241
|
+
## Next up
|
|
242
|
+
|
|
243
|
+
The import now runs in the background, but the user has no way to know what is happening. They click "Import" and stare at a static page. In part 8, we build a **real-time progress system** using ActionCable and Stimulus -- a Broadcaster service that pushes status updates and record counts to the browser as the Orchestrator processes each row. No more refreshing to check if it is done.
|
|
244
|
+
|
|
245
|
+
---
|
|
246
|
+
|
|
247
|
+
*This is part 7 of the series "Building DataPorter - A Data Import Engine for Rails". [Previous: Parsing CSV Data with Sources](#) | [Next: Real-time Progress with ActionCable & Stimulus](#)*
|