@caracal-lynx/sluice 0.1.1 → 0.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +1822 -1822
- package/LICENCE-FAQ.md +74 -74
- package/LICENSE +92 -92
- package/PLUGINS.md +294 -0
- package/README.md +681 -573
- package/dist/multi-source-runner.js +16 -16
- package/dist/runner.js +10 -10
- package/package.json +98 -92
- package/dist/adapters/source/rest.types.d.ts +0 -15
- package/dist/adapters/source/rest.types.d.ts.map +0 -1
- package/dist/adapters/source/rest.types.js +0 -6
- package/dist/adapters/source/rest.types.js.map +0 -1
- package/dist/merge/strategies/registry.d.ts +0 -8
- package/dist/merge/strategies/registry.d.ts.map +0 -1
- package/dist/merge/strategies/registry.js +0 -19
- package/dist/merge/strategies/registry.js.map +0 -1
package/CLAUDE.md
CHANGED
|
@@ -1,1822 +1,1822 @@
|
|
|
1
|
-
# Sluice — CLAUDE.md
|
|
2
|
-
# Project specification for Claude Code
|
|
3
|
-
# Sluice: config-driven ETL toolkit for ERP data migrations
|
|
4
|
-
# npm package: @caracal-lynx/sluice
|
|
5
|
-
# Owner: Michael Scott, Caracal Lynx Ltd. (SC826823)
|
|
6
|
-
# Last updated: 2026-04-20
|
|
7
|
-
|
|
8
|
-
---
|
|
9
|
-
|
|
10
|
-
## Project overview
|
|
11
|
-
|
|
12
|
-
**Sluice** is a config-driven ETL toolkit for ERP data migrations, developed and
|
|
13
|
-
maintained by Caracal Lynx Ltd. The engine is written once; each client
|
|
14
|
-
engagement is delivered as a folder of YAML pipeline configs. There is no UI, no
|
|
15
|
-
server, and no cloud dependency — just the `sluice` CLI and a set of TypeScript
|
|
16
|
-
modules that can be imported by other tools (e.g. n8n custom nodes, GitHub Actions).
|
|
17
|
-
|
|
18
|
-
*Clean data flows through.*
|
|
19
|
-
|
|
20
|
-
**Known clients and targets:**
|
|
21
|
-
|
|
22
|
-
| Client | Source(s) | Target ERP | Adapter |
|
|
23
|
-
|---|---|---|---|
|
|
24
|
-
| Acme Corp | MSSQL legacy DB | IFS ERP | `ifs` |
|
|
25
|
-
| Style Co | MSSQL / CSV exports | BlueCherry ERP | `bluecherry` |
|
|
26
|
-
|
|
27
|
-
**Primary use cases:**
|
|
28
|
-
- Extract data from legacy SQL databases, CSV/Excel exports, and REST APIs
|
|
29
|
-
- Validate data quality against a configurable rule set
|
|
30
|
-
- Transform field mappings, apply lookups, cleanse values, evaluate expressions
|
|
31
|
-
- Load output to BC via REST API, IFS via CSV import, BlueCherry via CSV import,
|
|
32
|
-
or generic CSV/JSON for any other target
|
|
33
|
-
- Run from the command line on a developer laptop (Windows, PowerShell 7)
|
|
34
|
-
- Run unattended in GitHub Actions CI
|
|
35
|
-
|
|
36
|
-
**Non-goals:**
|
|
37
|
-
- No web UI or dashboard
|
|
38
|
-
- No streaming / real-time ingestion
|
|
39
|
-
- No data warehouse or lake — DuckDB is used only as a local staging store
|
|
40
|
-
- No multi-tenant SaaS — this is a consultant's toolkit, not a product
|
|
41
|
-
|
|
42
|
-
**Related docs:**
|
|
43
|
-
- [README.md](README.md) — install, quick-start, composite rules (Tier 1)
|
|
44
|
-
- [PLUGINS.md](PLUGINS.md) — Tier 2 (file) and Tier 3 (npm) plugin author guide
|
|
45
|
-
- [docs/architecture-diagrams.md](docs/architecture-diagrams.md) — Mermaid diagrams of
|
|
46
|
-
the single- and multi-source pipeline flow
|
|
47
|
-
|
|
48
|
-
---
|
|
49
|
-
|
|
50
|
-
## Repository structure
|
|
51
|
-
|
|
52
|
-
```
|
|
53
|
-
sluice/
|
|
54
|
-
├── CLAUDE.md ← you are here
|
|
55
|
-
├── PLUGINS.md ← Tier 2 / Tier 3 plugin author guide
|
|
56
|
-
├── README.md
|
|
57
|
-
├── package.json
|
|
58
|
-
├── tsconfig.json
|
|
59
|
-
├── tsconfig.test.json
|
|
60
|
-
├── .env.example
|
|
61
|
-
├── .gitignore
|
|
62
|
-
├── eslint.config.js
|
|
63
|
-
├── .prettierrc
|
|
64
|
-
├── .github/workflows/ci.yml
|
|
65
|
-
├── docs/
|
|
66
|
-
│ └── architecture-diagrams.md
|
|
67
|
-
├── examples/ ← sample pipelines (not run by tests)
|
|
68
|
-
│
|
|
69
|
-
├── src/
|
|
70
|
-
│ ├── index.ts ← public API barrel (re-exports from all modules)
|
|
71
|
-
│ ├── cli.ts ← commander CLI entry point
|
|
72
|
-
│ ├── runner.ts ← PipelineRunner (single-source)
|
|
73
|
-
│ ├── multi-source-runner.ts ← MultiSourcePipelineRunner (extends PipelineRunner)
|
|
74
|
-
│ │
|
|
75
|
-
│ ├── config/
|
|
76
|
-
│ │ ├── index.ts ← re-exports schema + types
|
|
77
|
-
│ │ ├── schema.ts ← Zod schema (PipelineSchema + sub-schemas)
|
|
78
|
-
│ │ ├── loader.ts ← YAML load + ${ENV_VAR} interp + composite-rule expansion + parse
|
|
79
|
-
│ │ └── types.ts ← re-exports of all inferred Zod types + guards
|
|
80
|
-
│ │
|
|
81
|
-
│ ├── adapters/
|
|
82
|
-
│ │ ├── source/
|
|
83
|
-
│ │ │ ├── index.ts ← barrel (self-registers built-ins on import)
|
|
84
|
-
│ │ │ ├── registry.ts ← SourceAdapterRegistry
|
|
85
|
-
│ │ │ ├── types.ts ← SourceAdapter + ExtractResult
|
|
86
|
-
│ │ │ ├── mssql.ts
|
|
87
|
-
│ │ │ ├── pg.ts
|
|
88
|
-
│ │ │ ├── csv.ts
|
|
89
|
-
│ │ │ ├── xlsx.ts
|
|
90
|
-
│ │ │ └── rest.ts
|
|
91
|
-
│ │ └── target/
|
|
92
|
-
│ │ ├── index.ts ← barrel (self-registers built-ins on import)
|
|
93
|
-
│ │ ├── registry.ts ← TargetAdapterRegistry
|
|
94
|
-
│ │ ├── types.ts ← TargetAdapter + LoadResult
|
|
95
|
-
│ │ ├── bc.ts ← Business Central REST (+ BcTokenManager)
|
|
96
|
-
│ │ ├── ifs.ts ← IFS ERP CSV import
|
|
97
|
-
│ │ ├── bluecherry.ts ← BlueCherry ERP CSV import
|
|
98
|
-
│ │ ├── csv.ts ← generic CSV
|
|
99
|
-
│ │ └── pg.ts
|
|
100
|
-
│ │
|
|
101
|
-
│ ├── staging/
|
|
102
|
-
│ │ ├── index.ts ← barrel
|
|
103
|
-
│ │ ├── store.ts ← DuckDB wrapper (the only file that imports `@duckdb/node-api`)
|
|
104
|
-
│ │ └── schema.ts ← ColumnMeta, quoteIdent, buildCreateTableSql
|
|
105
|
-
│ │
|
|
106
|
-
│ ├── dq/
|
|
107
|
-
│ │ ├── index.ts ← barrel
|
|
108
|
-
│ │ ├── engine.ts ← DQEngine
|
|
109
|
-
│ │ ├── reporter.ts ← writeRejectionCsv, writeSummaryJson
|
|
110
|
-
│ │ ├── types.ts ← DQSummary, ViolationCounts
|
|
111
|
-
│ │ └── rules/
|
|
112
|
-
│ │ ├── index.ts ← BUILT_IN_RULES map (id → Rule instance)
|
|
113
|
-
│ │ ├── types.ts ← Rule = RulePlugin, RuleViolation (re-exported from plugins)
|
|
114
|
-
│ │ ├── notNull.ts
|
|
115
|
-
│ │ ├── unique.ts
|
|
116
|
-
│ │ ├── pattern.ts
|
|
117
|
-
│ │ ├── email.ts
|
|
118
|
-
│ │ ├── ukPostcode.ts
|
|
119
|
-
│ │ ├── maxLength.ts
|
|
120
|
-
│ │ ├── minMax.ts
|
|
121
|
-
│ │ └── allowedValues.ts
|
|
122
|
-
│ │
|
|
123
|
-
│ ├── transform/
|
|
124
|
-
│ │ ├── index.ts
|
|
125
|
-
│ │ ├── engine.ts ← TransformEngine (built-in types + custom plugins)
|
|
126
|
-
│ │ ├── lookup.ts
|
|
127
|
-
│ │ ├── cleanse.ts
|
|
128
|
-
│ │ ├── expression.ts ← expr-eval + `js:` vm sandbox
|
|
129
|
-
│ │ └── types.ts ← TransformResult
|
|
130
|
-
│ │
|
|
131
|
-
│ ├── merge/ ← multi-source merge engine + strategies
|
|
132
|
-
│ │ ├── index.ts ← MergeStrategyRegistry (pre-registers all built-ins)
|
|
133
|
-
│ │ ├── engine.ts ← MergeEngine
|
|
134
|
-
│ │ ├── sql-builder.ts ← shared JOIN + coalesce SQL helpers
|
|
135
|
-
│ │ ├── conflict-log.ts ← conflict CSV writer
|
|
136
|
-
│ │ ├── types.ts ← MergeStrategyPlugin, MergeSourceMeta, MergeResult
|
|
137
|
-
│ │ └── strategies/
|
|
138
|
-
│ │ ├── index.ts
|
|
139
|
-
│ │ ├── coalesce.ts
|
|
140
|
-
│ │ ├── priority-override.ts
|
|
141
|
-
│ │ ├── union.ts
|
|
142
|
-
│ │ └── intersect.ts
|
|
143
|
-
│ │
|
|
144
|
-
│ ├── plugins/ ← Tier 2 / Tier 3 plugin system
|
|
145
|
-
│ │ ├── index.ts ← barrel
|
|
146
|
-
│ │ ├── types.ts ← RulePlugin, TransformPlugin, PluginPackage
|
|
147
|
-
│ │ ├── registry.ts ← RuleRegistry, TransformRegistry (custom plugin holders)
|
|
148
|
-
│ │ └── loader.ts ← loadPlugins (file-based), loadNpmPlugins (sluice.config.yaml)
|
|
149
|
-
│ │
|
|
150
|
-
│ ├── enrich/ ← Phase 4a public surface (types only)
|
|
151
|
-
│ │ └── types.ts ← EnrichPlugin, EnrichResult, EnrichOptions, EnrichSummary,
|
|
152
|
-
│ │ EnrichPhaseFactory (implementation lives in private
|
|
153
|
-
│ │ @caracal-lynx/sluice-enrich package)
|
|
154
|
-
│ │
|
|
155
|
-
│ └── utils/
|
|
156
|
-
│ ├── index.ts
|
|
157
|
-
│ ├── logger.ts ← pino singleton
|
|
158
|
-
│ ├── env.ts ← loadEnv + requireEnv
|
|
159
|
-
│ └── errors.ts
|
|
160
|
-
│
|
|
161
|
-
├── tests/
|
|
162
|
-
│ ├── fixtures/
|
|
163
|
-
│ │ ├── acme-corp-customers.pipeline.yaml
|
|
164
|
-
│ │ ├── style-co-styles.pipeline.yaml
|
|
165
|
-
│ │ ├── style-co-products-merged.pipeline.yaml ← multi-source
|
|
166
|
-
│ │ ├── multi-source-no-merge.pipeline.yaml ← negative-path multi-source
|
|
167
|
-
│ │ ├── shared-rules.yaml ← composite rule library
|
|
168
|
-
│ │ └── plugins/ ← test plugin fixtures (Tier 2 files)
|
|
169
|
-
│ │
|
|
170
|
-
│ ├── unit/
|
|
171
|
-
│ │ ├── cli.test.ts
|
|
172
|
-
│ │ ├── runner.test.ts
|
|
173
|
-
│ │ ├── adapters/
|
|
174
|
-
│ │ │ ├── source/ ← csv, mssql, pg, rest, xlsx
|
|
175
|
-
│ │ │ └── target/ ← bc, bluecherry, ifs, pg
|
|
176
|
-
│ │ ├── config/ ← loader, schema, multi-source, composite-expansion
|
|
177
|
-
│ │ ├── dq/ ← engine, reporter, rules
|
|
178
|
-
│ │ ├── merge/ ← engine, registry, strategies
|
|
179
|
-
│ │ ├── plugins/ ← loader, registry, composite-expansion
|
|
180
|
-
│ │ ├── staging/ ← store
|
|
181
|
-
│ │ └── transform/ ← cleanse, expression, engine, custom
|
|
182
|
-
│ │
|
|
183
|
-
│ └── integration/
|
|
184
|
-
│ ├── cli-check.test.ts
|
|
185
|
-
│ ├── cli-commands.test.ts
|
|
186
|
-
│ ├── cli-plugins.test.ts
|
|
187
|
-
│ ├── csv-to-csv-mvp.test.ts
|
|
188
|
-
│ ├── dq-integration.test.ts
|
|
189
|
-
│ ├── style-co-styles-mini.test.ts
|
|
190
|
-
│ ├── merge-strategies.test.ts
|
|
191
|
-
│ ├── multi-source-runner.test.ts
|
|
192
|
-
│ └── runner-plugin-wiring.test.ts
|
|
193
|
-
│
|
|
194
|
-
└── clients/ ← gitignored in this repo; each client
|
|
195
|
-
├── acme-corp/ gets their own private repo
|
|
196
|
-
│ ├── .env
|
|
197
|
-
│ ├── customers.pipeline.yaml
|
|
198
|
-
│ ├── items.pipeline.yaml
|
|
199
|
-
│ ├── vendors.pipeline.yaml
|
|
200
|
-
│ └── lookups/
|
|
201
|
-
└── style-co/
|
|
202
|
-
├── .env
|
|
203
|
-
├── styles.pipeline.yaml
|
|
204
|
-
├── vendors.pipeline.yaml
|
|
205
|
-
├── purchase-orders.pipeline.yaml
|
|
206
|
-
└── lookups/
|
|
207
|
-
```
|
|
208
|
-
|
|
209
|
-
---
|
|
210
|
-
|
|
211
|
-
## Technology stack
|
|
212
|
-
|
|
213
|
-
| Concern | Package | Notes |
|
|
214
|
-
|---|---|---|
|
|
215
|
-
| Language | TypeScript 5.x | `strict: true`, `exactOptionalPropertyTypes: true` |
|
|
216
|
-
| Runtime | Node.js 24 LTS | No Bun, no Deno — must run in GitHub Actions |
|
|
217
|
-
| Config parsing | `js-yaml` | YAML 1.2 only |
|
|
218
|
-
| Config validation | `zod` v3 | All config types inferred from Zod |
|
|
219
|
-
| SQL Server | `mssql` | Trusted + SQL auth both supported |
|
|
220
|
-
| PostgreSQL | `pg` + `@types/pg` | |
|
|
221
|
-
| CSV | `csv-parse` + `csv-stringify` | Streaming |
|
|
222
|
-
| Excel | `xlsx` (SheetJS) | Read-only |
|
|
223
|
-
| HTTP | `axios` + `axios-retry` | 3 retries, exponential backoff |
|
|
224
|
-
| Dates | `dayjs` | All date parsing and formatting |
|
|
225
|
-
| Staging | `@duckdb/node-api` | Embedded; no server. Replaces deprecated `duckdb` package — ABI-stable (no `npm rebuild` after Node ABI bumps). |
|
|
226
|
-
| CLI | `commander` v12 | |
|
|
227
|
-
| Logging | `pino` | JSON; `pino-pretty` in dev |
|
|
228
|
-
| Testing | `vitest` | No Jest |
|
|
229
|
-
| Env vars | `dotenv` | Loaded once at CLI entry |
|
|
230
|
-
| Linting | `eslint` + `@typescript-eslint` | |
|
|
231
|
-
| Formatting | `prettier` | 2-space, single quotes, trailing commas |
|
|
232
|
-
| Expressions | `expr-eval` | Safe expression parser; no eval() |
|
|
233
|
-
|
|
234
|
-
---
|
|
235
|
-
|
|
236
|
-
## TypeScript conventions
|
|
237
|
-
|
|
238
|
-
- **All config types come from Zod inference.** Do not write manual `type` or
|
|
239
|
-
`interface` declarations for anything that maps to pipeline config.
|
|
240
|
-
Use `z.infer<typeof SomeSchema>`.
|
|
241
|
-
- **No `any`.** Use `unknown` and narrow explicitly.
|
|
242
|
-
- **No `eval()` or `Function()`** anywhere. See expression evaluator section.
|
|
243
|
-
- **Async throughout.** All I/O must be `async/await`. No callbacks.
|
|
244
|
-
- **Error handling:** throw typed errors from `src/utils/errors.ts`. Never throw
|
|
245
|
-
raw strings. Catch at the `PipelineRunner` boundary.
|
|
246
|
-
- **Barrel exports:** each directory has an `index.ts`. Do not import from internal
|
|
247
|
-
files across module boundaries.
|
|
248
|
-
- **No circular imports.** Dependency direction:
|
|
249
|
-
`cli` → `runner` / `multi-source-runner` → `adapters`, `staging`, `dq`,
|
|
250
|
-
`transform`, `merge`, `plugins`, `config`, `enrich`. `plugins/` is imported by
|
|
251
|
-
`runner`, `dq`, `transform`, and `merge`; it must not import any of them.
|
|
252
|
-
`enrich/` is type-only (no runtime imports of other modules in this repo —
|
|
253
|
-
the implementation lives in the private `@caracal-lynx/sluice-enrich`).
|
|
254
|
-
Utils are imported by everyone.
|
|
255
|
-
- **Path aliases:** `@/` → `src/` in tsconfig.
|
|
256
|
-
|
|
257
|
-
---
|
|
258
|
-
|
|
259
|
-
## ═══════════════════════════════════════════════════════════
|
|
260
|
-
## YAML PIPELINE CONFIG SPECIFICATION
|
|
261
|
-
## ═══════════════════════════════════════════════════════════
|
|
262
|
-
|
|
263
|
-
Every pipeline is a single YAML file. One file = one migrated entity
|
|
264
|
-
(e.g. customers, items, vendors, styles, purchase orders).
|
|
265
|
-
|
|
266
|
-
### Top-level structure
|
|
267
|
-
|
|
268
|
-
```yaml
|
|
269
|
-
pipeline: { ... } # identity and metadata
|
|
270
|
-
source: { ... } # where to read from
|
|
271
|
-
enrich: { ... } # OPTIONAL — Phase 4a; external API lookups (private)
|
|
272
|
-
dq: { ... } # data quality rules
|
|
273
|
-
transform: { ... } # field mappings and lookups
|
|
274
|
-
target: { ... } # where to write to
|
|
275
|
-
run: { ... } # execution options (all fields optional; all have defaults)
|
|
276
|
-
```
|
|
277
|
-
|
|
278
|
-
> **Phase 4a — Enrich Phase (private):** the `enrich:` block, when present, runs after Extract (and after Merge for multi-source pipelines) and before DQ. The framework that drives it lives in the **private** `@caracal-lynx/sluice-enrich` package — the open-source core only ships the Zod schema, the public `EnrichPlugin` interface (`src/enrich/types.ts`), and the `registerEnrichPhase()` injection hook on `PipelineRunner`. With `sluice-enrich` not installed, an `enrich:` block is parsed and validated but the phase is skipped with a `WARN` log. See [docs/PHASE-04-enrich-phase.md](docs/PHASE-04-enrich-phase.md) for the full spec.
|
|
279
|
-
|
|
280
|
-
---
|
|
281
|
-
|
|
282
|
-
### `pipeline` section
|
|
283
|
-
|
|
284
|
-
```yaml
|
|
285
|
-
pipeline:
|
|
286
|
-
name: acme-corp-customers # REQUIRED. Slug: lowercase, hyphens only.
|
|
287
|
-
# Used in output filenames and log messages.
|
|
288
|
-
client: acme-corp # REQUIRED. Client identifier.
|
|
289
|
-
version: "1.0" # REQUIRED. Quote to ensure string type.
|
|
290
|
-
entity: CustomerInfo # REQUIRED. Logical entity name (used in
|
|
291
|
-
# load reports and target adapter metadata).
|
|
292
|
-
description: > # Optional. Human-readable description.
|
|
293
|
-
Customer master migration —
|
|
294
|
-
legacy SQL to IFS ERP
|
|
295
|
-
```
|
|
296
|
-
|
|
297
|
-
---
|
|
298
|
-
|
|
299
|
-
### `source` section
|
|
300
|
-
|
|
301
|
-
Exactly one of `query`, `file`, or `endpoint` must be present.
|
|
302
|
-
|
|
303
|
-
```yaml
|
|
304
|
-
source:
|
|
305
|
-
adapter: mssql # REQUIRED. One of: mssql | pg | csv | xlsx | rest
|
|
306
|
-
|
|
307
|
-
# ── SQL adapters (mssql, pg) ──────────────────────────────
|
|
308
|
-
connection: ${SOURCE_MSSQL} # Connection string from .env.
|
|
309
|
-
# mssql: mssql://user:pass@host/database
|
|
310
|
-
# Or a JSON string for trusted/advanced config.
|
|
311
|
-
query: |
|
|
312
|
-
SELECT c.CUST_CODE, c.CUST_NAME, c.POST_CODE
|
|
313
|
-
FROM dbo.Customers c
|
|
314
|
-
WHERE c.Active = 1
|
|
315
|
-
|
|
316
|
-
# ── CSV adapter ───────────────────────────────────────────
|
|
317
|
-
file: ./data/customers.csv # Path or glob (./data/export-*.csv).
|
|
318
|
-
delimiter: "," # Default: ","
|
|
319
|
-
encoding: utf-8 # Default: utf-8
|
|
320
|
-
|
|
321
|
-
# ── XLSX adapter ──────────────────────────────────────────
|
|
322
|
-
file: ./data/customers.xlsx
|
|
323
|
-
sheet: "Customer Export" # Sheet name or 0-based index. Default: 0.
|
|
324
|
-
|
|
325
|
-
# ── REST adapter ──────────────────────────────────────────
|
|
326
|
-
endpoint: ${API_BASE}/customers # Full URL. ${ENV_VAR} resolved at runtime.
|
|
327
|
-
headers: # Optional. Added to every request.
|
|
328
|
-
Authorization: Bearer ${API_TOKEN}
|
|
329
|
-
Accept: application/json
|
|
330
|
-
pagination: # Optional. Omit for single-page responses.
|
|
331
|
-
type: offset # offset | cursor | page
|
|
332
|
-
pageSize: 100
|
|
333
|
-
pageParam: skip # Query param name for the offset/page value.
|
|
334
|
-
totalField: data.total # Dot-path to total count in response body.
|
|
335
|
-
dataField: data.items # Dot-path to the records array.
|
|
336
|
-
cursorField: nextCursor # For cursor pagination: field in response body.
|
|
337
|
-
cursorParam: cursor # For cursor pagination: query param name.
|
|
338
|
-
```
|
|
339
|
-
|
|
340
|
-
---
|
|
341
|
-
|
|
342
|
-
### `dq` section
|
|
343
|
-
|
|
344
|
-
```yaml
|
|
345
|
-
dq:
|
|
346
|
-
stopOnCritical: true # Default: true. Halt pipeline if any critical rule fails.
|
|
347
|
-
rejectionFile: ./output/acme-corp-customers-rejected.csv
|
|
348
|
-
# Default: ./output/{pipeline.name}-rejected.csv
|
|
349
|
-
|
|
350
|
-
rules:
|
|
351
|
-
- field: FIELD_NAME # Source column name (pre-transform).
|
|
352
|
-
checks:
|
|
353
|
-
|
|
354
|
-
# notNull — fails if null, undefined, empty string, or whitespace-only
|
|
355
|
-
- type: notNull
|
|
356
|
-
severity: critical
|
|
357
|
-
|
|
358
|
-
# unique — fails if value appears more than once across the full dataset
|
|
359
|
-
- type: unique
|
|
360
|
-
severity: critical
|
|
361
|
-
|
|
362
|
-
# pattern — ECMAScript regex, tested with new RegExp(value)
|
|
363
|
-
- type: pattern
|
|
364
|
-
value: "^[A-Z0-9]{3,10}$"
|
|
365
|
-
severity: warning
|
|
366
|
-
message: "Must be 3-10 uppercase alphanumeric characters"
|
|
367
|
-
# message is optional; overrides default.
|
|
368
|
-
|
|
369
|
-
# email — RFC 5322-ish email validation
|
|
370
|
-
- type: email
|
|
371
|
-
severity: warning
|
|
372
|
-
|
|
373
|
-
# ukPostcode — all current UK postcode formats; strips spaces before testing
|
|
374
|
-
- type: ukPostcode
|
|
375
|
-
severity: warning
|
|
376
|
-
|
|
377
|
-
# maxLength — maximum string length (integer)
|
|
378
|
-
- type: maxLength
|
|
379
|
-
value: 100
|
|
380
|
-
severity: warning
|
|
381
|
-
|
|
382
|
-
# min / max — numeric comparison; coerces value to float
|
|
383
|
-
- type: min
|
|
384
|
-
value: 0
|
|
385
|
-
severity: critical
|
|
386
|
-
- type: max
|
|
387
|
-
value: 500000
|
|
388
|
-
severity: warning
|
|
389
|
-
|
|
390
|
-
# allowedValues — case-sensitive array of permitted string values
|
|
391
|
-
- type: allowedValues
|
|
392
|
-
value: [GB, IE, US, DE, FR]
|
|
393
|
-
severity: warning
|
|
394
|
-
|
|
395
|
-
# Severity:
|
|
396
|
-
# critical row is rejected; pipeline halts if stopOnCritical: true
|
|
397
|
-
# warning row is flagged in rejection report but NOT removed from output
|
|
398
|
-
# info recorded in summary JSON only
|
|
399
|
-
```
|
|
400
|
-
|
|
401
|
-
---
|
|
402
|
-
|
|
403
|
-
### `transform` section
|
|
404
|
-
|
|
405
|
-
```yaml
|
|
406
|
-
transform:
|
|
407
|
-
|
|
408
|
-
# ── Lookup tables ─────────────────────────────────────────
|
|
409
|
-
# Loaded once at start of transform phase, cached in memory.
|
|
410
|
-
lookups:
|
|
411
|
-
- name: currencyMap # Referenced by field mappings.
|
|
412
|
-
source: # Any source adapter works here.
|
|
413
|
-
adapter: csv
|
|
414
|
-
file: ./lookups/currency-codes.csv
|
|
415
|
-
key: legacyCode # Column to match against source value.
|
|
416
|
-
value: isoCode # Column to return as resolved value.
|
|
417
|
-
|
|
418
|
-
- name: acctMgrMap
|
|
419
|
-
source:
|
|
420
|
-
adapter: mssql
|
|
421
|
-
connection: ${SOURCE_MSSQL}
|
|
422
|
-
query: "SELECT STAFF_ID as key, IFS_USER_ID as value FROM dbo.Staff"
|
|
423
|
-
key: key
|
|
424
|
-
value: value
|
|
425
|
-
|
|
426
|
-
# ── Field mappings ────────────────────────────────────────
|
|
427
|
-
fields:
|
|
428
|
-
|
|
429
|
-
# type: string
|
|
430
|
-
- from: CUST_CODE
|
|
431
|
-
to: CustomerNo
|
|
432
|
-
type: string
|
|
433
|
-
max: 20 # Optional. Truncate after cleanse.
|
|
434
|
-
|
|
435
|
-
- from: CUST_NAME
|
|
436
|
-
to: Name
|
|
437
|
-
type: string
|
|
438
|
-
max: 100
|
|
439
|
-
cleanse: trim|titleCase # Pipe-separated cleanse ops. See table below.
|
|
440
|
-
|
|
441
|
-
# type: number — coerce to integer; throws if NaN
|
|
442
|
-
- from: QTY
|
|
443
|
-
to: Quantity
|
|
444
|
-
type: number
|
|
445
|
-
|
|
446
|
-
# type: decimal — fixed precision; stored as string in staging
|
|
447
|
-
- from: CREDIT_LIMIT
|
|
448
|
-
to: CreditLimit
|
|
449
|
-
type: decimal
|
|
450
|
-
precision: 2 # Default: 2
|
|
451
|
-
|
|
452
|
-
# type: boolean
|
|
453
|
-
# Truthy: '1','true','yes','y','t' (case-insensitive). All else false.
|
|
454
|
-
- from: IS_ACTIVE
|
|
455
|
-
to: Active
|
|
456
|
-
type: boolean
|
|
457
|
-
|
|
458
|
-
# type: date — parse source date, output as dateFormat (default ISO)
|
|
459
|
-
- from: START_DATE
|
|
460
|
-
to: StartDate
|
|
461
|
-
type: date
|
|
462
|
-
format: DD/MM/YYYY # Optional source parse format (dayjs tokens).
|
|
463
|
-
|
|
464
|
-
# type: lookup — resolve via a named lookup table
|
|
465
|
-
- from: CURRENCY
|
|
466
|
-
to: CurrencyCode
|
|
467
|
-
type: lookup
|
|
468
|
-
lookup: currencyMap # Must match a lookup name above.
|
|
469
|
-
default: GBP # Emitted when lookup key not found.
|
|
470
|
-
optional: false # Default: false. true = null on miss (no error).
|
|
471
|
-
|
|
472
|
-
# type: concat — join multiple source fields
|
|
473
|
-
- from: [ADDR1, ADDR2] # Array of source field names.
|
|
474
|
-
to: Address1
|
|
475
|
-
type: concat
|
|
476
|
-
separator: ", " # Default: " "
|
|
477
|
-
cleanse: trim|nullIfEmpty
|
|
478
|
-
|
|
479
|
-
# type: constant — emit a fixed value regardless of source data
|
|
480
|
-
- to: CustomerGroup
|
|
481
|
-
type: constant
|
|
482
|
-
value: DOMESTIC
|
|
483
|
-
|
|
484
|
-
# type: expression — evaluate against source row
|
|
485
|
-
- to: SearchName
|
|
486
|
-
type: expression
|
|
487
|
-
value: "row.CUST_NAME.toUpperCase().substring(0, 20)"
|
|
488
|
-
# For logic beyond expr-eval, prefix with js:
|
|
489
|
-
# value: "js: row.PRICE * (1 - row.DISCOUNT / 100)"
|
|
490
|
-
|
|
491
|
-
# Common optional field properties:
|
|
492
|
-
# optional: true null result does not cause a pipeline error
|
|
493
|
-
# default: <val> fallback value if source is null/empty
|
|
494
|
-
# max: <n> truncate string to n chars AFTER cleanse
|
|
495
|
-
```
|
|
496
|
-
|
|
497
|
-
#### Cleanse operations reference
|
|
498
|
-
|
|
499
|
-
Applied left-to-right in the pipe chain. Defined in `src/transform/cleanse.ts`.
|
|
500
|
-
|
|
501
|
-
| Op | Example input | Example output |
|
|
502
|
-
|---|---|---|
|
|
503
|
-
| `trim` | `" hello "` | `"hello"` |
|
|
504
|
-
| `uppercase` | `"hello"` | `"HELLO"` |
|
|
505
|
-
| `lowercase` | `"HELLO"` | `"hello"` |
|
|
506
|
-
| `titleCase` | `"john smith"` | `"John Smith"` |
|
|
507
|
-
| `stripNonAlpha` | `"AB-12!"` | `"AB"` |
|
|
508
|
-
| `stripNonNumeric` | `"AB-12!"` | `"12"` |
|
|
509
|
-
| `stripWhitespace` | `"h e l l o"` | `"hello"` |
|
|
510
|
-
| `padStart:6:0` | `"42"` | `"000042"` |
|
|
511
|
-
| `truncate:20` | 21-char string | 20-char string |
|
|
512
|
-
| `nullIfEmpty` | `""` | `null` |
|
|
513
|
-
| `normaliseQuotes` | `"it\u2019s"` | `"it's"` |
|
|
514
|
-
| `normaliseUnicode` | `"caf\u00e9"` | `"cafe"` (NFD→ASCII) |
|
|
515
|
-
|
|
516
|
-
---
|
|
517
|
-
|
|
518
|
-
### `target` section
|
|
519
|
-
|
|
520
|
-
```yaml
|
|
521
|
-
target:
|
|
522
|
-
adapter: ifs # REQUIRED. One of:
|
|
523
|
-
# bc | ifs | bluecherry | csv | pg | rest
|
|
524
|
-
|
|
525
|
-
# ── IFS adapter ───────────────────────────────────────────
|
|
526
|
-
adapter: ifs
|
|
527
|
-
output: ./output/acme-corp-customers-ifs.csv
|
|
528
|
-
entity: CustomerInfo # IFS entity name (used in import log).
|
|
529
|
-
includeHeader: false # Default: false (standard IFS import format).
|
|
530
|
-
columnOrder: # Optional. Forces specific column ordering.
|
|
531
|
-
- CustomerNo # Must match transform 'to' field names.
|
|
532
|
-
- Name
|
|
533
|
-
- Address1
|
|
534
|
-
dateFormat: YYYY-MM-DD # Default: YYYY-MM-DD
|
|
535
|
-
delimiter: "," # Default: ","
|
|
536
|
-
encoding: utf-8 # Default: utf-8
|
|
537
|
-
|
|
538
|
-
# ── BlueCherry adapter ────────────────────────────────────
|
|
539
|
-
adapter: bluecherry
|
|
540
|
-
entity: Style # REQUIRED. One of: Style | Vendor |
|
|
541
|
-
# PurchaseOrder | PODetail | Season | ColourSize
|
|
542
|
-
output: ./output/style-co-styles-bc.csv
|
|
543
|
-
template: default # Optional. 'default' uses built-in required
|
|
544
|
-
# columns. Or path to a header-only template CSV
|
|
545
|
-
# whose first row defines column order.
|
|
546
|
-
includeHeader: true # Default: true (BlueCherry expects headers).
|
|
547
|
-
dateFormat: MM/DD/YYYY # Default: MM/DD/YYYY (BlueCherry is US-origin).
|
|
548
|
-
delimiter: ","
|
|
549
|
-
encoding: utf-8
|
|
550
|
-
nullValue: "" # How nulls are rendered. Default: ""
|
|
551
|
-
|
|
552
|
-
# ── Business Central REST adapter ─────────────────────────
|
|
553
|
-
adapter: bc
|
|
554
|
-
baseUrl: ${BC_BASE_URL}
|
|
555
|
-
company: ${BC_COMPANY}
|
|
556
|
-
entity: customers # OData entity name (lowercase, plural).
|
|
557
|
-
apiVersion: v2.0 # Default: v2.0
|
|
558
|
-
onConflict: fail # fail | upsert. Default: fail.
|
|
559
|
-
batchEndpoint: true # Use OData $batch. Default: true.
|
|
560
|
-
|
|
561
|
-
# ── Generic CSV adapter ───────────────────────────────────
|
|
562
|
-
adapter: csv
|
|
563
|
-
output: ./output/data.csv
|
|
564
|
-
includeHeader: true
|
|
565
|
-
delimiter: ","
|
|
566
|
-
encoding: utf-8
|
|
567
|
-
nullValue: ""
|
|
568
|
-
|
|
569
|
-
# ── PostgreSQL adapter ────────────────────────────────────
|
|
570
|
-
adapter: pg
|
|
571
|
-
connection: ${TARGET_PG}
|
|
572
|
-
table: customers
|
|
573
|
-
schema: public # Default: public
|
|
574
|
-
onConflict: fail # fail | upsert | ignore
|
|
575
|
-
upsertKey: [customer_no] # REQUIRED if onConflict: upsert
|
|
576
|
-
```
|
|
577
|
-
|
|
578
|
-
---
|
|
579
|
-
|
|
580
|
-
### `run` section
|
|
581
|
-
|
|
582
|
-
All fields optional. Shown with defaults.
|
|
583
|
-
|
|
584
|
-
```yaml
|
|
585
|
-
run:
|
|
586
|
-
mode: full # full | incremental | validate-only
|
|
587
|
-
batchSize: 500 # Rows per DuckDB insert batch.
|
|
588
|
-
onError: continue # continue | stop
|
|
589
|
-
logLevel: info # debug | info | warn | error
|
|
590
|
-
dryRun: false # true: DQ + transform, no output written.
|
|
591
|
-
outputDir: ./output # Base directory for all output files.
|
|
592
|
-
stagingDb: "" # DuckDB path. Default: {outputDir}/{name}.duckdb
|
|
593
|
-
# Set ':memory:' to force in-memory mode.
|
|
594
|
-
incrementalField: UPDATED_AT # Source field for incremental mode.
|
|
595
|
-
incrementalSince: "" # ISO datetime. If empty, reads from state file.
|
|
596
|
-
```
|
|
597
|
-
|
|
598
|
-
---
|
|
599
|
-
|
|
600
|
-
### Full example — Acme Corp customers (MSSQL → IFS)
|
|
601
|
-
|
|
602
|
-
```yaml
|
|
603
|
-
pipeline:
|
|
604
|
-
name: acme-corp-customers
|
|
605
|
-
client: acme-corp
|
|
606
|
-
version: "1.0"
|
|
607
|
-
entity: CustomerInfo
|
|
608
|
-
description: Customer master — legacy Sage SQL to IFS ERP
|
|
609
|
-
|
|
610
|
-
source:
|
|
611
|
-
adapter: mssql
|
|
612
|
-
connection: ${SOURCE_MSSQL}
|
|
613
|
-
query: |
|
|
614
|
-
SELECT
|
|
615
|
-
c.CUST_CODE, c.CUST_NAME, c.ADDR1, c.ADDR2,
|
|
616
|
-
c.POST_CODE, c.COUNTRY, c.EMAIL, c.TEL,
|
|
617
|
-
c.CREDIT_LIMIT, c.CURRENCY, c.ACCT_MGR_ID
|
|
618
|
-
FROM dbo.Customers c
|
|
619
|
-
WHERE c.Active = 1 AND c.DELETED = 0
|
|
620
|
-
|
|
621
|
-
dq:
|
|
622
|
-
stopOnCritical: true
|
|
623
|
-
rejectionFile: ./output/acme-corp-customers-rejected.csv
|
|
624
|
-
rules:
|
|
625
|
-
- field: CUST_CODE
|
|
626
|
-
checks:
|
|
627
|
-
- { type: notNull, severity: critical }
|
|
628
|
-
- { type: unique, severity: critical }
|
|
629
|
-
- { type: pattern, value: "^[A-Z0-9]{3,10}$", severity: warning }
|
|
630
|
-
- field: CUST_NAME
|
|
631
|
-
checks:
|
|
632
|
-
- { type: notNull, severity: critical }
|
|
633
|
-
- { type: maxLength, value: 100, severity: warning }
|
|
634
|
-
- field: POST_CODE
|
|
635
|
-
checks:
|
|
636
|
-
- { type: ukPostcode, severity: warning }
|
|
637
|
-
- field: EMAIL
|
|
638
|
-
checks:
|
|
639
|
-
- { type: email, severity: warning }
|
|
640
|
-
- field: CREDIT_LIMIT
|
|
641
|
-
checks:
|
|
642
|
-
- { type: min, value: 0, severity: critical }
|
|
643
|
-
- { type: max, value: 500000, severity: warning }
|
|
644
|
-
- field: COUNTRY
|
|
645
|
-
checks:
|
|
646
|
-
- { type: allowedValues, value: [GB, IE, US, DE, FR], severity: warning }
|
|
647
|
-
|
|
648
|
-
transform:
|
|
649
|
-
lookups:
|
|
650
|
-
- name: currencyMap
|
|
651
|
-
source: { adapter: csv, file: ./lookups/currency-codes.csv }
|
|
652
|
-
key: legacyCode
|
|
653
|
-
value: isoCode
|
|
654
|
-
- name: acctMgrMap
|
|
655
|
-
source:
|
|
656
|
-
adapter: mssql
|
|
657
|
-
connection: ${SOURCE_MSSQL}
|
|
658
|
-
query: "SELECT STAFF_ID as key, IFS_USER_ID as value FROM dbo.Staff"
|
|
659
|
-
key: key
|
|
660
|
-
value: value
|
|
661
|
-
fields:
|
|
662
|
-
- { from: CUST_CODE, to: CustomerNo, type: string, max: 20 }
|
|
663
|
-
- { from: CUST_NAME, to: Name, type: string, max: 100, cleanse: trim|titleCase }
|
|
664
|
-
- { from: [ADDR1, ADDR2], to: Address1, type: concat, separator: ", ", cleanse: trim }
|
|
665
|
-
- { from: POST_CODE, to: ZipCode, type: string, cleanse: trim|uppercase }
|
|
666
|
-
- { from: COUNTRY, to: Country, type: string, default: GB }
|
|
667
|
-
- { from: CURRENCY, to: CurrencyCode, type: lookup, lookup: currencyMap, default: GBP }
|
|
668
|
-
- { from: ACCT_MGR_ID, to: SalesmanCode, type: lookup, lookup: acctMgrMap, optional: true }
|
|
669
|
-
- { from: CREDIT_LIMIT, to: CreditLimit, type: decimal, precision: 2 }
|
|
670
|
-
- { from: EMAIL, to: Email, type: string, cleanse: trim|lowercase }
|
|
671
|
-
- { to: CustomerGroup, type: constant, value: DOMESTIC }
|
|
672
|
-
- { to: SearchName, type: expression, value: "row.CUST_NAME.toUpperCase().substring(0, 20)" }
|
|
673
|
-
|
|
674
|
-
target:
|
|
675
|
-
adapter: ifs
|
|
676
|
-
entity: CustomerInfo
|
|
677
|
-
output: ./output/acme-corp-customers-ifs.csv
|
|
678
|
-
includeHeader: false
|
|
679
|
-
columnOrder: [CustomerNo, Name, Address1, ZipCode, Country, CurrencyCode,
|
|
680
|
-
SalesmanCode, CreditLimit, Email, CustomerGroup, SearchName]
|
|
681
|
-
|
|
682
|
-
run:
|
|
683
|
-
mode: full
|
|
684
|
-
batchSize: 500
|
|
685
|
-
logLevel: info
|
|
686
|
-
dryRun: false
|
|
687
|
-
```
|
|
688
|
-
|
|
689
|
-
---
|
|
690
|
-
|
|
691
|
-
### Full example — Style Co styles (CSV → BlueCherry)
|
|
692
|
-
|
|
693
|
-
```yaml
|
|
694
|
-
pipeline:
|
|
695
|
-
name: style-co-styles
|
|
696
|
-
client: style-co
|
|
697
|
-
version: "1.0"
|
|
698
|
-
entity: Style
|
|
699
|
-
description: Style master migration from legacy CSV exports to BlueCherry ERP
|
|
700
|
-
|
|
701
|
-
source:
|
|
702
|
-
adapter: csv
|
|
703
|
-
file: ./data/styles-export.csv
|
|
704
|
-
encoding: utf-8
|
|
705
|
-
|
|
706
|
-
dq:
|
|
707
|
-
stopOnCritical: true
|
|
708
|
-
rejectionFile: ./output/style-co-styles-rejected.csv
|
|
709
|
-
rules:
|
|
710
|
-
- field: STYLE_NO
|
|
711
|
-
checks:
|
|
712
|
-
- { type: notNull, severity: critical }
|
|
713
|
-
- { type: unique, severity: critical }
|
|
714
|
-
- { type: maxLength, value: 20, severity: warning }
|
|
715
|
-
- field: STYLE_DESC
|
|
716
|
-
checks:
|
|
717
|
-
- { type: notNull, severity: critical }
|
|
718
|
-
- { type: maxLength, value: 255, severity: warning }
|
|
719
|
-
- field: DIVISION
|
|
720
|
-
checks:
|
|
721
|
-
- { type: notNull, severity: critical }
|
|
722
|
-
- { type: allowedValues, value: [WOMENS, MENS, ACCESSORIES], severity: warning }
|
|
723
|
-
- field: SEASON_CODE
|
|
724
|
-
checks:
|
|
725
|
-
- { type: notNull, severity: warning }
|
|
726
|
-
- { type: pattern, value: "^(SS|AW)[0-9]{2}$", severity: warning }
|
|
727
|
-
- field: COST_PRICE
|
|
728
|
-
checks:
|
|
729
|
-
- { type: min, value: 0, severity: critical }
|
|
730
|
-
- { type: max, value: 9999.99, severity: warning }
|
|
731
|
-
- field: RETAIL_PRICE
|
|
732
|
-
checks:
|
|
733
|
-
- { type: min, value: 0, severity: critical }
|
|
734
|
-
|
|
735
|
-
transform:
|
|
736
|
-
lookups:
|
|
737
|
-
- name: divisionMap
|
|
738
|
-
source: { adapter: csv, file: ./lookups/division-codes.csv }
|
|
739
|
-
key: legacyCode
|
|
740
|
-
value: bcCode
|
|
741
|
-
- name: vendorMap
|
|
742
|
-
source: { adapter: csv, file: ./lookups/vendor-codes.csv }
|
|
743
|
-
key: legacyVendorCode
|
|
744
|
-
value: bcVendorNo
|
|
745
|
-
fields:
|
|
746
|
-
- { from: STYLE_NO, to: StyleNo, type: string, max: 20, cleanse: trim|uppercase }
|
|
747
|
-
- { from: STYLE_DESC, to: StyleDesc, type: string, max: 255, cleanse: trim|normaliseUnicode }
|
|
748
|
-
- { from: DIVISION, to: Division, type: lookup, lookup: divisionMap }
|
|
749
|
-
- { from: SEASON_CODE, to: Season, type: string, max: 10 }
|
|
750
|
-
- { from: VENDOR_CODE, to: VendorNo, type: lookup, lookup: vendorMap, optional: true }
|
|
751
|
-
- { from: COST_PRICE, to: CostPrice, type: decimal, precision: 2 }
|
|
752
|
-
- { from: RETAIL_PRICE, to: RetailPrice, type: decimal, precision: 2 }
|
|
753
|
-
- { from: WEIGHT_KG, to: Weight, type: decimal, precision: 3, default: "0.000" }
|
|
754
|
-
- { from: COUNTRY_ORIG, to: CountryOrigin, type: string, default: GB }
|
|
755
|
-
- { from: FIBRE_CONTENT, to: FibreContent, type: string, max: 200, cleanse: trim }
|
|
756
|
-
- { to: ActiveFlag, type: constant, value: "Y" }
|
|
757
|
-
- { to: CreatedDate, type: expression, value: "js: new Date().toLocaleDateString('en-US')" }
|
|
758
|
-
|
|
759
|
-
target:
|
|
760
|
-
adapter: bluecherry
|
|
761
|
-
entity: Style
|
|
762
|
-
output: ./output/style-co-styles-bc.csv
|
|
763
|
-
includeHeader: true
|
|
764
|
-
dateFormat: MM/DD/YYYY
|
|
765
|
-
nullValue: ""
|
|
766
|
-
|
|
767
|
-
run:
|
|
768
|
-
mode: full
|
|
769
|
-
batchSize: 200
|
|
770
|
-
logLevel: info
|
|
771
|
-
dryRun: false
|
|
772
|
-
```
|
|
773
|
-
|
|
774
|
-
---
|
|
775
|
-
|
|
776
|
-
## ═══════════════════════════════════════════════════════════
|
|
777
|
-
## MULTI-SOURCE PIPELINES (Phase 3)
|
|
778
|
-
## ═══════════════════════════════════════════════════════════
|
|
779
|
-
|
|
780
|
-
A multi-source pipeline replaces the single `source:` block with a top-level
|
|
781
|
-
`sources:` array (min 2 entries) plus a `merge:` block. The rest of the YAML
|
|
782
|
-
(`pipeline`, `dq`, `transform`, `target`, `run`) is unchanged. `PipelineSchema`
|
|
783
|
-
requires *either* `source` (single) *or* both `sources` + `merge` (multi) —
|
|
784
|
-
never both — and the CLI auto-routes multi-source configs to
|
|
785
|
-
`MultiSourcePipelineRunner` (see `src/cli.ts:createRunnerForPipeline`).
|
|
786
|
-
|
|
787
|
-
### Top-level layout
|
|
788
|
-
|
|
789
|
-
```yaml
|
|
790
|
-
pipeline: { ... }
|
|
791
|
-
sources: [ { ... }, { ... } ] # REQUIRED in multi-source mode; min 2 entries
|
|
792
|
-
merge: { ... } # REQUIRED when `sources` is present
|
|
793
|
-
dq: { ... }
|
|
794
|
-
transform: { ... }
|
|
795
|
-
target: { ... }
|
|
796
|
-
run: { ... }
|
|
797
|
-
```
|
|
798
|
-
|
|
799
|
-
### `sources` entries
|
|
800
|
-
|
|
801
|
-
Each entry is a `SourceConfig` with three extra multi-source-only fields:
|
|
802
|
-
|
|
803
|
-
```yaml
|
|
804
|
-
sources:
|
|
805
|
-
- id: sql-server # REQUIRED. Lowercase alphanumeric + hyphens only;
|
|
806
|
-
# must be unique across the array; used as the
|
|
807
|
-
# staging table suffix (stg_raw_sql-server).
|
|
808
|
-
priority: 1 # REQUIRED. Positive integer. Lower priority =
|
|
809
|
-
# higher precedence in coalesce / priority-override.
|
|
810
|
-
adapter: mssql
|
|
811
|
-
connection: ${SOURCE_2_MSSQL}
|
|
812
|
-
query: |
|
|
813
|
-
SELECT STYLE_NO, STYLE_DESC, COST_PRICE FROM dbo.Styles WHERE Active = 1
|
|
814
|
-
|
|
815
|
-
- id: excel
|
|
816
|
-
priority: 2
|
|
817
|
-
adapter: xlsx
|
|
818
|
-
file: ./data/product-data.xlsx
|
|
819
|
-
sheet: "Products"
|
|
820
|
-
rename: # Optional. { 'old column': 'new column' }.
|
|
821
|
-
Style Number: STYLE_NO # Applied in-place after extract, before DQ and
|
|
822
|
-
Description: STYLE_DESC # merge. Intended for CSV/XLSX sources where
|
|
823
|
-
Fibre: FIBRE_CONTENT # column headers are fixed; SQL/REST sources
|
|
824
|
-
# should rename in the query or field selection.
|
|
825
|
-
# Unknown keys are logged as warnings, not errors.
|
|
826
|
-
```
|
|
827
|
-
|
|
828
|
-
### `merge` block
|
|
829
|
-
|
|
830
|
-
```yaml
|
|
831
|
-
merge:
|
|
832
|
-
key: STYLE_NO # REQUIRED. Single column name or array of
|
|
833
|
-
# columns (composite key). Must exist in every
|
|
834
|
-
# source after `rename` is applied.
|
|
835
|
-
|
|
836
|
-
strategy: coalesce # Default: coalesce. One of:
|
|
837
|
-
# coalesce first non-null value wins
|
|
838
|
-
# (priority-ordered; whitespace
|
|
839
|
-
# treated as blank)
|
|
840
|
-
# priority-override highest-priority source
|
|
841
|
-
# wins (even if null/blank)
|
|
842
|
-
# union all rows from all sources
|
|
843
|
-
# (dedupe by key)
|
|
844
|
-
# intersect only rows present in ALL
|
|
845
|
-
# sources
|
|
846
|
-
|
|
847
|
-
onUnmatched: include # Default: include. One of:
|
|
848
|
-
# include (default) keep unmatched rows
|
|
849
|
-
# exclude drop them
|
|
850
|
-
# warn keep and log a warning
|
|
851
|
-
# error fail the pipeline
|
|
852
|
-
# Ignored by `intersect`, which always excludes.
|
|
853
|
-
|
|
854
|
-
fieldStrategies: # Optional. Per-field overrides of the
|
|
855
|
-
# top-level strategy.
|
|
856
|
-
- field: FIBRE_CONTENT
|
|
857
|
-
source: excel # Force this field to always come from the
|
|
858
|
-
# named source, ignoring priority.
|
|
859
|
-
- field: COST_PRICE
|
|
860
|
-
strategy: priority-override # Override just this field's strategy.
|
|
861
|
-
|
|
862
|
-
conflictLog: ./output/style-co-products-conflicts.csv
|
|
863
|
-
# Optional. CSV of (key, field, winning_source,
|
|
864
|
-
# winning_value, source_values). Only written
|
|
865
|
-
# when at least one conflict is detected.
|
|
866
|
-
|
|
867
|
-
incrementalSource: sql-server # REQUIRED when `run.mode: incremental`.
|
|
868
|
-
# Must match one of the source `id` values.
|
|
869
|
-
# The named source is filtered by
|
|
870
|
-
# `run.incrementalField` / state-file lastRunAt;
|
|
871
|
-
# other sources run full each time.
|
|
872
|
-
```
|
|
873
|
-
|
|
874
|
-
### Multi-source DQ rules
|
|
875
|
-
|
|
876
|
-
`dq.rules[].sourceId` (optional) scopes a rule to a specific pre-merge source
|
|
877
|
-
table. Rules without `sourceId` run post-merge against `stg_merged`:
|
|
878
|
-
|
|
879
|
-
```yaml
|
|
880
|
-
dq:
|
|
881
|
-
stopOnCritical: true
|
|
882
|
-
rules:
|
|
883
|
-
- field: STYLE_NO # Pre-merge: runs against stg_raw_sql-server only.
|
|
884
|
-
sourceId: sql-server
|
|
885
|
-
checks:
|
|
886
|
-
- { type: notNull, severity: critical }
|
|
887
|
-
- { type: unique, severity: critical }
|
|
888
|
-
|
|
889
|
-
- field: STYLE_DESC # Post-merge: runs against stg_merged.
|
|
890
|
-
checks:
|
|
891
|
-
- { type: notNull, severity: critical }
|
|
892
|
-
- { type: maxLength, value: 255, severity: warning }
|
|
893
|
-
```
|
|
894
|
-
|
|
895
|
-
Per-source rejection files are auto-named by appending `-{sourceId}` to the
|
|
896
|
-
configured `rejectionFile` stem. Rows failing a critical pre-merge rule are
|
|
897
|
-
filtered out of that source's staging table *before* the merge phase.
|
|
898
|
-
|
|
899
|
-
### Full example
|
|
900
|
-
|
|
901
|
-
See [tests/fixtures/style-co-products-merged.pipeline.yaml](tests/fixtures/style-co-products-merged.pipeline.yaml)
|
|
902
|
-
for a complete, tested multi-source pipeline (MSSQL + REST + XLSX → BlueCherry
|
|
903
|
-
with `coalesce` + `fieldStrategies` + `incrementalSource`).
|
|
904
|
-
|
|
905
|
-
### Invocation
|
|
906
|
-
|
|
907
|
-
```bash
|
|
908
|
-
sluice check tests/fixtures/style-co-products-merged.pipeline.yaml
|
|
909
|
-
sluice run tests/fixtures/style-co-products-merged.pipeline.yaml
|
|
910
|
-
sluice merge list-strategies
|
|
911
|
-
sluice merge info coalesce
|
|
912
|
-
```
|
|
913
|
-
|
|
914
|
-
---
|
|
915
|
-
|
|
916
|
-
## ═══════════════════════════════════════════════════════════
|
|
917
|
-
## ZOD SCHEMA (src/config/schema.ts)
|
|
918
|
-
## ═══════════════════════════════════════════════════════════
|
|
919
|
-
|
|
920
|
-
Reproduce this schema exactly. Do not invent additional fields or rename enums.
|
|
921
|
-
|
|
922
|
-
```typescript
|
|
923
|
-
import { z } from 'zod';
|
|
924
|
-
|
|
925
|
-
const Severity = z.enum(['critical', 'warning', 'info']);
|
|
926
|
-
const SourceAd = z.enum(['mssql', 'pg', 'csv', 'xlsx', 'rest']);
|
|
927
|
-
const TargetAd = z.enum(['bc', 'ifs', 'bluecherry', 'csv', 'pg', 'rest']);
|
|
928
|
-
const CleanseOps = z.string().regex(/^[a-zA-Z|:0-9]+$/);
|
|
929
|
-
|
|
930
|
-
const PaginationSchema = z.object({
|
|
931
|
-
type: z.enum(['offset', 'cursor', 'page']),
|
|
932
|
-
pageSize: z.number().int().positive().default(100),
|
|
933
|
-
pageParam: z.string().optional(),
|
|
934
|
-
totalField: z.string().optional(),
|
|
935
|
-
dataField: z.string().optional(),
|
|
936
|
-
cursorField: z.string().optional(),
|
|
937
|
-
cursorParam: z.string().optional(),
|
|
938
|
-
});
|
|
939
|
-
|
|
940
|
-
export const SourceSchema = z.object({
|
|
941
|
-
adapter: SourceAd,
|
|
942
|
-
connection: z.string().optional(),
|
|
943
|
-
query: z.string().optional(),
|
|
944
|
-
file: z.string().optional(),
|
|
945
|
-
endpoint: z.string().optional(),
|
|
946
|
-
headers: z.record(z.string()).optional(),
|
|
947
|
-
delimiter: z.string().default(','),
|
|
948
|
-
encoding: z.string().default('utf-8'),
|
|
949
|
-
sheet: z.union([z.string(), z.number()]).optional(),
|
|
950
|
-
pagination: PaginationSchema.optional(),
|
|
951
|
-
}).refine(s => s.query || s.file || s.endpoint,
|
|
952
|
-
{ message: 'source must have query, file, or endpoint' });
|
|
953
|
-
|
|
954
|
-
const CheckType = z.enum([
|
|
955
|
-
'notNull', 'unique', 'pattern', 'email', 'ukPostcode',
|
|
956
|
-
'maxLength', 'min', 'max', 'allowedValues',
|
|
957
|
-
]);
|
|
958
|
-
|
|
959
|
-
const CheckSchema = z.object({
|
|
960
|
-
type: CheckType,
|
|
961
|
-
value: z.union([z.string(), z.number(), z.array(z.string())]).optional(),
|
|
962
|
-
severity: Severity,
|
|
963
|
-
message: z.string().optional(),
|
|
964
|
-
});
|
|
965
|
-
|
|
966
|
-
const DqRuleSchema = z.object({
|
|
967
|
-
field: z.string(),
|
|
968
|
-
checks: z.array(CheckSchema).min(1),
|
|
969
|
-
});
|
|
970
|
-
|
|
971
|
-
export const DqSchema = z.object({
|
|
972
|
-
stopOnCritical: z.boolean().default(true),
|
|
973
|
-
rejectionFile: z.string().optional(),
|
|
974
|
-
rules: z.array(DqRuleSchema).default([]),
|
|
975
|
-
});
|
|
976
|
-
|
|
977
|
-
const LookupSchema = z.object({
|
|
978
|
-
name: z.string(),
|
|
979
|
-
source: SourceSchema,
|
|
980
|
-
key: z.string(),
|
|
981
|
-
value: z.string(),
|
|
982
|
-
});
|
|
983
|
-
|
|
984
|
-
const FieldType = z.enum([
|
|
985
|
-
'string', 'number', 'decimal', 'boolean', 'date',
|
|
986
|
-
'lookup', 'concat', 'constant', 'expression',
|
|
987
|
-
]);
|
|
988
|
-
|
|
989
|
-
const FieldMappingSchema = z.object({
|
|
990
|
-
from: z.union([z.string(), z.array(z.string())]).optional(),
|
|
991
|
-
to: z.string(),
|
|
992
|
-
type: FieldType,
|
|
993
|
-
max: z.number().optional(),
|
|
994
|
-
precision: z.number().optional(),
|
|
995
|
-
format: z.string().optional(),
|
|
996
|
-
cleanse: CleanseOps.optional(),
|
|
997
|
-
lookup: z.string().optional(),
|
|
998
|
-
separator: z.string().optional(),
|
|
999
|
-
value: z.union([z.string(), z.number(), z.boolean()]).optional(),
|
|
1000
|
-
default: z.union([z.string(), z.number(), z.boolean(), z.null()]).optional(),
|
|
1001
|
-
optional: z.boolean().default(false),
|
|
1002
|
-
});
|
|
1003
|
-
|
|
1004
|
-
export const TransformSchema = z.object({
|
|
1005
|
-
lookups: z.array(LookupSchema).default([]),
|
|
1006
|
-
fields: z.array(FieldMappingSchema).min(1),
|
|
1007
|
-
});
|
|
1008
|
-
|
|
1009
|
-
export const TargetSchema = z.object({
|
|
1010
|
-
adapter: TargetAd,
|
|
1011
|
-
output: z.string().optional(),
|
|
1012
|
-
entity: z.string().optional(),
|
|
1013
|
-
includeHeader: z.boolean().optional(),
|
|
1014
|
-
columnOrder: z.array(z.string()).optional(),
|
|
1015
|
-
dateFormat: z.string().optional(),
|
|
1016
|
-
delimiter: z.string().default(','),
|
|
1017
|
-
encoding: z.string().default('utf-8'),
|
|
1018
|
-
nullValue: z.string().default(''),
|
|
1019
|
-
template: z.string().optional(),
|
|
1020
|
-
// BC REST
|
|
1021
|
-
baseUrl: z.string().optional(),
|
|
1022
|
-
company: z.string().optional(),
|
|
1023
|
-
apiVersion: z.string().default('v2.0'),
|
|
1024
|
-
onConflict: z.enum(['fail', 'upsert', 'ignore']).default('fail'),
|
|
1025
|
-
upsertKey: z.array(z.string()).optional(),
|
|
1026
|
-
batchEndpoint: z.boolean().default(true),
|
|
1027
|
-
// PostgreSQL
|
|
1028
|
-
connection: z.string().optional(),
|
|
1029
|
-
table: z.string().optional(),
|
|
1030
|
-
schema: z.string().default('public'),
|
|
1031
|
-
});
|
|
1032
|
-
|
|
1033
|
-
export const RunSchema = z.object({
|
|
1034
|
-
mode: z.enum(['full', 'incremental', 'validate-only']).default('full'),
|
|
1035
|
-
batchSize: z.number().int().positive().default(500),
|
|
1036
|
-
onError: z.enum(['continue', 'stop']).default('continue'),
|
|
1037
|
-
logLevel: z.enum(['debug', 'info', 'warn', 'error']).default('info'),
|
|
1038
|
-
dryRun: z.boolean().default(false),
|
|
1039
|
-
outputDir: z.string().default('./output'),
|
|
1040
|
-
stagingDb: z.string().default(''),
|
|
1041
|
-
// Phase 4a — enrich tuning (consumed by @caracal-lynx/sluice-enrich)
|
|
1042
|
-
enrichConcurrency: z.number().int().positive().default(5),
|
|
1043
|
-
enrichTimeoutMs: z.number().int().positive().default(5000),
|
|
1044
|
-
enrichMaxRetries: z.number().int().min(0).max(5).default(3),
|
|
1045
|
-
incrementalField: z.string().optional(),
|
|
1046
|
-
incrementalSince: z.string().optional(),
|
|
1047
|
-
});
|
|
1048
|
-
|
|
1049
|
-
export const PipelineSchema = z.object({
|
|
1050
|
-
pipeline: z.object({
|
|
1051
|
-
name: z.string().regex(/^[a-z0-9-]+$/),
|
|
1052
|
-
client: z.string(),
|
|
1053
|
-
version: z.string(),
|
|
1054
|
-
entity: z.string(),
|
|
1055
|
-
description: z.string().optional(),
|
|
1056
|
-
}),
|
|
1057
|
-
source: SourceSchema,
|
|
1058
|
-
enrich: EnrichSchema.optional(), // Phase 4a — runs between Extract/Merge and DQ
|
|
1059
|
-
dq: DqSchema,
|
|
1060
|
-
transform: TransformSchema,
|
|
1061
|
-
target: TargetSchema,
|
|
1062
|
-
run: RunSchema.default({}),
|
|
1063
|
-
});
|
|
1064
|
-
|
|
1065
|
-
// Inferred types — use these everywhere; do not write manual interfaces.
|
|
1066
|
-
export type Pipeline = z.infer<typeof PipelineSchema>;
|
|
1067
|
-
export type SourceConfig = z.infer<typeof SourceSchema>;
|
|
1068
|
-
export type TargetConfig = z.infer<typeof TargetSchema>;
|
|
1069
|
-
export type RunConfig = z.infer<typeof RunSchema>;
|
|
1070
|
-
export type FieldMapping = z.infer<typeof FieldMappingSchema>;
|
|
1071
|
-
export type DqRule = z.infer<typeof DqRuleSchema>;
|
|
1072
|
-
export type Lookup = z.infer<typeof LookupSchema>;
|
|
1073
|
-
```
|
|
1074
|
-
|
|
1075
|
-
### Phase 2 schema additions (already in `src/config/schema.ts`)
|
|
1076
|
-
|
|
1077
|
-
The following are forward-looking additions that extend the canonical schema above.
|
|
1078
|
-
They are live in the codebase and tested. Do not remove them.
|
|
1079
|
-
|
|
1080
|
-
- **`DqSchema.rulesFile`** (`z.string().optional()`) — path to a composite rule
|
|
1081
|
-
library YAML file. `ConfigLoader` expands composite rule references into
|
|
1082
|
-
built-in check types before Zod validation, so the pipeline runner only sees
|
|
1083
|
-
standard checks.
|
|
1084
|
-
- **`FieldType` includes `'custom'`** — delegates to a `TransformPlugin` via
|
|
1085
|
-
`customOp`. Requires `customOp` to be set (enforced by a `.refine()`).
|
|
1086
|
-
- **`FieldMappingSchema.customOp`** (`z.string().optional()`) — plugin ID for
|
|
1087
|
-
`type: custom` fields.
|
|
1088
|
-
- **`FieldMappingSchema.options`** (`z.record(z.unknown()).optional()`) — arbitrary
|
|
1089
|
-
per-plugin config passed through to the transform plugin.
|
|
1090
|
-
- **`FieldMappingSchema` refinement** — field types in `TYPES_REQUIRING_FROM`
|
|
1091
|
-
(`string`, `number`, `decimal`, `boolean`, `date`, `lookup`, `concat`) must
|
|
1092
|
-
declare `from`. Only `constant`, `expression`, and `custom` may omit it.
|
|
1093
|
-
- **`TargetSchema` refinement** — when `onConflict: 'upsert'`, a non-empty
|
|
1094
|
-
`upsertKey` is required (checked at config-parse time).
|
|
1095
|
-
- **`ToolkitConfigSchema`** — schema for `sluice.config.yaml` (toolkit-level
|
|
1096
|
-
plugin loading). Consumed by `PipelineRunner.loadAllPlugins()` via
|
|
1097
|
-
`plugins/loader.ts → loadNpmPlugins()` at the start of every run.
|
|
1098
|
-
- **`CompositeRuleSchema` / `CompositeRuleLibrarySchema`** — schemas for the
|
|
1099
|
-
shared rule library YAML files referenced by `dq.rulesFile`.
|
|
1100
|
-
|
|
1101
|
-
### Phase 3 schema additions (multi-source merge)
|
|
1102
|
-
|
|
1103
|
-
- **`DqRuleSchema.sourceId`** (`z.string().optional()`) — scopes a rule to a
|
|
1104
|
-
named pre-merge source; omitted for post-merge rules.
|
|
1105
|
-
- **`PipelineSchema.source`** — now `optional()`; mutually exclusive with
|
|
1106
|
-
`sources` (enforced by `.refine()`).
|
|
1107
|
-
- **`PipelineSchema.sources`** (`z.array(MultiSourceEntrySchema).min(2).optional()`)
|
|
1108
|
-
— the multi-source array. Refinement also checks unique source ids and
|
|
1109
|
-
(in incremental mode) that `merge.incrementalSource` matches a source id.
|
|
1110
|
-
- **`PipelineSchema.merge`** (`MergeSchema.optional()`) — per-pipeline merge
|
|
1111
|
-
config. Defaults: `strategy: 'coalesce'`, `onUnmatched: 'include'`.
|
|
1112
|
-
- **`MergeSchema`** — `key`, `strategy`, `onUnmatched`, `fieldStrategies[]`,
|
|
1113
|
-
`conflictLog`, `incrementalSource`.
|
|
1114
|
-
- **`MergeFieldStrategySchema`** — per-field override: `field`, optional
|
|
1115
|
-
`strategy`, optional `source` (at least one required).
|
|
1116
|
-
- **`MultiSourceEntrySchema`** — extends `SourceBaseSchema` with `id`,
|
|
1117
|
-
`priority`, and optional `rename`.
|
|
1118
|
-
- **`isSingleSource(p)` / `isMultiSource(p)`** — exported type guards that
|
|
1119
|
-
narrow `Pipeline` to the single- or multi-source shape.
|
|
1120
|
-
|
|
1121
|
-
---
|
|
1122
|
-
|
|
1123
|
-
## ═══════════════════════════════════════════════════════════
|
|
1124
|
-
## PLUGIN INTERFACES
|
|
1125
|
-
## ═══════════════════════════════════════════════════════════
|
|
1126
|
-
|
|
1127
|
-
### SourceAdapter (src/adapters/source/types.ts)
|
|
1128
|
-
|
|
1129
|
-
```typescript
|
|
1130
|
-
export interface SourceAdapter {
|
|
1131
|
-
readonly id: string;
|
|
1132
|
-
connect(config: SourceConfig): Promise<void>;
|
|
1133
|
-
extract(
|
|
1134
|
-
config: SourceConfig,
|
|
1135
|
-
store: StagingStore,
|
|
1136
|
-
runConfig: RunConfig,
|
|
1137
|
-
onProgress: (rows: number) => void,
|
|
1138
|
-
targetTable?: string // defaults to 'stg_raw'; set per-source in
|
|
1139
|
-
// multi-source pipelines
|
|
1140
|
-
): Promise<ExtractResult>;
|
|
1141
|
-
disconnect(): Promise<void>;
|
|
1142
|
-
}
|
|
1143
|
-
|
|
1144
|
-
export interface ExtractResult {
|
|
1145
|
-
rowsExtracted: number;
|
|
1146
|
-
tableName: string; // caller-supplied; 'stg_raw' for single-source,
|
|
1147
|
-
// 'stg_raw_{sourceId}' for each source in a
|
|
1148
|
-
// multi-source pipeline
|
|
1149
|
-
columns: ColumnMeta[];
|
|
1150
|
-
}
|
|
1151
|
-
|
|
1152
|
-
export interface ColumnMeta {
|
|
1153
|
-
name: string;
|
|
1154
|
-
duckDbType: string; // VARCHAR | BIGINT | DOUBLE | BOOLEAN | TIMESTAMP
|
|
1155
|
-
}
|
|
1156
|
-
```
|
|
1157
|
-
|
|
1158
|
-
### TargetAdapter (src/adapters/target/types.ts)
|
|
1159
|
-
|
|
1160
|
-
```typescript
|
|
1161
|
-
export interface TargetAdapter {
|
|
1162
|
-
readonly id: string;
|
|
1163
|
-
connect(config: TargetConfig): Promise<void>;
|
|
1164
|
-
load(
|
|
1165
|
-
config: TargetConfig,
|
|
1166
|
-
store: StagingStore,
|
|
1167
|
-
runConfig: RunConfig,
|
|
1168
|
-
onProgress: (rows: number) => void
|
|
1169
|
-
): Promise<LoadResult>;
|
|
1170
|
-
disconnect(): Promise<void>;
|
|
1171
|
-
}
|
|
1172
|
-
|
|
1173
|
-
export interface LoadResult {
|
|
1174
|
-
rowsLoaded: number;
|
|
1175
|
-
rowsFailed: number;
|
|
1176
|
-
outputPath?: string; // set for file-based targets
|
|
1177
|
-
}
|
|
1178
|
-
```
|
|
1179
|
-
|
|
1180
|
-
### DQ Rule (src/dq/rules/types.ts)
|
|
1181
|
-
|
|
1182
|
-
```typescript
|
|
1183
|
-
export interface Rule {
|
|
1184
|
-
readonly id: string;
|
|
1185
|
-
validate(
|
|
1186
|
-
value: unknown,
|
|
1187
|
-
config: CheckConfig,
|
|
1188
|
-
rowIndex: number,
|
|
1189
|
-
field: string
|
|
1190
|
-
): RuleViolation | null;
|
|
1191
|
-
}
|
|
1192
|
-
|
|
1193
|
-
export interface RuleViolation {
|
|
1194
|
-
field: string;
|
|
1195
|
-
rowIndex: number;
|
|
1196
|
-
value: unknown;
|
|
1197
|
-
rule: string;
|
|
1198
|
-
severity: 'critical' | 'warning' | 'info';
|
|
1199
|
-
message: string;
|
|
1200
|
-
}
|
|
1201
|
-
```
|
|
1202
|
-
|
|
1203
|
-
### MergeStrategyPlugin (src/merge/types.ts)
|
|
1204
|
-
|
|
1205
|
-
```typescript
|
|
1206
|
-
export interface MergeSourceMeta {
|
|
1207
|
-
id: string;
|
|
1208
|
-
priority: number;
|
|
1209
|
-
tableName: string; // e.g. 'stg_raw_sql-server'
|
|
1210
|
-
}
|
|
1211
|
-
|
|
1212
|
-
export interface MergeResult {
|
|
1213
|
-
rowsMerged: number;
|
|
1214
|
-
conflicts: number; // fields where two non-null values disagreed
|
|
1215
|
-
unmatched: number; // records present in only one source
|
|
1216
|
-
tableName: 'stg_merged';
|
|
1217
|
-
}
|
|
1218
|
-
|
|
1219
|
-
export interface MergeStrategyPlugin {
|
|
1220
|
-
readonly id: string; // matches MergeSchema.strategy value
|
|
1221
|
-
readonly description?: string; // shown by `sluice merge list-strategies`
|
|
1222
|
-
|
|
1223
|
-
merge(
|
|
1224
|
-
store: StagingStore,
|
|
1225
|
-
sources: MergeSourceMeta[], // priority-ordered (priority 1 first)
|
|
1226
|
-
config: MergeConfig,
|
|
1227
|
-
): Promise<MergeResult>;
|
|
1228
|
-
}
|
|
1229
|
-
```
|
|
1230
|
-
|
|
1231
|
-
Built-in strategies: `coalesce`, `priority-override`, `union`, `intersect`
|
|
1232
|
-
(all pre-registered in `MergeStrategyRegistry`; live in
|
|
1233
|
-
`src/merge/strategies/*.ts`). Custom strategies can be dropped into a
|
|
1234
|
-
`plugins/` folder as `*.merge.ts` files exporting `const mergeStrategy`.
|
|
1235
|
-
|
|
1236
|
-
---
|
|
1237
|
-
|
|
1238
|
-
## ═══════════════════════════════════════════════════════════
|
|
1239
|
-
## ADAPTER IMPLEMENTATION NOTES
|
|
1240
|
-
## ═══════════════════════════════════════════════════════════
|
|
1241
|
-
|
|
1242
|
-
### mssql source
|
|
1243
|
-
|
|
1244
|
-
- Stream results: `request.stream = true` + `RecordSet` events.
|
|
1245
|
-
- SQL Server → DuckDB type map: `varchar/nvarchar/char → VARCHAR`,
|
|
1246
|
-
`int/bigint → BIGINT`, `decimal/numeric/money → DOUBLE`,
|
|
1247
|
-
`bit → BOOLEAN`, `datetime/date → TIMESTAMP`, `float/real → DOUBLE`.
|
|
1248
|
-
- Trusted connection: detect `trustedConnection: true` in JSON connection config.
|
|
1249
|
-
|
|
1250
|
-
### csv source
|
|
1251
|
-
|
|
1252
|
-
- `csv-parse` options: `{ columns: true, skip_empty_lines: true, bom: true }`.
|
|
1253
|
-
`bom: true` strips the UTF-8 BOM common in Excel-generated CSVs.
|
|
1254
|
-
- All columns inferred as `VARCHAR` in DuckDB.
|
|
1255
|
-
- Support glob patterns: concatenate all matching files into a single staging table.
|
|
1256
|
-
|
|
1257
|
-
### xlsx source
|
|
1258
|
-
|
|
1259
|
-
- SheetJS: convert to CSV via `xlsx.utils.sheet_to_csv`, then pipe through csv-parse.
|
|
1260
|
-
- Log a warning if workbook has more than one sheet and `source.sheet` is unset.
|
|
1261
|
-
|
|
1262
|
-
### rest source
|
|
1263
|
-
|
|
1264
|
-
- `axios-retry`: 3 retries, exponential backoff, retry on 429 and 5xx.
|
|
1265
|
-
- Flatten nested JSON using `__` separator (`address.postCode` → `address__postCode`).
|
|
1266
|
-
- All three pagination types must be supported: offset, page, cursor.
|
|
1267
|
-
|
|
1268
|
-
### IFS target
|
|
1269
|
-
|
|
1270
|
-
- UTF-8 CSV via `csv-stringify`.
|
|
1271
|
-
- `includeHeader` defaults to `false` for this adapter.
|
|
1272
|
-
- Apply `target.columnOrder` if specified.
|
|
1273
|
-
- Format date columns using `dayjs` with `target.dateFormat` (default `YYYY-MM-DD`).
|
|
1274
|
-
|
|
1275
|
-
### BlueCherry target (src/adapters/target/bluecherry.ts)
|
|
1276
|
-
|
|
1277
|
-
BlueCherry ERP (CGS — Computer Generated Solutions) uses fixed-format CSV for
|
|
1278
|
-
bulk import. Each entity type has a required column set. The adapter validates
|
|
1279
|
-
required columns at `connect()` time, before any data is read.
|
|
1280
|
-
|
|
1281
|
-
**Required columns per entity:**
|
|
1282
|
-
|
|
1283
|
-
```typescript
|
|
1284
|
-
const REQUIRED_COLUMNS: Record<string, string[]> = {
|
|
1285
|
-
Style: [
|
|
1286
|
-
'StyleNo', 'StyleDesc', 'Division', 'Season',
|
|
1287
|
-
'CostPrice', 'RetailPrice', 'ActiveFlag',
|
|
1288
|
-
],
|
|
1289
|
-
Vendor: [
|
|
1290
|
-
'VendorNo', 'VendorName', 'Country', 'CurrencyCode',
|
|
1291
|
-
],
|
|
1292
|
-
PurchaseOrder: [
|
|
1293
|
-
'PONumber', 'VendorNo', 'Season', 'OrderDate', 'DeliveryDate',
|
|
1294
|
-
],
|
|
1295
|
-
PODetail: [
|
|
1296
|
-
'PONumber', 'StyleNo', 'ColourCode', 'SizeCode', 'Quantity', 'CostPrice',
|
|
1297
|
-
],
|
|
1298
|
-
Season: [
|
|
1299
|
-
'SeasonCode', 'SeasonDesc', 'StartDate', 'EndDate',
|
|
1300
|
-
],
|
|
1301
|
-
ColourSize: [
|
|
1302
|
-
'StyleNo', 'ColourCode', 'ColourDesc', 'SizeCode', 'SizeDesc',
|
|
1303
|
-
],
|
|
1304
|
-
};
|
|
1305
|
-
```
|
|
1306
|
-
|
|
1307
|
-
**Behaviour:**
|
|
1308
|
-
- `includeHeader` defaults to `true`.
|
|
1309
|
-
- Default `dateFormat` is `MM/DD/YYYY` (BlueCherry is US-origin software).
|
|
1310
|
-
- Any column whose name ends with `Date` (case-insensitive) is automatically
|
|
1311
|
-
formatted using `target.dateFormat` via `dayjs`.
|
|
1312
|
-
- `nullValue` (default `""`) is used for all null/undefined fields.
|
|
1313
|
-
- At `connect()`:
|
|
1314
|
-
1. Verify `target.entity` is in `REQUIRED_COLUMNS`. Throw `ConfigError` if not.
|
|
1315
|
-
2. Query `store.columnNames('stg_transformed')` and verify all required columns
|
|
1316
|
-
are present. Throw `ConfigError` listing any missing columns.
|
|
1317
|
-
3. If `target.template` is a file path, read its header row and use it as the
|
|
1318
|
-
definitive column order for the output. If `target.template === 'default'`,
|
|
1319
|
-
use the required columns list as column order, with any additional columns
|
|
1320
|
-
from `stg_transformed` appended.
|
|
1321
|
-
|
|
1322
|
-
**Note on BlueCherry column names:** The column names in `REQUIRED_COLUMNS` are
|
|
1323
|
-
internal conventions for this toolkit. Verify them against the actual BlueCherry
|
|
1324
|
-
import documentation before running a live migration. The `template` feature exists
|
|
1325
|
-
precisely to override these if the client's BlueCherry instance uses different names.
|
|
1326
|
-
|
|
1327
|
-
### Business Central REST target
|
|
1328
|
-
|
|
1329
|
-
- OAuth2 client credentials: `POST https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token`
|
|
1330
|
-
- Cache token in memory; refresh 60 seconds before expiry.
|
|
1331
|
-
- OData `$batch`: `POST {baseUrl}/api/{version}/companies({company})/$batch`
|
|
1332
|
-
with `Content-Type: multipart/mixed; boundary=batch_{uuid}`.
|
|
1333
|
-
Maximum 100 operations per batch request.
|
|
1334
|
-
- HTTP 409 with `onConflict: upsert` → issue PATCH to individual entity URL.
|
|
1335
|
-
- HTTP 4xx (non-409): log error, increment `rowsFailed`, continue if
|
|
1336
|
-
`run.onError: continue`.
|
|
1337
|
-
|
|
1338
|
-
---
|
|
1339
|
-
|
|
1340
|
-
## ═══════════════════════════════════════════════════════════
|
|
1341
|
-
## PIPELINE RUNNER — EXECUTION ORDER
|
|
1342
|
-
## ═══════════════════════════════════════════════════════════
|
|
1343
|
-
|
|
1344
|
-
**Important:** `ConfigLoader.load()` interpolates `${ENV_VAR}` tokens from
|
|
1345
|
-
`process.env` but does **not** call `loadEnv()` / `dotenv.config()` itself.
|
|
1346
|
-
The CLI entry point must call `loadEnv()` before invoking the loader. This keeps
|
|
1347
|
-
`ConfigLoader` side-effect-free and testable (tests stub `process.env` directly).
|
|
1348
|
-
|
|
1349
|
-
```
|
|
1350
|
-
1. Load + validate config ConfigLoader.load(yamlPath)
|
|
1351
|
-
2. Resolve output directory create if not exists
|
|
1352
|
-
3. Open DuckDB staging store StagingStore.open(dbPath)
|
|
1353
|
-
4. Connect source adapter
|
|
1354
|
-
5. Extract → 'stg_raw' log: rows extracted
|
|
1355
|
-
5a. Disconnect source adapter always in finally
|
|
1356
|
-
5b. Phase 4a Enrich (optional) runs only when:
|
|
1357
|
-
- `enrich:` block configured
|
|
1358
|
-
- --no-enrich NOT set
|
|
1359
|
-
- mode != validate-only and not dryRun
|
|
1360
|
-
- @caracal-lynx/sluice-enrich is installed
|
|
1361
|
-
and has called registerEnrichPhase()
|
|
1362
|
-
Otherwise skipped (WARN log if last bullet
|
|
1363
|
-
fails). Writes new columns to 'stg_raw'.
|
|
1364
|
-
6. Run DQ rules against 'stg_raw'
|
|
1365
|
-
a. Collect all RuleViolations
|
|
1366
|
-
b. Write rejection CSV
|
|
1367
|
-
c. Write summary JSON
|
|
1368
|
-
d. Log DQ summary (info)
|
|
1369
|
-
e. If stopOnCritical AND criticalCount > 0 → throw PipelineDQError
|
|
1370
|
-
7. Resolve all lookups LookupResolver.loadAll()
|
|
1371
|
-
8. Transform 'stg_raw' → 'stg_transformed' (batch by batchSize)
|
|
1372
|
-
9. If dryRun === true → STOP (log summary, exit 0)
|
|
1373
|
-
10. If mode === 'validate-only' → STOP (log summary, exit 0)
|
|
1374
|
-
11. Connect target adapter
|
|
1375
|
-
12. Load 'stg_transformed' → target
|
|
1376
|
-
12a.Disconnect target adapter always in finally
|
|
1377
|
-
13. Close DuckDB staging store always in finally
|
|
1378
|
-
14. Write run state file {outputDir}/{name}-state.json
|
|
1379
|
-
15. Log final summary (info)
|
|
1380
|
-
```
|
|
1381
|
-
|
|
1382
|
-
**Run state file** `{outputDir}/{name}-state.json`:
|
|
1383
|
-
```json
|
|
1384
|
-
{
|
|
1385
|
-
"pipeline": "acme-corp-customers",
|
|
1386
|
-
"lastRunAt": "2026-04-15T09:30:00.000Z",
|
|
1387
|
-
"lastMode": "full",
|
|
1388
|
-
"rowsExtracted": 1842,
|
|
1389
|
-
"rowsLoaded": 1801,
|
|
1390
|
-
"criticalViolations": 0,
|
|
1391
|
-
"warnings": 41,
|
|
1392
|
-
"incrementalSince": ""
|
|
1393
|
-
}
|
|
1394
|
-
```
|
|
1395
|
-
|
|
1396
|
-
Used by `mode: incremental` to auto-determine the `since` timestamp.
|
|
1397
|
-
|
|
1398
|
-
### Multi-source execution order (`MultiSourcePipelineRunner`)
|
|
1399
|
-
|
|
1400
|
-
For a pipeline with `sources` + `merge`, the CLI selects
|
|
1401
|
-
`MultiSourcePipelineRunner` (a subclass of `PipelineRunner` that overrides
|
|
1402
|
-
`run()`, `profile()`, and `writeStateFile()` and reuses the protected
|
|
1403
|
-
`runExtract`, `runDQ`, `runTransform`, `runLoad` phase methods).
|
|
1404
|
-
|
|
1405
|
-
```
|
|
1406
|
-
1. Load + validate config ConfigLoader.load(yamlPath)
|
|
1407
|
-
2. Load plugins files + sluice.config.yaml (Tier 2/3)
|
|
1408
|
-
3. Resolve output dir, open DuckDB staging store
|
|
1409
|
-
4. For each source (priority-ordered):
|
|
1410
|
-
a. runExtract → 'stg_raw_{sourceId}'
|
|
1411
|
-
b. If source.rename is set StagingStore.renameColumns(...)
|
|
1412
|
-
c. If mode: incremental AND source.id === merge.incrementalSource:
|
|
1413
|
-
apply TRY_CAST(... AS TIMESTAMP) >= since filter
|
|
1414
|
-
d. Filter dq.rules by sourceId; runDQ against 'stg_raw_{sourceId}'
|
|
1415
|
-
(writes per-source rejection CSV, stops on critical)
|
|
1416
|
-
e. Rewrite 'stg_raw_{sourceId}' to only the accepted rows
|
|
1417
|
-
5. MergeEngine.run(store, sources, merge)
|
|
1418
|
-
→ creates 'stg_merge_joined', 'stg_merged', 'stg_merge_conflicts'
|
|
1419
|
-
→ writes conflictLog CSV if configured
|
|
1420
|
-
5a. Phase 4a Enrich (optional) runs once against 'stg_merged' if
|
|
1421
|
-
`enrich:` block is present and the four
|
|
1422
|
-
gating conditions hold (see single-source
|
|
1423
|
-
step 5b above). Single post-merge pass —
|
|
1424
|
-
never per-source.
|
|
1425
|
-
6. runDQ on the post-merge rules (no sourceId) against 'stg_merged'
|
|
1426
|
-
7. Filter rejected rows; runTransform against the filtered merge result
|
|
1427
|
-
8. If dryRun OR validate-only → STOP
|
|
1428
|
-
9. runLoad → target adapter reads 'stg_transformed'
|
|
1429
|
-
10. writeStateFile → per-source lastRunAt block + top-level summary
|
|
1430
|
-
11. Close DuckDB
|
|
1431
|
-
```
|
|
1432
|
-
|
|
1433
|
-
**Multi-source state file** adds a `sources` block keyed by source id:
|
|
1434
|
-
|
|
1435
|
-
```json
|
|
1436
|
-
{
|
|
1437
|
-
"pipeline": "style-co-products-merged",
|
|
1438
|
-
"lastRunAt": "2026-04-19T09:30:00.000Z",
|
|
1439
|
-
"lastMode": "incremental",
|
|
1440
|
-
"rowsMerged": 3201,
|
|
1441
|
-
"rowsLoaded": 3188,
|
|
1442
|
-
"criticalViolations": 0,
|
|
1443
|
-
"warnings": 14,
|
|
1444
|
-
"incrementalSince": "",
|
|
1445
|
-
"sources": {
|
|
1446
|
-
"sql-server": {
|
|
1447
|
-
"lastRunAt": "2026-04-19T09:30:00.000Z",
|
|
1448
|
-
"rowsExtracted": 2910,
|
|
1449
|
-
"incrementalSince": "2026-04-18T22:00:00.000Z"
|
|
1450
|
-
},
|
|
1451
|
-
"excel": { "lastRunAt": "...", "rowsExtracted": 412, "incrementalSince": "" }
|
|
1452
|
-
}
|
|
1453
|
-
}
|
|
1454
|
-
```
|
|
1455
|
-
|
|
1456
|
-
---
|
|
1457
|
-
|
|
1458
|
-
## ═══════════════════════════════════════════════════════════
|
|
1459
|
-
## DUCKDB STAGING STORE (src/staging/store.ts)
|
|
1460
|
-
## ═══════════════════════════════════════════════════════════
|
|
1461
|
-
|
|
1462
|
-
```typescript
|
|
1463
|
-
class StagingStore {
|
|
1464
|
-
constructor(private dbPath: string) {} // ':memory:' for dryRun/tests
|
|
1465
|
-
|
|
1466
|
-
async open(): Promise<void>
|
|
1467
|
-
async close(): Promise<void>
|
|
1468
|
-
async createTable(name: string, columns: ColumnMeta[]): Promise<void>
|
|
1469
|
-
async insertBatch(table: string, rows: Record<string, unknown>[]): Promise<void>
|
|
1470
|
-
async query<T>(sql: string, params?: unknown[]): Promise<T[]>
|
|
1471
|
-
async tableExists(name: string): Promise<boolean>
|
|
1472
|
-
async dropTable(name: string): Promise<void>
|
|
1473
|
-
async rowCount(table: string): Promise<number>
|
|
1474
|
-
async columnNames(table: string): Promise<string[]>
|
|
1475
|
-
async exportToCsv(
|
|
1476
|
-
table: string,
|
|
1477
|
-
outputPath: string,
|
|
1478
|
-
options?: { delimiter?: string; header?: boolean; encoding?: string }
|
|
1479
|
-
): Promise<void>
|
|
1480
|
-
async renameColumns( // Phase 3: used by MultiSourcePipelineRunner
|
|
1481
|
-
tableName: string, // after a per-source extract. Implemented as
|
|
1482
|
-
renames: Record<string, string> // CREATE OR REPLACE TABLE ... AS SELECT ...
|
|
1483
|
-
): Promise<void> // Unknown keys log a warning, not an error.
|
|
1484
|
-
}
|
|
1485
|
-
```
|
|
1486
|
-
|
|
1487
|
-
Default DuckDB path: `{outputDir}/{pipelineName}.duckdb`
|
|
1488
|
-
Use `':memory:'` when `dryRun: true` or `stagingDb: ':memory:'`.
|
|
1489
|
-
|
|
1490
|
-
---
|
|
1491
|
-
|
|
1492
|
-
## ═══════════════════════════════════════════════════════════
|
|
1493
|
-
## TRANSFORM ENGINE (src/transform/engine.ts)
|
|
1494
|
-
## ═══════════════════════════════════════════════════════════
|
|
1495
|
-
|
|
1496
|
-
### Field type behaviours
|
|
1497
|
-
|
|
1498
|
-
| type | behaviour |
|
|
1499
|
-
|---|---|
|
|
1500
|
-
| `string` | `String(value)`, cleanse ops, then truncate to `max` |
|
|
1501
|
-
| `number` | `Math.round(Number(value))`. Throw `TransformError` if NaN. |
|
|
1502
|
-
| `decimal` | `parseFloat(value).toFixed(precision)` stored as string |
|
|
1503
|
-
| `boolean` | `['1','true','yes','y','t'].includes(String(v).toLowerCase())` |
|
|
1504
|
-
| `date` | Parse with `dayjs(value, format)`; output as `target.dateFormat` or ISO |
|
|
1505
|
-
| `lookup` | `LookupResolver.resolve(lookupName, value)` |
|
|
1506
|
-
| `concat` | Join `from[]` with `separator`, then cleanse |
|
|
1507
|
-
| `constant` | Emit `value` verbatim |
|
|
1508
|
-
| `expression` | `ExpressionEvaluator.evaluate(expression, row)` |
|
|
1509
|
-
|
|
1510
|
-
### Expression evaluator (src/transform/expression.ts)
|
|
1511
|
-
|
|
1512
|
-
**Must not use `eval()` or `new Function()`.**
|
|
1513
|
-
|
|
1514
|
-
1. Expression does NOT start with `js:` → use `expr-eval` Parser.
|
|
1515
|
-
Provide `row` as a variable containing all source field values.
|
|
1516
|
-
2. Expression starts with `js:` → strip prefix, execute via
|
|
1517
|
-
`vm.runInNewContext(code, { row, Date, Math, JSON, String, Number, Boolean })`.
|
|
1518
|
-
Log a `warn` whenever the `js:` path is taken.
|
|
1519
|
-
|
|
1520
|
-
---
|
|
1521
|
-
|
|
1522
|
-
## ═══════════════════════════════════════════════════════════
|
|
1523
|
-
## DQ REPORTER OUTPUT (src/dq/reporter.ts)
|
|
1524
|
-
## ═══════════════════════════════════════════════════════════
|
|
1525
|
-
|
|
1526
|
-
**Rejection CSV** columns: `row_index`, `field`, `value`, `rule`, `severity`, `message`
|
|
1527
|
-
|
|
1528
|
-
**Summary JSON** (`{outputDir}/{name}-dq-summary.json`):
|
|
1529
|
-
```json
|
|
1530
|
-
{
|
|
1531
|
-
"pipeline": "acme-corp-customers",
|
|
1532
|
-
"runAt": "2026-04-15T09:30:00Z",
|
|
1533
|
-
"rowsChecked": 1842,
|
|
1534
|
-
"rowsPassed": 1801,
|
|
1535
|
-
"rowsRejected": 41,
|
|
1536
|
-
"violations": { "critical": 0, "warning": 38, "info": 3 },
|
|
1537
|
-
"byField": {
|
|
1538
|
-
"POST_CODE": { "critical": 0, "warning": 22 },
|
|
1539
|
-
"EMAIL": { "critical": 0, "warning": 16 }
|
|
1540
|
-
}
|
|
1541
|
-
}
|
|
1542
|
-
```
|
|
1543
|
-
|
|
1544
|
-
---
|
|
1545
|
-
|
|
1546
|
-
## ═══════════════════════════════════════════════════════════
|
|
1547
|
-
## ERROR TYPES (src/utils/errors.ts)
|
|
1548
|
-
## ═══════════════════════════════════════════════════════════
|
|
1549
|
-
|
|
1550
|
-
```typescript
|
|
1551
|
-
export class PipelineError extends Error {
|
|
1552
|
-
constructor(message: string, public readonly cause?: unknown) {
|
|
1553
|
-
super(message);
|
|
1554
|
-
this.name = this.constructor.name;
|
|
1555
|
-
if (Error.captureStackTrace) {
|
|
1556
|
-
Error.captureStackTrace(this, this.constructor);
|
|
1557
|
-
}
|
|
1558
|
-
}
|
|
1559
|
-
}
|
|
1560
|
-
export class ConfigError extends PipelineError {}
|
|
1561
|
-
export class SourceError extends PipelineError {}
|
|
1562
|
-
export class StagingError extends PipelineError {}
|
|
1563
|
-
export class DQError extends PipelineError {}
|
|
1564
|
-
export class PipelineDQError extends DQError {
|
|
1565
|
-
constructor(
|
|
1566
|
-
public readonly criticalCount: number,
|
|
1567
|
-
public readonly reportPath: string,
|
|
1568
|
-
) {
|
|
1569
|
-
super(`Pipeline halted: ${criticalCount} critical DQ violations. See ${reportPath}`);
|
|
1570
|
-
}
|
|
1571
|
-
}
|
|
1572
|
-
export class TransformError extends PipelineError {}
|
|
1573
|
-
export class ExpressionError extends TransformError {}
|
|
1574
|
-
export class LoadError extends PipelineError {}
|
|
1575
|
-
export class EnrichError extends PipelineError {} // Phase 4a — exit code 4
|
|
1576
|
-
```
|
|
1577
|
-
|
|
1578
|
-
All error subclasses inherit `this.name = this.constructor.name` from
|
|
1579
|
-
`PipelineError`, so `err.name` reflects the actual class (e.g. `"ConfigError"`,
|
|
1580
|
-
`"PipelineDQError"`). `Error.captureStackTrace` (V8-only) trims the constructor
|
|
1581
|
-
frame from stack traces for cleaner output.
|
|
1582
|
-
|
|
1583
|
-
---
|
|
1584
|
-
|
|
1585
|
-
## ═══════════════════════════════════════════════════════════
|
|
1586
|
-
## CLI (src/cli.ts)
|
|
1587
|
-
## ═══════════════════════════════════════════════════════════
|
|
1588
|
-
|
|
1589
|
-
```
|
|
1590
|
-
sluice run <pipeline.yaml> Full pipeline run (auto-detects single vs multi-source)
|
|
1591
|
-
sluice validate <pipeline.yaml> DQ + transform only; no load
|
|
1592
|
-
sluice profile <pipeline.yaml> Extract + column profiling; no DQ
|
|
1593
|
-
sluice check <pipeline.yaml> Config validation only; no execution
|
|
1594
|
-
sluice plugins List all loaded rule/transform/merge plugins
|
|
1595
|
-
sluice merge list-strategies List all registered merge strategies
|
|
1596
|
-
sluice merge info <strategy> Show details about a specific merge strategy
|
|
1597
|
-
|
|
1598
|
-
Global options:
|
|
1599
|
-
--log-level <level> debug | info | warn | error
|
|
1600
|
-
--env <file> Path to .env file (default: ./.env)
|
|
1601
|
-
--output <dir> Override outputDir
|
|
1602
|
-
--plugins <dir...> Additional plugin directory/directories to load
|
|
1603
|
-
--dry-run Force dryRun: true
|
|
1604
|
-
--silent Suppress the progress bar on stdout (logs still go to stderr)
|
|
1605
|
-
|
|
1606
|
-
`sluice run` options:
|
|
1607
|
-
--no-enrich Skip the Phase 4a enrich phase even if `enrich:` is configured.
|
|
1608
|
-
(validate / profile / check do not run enrich at all, by design.)
|
|
1609
|
-
```
|
|
1610
|
-
|
|
1611
|
-
**Progress feedback:** `sluice run`, `sluice validate`, and `sluice profile`
|
|
1612
|
-
render a phase-by-phase progress bar to stdout via
|
|
1613
|
-
`src/utils/progress.ts → ProgressReporter`, with per-phase emoji icons
|
|
1614
|
-
(🔎 extract · 🛡️ DQ · 🔀 merge · 🌐 enrich · 🔧 transform · 📤 load), an ETA for
|
|
1615
|
-
determinate phases, and a coloured ✅/⚠️/❌ run-summary line. The bar
|
|
1616
|
-
degrades gracefully:
|
|
1617
|
-
- `--silent` → no stdout output at all
|
|
1618
|
-
- `--log-level debug` → bar disabled; per-row debug lines are used instead
|
|
1619
|
-
- `process.stdout.isTTY` → false: plain-ASCII lines (one per phase),
|
|
1620
|
-
no emojis, no ANSI escapes — log-file friendly
|
|
1621
|
-
- `NO_COLOR` env var → ANSI colour dropped (handled by `picocolors`)
|
|
1622
|
-
|
|
1623
|
-
**Exit codes:** `0` success · `1` pipeline error · `2` DQ critical violations · `3` config error · `4` enrich error (Phase 4a)
|
|
1624
|
-
|
|
1625
|
-
---
|
|
1626
|
-
|
|
1627
|
-
## ═══════════════════════════════════════════════════════════
|
|
1628
|
-
## LOGGING (src/utils/logger.ts)
|
|
1629
|
-
## ═══════════════════════════════════════════════════════════
|
|
1630
|
-
|
|
1631
|
-
Single `pino` instance. All log records (every level) go to **stderr**; stdout
|
|
1632
|
-
is reserved exclusively for the progress bar and final summary rendered by
|
|
1633
|
-
`ProgressReporter`. This mirrors how git, cargo, and npm split streams.
|
|
1634
|
-
|
|
1635
|
-
No `console.log` in `src/`. Operators who want logs in a file can run
|
|
1636
|
-
`sluice run p.yaml 2>run.log` — the bar stays visible on the terminal while
|
|
1637
|
-
every pino record is captured to the file. Use `--log-level error` to narrow
|
|
1638
|
-
the file to errors only.
|
|
1639
|
-
|
|
1640
|
-
| Level | Used for |
|
|
1641
|
-
|---|---|
|
|
1642
|
-
| `debug` | Per-row progress, SQL queries, lookup cache hits |
|
|
1643
|
-
| `info` | Phase transitions, row counts, file paths, run summary |
|
|
1644
|
-
| `warn` | DQ warnings, missing optional lookups, `js:` expression usage |
|
|
1645
|
-
| `error` | All caught errors before re-throw |
|
|
1646
|
-
|
|
1647
|
-
Dev: `npx sluice run pipeline.yaml | npx pino-pretty`
|
|
1648
|
-
|
|
1649
|
-
---
|
|
1650
|
-
|
|
1651
|
-
## ═══════════════════════════════════════════════════════════
|
|
1652
|
-
## ENVIRONMENT VARIABLES (.env.example)
|
|
1653
|
-
## ═══════════════════════════════════════════════════════════
|
|
1654
|
-
|
|
1655
|
-
```bash
|
|
1656
|
-
# ── Acme Corp — source ────────────────────────────────────
|
|
1657
|
-
SOURCE_MSSQL=mssql://user:password@serverlegacy.example.local/LegacyDB
|
|
1658
|
-
|
|
1659
|
-
# ── Acme Corp — IFS target ────────────────────────────────
|
|
1660
|
-
IFS_IMPORT_PATH=C:\IFS\Import
|
|
1661
|
-
|
|
1662
|
-
# ── Business Central target (any client using the `bc` adapter) ──
|
|
1663
|
-
BC_BASE_URL=https://api.businesscentral.dynamics.com/v2.0
|
|
1664
|
-
BC_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
|
|
1665
|
-
BC_CLIENT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
|
|
1666
|
-
BC_CLIENT_SECRET=your-client-secret
|
|
1667
|
-
BC_COMPANY=Example Company Ltd
|
|
1668
|
-
|
|
1669
|
-
# ── Style Co — source ───────────────────────────────────
|
|
1670
|
-
SOURCE_2_MSSQL=mssql://user:password@serverlegacy2.example.local/LegacyDB
|
|
1671
|
-
|
|
1672
|
-
# ── Style Co — BlueCherry (file-based; no API creds) ───
|
|
1673
|
-
BC_IMPORT_PATH=C:\BlueCherry\Import
|
|
1674
|
-
|
|
1675
|
-
# ── Runtime ───────────────────────────────────────────────────
|
|
1676
|
-
NODE_ENV=development
|
|
1677
|
-
LOG_LEVEL=info
|
|
1678
|
-
```
|
|
1679
|
-
|
|
1680
|
-
---
|
|
1681
|
-
|
|
1682
|
-
## ═══════════════════════════════════════════════════════════
|
|
1683
|
-
## TESTING
|
|
1684
|
-
## ═══════════════════════════════════════════════════════════
|
|
1685
|
-
|
|
1686
|
-
- **Vitest only.** No Jest.
|
|
1687
|
-
- Unit tests: mock all I/O with `vi.mock`.
|
|
1688
|
-
- Integration tests: real DuckDB (`:memory:`) + CSV fixtures.
|
|
1689
|
-
- No tests against live SQL Server, BC, IFS, or BlueCherry.
|
|
1690
|
-
- Target: 80% line coverage across `src/dq/` and `src/transform/`.
|
|
1691
|
-
- Both full example pipelines in this file must parse cleanly in the config tests.
|
|
1692
|
-
|
|
1693
|
-
**Required test cases:**
|
|
1694
|
-
|
|
1695
|
-
Config loader: `${ENV_VAR}` resolution · missing var → `ConfigError` ·
|
|
1696
|
-
invalid YAML → `ZodError` · minimal pipeline with all defaults · both example
|
|
1697
|
-
pipelines in this spec parse cleanly.
|
|
1698
|
-
|
|
1699
|
-
DQ engine: `notNull` on null/empty/whitespace · `unique` with duplicates ·
|
|
1700
|
-
`ukPostcode` valid and invalid formats · `allowedValues` case sensitivity ·
|
|
1701
|
-
`stopOnCritical` throws `PipelineDQError` · reporter writes correct CSV and JSON.
|
|
1702
|
-
|
|
1703
|
-
Transform engine: `concat` with separator · `lookup` miss + `optional: true` → null ·
|
|
1704
|
-
`lookup` miss + `optional: false` → `TransformError` · `expression` basic eval ·
|
|
1705
|
-
`expression` with `js:` prefix · `cleanse: trim|titleCase` · `cleanse: padStart:6:0` ·
|
|
1706
|
-
`cleanse: normaliseUnicode` · `type: date` with `format: DD/MM/YYYY` ·
|
|
1707
|
-
`type: boolean` all truthy/falsy variants.
|
|
1708
|
-
|
|
1709
|
-
BlueCherry adapter: missing required column → `ConfigError` at `connect()` ·
|
|
1710
|
-
date columns formatted with `target.dateFormat` · header row present ·
|
|
1711
|
-
`nullValue` respected · `template` CSV used as column order.
|
|
1712
|
-
|
|
1713
|
-
Staging store: insert/query round-trip all DuckDB types · `exportToCsv` delimiter
|
|
1714
|
-
and header options · `:memory:` mode works correctly.
|
|
1715
|
-
|
|
1716
|
-
---
|
|
1717
|
-
|
|
1718
|
-
## ═══════════════════════════════════════════════════════════
|
|
1719
|
-
## BUILD, SCRIPTS, CI
|
|
1720
|
-
## ═══════════════════════════════════════════════════════════
|
|
1721
|
-
|
|
1722
|
-
**package.json scripts:**
|
|
1723
|
-
```json
|
|
1724
|
-
{
|
|
1725
|
-
"name": "@caracal-lynx/sluice",
|
|
1726
|
-
"scripts": {
|
|
1727
|
-
"build": "tsc -p tsconfig.json",
|
|
1728
|
-
"dev": "tsx watch src/cli.ts",
|
|
1729
|
-
"lint": "eslint src tests",
|
|
1730
|
-
"format": "prettier --write src tests",
|
|
1731
|
-
"test": "vitest run",
|
|
1732
|
-
"test:watch": "vitest",
|
|
1733
|
-
"test:cov": "vitest run --coverage",
|
|
1734
|
-
"sluice": "tsx src/cli.ts"
|
|
1735
|
-
},
|
|
1736
|
-
"bin": { "sluice": "dist/cli.js" }
|
|
1737
|
-
}
|
|
1738
|
-
```
|
|
1739
|
-
|
|
1740
|
-
Use `tsx` (not `ts-node`) for development execution — handles tsconfig path aliases
|
|
1741
|
-
on Windows without extra configuration.
|
|
1742
|
-
|
|
1743
|
-
**GitHub Actions** (`.github/workflows/ci.yml`):
|
|
1744
|
-
```yaml
|
|
1745
|
-
on: [push, pull_request]
|
|
1746
|
-
jobs:
|
|
1747
|
-
test:
|
|
1748
|
-
runs-on: ubuntu-latest
|
|
1749
|
-
steps:
|
|
1750
|
-
- uses: actions/checkout@v4
|
|
1751
|
-
- uses: actions/setup-node@v4
|
|
1752
|
-
with: { node-version: '24', cache: 'npm' }
|
|
1753
|
-
- run: npm ci
|
|
1754
|
-
- run: npm run lint
|
|
1755
|
-
- run: npm run build
|
|
1756
|
-
- run: npm run test:cov
|
|
1757
|
-
- uses: actions/upload-artifact@v4
|
|
1758
|
-
with: { name: coverage, path: coverage/ }
|
|
1759
|
-
```
|
|
1760
|
-
|
|
1761
|
-
---
|
|
1762
|
-
|
|
1763
|
-
## ═══════════════════════════════════════════════════════════
|
|
1764
|
-
## WINDOWS / POWERSHELL NOTES
|
|
1765
|
-
## ═══════════════════════════════════════════════════════════
|
|
1766
|
-
|
|
1767
|
-
- All file paths: `path.join()` / `path.resolve()`. Never string concat with `/`.
|
|
1768
|
-
- `.env` uses LF line endings (set in `.gitattributes`).
|
|
1769
|
-
- DuckDB npm package includes the `win32-x64` native binary automatically.
|
|
1770
|
-
- Do not write Windows-only shell commands in CI (CI runs ubuntu-latest).
|
|
1771
|
-
- Developer shell: PowerShell 7 on Windows Terminal.
|
|
1772
|
-
|
|
1773
|
-
---
|
|
1774
|
-
|
|
1775
|
-
## ═══════════════════════════════════════════════════════════
|
|
1776
|
-
## WHAT NOT TO DO
|
|
1777
|
-
## ═══════════════════════════════════════════════════════════
|
|
1778
|
-
|
|
1779
|
-
- Do not use `ts-node` — use `tsx`.
|
|
1780
|
-
- Do not use `jest` — use `vitest`.
|
|
1781
|
-
- Do not use `console.log` in `src/` — use the pino logger.
|
|
1782
|
-
- Do not write manual TypeScript interfaces for config types — use `z.infer<>`.
|
|
1783
|
-
- Do not use `eval()` or `new Function()` — use `expr-eval` or `vm.runInNewContext`.
|
|
1784
|
-
- Do not hard-code connection strings, credentials, or client-specific values.
|
|
1785
|
-
- Do not import from `@duckdb/node-api` directly outside `src/staging/store.ts`.
|
|
1786
|
-
- Do not create `StagingStore` instances outside `PipelineRunner`.
|
|
1787
|
-
- Do not add UI, REST server, or dashboard code.
|
|
1788
|
-
- Do not add adapter-specific logic to `PipelineRunner`.
|
|
1789
|
-
- Do not invent new top-level YAML keys — the schema is fixed.
|
|
1790
|
-
- Do not add cleanse ops without adding them to the reference table in this file.
|
|
1791
|
-
- Do not add BlueCherry entity types to `REQUIRED_COLUMNS` without verifying
|
|
1792
|
-
column names against actual BlueCherry import documentation first.
|
|
1793
|
-
- Do not use `dayjs` plugins without importing them explicitly at the call site.
|
|
1794
|
-
|
|
1795
|
-
---
|
|
1796
|
-
|
|
1797
|
-
## ═══════════════════════════════════════════════════════════
|
|
1798
|
-
## SUGGESTED BUILD ORDER FOR CLAUDE CODE
|
|
1799
|
-
## ═══════════════════════════════════════════════════════════
|
|
1800
|
-
|
|
1801
|
-
Work phase by phase. Do not start the next phase until the current phase passes
|
|
1802
|
-
`npm run build` and `npm test` without errors. Ask before proceeding if anything
|
|
1803
|
-
in this spec is ambiguous.
|
|
1804
|
-
|
|
1805
|
-
1. **Scaffold** — `package.json`, `tsconfig.json`, `src/utils/`, `src/config/`.
|
|
1806
|
-
Verify both example pipelines parse cleanly.
|
|
1807
|
-
2. **Staging store** — `src/staging/`. Unit tests with `:memory:`.
|
|
1808
|
-
3. **Source adapters** — `csv` first, then `mssql`, `pg`, `xlsx`, `rest`.
|
|
1809
|
-
Mock all external connections in tests.
|
|
1810
|
-
4. **DQ engine** — `src/dq/` including all rules and reporter.
|
|
1811
|
-
5. **Transform engine** — `src/transform/` — all types, cleanse ops, expression eval.
|
|
1812
|
-
6. **Target adapters** — `csv` → `ifs` → `bluecherry` → `bc` (BC is most complex;
|
|
1813
|
-
mock OAuth2 token endpoint in tests).
|
|
1814
|
-
7. **PipelineRunner** — wire all phases; integration test both fixture pipelines.
|
|
1815
|
-
8. **CLI** — all four commands and exit codes.
|
|
1816
|
-
9. **CI** — `.github/workflows/ci.yml`.
|
|
1817
|
-
|
|
1818
|
-
---
|
|
1819
|
-
|
|
1820
|
-
*This file is the authoritative specification for Sluice. If anything in the
|
|
1821
|
-
codebase contradicts this file, the codebase is wrong. Update this file whenever
|
|
1822
|
-
the architecture evolves — then tell Claude Code to re-read it before continuing.*
|
|
1
|
+
# Sluice — CLAUDE.md
|
|
2
|
+
# Project specification for Claude Code
|
|
3
|
+
# Sluice: config-driven ETL toolkit for ERP data migrations
|
|
4
|
+
# npm package: @caracal-lynx/sluice
|
|
5
|
+
# Owner: Michael Scott, Caracal Lynx Ltd. (SC826823)
|
|
6
|
+
# Last updated: 2026-04-20
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Project overview
|
|
11
|
+
|
|
12
|
+
**Sluice** is a config-driven ETL toolkit for ERP data migrations, developed and
|
|
13
|
+
maintained by Caracal Lynx Ltd. The engine is written once; each client
|
|
14
|
+
engagement is delivered as a folder of YAML pipeline configs. There is no UI, no
|
|
15
|
+
server, and no cloud dependency — just the `sluice` CLI and a set of TypeScript
|
|
16
|
+
modules that can be imported by other tools (e.g. n8n custom nodes, GitHub Actions).
|
|
17
|
+
|
|
18
|
+
*Clean data flows through.*
|
|
19
|
+
|
|
20
|
+
**Known clients and targets:**
|
|
21
|
+
|
|
22
|
+
| Client | Source(s) | Target ERP | Adapter |
|
|
23
|
+
|---|---|---|---|
|
|
24
|
+
| Acme Corp | MSSQL legacy DB | IFS ERP | `ifs` |
|
|
25
|
+
| Style Co | MSSQL / CSV exports | BlueCherry ERP | `bluecherry` |
|
|
26
|
+
|
|
27
|
+
**Primary use cases:**
|
|
28
|
+
- Extract data from legacy SQL databases, CSV/Excel exports, and REST APIs
|
|
29
|
+
- Validate data quality against a configurable rule set
|
|
30
|
+
- Transform field mappings, apply lookups, cleanse values, evaluate expressions
|
|
31
|
+
- Load output to BC via REST API, IFS via CSV import, BlueCherry via CSV import,
|
|
32
|
+
or generic CSV/JSON for any other target
|
|
33
|
+
- Run from the command line on a developer laptop (Windows, PowerShell 7)
|
|
34
|
+
- Run unattended in GitHub Actions CI
|
|
35
|
+
|
|
36
|
+
**Non-goals:**
|
|
37
|
+
- No web UI or dashboard
|
|
38
|
+
- No streaming / real-time ingestion
|
|
39
|
+
- No data warehouse or lake — DuckDB is used only as a local staging store
|
|
40
|
+
- No multi-tenant SaaS — this is a consultant's toolkit, not a product
|
|
41
|
+
|
|
42
|
+
**Related docs:**
|
|
43
|
+
- [README.md](README.md) — install, quick-start, composite rules (Tier 1)
|
|
44
|
+
- [PLUGINS.md](PLUGINS.md) — Tier 2 (file) and Tier 3 (npm) plugin author guide
|
|
45
|
+
- [docs/architecture-diagrams.md](docs/architecture-diagrams.md) — Mermaid diagrams of
|
|
46
|
+
the single- and multi-source pipeline flow
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## Repository structure
|
|
51
|
+
|
|
52
|
+
```
|
|
53
|
+
sluice/
|
|
54
|
+
├── CLAUDE.md ← you are here
|
|
55
|
+
├── PLUGINS.md ← Tier 2 / Tier 3 plugin author guide
|
|
56
|
+
├── README.md
|
|
57
|
+
├── package.json
|
|
58
|
+
├── tsconfig.json
|
|
59
|
+
├── tsconfig.test.json
|
|
60
|
+
├── .env.example
|
|
61
|
+
├── .gitignore
|
|
62
|
+
├── eslint.config.js
|
|
63
|
+
├── .prettierrc
|
|
64
|
+
├── .github/workflows/ci.yml
|
|
65
|
+
├── docs/
|
|
66
|
+
│ └── architecture-diagrams.md
|
|
67
|
+
├── examples/ ← sample pipelines (not run by tests)
|
|
68
|
+
│
|
|
69
|
+
├── src/
|
|
70
|
+
│ ├── index.ts ← public API barrel (re-exports from all modules)
|
|
71
|
+
│ ├── cli.ts ← commander CLI entry point
|
|
72
|
+
│ ├── runner.ts ← PipelineRunner (single-source)
|
|
73
|
+
│ ├── multi-source-runner.ts ← MultiSourcePipelineRunner (extends PipelineRunner)
|
|
74
|
+
│ │
|
|
75
|
+
│ ├── config/
|
|
76
|
+
│ │ ├── index.ts ← re-exports schema + types
|
|
77
|
+
│ │ ├── schema.ts ← Zod schema (PipelineSchema + sub-schemas)
|
|
78
|
+
│ │ ├── loader.ts ← YAML load + ${ENV_VAR} interp + composite-rule expansion + parse
|
|
79
|
+
│ │ └── types.ts ← re-exports of all inferred Zod types + guards
|
|
80
|
+
│ │
|
|
81
|
+
│ ├── adapters/
|
|
82
|
+
│ │ ├── source/
|
|
83
|
+
│ │ │ ├── index.ts ← barrel (self-registers built-ins on import)
|
|
84
|
+
│ │ │ ├── registry.ts ← SourceAdapterRegistry
|
|
85
|
+
│ │ │ ├── types.ts ← SourceAdapter + ExtractResult
|
|
86
|
+
│ │ │ ├── mssql.ts
|
|
87
|
+
│ │ │ ├── pg.ts
|
|
88
|
+
│ │ │ ├── csv.ts
|
|
89
|
+
│ │ │ ├── xlsx.ts
|
|
90
|
+
│ │ │ └── rest.ts
|
|
91
|
+
│ │ └── target/
|
|
92
|
+
│ │ ├── index.ts ← barrel (self-registers built-ins on import)
|
|
93
|
+
│ │ ├── registry.ts ← TargetAdapterRegistry
|
|
94
|
+
│ │ ├── types.ts ← TargetAdapter + LoadResult
|
|
95
|
+
│ │ ├── bc.ts ← Business Central REST (+ BcTokenManager)
|
|
96
|
+
│ │ ├── ifs.ts ← IFS ERP CSV import
|
|
97
|
+
│ │ ├── bluecherry.ts ← BlueCherry ERP CSV import
|
|
98
|
+
│ │ ├── csv.ts ← generic CSV
|
|
99
|
+
│ │ └── pg.ts
|
|
100
|
+
│ │
|
|
101
|
+
│ ├── staging/
|
|
102
|
+
│ │ ├── index.ts ← barrel
|
|
103
|
+
│ │ ├── store.ts ← DuckDB wrapper (the only file that imports `@duckdb/node-api`)
|
|
104
|
+
│ │ └── schema.ts ← ColumnMeta, quoteIdent, buildCreateTableSql
|
|
105
|
+
│ │
|
|
106
|
+
│ ├── dq/
|
|
107
|
+
│ │ ├── index.ts ← barrel
|
|
108
|
+
│ │ ├── engine.ts ← DQEngine
|
|
109
|
+
│ │ ├── reporter.ts ← writeRejectionCsv, writeSummaryJson
|
|
110
|
+
│ │ ├── types.ts ← DQSummary, ViolationCounts
|
|
111
|
+
│ │ └── rules/
|
|
112
|
+
│ │ ├── index.ts ← BUILT_IN_RULES map (id → Rule instance)
|
|
113
|
+
│ │ ├── types.ts ← Rule = RulePlugin, RuleViolation (re-exported from plugins)
|
|
114
|
+
│ │ ├── notNull.ts
|
|
115
|
+
│ │ ├── unique.ts
|
|
116
|
+
│ │ ├── pattern.ts
|
|
117
|
+
│ │ ├── email.ts
|
|
118
|
+
│ │ ├── ukPostcode.ts
|
|
119
|
+
│ │ ├── maxLength.ts
|
|
120
|
+
│ │ ├── minMax.ts
|
|
121
|
+
│ │ └── allowedValues.ts
|
|
122
|
+
│ │
|
|
123
|
+
│ ├── transform/
|
|
124
|
+
│ │ ├── index.ts
|
|
125
|
+
│ │ ├── engine.ts ← TransformEngine (built-in types + custom plugins)
|
|
126
|
+
│ │ ├── lookup.ts
|
|
127
|
+
│ │ ├── cleanse.ts
|
|
128
|
+
│ │ ├── expression.ts ← expr-eval + `js:` vm sandbox
|
|
129
|
+
│ │ └── types.ts ← TransformResult
|
|
130
|
+
│ │
|
|
131
|
+
│ ├── merge/ ← multi-source merge engine + strategies
|
|
132
|
+
│ │ ├── index.ts ← MergeStrategyRegistry (pre-registers all built-ins)
|
|
133
|
+
│ │ ├── engine.ts ← MergeEngine
|
|
134
|
+
│ │ ├── sql-builder.ts ← shared JOIN + coalesce SQL helpers
|
|
135
|
+
│ │ ├── conflict-log.ts ← conflict CSV writer
|
|
136
|
+
│ │ ├── types.ts ← MergeStrategyPlugin, MergeSourceMeta, MergeResult
|
|
137
|
+
│ │ └── strategies/
|
|
138
|
+
│ │ ├── index.ts
|
|
139
|
+
│ │ ├── coalesce.ts
|
|
140
|
+
│ │ ├── priority-override.ts
|
|
141
|
+
│ │ ├── union.ts
|
|
142
|
+
│ │ └── intersect.ts
|
|
143
|
+
│ │
|
|
144
|
+
│ ├── plugins/ ← Tier 2 / Tier 3 plugin system
|
|
145
|
+
│ │ ├── index.ts ← barrel
|
|
146
|
+
│ │ ├── types.ts ← RulePlugin, TransformPlugin, PluginPackage
|
|
147
|
+
│ │ ├── registry.ts ← RuleRegistry, TransformRegistry (custom plugin holders)
|
|
148
|
+
│ │ └── loader.ts ← loadPlugins (file-based), loadNpmPlugins (sluice.config.yaml)
|
|
149
|
+
│ │
|
|
150
|
+
│ ├── enrich/ ← Phase 4a public surface (types only)
|
|
151
|
+
│ │ └── types.ts ← EnrichPlugin, EnrichResult, EnrichOptions, EnrichSummary,
|
|
152
|
+
│ │ EnrichPhaseFactory (implementation lives in private
|
|
153
|
+
│ │ @caracal-lynx/sluice-enrich package)
|
|
154
|
+
│ │
|
|
155
|
+
│ └── utils/
|
|
156
|
+
│ ├── index.ts
|
|
157
|
+
│ ├── logger.ts ← pino singleton
|
|
158
|
+
│ ├── env.ts ← loadEnv + requireEnv
|
|
159
|
+
│ └── errors.ts
|
|
160
|
+
│
|
|
161
|
+
├── tests/
|
|
162
|
+
│ ├── fixtures/
|
|
163
|
+
│ │ ├── acme-corp-customers.pipeline.yaml
|
|
164
|
+
│ │ ├── style-co-styles.pipeline.yaml
|
|
165
|
+
│ │ ├── style-co-products-merged.pipeline.yaml ← multi-source
|
|
166
|
+
│ │ ├── multi-source-no-merge.pipeline.yaml ← negative-path multi-source
|
|
167
|
+
│ │ ├── shared-rules.yaml ← composite rule library
|
|
168
|
+
│ │ └── plugins/ ← test plugin fixtures (Tier 2 files)
|
|
169
|
+
│ │
|
|
170
|
+
│ ├── unit/
|
|
171
|
+
│ │ ├── cli.test.ts
|
|
172
|
+
│ │ ├── runner.test.ts
|
|
173
|
+
│ │ ├── adapters/
|
|
174
|
+
│ │ │ ├── source/ ← csv, mssql, pg, rest, xlsx
|
|
175
|
+
│ │ │ └── target/ ← bc, bluecherry, ifs, pg
|
|
176
|
+
│ │ ├── config/ ← loader, schema, multi-source, composite-expansion
|
|
177
|
+
│ │ ├── dq/ ← engine, reporter, rules
|
|
178
|
+
│ │ ├── merge/ ← engine, registry, strategies
|
|
179
|
+
│ │ ├── plugins/ ← loader, registry, composite-expansion
|
|
180
|
+
│ │ ├── staging/ ← store
|
|
181
|
+
│ │ └── transform/ ← cleanse, expression, engine, custom
|
|
182
|
+
│ │
|
|
183
|
+
│ └── integration/
|
|
184
|
+
│ ├── cli-check.test.ts
|
|
185
|
+
│ ├── cli-commands.test.ts
|
|
186
|
+
│ ├── cli-plugins.test.ts
|
|
187
|
+
│ ├── csv-to-csv-mvp.test.ts
|
|
188
|
+
│ ├── dq-integration.test.ts
|
|
189
|
+
│ ├── style-co-styles-mini.test.ts
|
|
190
|
+
│ ├── merge-strategies.test.ts
|
|
191
|
+
│ ├── multi-source-runner.test.ts
|
|
192
|
+
│ └── runner-plugin-wiring.test.ts
|
|
193
|
+
│
|
|
194
|
+
└── clients/ ← gitignored in this repo; each client
|
|
195
|
+
├── acme-corp/ gets their own private repo
|
|
196
|
+
│ ├── .env
|
|
197
|
+
│ ├── customers.pipeline.yaml
|
|
198
|
+
│ ├── items.pipeline.yaml
|
|
199
|
+
│ ├── vendors.pipeline.yaml
|
|
200
|
+
│ └── lookups/
|
|
201
|
+
└── style-co/
|
|
202
|
+
├── .env
|
|
203
|
+
├── styles.pipeline.yaml
|
|
204
|
+
├── vendors.pipeline.yaml
|
|
205
|
+
├── purchase-orders.pipeline.yaml
|
|
206
|
+
└── lookups/
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## Technology stack
|
|
212
|
+
|
|
213
|
+
| Concern | Package | Notes |
|
|
214
|
+
|---|---|---|
|
|
215
|
+
| Language | TypeScript 5.x | `strict: true`, `exactOptionalPropertyTypes: true` |
|
|
216
|
+
| Runtime | Node.js 24 LTS | No Bun, no Deno — must run in GitHub Actions |
|
|
217
|
+
| Config parsing | `js-yaml` | YAML 1.2 only |
|
|
218
|
+
| Config validation | `zod` v3 | All config types inferred from Zod |
|
|
219
|
+
| SQL Server | `mssql` | Trusted + SQL auth both supported |
|
|
220
|
+
| PostgreSQL | `pg` + `@types/pg` | |
|
|
221
|
+
| CSV | `csv-parse` + `csv-stringify` | Streaming |
|
|
222
|
+
| Excel | `xlsx` (SheetJS) | Read-only |
|
|
223
|
+
| HTTP | `axios` + `axios-retry` | 3 retries, exponential backoff |
|
|
224
|
+
| Dates | `dayjs` | All date parsing and formatting |
|
|
225
|
+
| Staging | `@duckdb/node-api` | Embedded; no server. Replaces deprecated `duckdb` package — ABI-stable (no `npm rebuild` after Node ABI bumps). |
|
|
226
|
+
| CLI | `commander` v12 | |
|
|
227
|
+
| Logging | `pino` | JSON; `pino-pretty` in dev |
|
|
228
|
+
| Testing | `vitest` | No Jest |
|
|
229
|
+
| Env vars | `dotenv` | Loaded once at CLI entry |
|
|
230
|
+
| Linting | `eslint` + `@typescript-eslint` | |
|
|
231
|
+
| Formatting | `prettier` | 2-space, single quotes, trailing commas |
|
|
232
|
+
| Expressions | `expr-eval` | Safe expression parser; no eval() |
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## TypeScript conventions
|
|
237
|
+
|
|
238
|
+
- **All config types come from Zod inference.** Do not write manual `type` or
|
|
239
|
+
`interface` declarations for anything that maps to pipeline config.
|
|
240
|
+
Use `z.infer<typeof SomeSchema>`.
|
|
241
|
+
- **No `any`.** Use `unknown` and narrow explicitly.
|
|
242
|
+
- **No `eval()` or `Function()`** anywhere. See expression evaluator section.
|
|
243
|
+
- **Async throughout.** All I/O must be `async/await`. No callbacks.
|
|
244
|
+
- **Error handling:** throw typed errors from `src/utils/errors.ts`. Never throw
|
|
245
|
+
raw strings. Catch at the `PipelineRunner` boundary.
|
|
246
|
+
- **Barrel exports:** each directory has an `index.ts`. Do not import from internal
|
|
247
|
+
files across module boundaries.
|
|
248
|
+
- **No circular imports.** Dependency direction:
|
|
249
|
+
`cli` → `runner` / `multi-source-runner` → `adapters`, `staging`, `dq`,
|
|
250
|
+
`transform`, `merge`, `plugins`, `config`, `enrich`. `plugins/` is imported by
|
|
251
|
+
`runner`, `dq`, `transform`, and `merge`; it must not import any of them.
|
|
252
|
+
`enrich/` is type-only (no runtime imports of other modules in this repo —
|
|
253
|
+
the implementation lives in the private `@caracal-lynx/sluice-enrich`).
|
|
254
|
+
Utils are imported by everyone.
|
|
255
|
+
- **Path aliases:** `@/` → `src/` in tsconfig.
|
|
256
|
+
|
|
257
|
+
---
|
|
258
|
+
|
|
259
|
+
## ═══════════════════════════════════════════════════════════
|
|
260
|
+
## YAML PIPELINE CONFIG SPECIFICATION
|
|
261
|
+
## ═══════════════════════════════════════════════════════════
|
|
262
|
+
|
|
263
|
+
Every pipeline is a single YAML file. One file = one migrated entity
|
|
264
|
+
(e.g. customers, items, vendors, styles, purchase orders).
|
|
265
|
+
|
|
266
|
+
### Top-level structure
|
|
267
|
+
|
|
268
|
+
```yaml
|
|
269
|
+
pipeline: { ... } # identity and metadata
|
|
270
|
+
source: { ... } # where to read from
|
|
271
|
+
enrich: { ... } # OPTIONAL — Phase 4a; external API lookups (private)
|
|
272
|
+
dq: { ... } # data quality rules
|
|
273
|
+
transform: { ... } # field mappings and lookups
|
|
274
|
+
target: { ... } # where to write to
|
|
275
|
+
run: { ... } # execution options (all fields optional; all have defaults)
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
> **Phase 4a — Enrich Phase (private):** the `enrich:` block, when present, runs after Extract (and after Merge for multi-source pipelines) and before DQ. The framework that drives it lives in the **private** `@caracal-lynx/sluice-enrich` package — the open-source core only ships the Zod schema, the public `EnrichPlugin` interface (`src/enrich/types.ts`), and the `registerEnrichPhase()` injection hook on `PipelineRunner`. With `sluice-enrich` not installed, an `enrich:` block is parsed and validated but the phase is skipped with a `WARN` log. See [docs/PHASE-04-enrich-phase.md](docs/PHASE-04-enrich-phase.md) for the full spec.
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
### `pipeline` section
|
|
283
|
+
|
|
284
|
+
```yaml
|
|
285
|
+
pipeline:
|
|
286
|
+
name: acme-corp-customers # REQUIRED. Slug: lowercase, hyphens only.
|
|
287
|
+
# Used in output filenames and log messages.
|
|
288
|
+
client: acme-corp # REQUIRED. Client identifier.
|
|
289
|
+
version: "1.0" # REQUIRED. Quote to ensure string type.
|
|
290
|
+
entity: CustomerInfo # REQUIRED. Logical entity name (used in
|
|
291
|
+
# load reports and target adapter metadata).
|
|
292
|
+
description: > # Optional. Human-readable description.
|
|
293
|
+
Customer master migration —
|
|
294
|
+
legacy SQL to IFS ERP
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
---
|
|
298
|
+
|
|
299
|
+
### `source` section
|
|
300
|
+
|
|
301
|
+
Exactly one of `query`, `file`, or `endpoint` must be present.
|
|
302
|
+
|
|
303
|
+
```yaml
|
|
304
|
+
source:
|
|
305
|
+
adapter: mssql # REQUIRED. One of: mssql | pg | csv | xlsx | rest
|
|
306
|
+
|
|
307
|
+
# ── SQL adapters (mssql, pg) ──────────────────────────────
|
|
308
|
+
connection: ${SOURCE_MSSQL} # Connection string from .env.
|
|
309
|
+
# mssql: mssql://user:pass@host/database
|
|
310
|
+
# Or a JSON string for trusted/advanced config.
|
|
311
|
+
query: |
|
|
312
|
+
SELECT c.CUST_CODE, c.CUST_NAME, c.POST_CODE
|
|
313
|
+
FROM dbo.Customers c
|
|
314
|
+
WHERE c.Active = 1
|
|
315
|
+
|
|
316
|
+
# ── CSV adapter ───────────────────────────────────────────
|
|
317
|
+
file: ./data/customers.csv # Path or glob (./data/export-*.csv).
|
|
318
|
+
delimiter: "," # Default: ","
|
|
319
|
+
encoding: utf-8 # Default: utf-8
|
|
320
|
+
|
|
321
|
+
# ── XLSX adapter ──────────────────────────────────────────
|
|
322
|
+
file: ./data/customers.xlsx
|
|
323
|
+
sheet: "Customer Export" # Sheet name or 0-based index. Default: 0.
|
|
324
|
+
|
|
325
|
+
# ── REST adapter ──────────────────────────────────────────
|
|
326
|
+
endpoint: ${API_BASE}/customers # Full URL. ${ENV_VAR} resolved at runtime.
|
|
327
|
+
headers: # Optional. Added to every request.
|
|
328
|
+
Authorization: Bearer ${API_TOKEN}
|
|
329
|
+
Accept: application/json
|
|
330
|
+
pagination: # Optional. Omit for single-page responses.
|
|
331
|
+
type: offset # offset | cursor | page
|
|
332
|
+
pageSize: 100
|
|
333
|
+
pageParam: skip # Query param name for the offset/page value.
|
|
334
|
+
totalField: data.total # Dot-path to total count in response body.
|
|
335
|
+
dataField: data.items # Dot-path to the records array.
|
|
336
|
+
cursorField: nextCursor # For cursor pagination: field in response body.
|
|
337
|
+
cursorParam: cursor # For cursor pagination: query param name.
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
---
|
|
341
|
+
|
|
342
|
+
### `dq` section
|
|
343
|
+
|
|
344
|
+
```yaml
|
|
345
|
+
dq:
|
|
346
|
+
stopOnCritical: true # Default: true. Halt pipeline if any critical rule fails.
|
|
347
|
+
rejectionFile: ./output/acme-corp-customers-rejected.csv
|
|
348
|
+
# Default: ./output/{pipeline.name}-rejected.csv
|
|
349
|
+
|
|
350
|
+
rules:
|
|
351
|
+
- field: FIELD_NAME # Source column name (pre-transform).
|
|
352
|
+
checks:
|
|
353
|
+
|
|
354
|
+
# notNull — fails if null, undefined, empty string, or whitespace-only
|
|
355
|
+
- type: notNull
|
|
356
|
+
severity: critical
|
|
357
|
+
|
|
358
|
+
# unique — fails if value appears more than once across the full dataset
|
|
359
|
+
- type: unique
|
|
360
|
+
severity: critical
|
|
361
|
+
|
|
362
|
+
# pattern — ECMAScript regex, tested with new RegExp(value)
|
|
363
|
+
- type: pattern
|
|
364
|
+
value: "^[A-Z0-9]{3,10}$"
|
|
365
|
+
severity: warning
|
|
366
|
+
message: "Must be 3-10 uppercase alphanumeric characters"
|
|
367
|
+
# message is optional; overrides default.
|
|
368
|
+
|
|
369
|
+
# email — RFC 5322-ish email validation
|
|
370
|
+
- type: email
|
|
371
|
+
severity: warning
|
|
372
|
+
|
|
373
|
+
# ukPostcode — all current UK postcode formats; strips spaces before testing
|
|
374
|
+
- type: ukPostcode
|
|
375
|
+
severity: warning
|
|
376
|
+
|
|
377
|
+
# maxLength — maximum string length (integer)
|
|
378
|
+
- type: maxLength
|
|
379
|
+
value: 100
|
|
380
|
+
severity: warning
|
|
381
|
+
|
|
382
|
+
# min / max — numeric comparison; coerces value to float
|
|
383
|
+
- type: min
|
|
384
|
+
value: 0
|
|
385
|
+
severity: critical
|
|
386
|
+
- type: max
|
|
387
|
+
value: 500000
|
|
388
|
+
severity: warning
|
|
389
|
+
|
|
390
|
+
# allowedValues — case-sensitive array of permitted string values
|
|
391
|
+
- type: allowedValues
|
|
392
|
+
value: [GB, IE, US, DE, FR]
|
|
393
|
+
severity: warning
|
|
394
|
+
|
|
395
|
+
# Severity:
|
|
396
|
+
# critical row is rejected; pipeline halts if stopOnCritical: true
|
|
397
|
+
# warning row is flagged in rejection report but NOT removed from output
|
|
398
|
+
# info recorded in summary JSON only
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
---
|
|
402
|
+
|
|
403
|
+
### `transform` section
|
|
404
|
+
|
|
405
|
+
```yaml
|
|
406
|
+
transform:
|
|
407
|
+
|
|
408
|
+
# ── Lookup tables ─────────────────────────────────────────
|
|
409
|
+
# Loaded once at start of transform phase, cached in memory.
|
|
410
|
+
lookups:
|
|
411
|
+
- name: currencyMap # Referenced by field mappings.
|
|
412
|
+
source: # Any source adapter works here.
|
|
413
|
+
adapter: csv
|
|
414
|
+
file: ./lookups/currency-codes.csv
|
|
415
|
+
key: legacyCode # Column to match against source value.
|
|
416
|
+
value: isoCode # Column to return as resolved value.
|
|
417
|
+
|
|
418
|
+
- name: acctMgrMap
|
|
419
|
+
source:
|
|
420
|
+
adapter: mssql
|
|
421
|
+
connection: ${SOURCE_MSSQL}
|
|
422
|
+
query: "SELECT STAFF_ID as key, IFS_USER_ID as value FROM dbo.Staff"
|
|
423
|
+
key: key
|
|
424
|
+
value: value
|
|
425
|
+
|
|
426
|
+
# ── Field mappings ────────────────────────────────────────
|
|
427
|
+
fields:
|
|
428
|
+
|
|
429
|
+
# type: string
|
|
430
|
+
- from: CUST_CODE
|
|
431
|
+
to: CustomerNo
|
|
432
|
+
type: string
|
|
433
|
+
max: 20 # Optional. Truncate after cleanse.
|
|
434
|
+
|
|
435
|
+
- from: CUST_NAME
|
|
436
|
+
to: Name
|
|
437
|
+
type: string
|
|
438
|
+
max: 100
|
|
439
|
+
cleanse: trim|titleCase # Pipe-separated cleanse ops. See table below.
|
|
440
|
+
|
|
441
|
+
# type: number — coerce to integer; throws if NaN
|
|
442
|
+
- from: QTY
|
|
443
|
+
to: Quantity
|
|
444
|
+
type: number
|
|
445
|
+
|
|
446
|
+
# type: decimal — fixed precision; stored as string in staging
|
|
447
|
+
- from: CREDIT_LIMIT
|
|
448
|
+
to: CreditLimit
|
|
449
|
+
type: decimal
|
|
450
|
+
precision: 2 # Default: 2
|
|
451
|
+
|
|
452
|
+
# type: boolean
|
|
453
|
+
# Truthy: '1','true','yes','y','t' (case-insensitive). All else false.
|
|
454
|
+
- from: IS_ACTIVE
|
|
455
|
+
to: Active
|
|
456
|
+
type: boolean
|
|
457
|
+
|
|
458
|
+
# type: date — parse source date, output as dateFormat (default ISO)
|
|
459
|
+
- from: START_DATE
|
|
460
|
+
to: StartDate
|
|
461
|
+
type: date
|
|
462
|
+
format: DD/MM/YYYY # Optional source parse format (dayjs tokens).
|
|
463
|
+
|
|
464
|
+
# type: lookup — resolve via a named lookup table
|
|
465
|
+
- from: CURRENCY
|
|
466
|
+
to: CurrencyCode
|
|
467
|
+
type: lookup
|
|
468
|
+
lookup: currencyMap # Must match a lookup name above.
|
|
469
|
+
default: GBP # Emitted when lookup key not found.
|
|
470
|
+
optional: false # Default: false. true = null on miss (no error).
|
|
471
|
+
|
|
472
|
+
# type: concat — join multiple source fields
|
|
473
|
+
- from: [ADDR1, ADDR2] # Array of source field names.
|
|
474
|
+
to: Address1
|
|
475
|
+
type: concat
|
|
476
|
+
separator: ", " # Default: " "
|
|
477
|
+
cleanse: trim|nullIfEmpty
|
|
478
|
+
|
|
479
|
+
# type: constant — emit a fixed value regardless of source data
|
|
480
|
+
- to: CustomerGroup
|
|
481
|
+
type: constant
|
|
482
|
+
value: DOMESTIC
|
|
483
|
+
|
|
484
|
+
# type: expression — evaluate against source row
|
|
485
|
+
- to: SearchName
|
|
486
|
+
type: expression
|
|
487
|
+
value: "row.CUST_NAME.toUpperCase().substring(0, 20)"
|
|
488
|
+
# For logic beyond expr-eval, prefix with js:
|
|
489
|
+
# value: "js: row.PRICE * (1 - row.DISCOUNT / 100)"
|
|
490
|
+
|
|
491
|
+
# Common optional field properties:
|
|
492
|
+
# optional: true null result does not cause a pipeline error
|
|
493
|
+
# default: <val> fallback value if source is null/empty
|
|
494
|
+
# max: <n> truncate string to n chars AFTER cleanse
|
|
495
|
+
```
|
|
496
|
+
|
|
497
|
+
#### Cleanse operations reference
|
|
498
|
+
|
|
499
|
+
Applied left-to-right in the pipe chain. Defined in `src/transform/cleanse.ts`.
|
|
500
|
+
|
|
501
|
+
| Op | Example input | Example output |
|
|
502
|
+
|---|---|---|
|
|
503
|
+
| `trim` | `" hello "` | `"hello"` |
|
|
504
|
+
| `uppercase` | `"hello"` | `"HELLO"` |
|
|
505
|
+
| `lowercase` | `"HELLO"` | `"hello"` |
|
|
506
|
+
| `titleCase` | `"john smith"` | `"John Smith"` |
|
|
507
|
+
| `stripNonAlpha` | `"AB-12!"` | `"AB"` |
|
|
508
|
+
| `stripNonNumeric` | `"AB-12!"` | `"12"` |
|
|
509
|
+
| `stripWhitespace` | `"h e l l o"` | `"hello"` |
|
|
510
|
+
| `padStart:6:0` | `"42"` | `"000042"` |
|
|
511
|
+
| `truncate:20` | 21-char string | 20-char string |
|
|
512
|
+
| `nullIfEmpty` | `""` | `null` |
|
|
513
|
+
| `normaliseQuotes` | `"it\u2019s"` | `"it's"` |
|
|
514
|
+
| `normaliseUnicode` | `"caf\u00e9"` | `"cafe"` (NFD→ASCII) |
|
|
515
|
+
|
|
516
|
+
---
|
|
517
|
+
|
|
518
|
+
### `target` section
|
|
519
|
+
|
|
520
|
+
```yaml
|
|
521
|
+
target:
|
|
522
|
+
adapter: ifs # REQUIRED. One of:
|
|
523
|
+
# bc | ifs | bluecherry | csv | pg | rest
|
|
524
|
+
|
|
525
|
+
# ── IFS adapter ───────────────────────────────────────────
|
|
526
|
+
adapter: ifs
|
|
527
|
+
output: ./output/acme-corp-customers-ifs.csv
|
|
528
|
+
entity: CustomerInfo # IFS entity name (used in import log).
|
|
529
|
+
includeHeader: false # Default: false (standard IFS import format).
|
|
530
|
+
columnOrder: # Optional. Forces specific column ordering.
|
|
531
|
+
- CustomerNo # Must match transform 'to' field names.
|
|
532
|
+
- Name
|
|
533
|
+
- Address1
|
|
534
|
+
dateFormat: YYYY-MM-DD # Default: YYYY-MM-DD
|
|
535
|
+
delimiter: "," # Default: ","
|
|
536
|
+
encoding: utf-8 # Default: utf-8
|
|
537
|
+
|
|
538
|
+
# ── BlueCherry adapter ────────────────────────────────────
|
|
539
|
+
adapter: bluecherry
|
|
540
|
+
entity: Style # REQUIRED. One of: Style | Vendor |
|
|
541
|
+
# PurchaseOrder | PODetail | Season | ColourSize
|
|
542
|
+
output: ./output/style-co-styles-bc.csv
|
|
543
|
+
template: default # Optional. 'default' uses built-in required
|
|
544
|
+
# columns. Or path to a header-only template CSV
|
|
545
|
+
# whose first row defines column order.
|
|
546
|
+
includeHeader: true # Default: true (BlueCherry expects headers).
|
|
547
|
+
dateFormat: MM/DD/YYYY # Default: MM/DD/YYYY (BlueCherry is US-origin).
|
|
548
|
+
delimiter: ","
|
|
549
|
+
encoding: utf-8
|
|
550
|
+
nullValue: "" # How nulls are rendered. Default: ""
|
|
551
|
+
|
|
552
|
+
# ── Business Central REST adapter ─────────────────────────
|
|
553
|
+
adapter: bc
|
|
554
|
+
baseUrl: ${BC_BASE_URL}
|
|
555
|
+
company: ${BC_COMPANY}
|
|
556
|
+
entity: customers # OData entity name (lowercase, plural).
|
|
557
|
+
apiVersion: v2.0 # Default: v2.0
|
|
558
|
+
onConflict: fail # fail | upsert. Default: fail.
|
|
559
|
+
batchEndpoint: true # Use OData $batch. Default: true.
|
|
560
|
+
|
|
561
|
+
# ── Generic CSV adapter ───────────────────────────────────
|
|
562
|
+
adapter: csv
|
|
563
|
+
output: ./output/data.csv
|
|
564
|
+
includeHeader: true
|
|
565
|
+
delimiter: ","
|
|
566
|
+
encoding: utf-8
|
|
567
|
+
nullValue: ""
|
|
568
|
+
|
|
569
|
+
# ── PostgreSQL adapter ────────────────────────────────────
|
|
570
|
+
adapter: pg
|
|
571
|
+
connection: ${TARGET_PG}
|
|
572
|
+
table: customers
|
|
573
|
+
schema: public # Default: public
|
|
574
|
+
onConflict: fail # fail | upsert | ignore
|
|
575
|
+
upsertKey: [customer_no] # REQUIRED if onConflict: upsert
|
|
576
|
+
```
|
|
577
|
+
|
|
578
|
+
---
|
|
579
|
+
|
|
580
|
+
### `run` section
|
|
581
|
+
|
|
582
|
+
All fields optional. Shown with defaults.
|
|
583
|
+
|
|
584
|
+
```yaml
|
|
585
|
+
run:
|
|
586
|
+
mode: full # full | incremental | validate-only
|
|
587
|
+
batchSize: 500 # Rows per DuckDB insert batch.
|
|
588
|
+
onError: continue # continue | stop
|
|
589
|
+
logLevel: info # debug | info | warn | error
|
|
590
|
+
dryRun: false # true: DQ + transform, no output written.
|
|
591
|
+
outputDir: ./output # Base directory for all output files.
|
|
592
|
+
stagingDb: "" # DuckDB path. Default: {outputDir}/{name}.duckdb
|
|
593
|
+
# Set ':memory:' to force in-memory mode.
|
|
594
|
+
incrementalField: UPDATED_AT # Source field for incremental mode.
|
|
595
|
+
incrementalSince: "" # ISO datetime. If empty, reads from state file.
|
|
596
|
+
```
|
|
597
|
+
|
|
598
|
+
---
|
|
599
|
+
|
|
600
|
+
### Full example — Acme Corp customers (MSSQL → IFS)
|
|
601
|
+
|
|
602
|
+
```yaml
|
|
603
|
+
pipeline:
|
|
604
|
+
name: acme-corp-customers
|
|
605
|
+
client: acme-corp
|
|
606
|
+
version: "1.0"
|
|
607
|
+
entity: CustomerInfo
|
|
608
|
+
description: Customer master — legacy Sage SQL to IFS ERP
|
|
609
|
+
|
|
610
|
+
source:
|
|
611
|
+
adapter: mssql
|
|
612
|
+
connection: ${SOURCE_MSSQL}
|
|
613
|
+
query: |
|
|
614
|
+
SELECT
|
|
615
|
+
c.CUST_CODE, c.CUST_NAME, c.ADDR1, c.ADDR2,
|
|
616
|
+
c.POST_CODE, c.COUNTRY, c.EMAIL, c.TEL,
|
|
617
|
+
c.CREDIT_LIMIT, c.CURRENCY, c.ACCT_MGR_ID
|
|
618
|
+
FROM dbo.Customers c
|
|
619
|
+
WHERE c.Active = 1 AND c.DELETED = 0
|
|
620
|
+
|
|
621
|
+
dq:
|
|
622
|
+
stopOnCritical: true
|
|
623
|
+
rejectionFile: ./output/acme-corp-customers-rejected.csv
|
|
624
|
+
rules:
|
|
625
|
+
- field: CUST_CODE
|
|
626
|
+
checks:
|
|
627
|
+
- { type: notNull, severity: critical }
|
|
628
|
+
- { type: unique, severity: critical }
|
|
629
|
+
- { type: pattern, value: "^[A-Z0-9]{3,10}$", severity: warning }
|
|
630
|
+
- field: CUST_NAME
|
|
631
|
+
checks:
|
|
632
|
+
- { type: notNull, severity: critical }
|
|
633
|
+
- { type: maxLength, value: 100, severity: warning }
|
|
634
|
+
- field: POST_CODE
|
|
635
|
+
checks:
|
|
636
|
+
- { type: ukPostcode, severity: warning }
|
|
637
|
+
- field: EMAIL
|
|
638
|
+
checks:
|
|
639
|
+
- { type: email, severity: warning }
|
|
640
|
+
- field: CREDIT_LIMIT
|
|
641
|
+
checks:
|
|
642
|
+
- { type: min, value: 0, severity: critical }
|
|
643
|
+
- { type: max, value: 500000, severity: warning }
|
|
644
|
+
- field: COUNTRY
|
|
645
|
+
checks:
|
|
646
|
+
- { type: allowedValues, value: [GB, IE, US, DE, FR], severity: warning }
|
|
647
|
+
|
|
648
|
+
transform:
|
|
649
|
+
lookups:
|
|
650
|
+
- name: currencyMap
|
|
651
|
+
source: { adapter: csv, file: ./lookups/currency-codes.csv }
|
|
652
|
+
key: legacyCode
|
|
653
|
+
value: isoCode
|
|
654
|
+
- name: acctMgrMap
|
|
655
|
+
source:
|
|
656
|
+
adapter: mssql
|
|
657
|
+
connection: ${SOURCE_MSSQL}
|
|
658
|
+
query: "SELECT STAFF_ID as key, IFS_USER_ID as value FROM dbo.Staff"
|
|
659
|
+
key: key
|
|
660
|
+
value: value
|
|
661
|
+
fields:
|
|
662
|
+
- { from: CUST_CODE, to: CustomerNo, type: string, max: 20 }
|
|
663
|
+
- { from: CUST_NAME, to: Name, type: string, max: 100, cleanse: trim|titleCase }
|
|
664
|
+
- { from: [ADDR1, ADDR2], to: Address1, type: concat, separator: ", ", cleanse: trim }
|
|
665
|
+
- { from: POST_CODE, to: ZipCode, type: string, cleanse: trim|uppercase }
|
|
666
|
+
- { from: COUNTRY, to: Country, type: string, default: GB }
|
|
667
|
+
- { from: CURRENCY, to: CurrencyCode, type: lookup, lookup: currencyMap, default: GBP }
|
|
668
|
+
- { from: ACCT_MGR_ID, to: SalesmanCode, type: lookup, lookup: acctMgrMap, optional: true }
|
|
669
|
+
- { from: CREDIT_LIMIT, to: CreditLimit, type: decimal, precision: 2 }
|
|
670
|
+
- { from: EMAIL, to: Email, type: string, cleanse: trim|lowercase }
|
|
671
|
+
- { to: CustomerGroup, type: constant, value: DOMESTIC }
|
|
672
|
+
- { to: SearchName, type: expression, value: "row.CUST_NAME.toUpperCase().substring(0, 20)" }
|
|
673
|
+
|
|
674
|
+
target:
|
|
675
|
+
adapter: ifs
|
|
676
|
+
entity: CustomerInfo
|
|
677
|
+
output: ./output/acme-corp-customers-ifs.csv
|
|
678
|
+
includeHeader: false
|
|
679
|
+
columnOrder: [CustomerNo, Name, Address1, ZipCode, Country, CurrencyCode,
|
|
680
|
+
SalesmanCode, CreditLimit, Email, CustomerGroup, SearchName]
|
|
681
|
+
|
|
682
|
+
run:
|
|
683
|
+
mode: full
|
|
684
|
+
batchSize: 500
|
|
685
|
+
logLevel: info
|
|
686
|
+
dryRun: false
|
|
687
|
+
```
|
|
688
|
+
|
|
689
|
+
---
|
|
690
|
+
|
|
691
|
+
### Full example — Style Co styles (CSV → BlueCherry)
|
|
692
|
+
|
|
693
|
+
```yaml
|
|
694
|
+
pipeline:
|
|
695
|
+
name: style-co-styles
|
|
696
|
+
client: style-co
|
|
697
|
+
version: "1.0"
|
|
698
|
+
entity: Style
|
|
699
|
+
description: Style master migration from legacy CSV exports to BlueCherry ERP
|
|
700
|
+
|
|
701
|
+
source:
|
|
702
|
+
adapter: csv
|
|
703
|
+
file: ./data/styles-export.csv
|
|
704
|
+
encoding: utf-8
|
|
705
|
+
|
|
706
|
+
dq:
|
|
707
|
+
stopOnCritical: true
|
|
708
|
+
rejectionFile: ./output/style-co-styles-rejected.csv
|
|
709
|
+
rules:
|
|
710
|
+
- field: STYLE_NO
|
|
711
|
+
checks:
|
|
712
|
+
- { type: notNull, severity: critical }
|
|
713
|
+
- { type: unique, severity: critical }
|
|
714
|
+
- { type: maxLength, value: 20, severity: warning }
|
|
715
|
+
- field: STYLE_DESC
|
|
716
|
+
checks:
|
|
717
|
+
- { type: notNull, severity: critical }
|
|
718
|
+
- { type: maxLength, value: 255, severity: warning }
|
|
719
|
+
- field: DIVISION
|
|
720
|
+
checks:
|
|
721
|
+
- { type: notNull, severity: critical }
|
|
722
|
+
- { type: allowedValues, value: [WOMENS, MENS, ACCESSORIES], severity: warning }
|
|
723
|
+
- field: SEASON_CODE
|
|
724
|
+
checks:
|
|
725
|
+
- { type: notNull, severity: warning }
|
|
726
|
+
- { type: pattern, value: "^(SS|AW)[0-9]{2}$", severity: warning }
|
|
727
|
+
- field: COST_PRICE
|
|
728
|
+
checks:
|
|
729
|
+
- { type: min, value: 0, severity: critical }
|
|
730
|
+
- { type: max, value: 9999.99, severity: warning }
|
|
731
|
+
- field: RETAIL_PRICE
|
|
732
|
+
checks:
|
|
733
|
+
- { type: min, value: 0, severity: critical }
|
|
734
|
+
|
|
735
|
+
transform:
|
|
736
|
+
lookups:
|
|
737
|
+
- name: divisionMap
|
|
738
|
+
source: { adapter: csv, file: ./lookups/division-codes.csv }
|
|
739
|
+
key: legacyCode
|
|
740
|
+
value: bcCode
|
|
741
|
+
- name: vendorMap
|
|
742
|
+
source: { adapter: csv, file: ./lookups/vendor-codes.csv }
|
|
743
|
+
key: legacyVendorCode
|
|
744
|
+
value: bcVendorNo
|
|
745
|
+
fields:
|
|
746
|
+
- { from: STYLE_NO, to: StyleNo, type: string, max: 20, cleanse: trim|uppercase }
|
|
747
|
+
- { from: STYLE_DESC, to: StyleDesc, type: string, max: 255, cleanse: trim|normaliseUnicode }
|
|
748
|
+
- { from: DIVISION, to: Division, type: lookup, lookup: divisionMap }
|
|
749
|
+
- { from: SEASON_CODE, to: Season, type: string, max: 10 }
|
|
750
|
+
- { from: VENDOR_CODE, to: VendorNo, type: lookup, lookup: vendorMap, optional: true }
|
|
751
|
+
- { from: COST_PRICE, to: CostPrice, type: decimal, precision: 2 }
|
|
752
|
+
- { from: RETAIL_PRICE, to: RetailPrice, type: decimal, precision: 2 }
|
|
753
|
+
- { from: WEIGHT_KG, to: Weight, type: decimal, precision: 3, default: "0.000" }
|
|
754
|
+
- { from: COUNTRY_ORIG, to: CountryOrigin, type: string, default: GB }
|
|
755
|
+
- { from: FIBRE_CONTENT, to: FibreContent, type: string, max: 200, cleanse: trim }
|
|
756
|
+
- { to: ActiveFlag, type: constant, value: "Y" }
|
|
757
|
+
- { to: CreatedDate, type: expression, value: "js: new Date().toLocaleDateString('en-US')" }
|
|
758
|
+
|
|
759
|
+
target:
|
|
760
|
+
adapter: bluecherry
|
|
761
|
+
entity: Style
|
|
762
|
+
output: ./output/style-co-styles-bc.csv
|
|
763
|
+
includeHeader: true
|
|
764
|
+
dateFormat: MM/DD/YYYY
|
|
765
|
+
nullValue: ""
|
|
766
|
+
|
|
767
|
+
run:
|
|
768
|
+
mode: full
|
|
769
|
+
batchSize: 200
|
|
770
|
+
logLevel: info
|
|
771
|
+
dryRun: false
|
|
772
|
+
```
|
|
773
|
+
|
|
774
|
+
---
|
|
775
|
+
|
|
776
|
+
## ═══════════════════════════════════════════════════════════
|
|
777
|
+
## MULTI-SOURCE PIPELINES (Phase 3)
|
|
778
|
+
## ═══════════════════════════════════════════════════════════
|
|
779
|
+
|
|
780
|
+
A multi-source pipeline replaces the single `source:` block with a top-level
|
|
781
|
+
`sources:` array (min 2 entries) plus a `merge:` block. The rest of the YAML
|
|
782
|
+
(`pipeline`, `dq`, `transform`, `target`, `run`) is unchanged. `PipelineSchema`
|
|
783
|
+
requires *either* `source` (single) *or* both `sources` + `merge` (multi) —
|
|
784
|
+
never both — and the CLI auto-routes multi-source configs to
|
|
785
|
+
`MultiSourcePipelineRunner` (see `src/cli.ts:createRunnerForPipeline`).
|
|
786
|
+
|
|
787
|
+
### Top-level layout
|
|
788
|
+
|
|
789
|
+
```yaml
|
|
790
|
+
pipeline: { ... }
|
|
791
|
+
sources: [ { ... }, { ... } ] # REQUIRED in multi-source mode; min 2 entries
|
|
792
|
+
merge: { ... } # REQUIRED when `sources` is present
|
|
793
|
+
dq: { ... }
|
|
794
|
+
transform: { ... }
|
|
795
|
+
target: { ... }
|
|
796
|
+
run: { ... }
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
### `sources` entries
|
|
800
|
+
|
|
801
|
+
Each entry is a `SourceConfig` with three extra multi-source-only fields:
|
|
802
|
+
|
|
803
|
+
```yaml
|
|
804
|
+
sources:
|
|
805
|
+
- id: sql-server # REQUIRED. Lowercase alphanumeric + hyphens only;
|
|
806
|
+
# must be unique across the array; used as the
|
|
807
|
+
# staging table suffix (stg_raw_sql-server).
|
|
808
|
+
priority: 1 # REQUIRED. Positive integer. Lower priority =
|
|
809
|
+
# higher precedence in coalesce / priority-override.
|
|
810
|
+
adapter: mssql
|
|
811
|
+
connection: ${SOURCE_2_MSSQL}
|
|
812
|
+
query: |
|
|
813
|
+
SELECT STYLE_NO, STYLE_DESC, COST_PRICE FROM dbo.Styles WHERE Active = 1
|
|
814
|
+
|
|
815
|
+
- id: excel
|
|
816
|
+
priority: 2
|
|
817
|
+
adapter: xlsx
|
|
818
|
+
file: ./data/product-data.xlsx
|
|
819
|
+
sheet: "Products"
|
|
820
|
+
rename: # Optional. { 'old column': 'new column' }.
|
|
821
|
+
Style Number: STYLE_NO # Applied in-place after extract, before DQ and
|
|
822
|
+
Description: STYLE_DESC # merge. Intended for CSV/XLSX sources where
|
|
823
|
+
Fibre: FIBRE_CONTENT # column headers are fixed; SQL/REST sources
|
|
824
|
+
# should rename in the query or field selection.
|
|
825
|
+
# Unknown keys are logged as warnings, not errors.
|
|
826
|
+
```
|
|
827
|
+
|
|
828
|
+
### `merge` block
|
|
829
|
+
|
|
830
|
+
```yaml
|
|
831
|
+
merge:
|
|
832
|
+
key: STYLE_NO # REQUIRED. Single column name or array of
|
|
833
|
+
# columns (composite key). Must exist in every
|
|
834
|
+
# source after `rename` is applied.
|
|
835
|
+
|
|
836
|
+
strategy: coalesce # Default: coalesce. One of:
|
|
837
|
+
# coalesce first non-null value wins
|
|
838
|
+
# (priority-ordered; whitespace
|
|
839
|
+
# treated as blank)
|
|
840
|
+
# priority-override highest-priority source
|
|
841
|
+
# wins (even if null/blank)
|
|
842
|
+
# union all rows from all sources
|
|
843
|
+
# (dedupe by key)
|
|
844
|
+
# intersect only rows present in ALL
|
|
845
|
+
# sources
|
|
846
|
+
|
|
847
|
+
onUnmatched: include # Default: include. One of:
|
|
848
|
+
# include (default) keep unmatched rows
|
|
849
|
+
# exclude drop them
|
|
850
|
+
# warn keep and log a warning
|
|
851
|
+
# error fail the pipeline
|
|
852
|
+
# Ignored by `intersect`, which always excludes.
|
|
853
|
+
|
|
854
|
+
fieldStrategies: # Optional. Per-field overrides of the
|
|
855
|
+
# top-level strategy.
|
|
856
|
+
- field: FIBRE_CONTENT
|
|
857
|
+
source: excel # Force this field to always come from the
|
|
858
|
+
# named source, ignoring priority.
|
|
859
|
+
- field: COST_PRICE
|
|
860
|
+
strategy: priority-override # Override just this field's strategy.
|
|
861
|
+
|
|
862
|
+
conflictLog: ./output/style-co-products-conflicts.csv
|
|
863
|
+
# Optional. CSV of (key, field, winning_source,
|
|
864
|
+
# winning_value, source_values). Only written
|
|
865
|
+
# when at least one conflict is detected.
|
|
866
|
+
|
|
867
|
+
incrementalSource: sql-server # REQUIRED when `run.mode: incremental`.
|
|
868
|
+
# Must match one of the source `id` values.
|
|
869
|
+
# The named source is filtered by
|
|
870
|
+
# `run.incrementalField` / state-file lastRunAt;
|
|
871
|
+
# other sources run full each time.
|
|
872
|
+
```
|
|
873
|
+
|
|
874
|
+
### Multi-source DQ rules
|
|
875
|
+
|
|
876
|
+
`dq.rules[].sourceId` (optional) scopes a rule to a specific pre-merge source
|
|
877
|
+
table. Rules without `sourceId` run post-merge against `stg_merged`:
|
|
878
|
+
|
|
879
|
+
```yaml
|
|
880
|
+
dq:
|
|
881
|
+
stopOnCritical: true
|
|
882
|
+
rules:
|
|
883
|
+
- field: STYLE_NO # Pre-merge: runs against stg_raw_sql-server only.
|
|
884
|
+
sourceId: sql-server
|
|
885
|
+
checks:
|
|
886
|
+
- { type: notNull, severity: critical }
|
|
887
|
+
- { type: unique, severity: critical }
|
|
888
|
+
|
|
889
|
+
- field: STYLE_DESC # Post-merge: runs against stg_merged.
|
|
890
|
+
checks:
|
|
891
|
+
- { type: notNull, severity: critical }
|
|
892
|
+
- { type: maxLength, value: 255, severity: warning }
|
|
893
|
+
```
|
|
894
|
+
|
|
895
|
+
Per-source rejection files are auto-named by appending `-{sourceId}` to the
|
|
896
|
+
configured `rejectionFile` stem. Rows failing a critical pre-merge rule are
|
|
897
|
+
filtered out of that source's staging table *before* the merge phase.
|
|
898
|
+
|
|
899
|
+
### Full example
|
|
900
|
+
|
|
901
|
+
See [tests/fixtures/style-co-products-merged.pipeline.yaml](tests/fixtures/style-co-products-merged.pipeline.yaml)
|
|
902
|
+
for a complete, tested multi-source pipeline (MSSQL + REST + XLSX → BlueCherry
|
|
903
|
+
with `coalesce` + `fieldStrategies` + `incrementalSource`).
|
|
904
|
+
|
|
905
|
+
### Invocation
|
|
906
|
+
|
|
907
|
+
```bash
|
|
908
|
+
sluice check tests/fixtures/style-co-products-merged.pipeline.yaml
|
|
909
|
+
sluice run tests/fixtures/style-co-products-merged.pipeline.yaml
|
|
910
|
+
sluice merge list-strategies
|
|
911
|
+
sluice merge info coalesce
|
|
912
|
+
```
|
|
913
|
+
|
|
914
|
+
---
|
|
915
|
+
|
|
916
|
+
## ═══════════════════════════════════════════════════════════
|
|
917
|
+
## ZOD SCHEMA (src/config/schema.ts)
|
|
918
|
+
## ═══════════════════════════════════════════════════════════
|
|
919
|
+
|
|
920
|
+
Reproduce this schema exactly. Do not invent additional fields or rename enums.
|
|
921
|
+
|
|
922
|
+
```typescript
|
|
923
|
+
import { z } from 'zod';
|
|
924
|
+
|
|
925
|
+
const Severity = z.enum(['critical', 'warning', 'info']);
|
|
926
|
+
const SourceAd = z.enum(['mssql', 'pg', 'csv', 'xlsx', 'rest']);
|
|
927
|
+
const TargetAd = z.enum(['bc', 'ifs', 'bluecherry', 'csv', 'pg', 'rest']);
|
|
928
|
+
const CleanseOps = z.string().regex(/^[a-zA-Z|:0-9]+$/);
|
|
929
|
+
|
|
930
|
+
const PaginationSchema = z.object({
|
|
931
|
+
type: z.enum(['offset', 'cursor', 'page']),
|
|
932
|
+
pageSize: z.number().int().positive().default(100),
|
|
933
|
+
pageParam: z.string().optional(),
|
|
934
|
+
totalField: z.string().optional(),
|
|
935
|
+
dataField: z.string().optional(),
|
|
936
|
+
cursorField: z.string().optional(),
|
|
937
|
+
cursorParam: z.string().optional(),
|
|
938
|
+
});
|
|
939
|
+
|
|
940
|
+
export const SourceSchema = z.object({
|
|
941
|
+
adapter: SourceAd,
|
|
942
|
+
connection: z.string().optional(),
|
|
943
|
+
query: z.string().optional(),
|
|
944
|
+
file: z.string().optional(),
|
|
945
|
+
endpoint: z.string().optional(),
|
|
946
|
+
headers: z.record(z.string()).optional(),
|
|
947
|
+
delimiter: z.string().default(','),
|
|
948
|
+
encoding: z.string().default('utf-8'),
|
|
949
|
+
sheet: z.union([z.string(), z.number()]).optional(),
|
|
950
|
+
pagination: PaginationSchema.optional(),
|
|
951
|
+
}).refine(s => s.query || s.file || s.endpoint,
|
|
952
|
+
{ message: 'source must have query, file, or endpoint' });
|
|
953
|
+
|
|
954
|
+
const CheckType = z.enum([
|
|
955
|
+
'notNull', 'unique', 'pattern', 'email', 'ukPostcode',
|
|
956
|
+
'maxLength', 'min', 'max', 'allowedValues',
|
|
957
|
+
]);
|
|
958
|
+
|
|
959
|
+
const CheckSchema = z.object({
|
|
960
|
+
type: CheckType,
|
|
961
|
+
value: z.union([z.string(), z.number(), z.array(z.string())]).optional(),
|
|
962
|
+
severity: Severity,
|
|
963
|
+
message: z.string().optional(),
|
|
964
|
+
});
|
|
965
|
+
|
|
966
|
+
const DqRuleSchema = z.object({
|
|
967
|
+
field: z.string(),
|
|
968
|
+
checks: z.array(CheckSchema).min(1),
|
|
969
|
+
});
|
|
970
|
+
|
|
971
|
+
export const DqSchema = z.object({
|
|
972
|
+
stopOnCritical: z.boolean().default(true),
|
|
973
|
+
rejectionFile: z.string().optional(),
|
|
974
|
+
rules: z.array(DqRuleSchema).default([]),
|
|
975
|
+
});
|
|
976
|
+
|
|
977
|
+
const LookupSchema = z.object({
|
|
978
|
+
name: z.string(),
|
|
979
|
+
source: SourceSchema,
|
|
980
|
+
key: z.string(),
|
|
981
|
+
value: z.string(),
|
|
982
|
+
});
|
|
983
|
+
|
|
984
|
+
const FieldType = z.enum([
|
|
985
|
+
'string', 'number', 'decimal', 'boolean', 'date',
|
|
986
|
+
'lookup', 'concat', 'constant', 'expression',
|
|
987
|
+
]);
|
|
988
|
+
|
|
989
|
+
const FieldMappingSchema = z.object({
|
|
990
|
+
from: z.union([z.string(), z.array(z.string())]).optional(),
|
|
991
|
+
to: z.string(),
|
|
992
|
+
type: FieldType,
|
|
993
|
+
max: z.number().optional(),
|
|
994
|
+
precision: z.number().optional(),
|
|
995
|
+
format: z.string().optional(),
|
|
996
|
+
cleanse: CleanseOps.optional(),
|
|
997
|
+
lookup: z.string().optional(),
|
|
998
|
+
separator: z.string().optional(),
|
|
999
|
+
value: z.union([z.string(), z.number(), z.boolean()]).optional(),
|
|
1000
|
+
default: z.union([z.string(), z.number(), z.boolean(), z.null()]).optional(),
|
|
1001
|
+
optional: z.boolean().default(false),
|
|
1002
|
+
});
|
|
1003
|
+
|
|
1004
|
+
export const TransformSchema = z.object({
|
|
1005
|
+
lookups: z.array(LookupSchema).default([]),
|
|
1006
|
+
fields: z.array(FieldMappingSchema).min(1),
|
|
1007
|
+
});
|
|
1008
|
+
|
|
1009
|
+
export const TargetSchema = z.object({
|
|
1010
|
+
adapter: TargetAd,
|
|
1011
|
+
output: z.string().optional(),
|
|
1012
|
+
entity: z.string().optional(),
|
|
1013
|
+
includeHeader: z.boolean().optional(),
|
|
1014
|
+
columnOrder: z.array(z.string()).optional(),
|
|
1015
|
+
dateFormat: z.string().optional(),
|
|
1016
|
+
delimiter: z.string().default(','),
|
|
1017
|
+
encoding: z.string().default('utf-8'),
|
|
1018
|
+
nullValue: z.string().default(''),
|
|
1019
|
+
template: z.string().optional(),
|
|
1020
|
+
// BC REST
|
|
1021
|
+
baseUrl: z.string().optional(),
|
|
1022
|
+
company: z.string().optional(),
|
|
1023
|
+
apiVersion: z.string().default('v2.0'),
|
|
1024
|
+
onConflict: z.enum(['fail', 'upsert', 'ignore']).default('fail'),
|
|
1025
|
+
upsertKey: z.array(z.string()).optional(),
|
|
1026
|
+
batchEndpoint: z.boolean().default(true),
|
|
1027
|
+
// PostgreSQL
|
|
1028
|
+
connection: z.string().optional(),
|
|
1029
|
+
table: z.string().optional(),
|
|
1030
|
+
schema: z.string().default('public'),
|
|
1031
|
+
});
|
|
1032
|
+
|
|
1033
|
+
export const RunSchema = z.object({
|
|
1034
|
+
mode: z.enum(['full', 'incremental', 'validate-only']).default('full'),
|
|
1035
|
+
batchSize: z.number().int().positive().default(500),
|
|
1036
|
+
onError: z.enum(['continue', 'stop']).default('continue'),
|
|
1037
|
+
logLevel: z.enum(['debug', 'info', 'warn', 'error']).default('info'),
|
|
1038
|
+
dryRun: z.boolean().default(false),
|
|
1039
|
+
outputDir: z.string().default('./output'),
|
|
1040
|
+
stagingDb: z.string().default(''),
|
|
1041
|
+
// Phase 4a — enrich tuning (consumed by @caracal-lynx/sluice-enrich)
|
|
1042
|
+
enrichConcurrency: z.number().int().positive().default(5),
|
|
1043
|
+
enrichTimeoutMs: z.number().int().positive().default(5000),
|
|
1044
|
+
enrichMaxRetries: z.number().int().min(0).max(5).default(3),
|
|
1045
|
+
incrementalField: z.string().optional(),
|
|
1046
|
+
incrementalSince: z.string().optional(),
|
|
1047
|
+
});
|
|
1048
|
+
|
|
1049
|
+
export const PipelineSchema = z.object({
|
|
1050
|
+
pipeline: z.object({
|
|
1051
|
+
name: z.string().regex(/^[a-z0-9-]+$/),
|
|
1052
|
+
client: z.string(),
|
|
1053
|
+
version: z.string(),
|
|
1054
|
+
entity: z.string(),
|
|
1055
|
+
description: z.string().optional(),
|
|
1056
|
+
}),
|
|
1057
|
+
source: SourceSchema,
|
|
1058
|
+
enrich: EnrichSchema.optional(), // Phase 4a — runs between Extract/Merge and DQ
|
|
1059
|
+
dq: DqSchema,
|
|
1060
|
+
transform: TransformSchema,
|
|
1061
|
+
target: TargetSchema,
|
|
1062
|
+
run: RunSchema.default({}),
|
|
1063
|
+
});
|
|
1064
|
+
|
|
1065
|
+
// Inferred types — use these everywhere; do not write manual interfaces.
|
|
1066
|
+
export type Pipeline = z.infer<typeof PipelineSchema>;
|
|
1067
|
+
export type SourceConfig = z.infer<typeof SourceSchema>;
|
|
1068
|
+
export type TargetConfig = z.infer<typeof TargetSchema>;
|
|
1069
|
+
export type RunConfig = z.infer<typeof RunSchema>;
|
|
1070
|
+
export type FieldMapping = z.infer<typeof FieldMappingSchema>;
|
|
1071
|
+
export type DqRule = z.infer<typeof DqRuleSchema>;
|
|
1072
|
+
export type Lookup = z.infer<typeof LookupSchema>;
|
|
1073
|
+
```
|
|
1074
|
+
|
|
1075
|
+
### Phase 2 schema additions (already in `src/config/schema.ts`)
|
|
1076
|
+
|
|
1077
|
+
The following are forward-looking additions that extend the canonical schema above.
|
|
1078
|
+
They are live in the codebase and tested. Do not remove them.
|
|
1079
|
+
|
|
1080
|
+
- **`DqSchema.rulesFile`** (`z.string().optional()`) — path to a composite rule
|
|
1081
|
+
library YAML file. `ConfigLoader` expands composite rule references into
|
|
1082
|
+
built-in check types before Zod validation, so the pipeline runner only sees
|
|
1083
|
+
standard checks.
|
|
1084
|
+
- **`FieldType` includes `'custom'`** — delegates to a `TransformPlugin` via
|
|
1085
|
+
`customOp`. Requires `customOp` to be set (enforced by a `.refine()`).
|
|
1086
|
+
- **`FieldMappingSchema.customOp`** (`z.string().optional()`) — plugin ID for
|
|
1087
|
+
`type: custom` fields.
|
|
1088
|
+
- **`FieldMappingSchema.options`** (`z.record(z.unknown()).optional()`) — arbitrary
|
|
1089
|
+
per-plugin config passed through to the transform plugin.
|
|
1090
|
+
- **`FieldMappingSchema` refinement** — field types in `TYPES_REQUIRING_FROM`
|
|
1091
|
+
(`string`, `number`, `decimal`, `boolean`, `date`, `lookup`, `concat`) must
|
|
1092
|
+
declare `from`. Only `constant`, `expression`, and `custom` may omit it.
|
|
1093
|
+
- **`TargetSchema` refinement** — when `onConflict: 'upsert'`, a non-empty
|
|
1094
|
+
`upsertKey` is required (checked at config-parse time).
|
|
1095
|
+
- **`ToolkitConfigSchema`** — schema for `sluice.config.yaml` (toolkit-level
|
|
1096
|
+
plugin loading). Consumed by `PipelineRunner.loadAllPlugins()` via
|
|
1097
|
+
`plugins/loader.ts → loadNpmPlugins()` at the start of every run.
|
|
1098
|
+
- **`CompositeRuleSchema` / `CompositeRuleLibrarySchema`** — schemas for the
|
|
1099
|
+
shared rule library YAML files referenced by `dq.rulesFile`.
|
|
1100
|
+
|
|
1101
|
+
### Phase 3 schema additions (multi-source merge)
|
|
1102
|
+
|
|
1103
|
+
- **`DqRuleSchema.sourceId`** (`z.string().optional()`) — scopes a rule to a
|
|
1104
|
+
named pre-merge source; omitted for post-merge rules.
|
|
1105
|
+
- **`PipelineSchema.source`** — now `optional()`; mutually exclusive with
|
|
1106
|
+
`sources` (enforced by `.refine()`).
|
|
1107
|
+
- **`PipelineSchema.sources`** (`z.array(MultiSourceEntrySchema).min(2).optional()`)
|
|
1108
|
+
— the multi-source array. Refinement also checks unique source ids and
|
|
1109
|
+
(in incremental mode) that `merge.incrementalSource` matches a source id.
|
|
1110
|
+
- **`PipelineSchema.merge`** (`MergeSchema.optional()`) — per-pipeline merge
|
|
1111
|
+
config. Defaults: `strategy: 'coalesce'`, `onUnmatched: 'include'`.
|
|
1112
|
+
- **`MergeSchema`** — `key`, `strategy`, `onUnmatched`, `fieldStrategies[]`,
|
|
1113
|
+
`conflictLog`, `incrementalSource`.
|
|
1114
|
+
- **`MergeFieldStrategySchema`** — per-field override: `field`, optional
|
|
1115
|
+
`strategy`, optional `source` (at least one required).
|
|
1116
|
+
- **`MultiSourceEntrySchema`** — extends `SourceBaseSchema` with `id`,
|
|
1117
|
+
`priority`, and optional `rename`.
|
|
1118
|
+
- **`isSingleSource(p)` / `isMultiSource(p)`** — exported type guards that
|
|
1119
|
+
narrow `Pipeline` to the single- or multi-source shape.
|
|
1120
|
+
|
|
1121
|
+
---
|
|
1122
|
+
|
|
1123
|
+
## ═══════════════════════════════════════════════════════════
|
|
1124
|
+
## PLUGIN INTERFACES
|
|
1125
|
+
## ═══════════════════════════════════════════════════════════
|
|
1126
|
+
|
|
1127
|
+
### SourceAdapter (src/adapters/source/types.ts)
|
|
1128
|
+
|
|
1129
|
+
```typescript
|
|
1130
|
+
export interface SourceAdapter {
|
|
1131
|
+
readonly id: string;
|
|
1132
|
+
connect(config: SourceConfig): Promise<void>;
|
|
1133
|
+
extract(
|
|
1134
|
+
config: SourceConfig,
|
|
1135
|
+
store: StagingStore,
|
|
1136
|
+
runConfig: RunConfig,
|
|
1137
|
+
onProgress: (rows: number) => void,
|
|
1138
|
+
targetTable?: string // defaults to 'stg_raw'; set per-source in
|
|
1139
|
+
// multi-source pipelines
|
|
1140
|
+
): Promise<ExtractResult>;
|
|
1141
|
+
disconnect(): Promise<void>;
|
|
1142
|
+
}
|
|
1143
|
+
|
|
1144
|
+
export interface ExtractResult {
|
|
1145
|
+
rowsExtracted: number;
|
|
1146
|
+
tableName: string; // caller-supplied; 'stg_raw' for single-source,
|
|
1147
|
+
// 'stg_raw_{sourceId}' for each source in a
|
|
1148
|
+
// multi-source pipeline
|
|
1149
|
+
columns: ColumnMeta[];
|
|
1150
|
+
}
|
|
1151
|
+
|
|
1152
|
+
export interface ColumnMeta {
|
|
1153
|
+
name: string;
|
|
1154
|
+
duckDbType: string; // VARCHAR | BIGINT | DOUBLE | BOOLEAN | TIMESTAMP
|
|
1155
|
+
}
|
|
1156
|
+
```
|
|
1157
|
+
|
|
1158
|
+
### TargetAdapter (src/adapters/target/types.ts)
|
|
1159
|
+
|
|
1160
|
+
```typescript
|
|
1161
|
+
export interface TargetAdapter {
|
|
1162
|
+
readonly id: string;
|
|
1163
|
+
connect(config: TargetConfig): Promise<void>;
|
|
1164
|
+
load(
|
|
1165
|
+
config: TargetConfig,
|
|
1166
|
+
store: StagingStore,
|
|
1167
|
+
runConfig: RunConfig,
|
|
1168
|
+
onProgress: (rows: number) => void
|
|
1169
|
+
): Promise<LoadResult>;
|
|
1170
|
+
disconnect(): Promise<void>;
|
|
1171
|
+
}
|
|
1172
|
+
|
|
1173
|
+
export interface LoadResult {
|
|
1174
|
+
rowsLoaded: number;
|
|
1175
|
+
rowsFailed: number;
|
|
1176
|
+
outputPath?: string; // set for file-based targets
|
|
1177
|
+
}
|
|
1178
|
+
```
|
|
1179
|
+
|
|
1180
|
+
### DQ Rule (src/dq/rules/types.ts)
|
|
1181
|
+
|
|
1182
|
+
```typescript
|
|
1183
|
+
export interface Rule {
|
|
1184
|
+
readonly id: string;
|
|
1185
|
+
validate(
|
|
1186
|
+
value: unknown,
|
|
1187
|
+
config: CheckConfig,
|
|
1188
|
+
rowIndex: number,
|
|
1189
|
+
field: string
|
|
1190
|
+
): RuleViolation | null;
|
|
1191
|
+
}
|
|
1192
|
+
|
|
1193
|
+
export interface RuleViolation {
|
|
1194
|
+
field: string;
|
|
1195
|
+
rowIndex: number;
|
|
1196
|
+
value: unknown;
|
|
1197
|
+
rule: string;
|
|
1198
|
+
severity: 'critical' | 'warning' | 'info';
|
|
1199
|
+
message: string;
|
|
1200
|
+
}
|
|
1201
|
+
```
|
|
1202
|
+
|
|
1203
|
+
### MergeStrategyPlugin (src/merge/types.ts)
|
|
1204
|
+
|
|
1205
|
+
```typescript
|
|
1206
|
+
export interface MergeSourceMeta {
|
|
1207
|
+
id: string;
|
|
1208
|
+
priority: number;
|
|
1209
|
+
tableName: string; // e.g. 'stg_raw_sql-server'
|
|
1210
|
+
}
|
|
1211
|
+
|
|
1212
|
+
export interface MergeResult {
|
|
1213
|
+
rowsMerged: number;
|
|
1214
|
+
conflicts: number; // fields where two non-null values disagreed
|
|
1215
|
+
unmatched: number; // records present in only one source
|
|
1216
|
+
tableName: 'stg_merged';
|
|
1217
|
+
}
|
|
1218
|
+
|
|
1219
|
+
export interface MergeStrategyPlugin {
|
|
1220
|
+
readonly id: string; // matches MergeSchema.strategy value
|
|
1221
|
+
readonly description?: string; // shown by `sluice merge list-strategies`
|
|
1222
|
+
|
|
1223
|
+
merge(
|
|
1224
|
+
store: StagingStore,
|
|
1225
|
+
sources: MergeSourceMeta[], // priority-ordered (priority 1 first)
|
|
1226
|
+
config: MergeConfig,
|
|
1227
|
+
): Promise<MergeResult>;
|
|
1228
|
+
}
|
|
1229
|
+
```
|
|
1230
|
+
|
|
1231
|
+
Built-in strategies: `coalesce`, `priority-override`, `union`, `intersect`
|
|
1232
|
+
(all pre-registered in `MergeStrategyRegistry`; live in
|
|
1233
|
+
`src/merge/strategies/*.ts`). Custom strategies can be dropped into a
|
|
1234
|
+
`plugins/` folder as `*.merge.ts` files exporting `const mergeStrategy`.
|
|
1235
|
+
|
|
1236
|
+
---
|
|
1237
|
+
|
|
1238
|
+
## ═══════════════════════════════════════════════════════════
|
|
1239
|
+
## ADAPTER IMPLEMENTATION NOTES
|
|
1240
|
+
## ═══════════════════════════════════════════════════════════
|
|
1241
|
+
|
|
1242
|
+
### mssql source
|
|
1243
|
+
|
|
1244
|
+
- Stream results: `request.stream = true` + `RecordSet` events.
|
|
1245
|
+
- SQL Server → DuckDB type map: `varchar/nvarchar/char → VARCHAR`,
|
|
1246
|
+
`int/bigint → BIGINT`, `decimal/numeric/money → DOUBLE`,
|
|
1247
|
+
`bit → BOOLEAN`, `datetime/date → TIMESTAMP`, `float/real → DOUBLE`.
|
|
1248
|
+
- Trusted connection: detect `trustedConnection: true` in JSON connection config.
|
|
1249
|
+
|
|
1250
|
+
### csv source
|
|
1251
|
+
|
|
1252
|
+
- `csv-parse` options: `{ columns: true, skip_empty_lines: true, bom: true }`.
|
|
1253
|
+
`bom: true` strips the UTF-8 BOM common in Excel-generated CSVs.
|
|
1254
|
+
- All columns inferred as `VARCHAR` in DuckDB.
|
|
1255
|
+
- Support glob patterns: concatenate all matching files into a single staging table.
|
|
1256
|
+
|
|
1257
|
+
### xlsx source
|
|
1258
|
+
|
|
1259
|
+
- SheetJS: convert to CSV via `xlsx.utils.sheet_to_csv`, then pipe through csv-parse.
|
|
1260
|
+
- Log a warning if workbook has more than one sheet and `source.sheet` is unset.
|
|
1261
|
+
|
|
1262
|
+
### rest source
|
|
1263
|
+
|
|
1264
|
+
- `axios-retry`: 3 retries, exponential backoff, retry on 429 and 5xx.
|
|
1265
|
+
- Flatten nested JSON using `__` separator (`address.postCode` → `address__postCode`).
|
|
1266
|
+
- All three pagination types must be supported: offset, page, cursor.
|
|
1267
|
+
|
|
1268
|
+
### IFS target
|
|
1269
|
+
|
|
1270
|
+
- UTF-8 CSV via `csv-stringify`.
|
|
1271
|
+
- `includeHeader` defaults to `false` for this adapter.
|
|
1272
|
+
- Apply `target.columnOrder` if specified.
|
|
1273
|
+
- Format date columns using `dayjs` with `target.dateFormat` (default `YYYY-MM-DD`).
|
|
1274
|
+
|
|
1275
|
+
### BlueCherry target (src/adapters/target/bluecherry.ts)
|
|
1276
|
+
|
|
1277
|
+
BlueCherry ERP (CGS — Computer Generated Solutions) uses fixed-format CSV for
|
|
1278
|
+
bulk import. Each entity type has a required column set. The adapter validates
|
|
1279
|
+
required columns at `connect()` time, before any data is read.
|
|
1280
|
+
|
|
1281
|
+
**Required columns per entity:**
|
|
1282
|
+
|
|
1283
|
+
```typescript
|
|
1284
|
+
const REQUIRED_COLUMNS: Record<string, string[]> = {
|
|
1285
|
+
Style: [
|
|
1286
|
+
'StyleNo', 'StyleDesc', 'Division', 'Season',
|
|
1287
|
+
'CostPrice', 'RetailPrice', 'ActiveFlag',
|
|
1288
|
+
],
|
|
1289
|
+
Vendor: [
|
|
1290
|
+
'VendorNo', 'VendorName', 'Country', 'CurrencyCode',
|
|
1291
|
+
],
|
|
1292
|
+
PurchaseOrder: [
|
|
1293
|
+
'PONumber', 'VendorNo', 'Season', 'OrderDate', 'DeliveryDate',
|
|
1294
|
+
],
|
|
1295
|
+
PODetail: [
|
|
1296
|
+
'PONumber', 'StyleNo', 'ColourCode', 'SizeCode', 'Quantity', 'CostPrice',
|
|
1297
|
+
],
|
|
1298
|
+
Season: [
|
|
1299
|
+
'SeasonCode', 'SeasonDesc', 'StartDate', 'EndDate',
|
|
1300
|
+
],
|
|
1301
|
+
ColourSize: [
|
|
1302
|
+
'StyleNo', 'ColourCode', 'ColourDesc', 'SizeCode', 'SizeDesc',
|
|
1303
|
+
],
|
|
1304
|
+
};
|
|
1305
|
+
```
|
|
1306
|
+
|
|
1307
|
+
**Behaviour:**
|
|
1308
|
+
- `includeHeader` defaults to `true`.
|
|
1309
|
+
- Default `dateFormat` is `MM/DD/YYYY` (BlueCherry is US-origin software).
|
|
1310
|
+
- Any column whose name ends with `Date` (case-insensitive) is automatically
|
|
1311
|
+
formatted using `target.dateFormat` via `dayjs`.
|
|
1312
|
+
- `nullValue` (default `""`) is used for all null/undefined fields.
|
|
1313
|
+
- At `connect()`:
|
|
1314
|
+
1. Verify `target.entity` is in `REQUIRED_COLUMNS`. Throw `ConfigError` if not.
|
|
1315
|
+
2. Query `store.columnNames('stg_transformed')` and verify all required columns
|
|
1316
|
+
are present. Throw `ConfigError` listing any missing columns.
|
|
1317
|
+
3. If `target.template` is a file path, read its header row and use it as the
|
|
1318
|
+
definitive column order for the output. If `target.template === 'default'`,
|
|
1319
|
+
use the required columns list as column order, with any additional columns
|
|
1320
|
+
from `stg_transformed` appended.
|
|
1321
|
+
|
|
1322
|
+
**Note on BlueCherry column names:** The column names in `REQUIRED_COLUMNS` are
|
|
1323
|
+
internal conventions for this toolkit. Verify them against the actual BlueCherry
|
|
1324
|
+
import documentation before running a live migration. The `template` feature exists
|
|
1325
|
+
precisely to override these if the client's BlueCherry instance uses different names.
|
|
1326
|
+
|
|
1327
|
+
### Business Central REST target
|
|
1328
|
+
|
|
1329
|
+
- OAuth2 client credentials: `POST https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token`
|
|
1330
|
+
- Cache token in memory; refresh 60 seconds before expiry.
|
|
1331
|
+
- OData `$batch`: `POST {baseUrl}/api/{version}/companies({company})/$batch`
|
|
1332
|
+
with `Content-Type: multipart/mixed; boundary=batch_{uuid}`.
|
|
1333
|
+
Maximum 100 operations per batch request.
|
|
1334
|
+
- HTTP 409 with `onConflict: upsert` → issue PATCH to individual entity URL.
|
|
1335
|
+
- HTTP 4xx (non-409): log error, increment `rowsFailed`, continue if
|
|
1336
|
+
`run.onError: continue`.
|
|
1337
|
+
|
|
1338
|
+
---
|
|
1339
|
+
|
|
1340
|
+
## ═══════════════════════════════════════════════════════════
|
|
1341
|
+
## PIPELINE RUNNER — EXECUTION ORDER
|
|
1342
|
+
## ═══════════════════════════════════════════════════════════
|
|
1343
|
+
|
|
1344
|
+
**Important:** `ConfigLoader.load()` interpolates `${ENV_VAR}` tokens from
|
|
1345
|
+
`process.env` but does **not** call `loadEnv()` / `dotenv.config()` itself.
|
|
1346
|
+
The CLI entry point must call `loadEnv()` before invoking the loader. This keeps
|
|
1347
|
+
`ConfigLoader` side-effect-free and testable (tests stub `process.env` directly).
|
|
1348
|
+
|
|
1349
|
+
```
|
|
1350
|
+
1. Load + validate config ConfigLoader.load(yamlPath)
|
|
1351
|
+
2. Resolve output directory create if not exists
|
|
1352
|
+
3. Open DuckDB staging store StagingStore.open(dbPath)
|
|
1353
|
+
4. Connect source adapter
|
|
1354
|
+
5. Extract → 'stg_raw' log: rows extracted
|
|
1355
|
+
5a. Disconnect source adapter always in finally
|
|
1356
|
+
5b. Phase 4a Enrich (optional) runs only when:
|
|
1357
|
+
- `enrich:` block configured
|
|
1358
|
+
- --no-enrich NOT set
|
|
1359
|
+
- mode != validate-only and not dryRun
|
|
1360
|
+
- @caracal-lynx/sluice-enrich is installed
|
|
1361
|
+
and has called registerEnrichPhase()
|
|
1362
|
+
Otherwise skipped (WARN log if last bullet
|
|
1363
|
+
fails). Writes new columns to 'stg_raw'.
|
|
1364
|
+
6. Run DQ rules against 'stg_raw'
|
|
1365
|
+
a. Collect all RuleViolations
|
|
1366
|
+
b. Write rejection CSV
|
|
1367
|
+
c. Write summary JSON
|
|
1368
|
+
d. Log DQ summary (info)
|
|
1369
|
+
e. If stopOnCritical AND criticalCount > 0 → throw PipelineDQError
|
|
1370
|
+
7. Resolve all lookups LookupResolver.loadAll()
|
|
1371
|
+
8. Transform 'stg_raw' → 'stg_transformed' (batch by batchSize)
|
|
1372
|
+
9. If dryRun === true → STOP (log summary, exit 0)
|
|
1373
|
+
10. If mode === 'validate-only' → STOP (log summary, exit 0)
|
|
1374
|
+
11. Connect target adapter
|
|
1375
|
+
12. Load 'stg_transformed' → target
|
|
1376
|
+
12a.Disconnect target adapter always in finally
|
|
1377
|
+
13. Close DuckDB staging store always in finally
|
|
1378
|
+
14. Write run state file {outputDir}/{name}-state.json
|
|
1379
|
+
15. Log final summary (info)
|
|
1380
|
+
```
|
|
1381
|
+
|
|
1382
|
+
**Run state file** `{outputDir}/{name}-state.json`:
|
|
1383
|
+
```json
|
|
1384
|
+
{
|
|
1385
|
+
"pipeline": "acme-corp-customers",
|
|
1386
|
+
"lastRunAt": "2026-04-15T09:30:00.000Z",
|
|
1387
|
+
"lastMode": "full",
|
|
1388
|
+
"rowsExtracted": 1842,
|
|
1389
|
+
"rowsLoaded": 1801,
|
|
1390
|
+
"criticalViolations": 0,
|
|
1391
|
+
"warnings": 41,
|
|
1392
|
+
"incrementalSince": ""
|
|
1393
|
+
}
|
|
1394
|
+
```
|
|
1395
|
+
|
|
1396
|
+
Used by `mode: incremental` to auto-determine the `since` timestamp.
|
|
1397
|
+
|
|
1398
|
+
### Multi-source execution order (`MultiSourcePipelineRunner`)
|
|
1399
|
+
|
|
1400
|
+
For a pipeline with `sources` + `merge`, the CLI selects
|
|
1401
|
+
`MultiSourcePipelineRunner` (a subclass of `PipelineRunner` that overrides
|
|
1402
|
+
`run()`, `profile()`, and `writeStateFile()` and reuses the protected
|
|
1403
|
+
`runExtract`, `runDQ`, `runTransform`, `runLoad` phase methods).
|
|
1404
|
+
|
|
1405
|
+
```
|
|
1406
|
+
1. Load + validate config ConfigLoader.load(yamlPath)
|
|
1407
|
+
2. Load plugins files + sluice.config.yaml (Tier 2/3)
|
|
1408
|
+
3. Resolve output dir, open DuckDB staging store
|
|
1409
|
+
4. For each source (priority-ordered):
|
|
1410
|
+
a. runExtract → 'stg_raw_{sourceId}'
|
|
1411
|
+
b. If source.rename is set StagingStore.renameColumns(...)
|
|
1412
|
+
c. If mode: incremental AND source.id === merge.incrementalSource:
|
|
1413
|
+
apply TRY_CAST(... AS TIMESTAMP) >= since filter
|
|
1414
|
+
d. Filter dq.rules by sourceId; runDQ against 'stg_raw_{sourceId}'
|
|
1415
|
+
(writes per-source rejection CSV, stops on critical)
|
|
1416
|
+
e. Rewrite 'stg_raw_{sourceId}' to only the accepted rows
|
|
1417
|
+
5. MergeEngine.run(store, sources, merge)
|
|
1418
|
+
→ creates 'stg_merge_joined', 'stg_merged', 'stg_merge_conflicts'
|
|
1419
|
+
→ writes conflictLog CSV if configured
|
|
1420
|
+
5a. Phase 4a Enrich (optional) runs once against 'stg_merged' if
|
|
1421
|
+
`enrich:` block is present and the four
|
|
1422
|
+
gating conditions hold (see single-source
|
|
1423
|
+
step 5b above). Single post-merge pass —
|
|
1424
|
+
never per-source.
|
|
1425
|
+
6. runDQ on the post-merge rules (no sourceId) against 'stg_merged'
|
|
1426
|
+
7. Filter rejected rows; runTransform against the filtered merge result
|
|
1427
|
+
8. If dryRun OR validate-only → STOP
|
|
1428
|
+
9. runLoad → target adapter reads 'stg_transformed'
|
|
1429
|
+
10. writeStateFile → per-source lastRunAt block + top-level summary
|
|
1430
|
+
11. Close DuckDB
|
|
1431
|
+
```
|
|
1432
|
+
|
|
1433
|
+
**Multi-source state file** adds a `sources` block keyed by source id:
|
|
1434
|
+
|
|
1435
|
+
```json
|
|
1436
|
+
{
|
|
1437
|
+
"pipeline": "style-co-products-merged",
|
|
1438
|
+
"lastRunAt": "2026-04-19T09:30:00.000Z",
|
|
1439
|
+
"lastMode": "incremental",
|
|
1440
|
+
"rowsMerged": 3201,
|
|
1441
|
+
"rowsLoaded": 3188,
|
|
1442
|
+
"criticalViolations": 0,
|
|
1443
|
+
"warnings": 14,
|
|
1444
|
+
"incrementalSince": "",
|
|
1445
|
+
"sources": {
|
|
1446
|
+
"sql-server": {
|
|
1447
|
+
"lastRunAt": "2026-04-19T09:30:00.000Z",
|
|
1448
|
+
"rowsExtracted": 2910,
|
|
1449
|
+
"incrementalSince": "2026-04-18T22:00:00.000Z"
|
|
1450
|
+
},
|
|
1451
|
+
"excel": { "lastRunAt": "...", "rowsExtracted": 412, "incrementalSince": "" }
|
|
1452
|
+
}
|
|
1453
|
+
}
|
|
1454
|
+
```
|
|
1455
|
+
|
|
1456
|
+
---
|
|
1457
|
+
|
|
1458
|
+
## ═══════════════════════════════════════════════════════════
|
|
1459
|
+
## DUCKDB STAGING STORE (src/staging/store.ts)
|
|
1460
|
+
## ═══════════════════════════════════════════════════════════
|
|
1461
|
+
|
|
1462
|
+
```typescript
|
|
1463
|
+
class StagingStore {
|
|
1464
|
+
constructor(private dbPath: string) {} // ':memory:' for dryRun/tests
|
|
1465
|
+
|
|
1466
|
+
async open(): Promise<void>
|
|
1467
|
+
async close(): Promise<void>
|
|
1468
|
+
async createTable(name: string, columns: ColumnMeta[]): Promise<void>
|
|
1469
|
+
async insertBatch(table: string, rows: Record<string, unknown>[]): Promise<void>
|
|
1470
|
+
async query<T>(sql: string, params?: unknown[]): Promise<T[]>
|
|
1471
|
+
async tableExists(name: string): Promise<boolean>
|
|
1472
|
+
async dropTable(name: string): Promise<void>
|
|
1473
|
+
async rowCount(table: string): Promise<number>
|
|
1474
|
+
async columnNames(table: string): Promise<string[]>
|
|
1475
|
+
async exportToCsv(
|
|
1476
|
+
table: string,
|
|
1477
|
+
outputPath: string,
|
|
1478
|
+
options?: { delimiter?: string; header?: boolean; encoding?: string }
|
|
1479
|
+
): Promise<void>
|
|
1480
|
+
async renameColumns( // Phase 3: used by MultiSourcePipelineRunner
|
|
1481
|
+
tableName: string, // after a per-source extract. Implemented as
|
|
1482
|
+
renames: Record<string, string> // CREATE OR REPLACE TABLE ... AS SELECT ...
|
|
1483
|
+
): Promise<void> // Unknown keys log a warning, not an error.
|
|
1484
|
+
}
|
|
1485
|
+
```
|
|
1486
|
+
|
|
1487
|
+
Default DuckDB path: `{outputDir}/{pipelineName}.duckdb`
|
|
1488
|
+
Use `':memory:'` when `dryRun: true` or `stagingDb: ':memory:'`.
|
|
1489
|
+
|
|
1490
|
+
---
|
|
1491
|
+
|
|
1492
|
+
## ═══════════════════════════════════════════════════════════
|
|
1493
|
+
## TRANSFORM ENGINE (src/transform/engine.ts)
|
|
1494
|
+
## ═══════════════════════════════════════════════════════════
|
|
1495
|
+
|
|
1496
|
+
### Field type behaviours
|
|
1497
|
+
|
|
1498
|
+
| type | behaviour |
|
|
1499
|
+
|---|---|
|
|
1500
|
+
| `string` | `String(value)`, cleanse ops, then truncate to `max` |
|
|
1501
|
+
| `number` | `Math.round(Number(value))`. Throw `TransformError` if NaN. |
|
|
1502
|
+
| `decimal` | `parseFloat(value).toFixed(precision)` stored as string |
|
|
1503
|
+
| `boolean` | `['1','true','yes','y','t'].includes(String(v).toLowerCase())` |
|
|
1504
|
+
| `date` | Parse with `dayjs(value, format)`; output as `target.dateFormat` or ISO |
|
|
1505
|
+
| `lookup` | `LookupResolver.resolve(lookupName, value)` |
|
|
1506
|
+
| `concat` | Join `from[]` with `separator`, then cleanse |
|
|
1507
|
+
| `constant` | Emit `value` verbatim |
|
|
1508
|
+
| `expression` | `ExpressionEvaluator.evaluate(expression, row)` |
|
|
1509
|
+
|
|
1510
|
+
### Expression evaluator (src/transform/expression.ts)
|
|
1511
|
+
|
|
1512
|
+
**Must not use `eval()` or `new Function()`.**
|
|
1513
|
+
|
|
1514
|
+
1. Expression does NOT start with `js:` → use `expr-eval` Parser.
|
|
1515
|
+
Provide `row` as a variable containing all source field values.
|
|
1516
|
+
2. Expression starts with `js:` → strip prefix, execute via
|
|
1517
|
+
`vm.runInNewContext(code, { row, Date, Math, JSON, String, Number, Boolean })`.
|
|
1518
|
+
Log a `warn` whenever the `js:` path is taken.
|
|
1519
|
+
|
|
1520
|
+
---
|
|
1521
|
+
|
|
1522
|
+
## ═══════════════════════════════════════════════════════════
|
|
1523
|
+
## DQ REPORTER OUTPUT (src/dq/reporter.ts)
|
|
1524
|
+
## ═══════════════════════════════════════════════════════════
|
|
1525
|
+
|
|
1526
|
+
**Rejection CSV** columns: `row_index`, `field`, `value`, `rule`, `severity`, `message`
|
|
1527
|
+
|
|
1528
|
+
**Summary JSON** (`{outputDir}/{name}-dq-summary.json`):
|
|
1529
|
+
```json
|
|
1530
|
+
{
|
|
1531
|
+
"pipeline": "acme-corp-customers",
|
|
1532
|
+
"runAt": "2026-04-15T09:30:00Z",
|
|
1533
|
+
"rowsChecked": 1842,
|
|
1534
|
+
"rowsPassed": 1801,
|
|
1535
|
+
"rowsRejected": 41,
|
|
1536
|
+
"violations": { "critical": 0, "warning": 38, "info": 3 },
|
|
1537
|
+
"byField": {
|
|
1538
|
+
"POST_CODE": { "critical": 0, "warning": 22 },
|
|
1539
|
+
"EMAIL": { "critical": 0, "warning": 16 }
|
|
1540
|
+
}
|
|
1541
|
+
}
|
|
1542
|
+
```
|
|
1543
|
+
|
|
1544
|
+
---
|
|
1545
|
+
|
|
1546
|
+
## ═══════════════════════════════════════════════════════════
|
|
1547
|
+
## ERROR TYPES (src/utils/errors.ts)
|
|
1548
|
+
## ═══════════════════════════════════════════════════════════
|
|
1549
|
+
|
|
1550
|
+
```typescript
|
|
1551
|
+
export class PipelineError extends Error {
|
|
1552
|
+
constructor(message: string, public readonly cause?: unknown) {
|
|
1553
|
+
super(message);
|
|
1554
|
+
this.name = this.constructor.name;
|
|
1555
|
+
if (Error.captureStackTrace) {
|
|
1556
|
+
Error.captureStackTrace(this, this.constructor);
|
|
1557
|
+
}
|
|
1558
|
+
}
|
|
1559
|
+
}
|
|
1560
|
+
export class ConfigError extends PipelineError {}
|
|
1561
|
+
export class SourceError extends PipelineError {}
|
|
1562
|
+
export class StagingError extends PipelineError {}
|
|
1563
|
+
export class DQError extends PipelineError {}
|
|
1564
|
+
export class PipelineDQError extends DQError {
|
|
1565
|
+
constructor(
|
|
1566
|
+
public readonly criticalCount: number,
|
|
1567
|
+
public readonly reportPath: string,
|
|
1568
|
+
) {
|
|
1569
|
+
super(`Pipeline halted: ${criticalCount} critical DQ violations. See ${reportPath}`);
|
|
1570
|
+
}
|
|
1571
|
+
}
|
|
1572
|
+
export class TransformError extends PipelineError {}
|
|
1573
|
+
export class ExpressionError extends TransformError {}
|
|
1574
|
+
export class LoadError extends PipelineError {}
|
|
1575
|
+
export class EnrichError extends PipelineError {} // Phase 4a — exit code 4
|
|
1576
|
+
```
|
|
1577
|
+
|
|
1578
|
+
All error subclasses inherit `this.name = this.constructor.name` from
|
|
1579
|
+
`PipelineError`, so `err.name` reflects the actual class (e.g. `"ConfigError"`,
|
|
1580
|
+
`"PipelineDQError"`). `Error.captureStackTrace` (V8-only) trims the constructor
|
|
1581
|
+
frame from stack traces for cleaner output.
|
|
1582
|
+
|
|
1583
|
+
---
|
|
1584
|
+
|
|
1585
|
+
## ═══════════════════════════════════════════════════════════
|
|
1586
|
+
## CLI (src/cli.ts)
|
|
1587
|
+
## ═══════════════════════════════════════════════════════════
|
|
1588
|
+
|
|
1589
|
+
```
|
|
1590
|
+
sluice run <pipeline.yaml> Full pipeline run (auto-detects single vs multi-source)
|
|
1591
|
+
sluice validate <pipeline.yaml> DQ + transform only; no load
|
|
1592
|
+
sluice profile <pipeline.yaml> Extract + column profiling; no DQ
|
|
1593
|
+
sluice check <pipeline.yaml> Config validation only; no execution
|
|
1594
|
+
sluice plugins List all loaded rule/transform/merge plugins
|
|
1595
|
+
sluice merge list-strategies List all registered merge strategies
|
|
1596
|
+
sluice merge info <strategy> Show details about a specific merge strategy
|
|
1597
|
+
|
|
1598
|
+
Global options:
|
|
1599
|
+
--log-level <level> debug | info | warn | error
|
|
1600
|
+
--env <file> Path to .env file (default: ./.env)
|
|
1601
|
+
--output <dir> Override outputDir
|
|
1602
|
+
--plugins <dir...> Additional plugin directory/directories to load
|
|
1603
|
+
--dry-run Force dryRun: true
|
|
1604
|
+
--silent Suppress the progress bar on stdout (logs still go to stderr)
|
|
1605
|
+
|
|
1606
|
+
`sluice run` options:
|
|
1607
|
+
--no-enrich Skip the Phase 4a enrich phase even if `enrich:` is configured.
|
|
1608
|
+
(validate / profile / check do not run enrich at all, by design.)
|
|
1609
|
+
```
|
|
1610
|
+
|
|
1611
|
+
**Progress feedback:** `sluice run`, `sluice validate`, and `sluice profile`
|
|
1612
|
+
render a phase-by-phase progress bar to stdout via
|
|
1613
|
+
`src/utils/progress.ts → ProgressReporter`, with per-phase emoji icons
|
|
1614
|
+
(🔎 extract · 🛡️ DQ · 🔀 merge · 🌐 enrich · 🔧 transform · 📤 load), an ETA for
|
|
1615
|
+
determinate phases, and a coloured ✅/⚠️/❌ run-summary line. The bar
|
|
1616
|
+
degrades gracefully:
|
|
1617
|
+
- `--silent` → no stdout output at all
|
|
1618
|
+
- `--log-level debug` → bar disabled; per-row debug lines are used instead
|
|
1619
|
+
- `process.stdout.isTTY` → false: plain-ASCII lines (one per phase),
|
|
1620
|
+
no emojis, no ANSI escapes — log-file friendly
|
|
1621
|
+
- `NO_COLOR` env var → ANSI colour dropped (handled by `picocolors`)
|
|
1622
|
+
|
|
1623
|
+
**Exit codes:** `0` success · `1` pipeline error · `2` DQ critical violations · `3` config error · `4` enrich error (Phase 4a)
|
|
1624
|
+
|
|
1625
|
+
---
|
|
1626
|
+
|
|
1627
|
+
## ═══════════════════════════════════════════════════════════
|
|
1628
|
+
## LOGGING (src/utils/logger.ts)
|
|
1629
|
+
## ═══════════════════════════════════════════════════════════
|
|
1630
|
+
|
|
1631
|
+
Single `pino` instance. All log records (every level) go to **stderr**; stdout
|
|
1632
|
+
is reserved exclusively for the progress bar and final summary rendered by
|
|
1633
|
+
`ProgressReporter`. This mirrors how git, cargo, and npm split streams.
|
|
1634
|
+
|
|
1635
|
+
No `console.log` in `src/`. Operators who want logs in a file can run
|
|
1636
|
+
`sluice run p.yaml 2>run.log` — the bar stays visible on the terminal while
|
|
1637
|
+
every pino record is captured to the file. Use `--log-level error` to narrow
|
|
1638
|
+
the file to errors only.
|
|
1639
|
+
|
|
1640
|
+
| Level | Used for |
|
|
1641
|
+
|---|---|
|
|
1642
|
+
| `debug` | Per-row progress, SQL queries, lookup cache hits |
|
|
1643
|
+
| `info` | Phase transitions, row counts, file paths, run summary |
|
|
1644
|
+
| `warn` | DQ warnings, missing optional lookups, `js:` expression usage |
|
|
1645
|
+
| `error` | All caught errors before re-throw |
|
|
1646
|
+
|
|
1647
|
+
Dev: `npx sluice run pipeline.yaml | npx pino-pretty`
|
|
1648
|
+
|
|
1649
|
+
---
|
|
1650
|
+
|
|
1651
|
+
## ═══════════════════════════════════════════════════════════
|
|
1652
|
+
## ENVIRONMENT VARIABLES (.env.example)
|
|
1653
|
+
## ═══════════════════════════════════════════════════════════
|
|
1654
|
+
|
|
1655
|
+
```bash
|
|
1656
|
+
# ── Acme Corp — source ────────────────────────────────────
|
|
1657
|
+
SOURCE_MSSQL=mssql://user:password@serverlegacy.example.local/LegacyDB
|
|
1658
|
+
|
|
1659
|
+
# ── Acme Corp — IFS target ────────────────────────────────
|
|
1660
|
+
IFS_IMPORT_PATH=C:\IFS\Import
|
|
1661
|
+
|
|
1662
|
+
# ── Business Central target (any client using the `bc` adapter) ──
|
|
1663
|
+
BC_BASE_URL=https://api.businesscentral.dynamics.com/v2.0
|
|
1664
|
+
BC_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
|
|
1665
|
+
BC_CLIENT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
|
|
1666
|
+
BC_CLIENT_SECRET=your-client-secret
|
|
1667
|
+
BC_COMPANY=Example Company Ltd
|
|
1668
|
+
|
|
1669
|
+
# ── Style Co — source ───────────────────────────────────
|
|
1670
|
+
SOURCE_2_MSSQL=mssql://user:password@serverlegacy2.example.local/LegacyDB
|
|
1671
|
+
|
|
1672
|
+
# ── Style Co — BlueCherry (file-based; no API creds) ───
|
|
1673
|
+
BC_IMPORT_PATH=C:\BlueCherry\Import
|
|
1674
|
+
|
|
1675
|
+
# ── Runtime ───────────────────────────────────────────────────
|
|
1676
|
+
NODE_ENV=development
|
|
1677
|
+
LOG_LEVEL=info
|
|
1678
|
+
```
|
|
1679
|
+
|
|
1680
|
+
---
|
|
1681
|
+
|
|
1682
|
+
## ═══════════════════════════════════════════════════════════
|
|
1683
|
+
## TESTING
|
|
1684
|
+
## ═══════════════════════════════════════════════════════════
|
|
1685
|
+
|
|
1686
|
+
- **Vitest only.** No Jest.
|
|
1687
|
+
- Unit tests: mock all I/O with `vi.mock`.
|
|
1688
|
+
- Integration tests: real DuckDB (`:memory:`) + CSV fixtures.
|
|
1689
|
+
- No tests against live SQL Server, BC, IFS, or BlueCherry.
|
|
1690
|
+
- Target: 80% line coverage across `src/dq/` and `src/transform/`.
|
|
1691
|
+
- Both full example pipelines in this file must parse cleanly in the config tests.
|
|
1692
|
+
|
|
1693
|
+
**Required test cases:**
|
|
1694
|
+
|
|
1695
|
+
Config loader: `${ENV_VAR}` resolution · missing var → `ConfigError` ·
|
|
1696
|
+
invalid YAML → `ZodError` · minimal pipeline with all defaults · both example
|
|
1697
|
+
pipelines in this spec parse cleanly.
|
|
1698
|
+
|
|
1699
|
+
DQ engine: `notNull` on null/empty/whitespace · `unique` with duplicates ·
|
|
1700
|
+
`ukPostcode` valid and invalid formats · `allowedValues` case sensitivity ·
|
|
1701
|
+
`stopOnCritical` throws `PipelineDQError` · reporter writes correct CSV and JSON.
|
|
1702
|
+
|
|
1703
|
+
Transform engine: `concat` with separator · `lookup` miss + `optional: true` → null ·
|
|
1704
|
+
`lookup` miss + `optional: false` → `TransformError` · `expression` basic eval ·
|
|
1705
|
+
`expression` with `js:` prefix · `cleanse: trim|titleCase` · `cleanse: padStart:6:0` ·
|
|
1706
|
+
`cleanse: normaliseUnicode` · `type: date` with `format: DD/MM/YYYY` ·
|
|
1707
|
+
`type: boolean` all truthy/falsy variants.
|
|
1708
|
+
|
|
1709
|
+
BlueCherry adapter: missing required column → `ConfigError` at `connect()` ·
|
|
1710
|
+
date columns formatted with `target.dateFormat` · header row present ·
|
|
1711
|
+
`nullValue` respected · `template` CSV used as column order.
|
|
1712
|
+
|
|
1713
|
+
Staging store: insert/query round-trip all DuckDB types · `exportToCsv` delimiter
|
|
1714
|
+
and header options · `:memory:` mode works correctly.
|
|
1715
|
+
|
|
1716
|
+
---
|
|
1717
|
+
|
|
1718
|
+
## ═══════════════════════════════════════════════════════════
|
|
1719
|
+
## BUILD, SCRIPTS, CI
|
|
1720
|
+
## ═══════════════════════════════════════════════════════════
|
|
1721
|
+
|
|
1722
|
+
**package.json scripts:**
|
|
1723
|
+
```json
|
|
1724
|
+
{
|
|
1725
|
+
"name": "@caracal-lynx/sluice",
|
|
1726
|
+
"scripts": {
|
|
1727
|
+
"build": "tsc -p tsconfig.json",
|
|
1728
|
+
"dev": "tsx watch src/cli.ts",
|
|
1729
|
+
"lint": "eslint src tests",
|
|
1730
|
+
"format": "prettier --write src tests",
|
|
1731
|
+
"test": "vitest run",
|
|
1732
|
+
"test:watch": "vitest",
|
|
1733
|
+
"test:cov": "vitest run --coverage",
|
|
1734
|
+
"sluice": "tsx src/cli.ts"
|
|
1735
|
+
},
|
|
1736
|
+
"bin": { "sluice": "dist/cli.js" }
|
|
1737
|
+
}
|
|
1738
|
+
```
|
|
1739
|
+
|
|
1740
|
+
Use `tsx` (not `ts-node`) for development execution — handles tsconfig path aliases
|
|
1741
|
+
on Windows without extra configuration.
|
|
1742
|
+
|
|
1743
|
+
**GitHub Actions** (`.github/workflows/ci.yml`):
|
|
1744
|
+
```yaml
|
|
1745
|
+
on: [push, pull_request]
|
|
1746
|
+
jobs:
|
|
1747
|
+
test:
|
|
1748
|
+
runs-on: ubuntu-latest
|
|
1749
|
+
steps:
|
|
1750
|
+
- uses: actions/checkout@v4
|
|
1751
|
+
- uses: actions/setup-node@v4
|
|
1752
|
+
with: { node-version: '24', cache: 'npm' }
|
|
1753
|
+
- run: npm ci
|
|
1754
|
+
- run: npm run lint
|
|
1755
|
+
- run: npm run build
|
|
1756
|
+
- run: npm run test:cov
|
|
1757
|
+
- uses: actions/upload-artifact@v4
|
|
1758
|
+
with: { name: coverage, path: coverage/ }
|
|
1759
|
+
```
|
|
1760
|
+
|
|
1761
|
+
---
|
|
1762
|
+
|
|
1763
|
+
## ═══════════════════════════════════════════════════════════
|
|
1764
|
+
## WINDOWS / POWERSHELL NOTES
|
|
1765
|
+
## ═══════════════════════════════════════════════════════════
|
|
1766
|
+
|
|
1767
|
+
- All file paths: `path.join()` / `path.resolve()`. Never string concat with `/`.
|
|
1768
|
+
- `.env` uses LF line endings (set in `.gitattributes`).
|
|
1769
|
+
- DuckDB npm package includes the `win32-x64` native binary automatically.
|
|
1770
|
+
- Do not write Windows-only shell commands in CI (CI runs ubuntu-latest).
|
|
1771
|
+
- Developer shell: PowerShell 7 on Windows Terminal.
|
|
1772
|
+
|
|
1773
|
+
---
|
|
1774
|
+
|
|
1775
|
+
## ═══════════════════════════════════════════════════════════
|
|
1776
|
+
## WHAT NOT TO DO
|
|
1777
|
+
## ═══════════════════════════════════════════════════════════
|
|
1778
|
+
|
|
1779
|
+
- Do not use `ts-node` — use `tsx`.
|
|
1780
|
+
- Do not use `jest` — use `vitest`.
|
|
1781
|
+
- Do not use `console.log` in `src/` — use the pino logger.
|
|
1782
|
+
- Do not write manual TypeScript interfaces for config types — use `z.infer<>`.
|
|
1783
|
+
- Do not use `eval()` or `new Function()` — use `expr-eval` or `vm.runInNewContext`.
|
|
1784
|
+
- Do not hard-code connection strings, credentials, or client-specific values.
|
|
1785
|
+
- Do not import from `@duckdb/node-api` directly outside `src/staging/store.ts`.
|
|
1786
|
+
- Do not create `StagingStore` instances outside `PipelineRunner`.
|
|
1787
|
+
- Do not add UI, REST server, or dashboard code.
|
|
1788
|
+
- Do not add adapter-specific logic to `PipelineRunner`.
|
|
1789
|
+
- Do not invent new top-level YAML keys — the schema is fixed.
|
|
1790
|
+
- Do not add cleanse ops without adding them to the reference table in this file.
|
|
1791
|
+
- Do not add BlueCherry entity types to `REQUIRED_COLUMNS` without verifying
|
|
1792
|
+
column names against actual BlueCherry import documentation first.
|
|
1793
|
+
- Do not use `dayjs` plugins without importing them explicitly at the call site.
|
|
1794
|
+
|
|
1795
|
+
---
|
|
1796
|
+
|
|
1797
|
+
## ═══════════════════════════════════════════════════════════
|
|
1798
|
+
## SUGGESTED BUILD ORDER FOR CLAUDE CODE
|
|
1799
|
+
## ═══════════════════════════════════════════════════════════
|
|
1800
|
+
|
|
1801
|
+
Work phase by phase. Do not start the next phase until the current phase passes
|
|
1802
|
+
`npm run build` and `npm test` without errors. Ask before proceeding if anything
|
|
1803
|
+
in this spec is ambiguous.
|
|
1804
|
+
|
|
1805
|
+
1. **Scaffold** — `package.json`, `tsconfig.json`, `src/utils/`, `src/config/`.
|
|
1806
|
+
Verify both example pipelines parse cleanly.
|
|
1807
|
+
2. **Staging store** — `src/staging/`. Unit tests with `:memory:`.
|
|
1808
|
+
3. **Source adapters** — `csv` first, then `mssql`, `pg`, `xlsx`, `rest`.
|
|
1809
|
+
Mock all external connections in tests.
|
|
1810
|
+
4. **DQ engine** — `src/dq/` including all rules and reporter.
|
|
1811
|
+
5. **Transform engine** — `src/transform/` — all types, cleanse ops, expression eval.
|
|
1812
|
+
6. **Target adapters** — `csv` → `ifs` → `bluecherry` → `bc` (BC is most complex;
|
|
1813
|
+
mock OAuth2 token endpoint in tests).
|
|
1814
|
+
7. **PipelineRunner** — wire all phases; integration test both fixture pipelines.
|
|
1815
|
+
8. **CLI** — all four commands and exit codes.
|
|
1816
|
+
9. **CI** — `.github/workflows/ci.yml`.
|
|
1817
|
+
|
|
1818
|
+
---
|
|
1819
|
+
|
|
1820
|
+
*This file is the authoritative specification for Sluice. If anything in the
|
|
1821
|
+
codebase contradicts this file, the codebase is wrong. Update this file whenever
|
|
1822
|
+
the architecture evolves — then tell Claude Code to re-read it before continuing.*
|