@caracal-lynx/sluice 0.1.1 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -1,1822 +1,1822 @@
1
- # Sluice — CLAUDE.md
2
- # Project specification for Claude Code
3
- # Sluice: config-driven ETL toolkit for ERP data migrations
4
- # npm package: @caracal-lynx/sluice
5
- # Owner: Michael Scott, Caracal Lynx Ltd. (SC826823)
6
- # Last updated: 2026-04-20
7
-
8
- ---
9
-
10
- ## Project overview
11
-
12
- **Sluice** is a config-driven ETL toolkit for ERP data migrations, developed and
13
- maintained by Caracal Lynx Ltd. The engine is written once; each client
14
- engagement is delivered as a folder of YAML pipeline configs. There is no UI, no
15
- server, and no cloud dependency — just the `sluice` CLI and a set of TypeScript
16
- modules that can be imported by other tools (e.g. n8n custom nodes, GitHub Actions).
17
-
18
- *Clean data flows through.*
19
-
20
- **Known clients and targets:**
21
-
22
- | Client | Source(s) | Target ERP | Adapter |
23
- |---|---|---|---|
24
- | Acme Corp | MSSQL legacy DB | IFS ERP | `ifs` |
25
- | Style Co | MSSQL / CSV exports | BlueCherry ERP | `bluecherry` |
26
-
27
- **Primary use cases:**
28
- - Extract data from legacy SQL databases, CSV/Excel exports, and REST APIs
29
- - Validate data quality against a configurable rule set
30
- - Transform field mappings, apply lookups, cleanse values, evaluate expressions
31
- - Load output to BC via REST API, IFS via CSV import, BlueCherry via CSV import,
32
- or generic CSV/JSON for any other target
33
- - Run from the command line on a developer laptop (Windows, PowerShell 7)
34
- - Run unattended in GitHub Actions CI
35
-
36
- **Non-goals:**
37
- - No web UI or dashboard
38
- - No streaming / real-time ingestion
39
- - No data warehouse or lake — DuckDB is used only as a local staging store
40
- - No multi-tenant SaaS — this is a consultant's toolkit, not a product
41
-
42
- **Related docs:**
43
- - [README.md](README.md) — install, quick-start, composite rules (Tier 1)
44
- - [PLUGINS.md](PLUGINS.md) — Tier 2 (file) and Tier 3 (npm) plugin author guide
45
- - [docs/architecture-diagrams.md](docs/architecture-diagrams.md) — Mermaid diagrams of
46
- the single- and multi-source pipeline flow
47
-
48
- ---
49
-
50
- ## Repository structure
51
-
52
- ```
53
- sluice/
54
- ├── CLAUDE.md ← you are here
55
- ├── PLUGINS.md ← Tier 2 / Tier 3 plugin author guide
56
- ├── README.md
57
- ├── package.json
58
- ├── tsconfig.json
59
- ├── tsconfig.test.json
60
- ├── .env.example
61
- ├── .gitignore
62
- ├── eslint.config.js
63
- ├── .prettierrc
64
- ├── .github/workflows/ci.yml
65
- ├── docs/
66
- │ └── architecture-diagrams.md
67
- ├── examples/ ← sample pipelines (not run by tests)
68
-
69
- ├── src/
70
- │ ├── index.ts ← public API barrel (re-exports from all modules)
71
- │ ├── cli.ts ← commander CLI entry point
72
- │ ├── runner.ts ← PipelineRunner (single-source)
73
- │ ├── multi-source-runner.ts ← MultiSourcePipelineRunner (extends PipelineRunner)
74
- │ │
75
- │ ├── config/
76
- │ │ ├── index.ts ← re-exports schema + types
77
- │ │ ├── schema.ts ← Zod schema (PipelineSchema + sub-schemas)
78
- │ │ ├── loader.ts ← YAML load + ${ENV_VAR} interp + composite-rule expansion + parse
79
- │ │ └── types.ts ← re-exports of all inferred Zod types + guards
80
- │ │
81
- │ ├── adapters/
82
- │ │ ├── source/
83
- │ │ │ ├── index.ts ← barrel (self-registers built-ins on import)
84
- │ │ │ ├── registry.ts ← SourceAdapterRegistry
85
- │ │ │ ├── types.ts ← SourceAdapter + ExtractResult
86
- │ │ │ ├── mssql.ts
87
- │ │ │ ├── pg.ts
88
- │ │ │ ├── csv.ts
89
- │ │ │ ├── xlsx.ts
90
- │ │ │ └── rest.ts
91
- │ │ └── target/
92
- │ │ ├── index.ts ← barrel (self-registers built-ins on import)
93
- │ │ ├── registry.ts ← TargetAdapterRegistry
94
- │ │ ├── types.ts ← TargetAdapter + LoadResult
95
- │ │ ├── bc.ts ← Business Central REST (+ BcTokenManager)
96
- │ │ ├── ifs.ts ← IFS ERP CSV import
97
- │ │ ├── bluecherry.ts ← BlueCherry ERP CSV import
98
- │ │ ├── csv.ts ← generic CSV
99
- │ │ └── pg.ts
100
- │ │
101
- │ ├── staging/
102
- │ │ ├── index.ts ← barrel
103
- │ │ ├── store.ts ← DuckDB wrapper (the only file that imports `@duckdb/node-api`)
104
- │ │ └── schema.ts ← ColumnMeta, quoteIdent, buildCreateTableSql
105
- │ │
106
- │ ├── dq/
107
- │ │ ├── index.ts ← barrel
108
- │ │ ├── engine.ts ← DQEngine
109
- │ │ ├── reporter.ts ← writeRejectionCsv, writeSummaryJson
110
- │ │ ├── types.ts ← DQSummary, ViolationCounts
111
- │ │ └── rules/
112
- │ │ ├── index.ts ← BUILT_IN_RULES map (id → Rule instance)
113
- │ │ ├── types.ts ← Rule = RulePlugin, RuleViolation (re-exported from plugins)
114
- │ │ ├── notNull.ts
115
- │ │ ├── unique.ts
116
- │ │ ├── pattern.ts
117
- │ │ ├── email.ts
118
- │ │ ├── ukPostcode.ts
119
- │ │ ├── maxLength.ts
120
- │ │ ├── minMax.ts
121
- │ │ └── allowedValues.ts
122
- │ │
123
- │ ├── transform/
124
- │ │ ├── index.ts
125
- │ │ ├── engine.ts ← TransformEngine (built-in types + custom plugins)
126
- │ │ ├── lookup.ts
127
- │ │ ├── cleanse.ts
128
- │ │ ├── expression.ts ← expr-eval + `js:` vm sandbox
129
- │ │ └── types.ts ← TransformResult
130
- │ │
131
- │ ├── merge/ ← multi-source merge engine + strategies
132
- │ │ ├── index.ts ← MergeStrategyRegistry (pre-registers all built-ins)
133
- │ │ ├── engine.ts ← MergeEngine
134
- │ │ ├── sql-builder.ts ← shared JOIN + coalesce SQL helpers
135
- │ │ ├── conflict-log.ts ← conflict CSV writer
136
- │ │ ├── types.ts ← MergeStrategyPlugin, MergeSourceMeta, MergeResult
137
- │ │ └── strategies/
138
- │ │ ├── index.ts
139
- │ │ ├── coalesce.ts
140
- │ │ ├── priority-override.ts
141
- │ │ ├── union.ts
142
- │ │ └── intersect.ts
143
- │ │
144
- │ ├── plugins/ ← Tier 2 / Tier 3 plugin system
145
- │ │ ├── index.ts ← barrel
146
- │ │ ├── types.ts ← RulePlugin, TransformPlugin, PluginPackage
147
- │ │ ├── registry.ts ← RuleRegistry, TransformRegistry (custom plugin holders)
148
- │ │ └── loader.ts ← loadPlugins (file-based), loadNpmPlugins (sluice.config.yaml)
149
- │ │
150
- │ ├── enrich/ ← Phase 4a public surface (types only)
151
- │ │ └── types.ts ← EnrichPlugin, EnrichResult, EnrichOptions, EnrichSummary,
152
- │ │ EnrichPhaseFactory (implementation lives in private
153
- │ │ @caracal-lynx/sluice-enrich package)
154
- │ │
155
- │ └── utils/
156
- │ ├── index.ts
157
- │ ├── logger.ts ← pino singleton
158
- │ ├── env.ts ← loadEnv + requireEnv
159
- │ └── errors.ts
160
-
161
- ├── tests/
162
- │ ├── fixtures/
163
- │ │ ├── acme-corp-customers.pipeline.yaml
164
- │ │ ├── style-co-styles.pipeline.yaml
165
- │ │ ├── style-co-products-merged.pipeline.yaml ← multi-source
166
- │ │ ├── multi-source-no-merge.pipeline.yaml ← negative-path multi-source
167
- │ │ ├── shared-rules.yaml ← composite rule library
168
- │ │ └── plugins/ ← test plugin fixtures (Tier 2 files)
169
- │ │
170
- │ ├── unit/
171
- │ │ ├── cli.test.ts
172
- │ │ ├── runner.test.ts
173
- │ │ ├── adapters/
174
- │ │ │ ├── source/ ← csv, mssql, pg, rest, xlsx
175
- │ │ │ └── target/ ← bc, bluecherry, ifs, pg
176
- │ │ ├── config/ ← loader, schema, multi-source, composite-expansion
177
- │ │ ├── dq/ ← engine, reporter, rules
178
- │ │ ├── merge/ ← engine, registry, strategies
179
- │ │ ├── plugins/ ← loader, registry, composite-expansion
180
- │ │ ├── staging/ ← store
181
- │ │ └── transform/ ← cleanse, expression, engine, custom
182
- │ │
183
- │ └── integration/
184
- │ ├── cli-check.test.ts
185
- │ ├── cli-commands.test.ts
186
- │ ├── cli-plugins.test.ts
187
- │ ├── csv-to-csv-mvp.test.ts
188
- │ ├── dq-integration.test.ts
189
- │ ├── style-co-styles-mini.test.ts
190
- │ ├── merge-strategies.test.ts
191
- │ ├── multi-source-runner.test.ts
192
- │ └── runner-plugin-wiring.test.ts
193
-
194
- └── clients/ ← gitignored in this repo; each client
195
- ├── acme-corp/ gets their own private repo
196
- │ ├── .env
197
- │ ├── customers.pipeline.yaml
198
- │ ├── items.pipeline.yaml
199
- │ ├── vendors.pipeline.yaml
200
- │ └── lookups/
201
- └── style-co/
202
- ├── .env
203
- ├── styles.pipeline.yaml
204
- ├── vendors.pipeline.yaml
205
- ├── purchase-orders.pipeline.yaml
206
- └── lookups/
207
- ```
208
-
209
- ---
210
-
211
- ## Technology stack
212
-
213
- | Concern | Package | Notes |
214
- |---|---|---|
215
- | Language | TypeScript 5.x | `strict: true`, `exactOptionalPropertyTypes: true` |
216
- | Runtime | Node.js 24 LTS | No Bun, no Deno — must run in GitHub Actions |
217
- | Config parsing | `js-yaml` | YAML 1.2 only |
218
- | Config validation | `zod` v3 | All config types inferred from Zod |
219
- | SQL Server | `mssql` | Trusted + SQL auth both supported |
220
- | PostgreSQL | `pg` + `@types/pg` | |
221
- | CSV | `csv-parse` + `csv-stringify` | Streaming |
222
- | Excel | `xlsx` (SheetJS) | Read-only |
223
- | HTTP | `axios` + `axios-retry` | 3 retries, exponential backoff |
224
- | Dates | `dayjs` | All date parsing and formatting |
225
- | Staging | `@duckdb/node-api` | Embedded; no server. Replaces deprecated `duckdb` package — ABI-stable (no `npm rebuild` after Node ABI bumps). |
226
- | CLI | `commander` v12 | |
227
- | Logging | `pino` | JSON; `pino-pretty` in dev |
228
- | Testing | `vitest` | No Jest |
229
- | Env vars | `dotenv` | Loaded once at CLI entry |
230
- | Linting | `eslint` + `@typescript-eslint` | |
231
- | Formatting | `prettier` | 2-space, single quotes, trailing commas |
232
- | Expressions | `expr-eval` | Safe expression parser; no eval() |
233
-
234
- ---
235
-
236
- ## TypeScript conventions
237
-
238
- - **All config types come from Zod inference.** Do not write manual `type` or
239
- `interface` declarations for anything that maps to pipeline config.
240
- Use `z.infer<typeof SomeSchema>`.
241
- - **No `any`.** Use `unknown` and narrow explicitly.
242
- - **No `eval()` or `Function()`** anywhere. See expression evaluator section.
243
- - **Async throughout.** All I/O must be `async/await`. No callbacks.
244
- - **Error handling:** throw typed errors from `src/utils/errors.ts`. Never throw
245
- raw strings. Catch at the `PipelineRunner` boundary.
246
- - **Barrel exports:** each directory has an `index.ts`. Do not import from internal
247
- files across module boundaries.
248
- - **No circular imports.** Dependency direction:
249
- `cli` → `runner` / `multi-source-runner` → `adapters`, `staging`, `dq`,
250
- `transform`, `merge`, `plugins`, `config`, `enrich`. `plugins/` is imported by
251
- `runner`, `dq`, `transform`, and `merge`; it must not import any of them.
252
- `enrich/` is type-only (no runtime imports of other modules in this repo —
253
- the implementation lives in the private `@caracal-lynx/sluice-enrich`).
254
- Utils are imported by everyone.
255
- - **Path aliases:** `@/` → `src/` in tsconfig.
256
-
257
- ---
258
-
259
- ## ═══════════════════════════════════════════════════════════
260
- ## YAML PIPELINE CONFIG SPECIFICATION
261
- ## ═══════════════════════════════════════════════════════════
262
-
263
- Every pipeline is a single YAML file. One file = one migrated entity
264
- (e.g. customers, items, vendors, styles, purchase orders).
265
-
266
- ### Top-level structure
267
-
268
- ```yaml
269
- pipeline: { ... } # identity and metadata
270
- source: { ... } # where to read from
271
- enrich: { ... } # OPTIONAL — Phase 4a; external API lookups (private)
272
- dq: { ... } # data quality rules
273
- transform: { ... } # field mappings and lookups
274
- target: { ... } # where to write to
275
- run: { ... } # execution options (all fields optional; all have defaults)
276
- ```
277
-
278
- > **Phase 4a — Enrich Phase (private):** the `enrich:` block, when present, runs after Extract (and after Merge for multi-source pipelines) and before DQ. The framework that drives it lives in the **private** `@caracal-lynx/sluice-enrich` package — the open-source core only ships the Zod schema, the public `EnrichPlugin` interface (`src/enrich/types.ts`), and the `registerEnrichPhase()` injection hook on `PipelineRunner`. With `sluice-enrich` not installed, an `enrich:` block is parsed and validated but the phase is skipped with a `WARN` log. See [docs/PHASE-04-enrich-phase.md](docs/PHASE-04-enrich-phase.md) for the full spec.
279
-
280
- ---
281
-
282
- ### `pipeline` section
283
-
284
- ```yaml
285
- pipeline:
286
- name: acme-corp-customers # REQUIRED. Slug: lowercase, hyphens only.
287
- # Used in output filenames and log messages.
288
- client: acme-corp # REQUIRED. Client identifier.
289
- version: "1.0" # REQUIRED. Quote to ensure string type.
290
- entity: CustomerInfo # REQUIRED. Logical entity name (used in
291
- # load reports and target adapter metadata).
292
- description: > # Optional. Human-readable description.
293
- Customer master migration —
294
- legacy SQL to IFS ERP
295
- ```
296
-
297
- ---
298
-
299
- ### `source` section
300
-
301
- Exactly one of `query`, `file`, or `endpoint` must be present.
302
-
303
- ```yaml
304
- source:
305
- adapter: mssql # REQUIRED. One of: mssql | pg | csv | xlsx | rest
306
-
307
- # ── SQL adapters (mssql, pg) ──────────────────────────────
308
- connection: ${SOURCE_MSSQL} # Connection string from .env.
309
- # mssql: mssql://user:pass@host/database
310
- # Or a JSON string for trusted/advanced config.
311
- query: |
312
- SELECT c.CUST_CODE, c.CUST_NAME, c.POST_CODE
313
- FROM dbo.Customers c
314
- WHERE c.Active = 1
315
-
316
- # ── CSV adapter ───────────────────────────────────────────
317
- file: ./data/customers.csv # Path or glob (./data/export-*.csv).
318
- delimiter: "," # Default: ","
319
- encoding: utf-8 # Default: utf-8
320
-
321
- # ── XLSX adapter ──────────────────────────────────────────
322
- file: ./data/customers.xlsx
323
- sheet: "Customer Export" # Sheet name or 0-based index. Default: 0.
324
-
325
- # ── REST adapter ──────────────────────────────────────────
326
- endpoint: ${API_BASE}/customers # Full URL. ${ENV_VAR} resolved at runtime.
327
- headers: # Optional. Added to every request.
328
- Authorization: Bearer ${API_TOKEN}
329
- Accept: application/json
330
- pagination: # Optional. Omit for single-page responses.
331
- type: offset # offset | cursor | page
332
- pageSize: 100
333
- pageParam: skip # Query param name for the offset/page value.
334
- totalField: data.total # Dot-path to total count in response body.
335
- dataField: data.items # Dot-path to the records array.
336
- cursorField: nextCursor # For cursor pagination: field in response body.
337
- cursorParam: cursor # For cursor pagination: query param name.
338
- ```
339
-
340
- ---
341
-
342
- ### `dq` section
343
-
344
- ```yaml
345
- dq:
346
- stopOnCritical: true # Default: true. Halt pipeline if any critical rule fails.
347
- rejectionFile: ./output/acme-corp-customers-rejected.csv
348
- # Default: ./output/{pipeline.name}-rejected.csv
349
-
350
- rules:
351
- - field: FIELD_NAME # Source column name (pre-transform).
352
- checks:
353
-
354
- # notNull — fails if null, undefined, empty string, or whitespace-only
355
- - type: notNull
356
- severity: critical
357
-
358
- # unique — fails if value appears more than once across the full dataset
359
- - type: unique
360
- severity: critical
361
-
362
- # pattern — ECMAScript regex, tested with new RegExp(value)
363
- - type: pattern
364
- value: "^[A-Z0-9]{3,10}$"
365
- severity: warning
366
- message: "Must be 3-10 uppercase alphanumeric characters"
367
- # message is optional; overrides default.
368
-
369
- # email — RFC 5322-ish email validation
370
- - type: email
371
- severity: warning
372
-
373
- # ukPostcode — all current UK postcode formats; strips spaces before testing
374
- - type: ukPostcode
375
- severity: warning
376
-
377
- # maxLength — maximum string length (integer)
378
- - type: maxLength
379
- value: 100
380
- severity: warning
381
-
382
- # min / max — numeric comparison; coerces value to float
383
- - type: min
384
- value: 0
385
- severity: critical
386
- - type: max
387
- value: 500000
388
- severity: warning
389
-
390
- # allowedValues — case-sensitive array of permitted string values
391
- - type: allowedValues
392
- value: [GB, IE, US, DE, FR]
393
- severity: warning
394
-
395
- # Severity:
396
- # critical row is rejected; pipeline halts if stopOnCritical: true
397
- # warning row is flagged in rejection report but NOT removed from output
398
- # info recorded in summary JSON only
399
- ```
400
-
401
- ---
402
-
403
- ### `transform` section
404
-
405
- ```yaml
406
- transform:
407
-
408
- # ── Lookup tables ─────────────────────────────────────────
409
- # Loaded once at start of transform phase, cached in memory.
410
- lookups:
411
- - name: currencyMap # Referenced by field mappings.
412
- source: # Any source adapter works here.
413
- adapter: csv
414
- file: ./lookups/currency-codes.csv
415
- key: legacyCode # Column to match against source value.
416
- value: isoCode # Column to return as resolved value.
417
-
418
- - name: acctMgrMap
419
- source:
420
- adapter: mssql
421
- connection: ${SOURCE_MSSQL}
422
- query: "SELECT STAFF_ID as key, IFS_USER_ID as value FROM dbo.Staff"
423
- key: key
424
- value: value
425
-
426
- # ── Field mappings ────────────────────────────────────────
427
- fields:
428
-
429
- # type: string
430
- - from: CUST_CODE
431
- to: CustomerNo
432
- type: string
433
- max: 20 # Optional. Truncate after cleanse.
434
-
435
- - from: CUST_NAME
436
- to: Name
437
- type: string
438
- max: 100
439
- cleanse: trim|titleCase # Pipe-separated cleanse ops. See table below.
440
-
441
- # type: number — coerce to integer; throws if NaN
442
- - from: QTY
443
- to: Quantity
444
- type: number
445
-
446
- # type: decimal — fixed precision; stored as string in staging
447
- - from: CREDIT_LIMIT
448
- to: CreditLimit
449
- type: decimal
450
- precision: 2 # Default: 2
451
-
452
- # type: boolean
453
- # Truthy: '1','true','yes','y','t' (case-insensitive). All else false.
454
- - from: IS_ACTIVE
455
- to: Active
456
- type: boolean
457
-
458
- # type: date — parse source date, output as dateFormat (default ISO)
459
- - from: START_DATE
460
- to: StartDate
461
- type: date
462
- format: DD/MM/YYYY # Optional source parse format (dayjs tokens).
463
-
464
- # type: lookup — resolve via a named lookup table
465
- - from: CURRENCY
466
- to: CurrencyCode
467
- type: lookup
468
- lookup: currencyMap # Must match a lookup name above.
469
- default: GBP # Emitted when lookup key not found.
470
- optional: false # Default: false. true = null on miss (no error).
471
-
472
- # type: concat — join multiple source fields
473
- - from: [ADDR1, ADDR2] # Array of source field names.
474
- to: Address1
475
- type: concat
476
- separator: ", " # Default: " "
477
- cleanse: trim|nullIfEmpty
478
-
479
- # type: constant — emit a fixed value regardless of source data
480
- - to: CustomerGroup
481
- type: constant
482
- value: DOMESTIC
483
-
484
- # type: expression — evaluate against source row
485
- - to: SearchName
486
- type: expression
487
- value: "row.CUST_NAME.toUpperCase().substring(0, 20)"
488
- # For logic beyond expr-eval, prefix with js:
489
- # value: "js: row.PRICE * (1 - row.DISCOUNT / 100)"
490
-
491
- # Common optional field properties:
492
- # optional: true null result does not cause a pipeline error
493
- # default: <val> fallback value if source is null/empty
494
- # max: <n> truncate string to n chars AFTER cleanse
495
- ```
496
-
497
- #### Cleanse operations reference
498
-
499
- Applied left-to-right in the pipe chain. Defined in `src/transform/cleanse.ts`.
500
-
501
- | Op | Example input | Example output |
502
- |---|---|---|
503
- | `trim` | `" hello "` | `"hello"` |
504
- | `uppercase` | `"hello"` | `"HELLO"` |
505
- | `lowercase` | `"HELLO"` | `"hello"` |
506
- | `titleCase` | `"john smith"` | `"John Smith"` |
507
- | `stripNonAlpha` | `"AB-12!"` | `"AB"` |
508
- | `stripNonNumeric` | `"AB-12!"` | `"12"` |
509
- | `stripWhitespace` | `"h e l l o"` | `"hello"` |
510
- | `padStart:6:0` | `"42"` | `"000042"` |
511
- | `truncate:20` | 21-char string | 20-char string |
512
- | `nullIfEmpty` | `""` | `null` |
513
- | `normaliseQuotes` | `"it\u2019s"` | `"it's"` |
514
- | `normaliseUnicode` | `"caf\u00e9"` | `"cafe"` (NFD→ASCII) |
515
-
516
- ---
517
-
518
- ### `target` section
519
-
520
- ```yaml
521
- target:
522
- adapter: ifs # REQUIRED. One of:
523
- # bc | ifs | bluecherry | csv | pg | rest
524
-
525
- # ── IFS adapter ───────────────────────────────────────────
526
- adapter: ifs
527
- output: ./output/acme-corp-customers-ifs.csv
528
- entity: CustomerInfo # IFS entity name (used in import log).
529
- includeHeader: false # Default: false (standard IFS import format).
530
- columnOrder: # Optional. Forces specific column ordering.
531
- - CustomerNo # Must match transform 'to' field names.
532
- - Name
533
- - Address1
534
- dateFormat: YYYY-MM-DD # Default: YYYY-MM-DD
535
- delimiter: "," # Default: ","
536
- encoding: utf-8 # Default: utf-8
537
-
538
- # ── BlueCherry adapter ────────────────────────────────────
539
- adapter: bluecherry
540
- entity: Style # REQUIRED. One of: Style | Vendor |
541
- # PurchaseOrder | PODetail | Season | ColourSize
542
- output: ./output/style-co-styles-bc.csv
543
- template: default # Optional. 'default' uses built-in required
544
- # columns. Or path to a header-only template CSV
545
- # whose first row defines column order.
546
- includeHeader: true # Default: true (BlueCherry expects headers).
547
- dateFormat: MM/DD/YYYY # Default: MM/DD/YYYY (BlueCherry is US-origin).
548
- delimiter: ","
549
- encoding: utf-8
550
- nullValue: "" # How nulls are rendered. Default: ""
551
-
552
- # ── Business Central REST adapter ─────────────────────────
553
- adapter: bc
554
- baseUrl: ${BC_BASE_URL}
555
- company: ${BC_COMPANY}
556
- entity: customers # OData entity name (lowercase, plural).
557
- apiVersion: v2.0 # Default: v2.0
558
- onConflict: fail # fail | upsert. Default: fail.
559
- batchEndpoint: true # Use OData $batch. Default: true.
560
-
561
- # ── Generic CSV adapter ───────────────────────────────────
562
- adapter: csv
563
- output: ./output/data.csv
564
- includeHeader: true
565
- delimiter: ","
566
- encoding: utf-8
567
- nullValue: ""
568
-
569
- # ── PostgreSQL adapter ────────────────────────────────────
570
- adapter: pg
571
- connection: ${TARGET_PG}
572
- table: customers
573
- schema: public # Default: public
574
- onConflict: fail # fail | upsert | ignore
575
- upsertKey: [customer_no] # REQUIRED if onConflict: upsert
576
- ```
577
-
578
- ---
579
-
580
- ### `run` section
581
-
582
- All fields optional. Shown with defaults.
583
-
584
- ```yaml
585
- run:
586
- mode: full # full | incremental | validate-only
587
- batchSize: 500 # Rows per DuckDB insert batch.
588
- onError: continue # continue | stop
589
- logLevel: info # debug | info | warn | error
590
- dryRun: false # true: DQ + transform, no output written.
591
- outputDir: ./output # Base directory for all output files.
592
- stagingDb: "" # DuckDB path. Default: {outputDir}/{name}.duckdb
593
- # Set ':memory:' to force in-memory mode.
594
- incrementalField: UPDATED_AT # Source field for incremental mode.
595
- incrementalSince: "" # ISO datetime. If empty, reads from state file.
596
- ```
597
-
598
- ---
599
-
600
- ### Full example — Acme Corp customers (MSSQL → IFS)
601
-
602
- ```yaml
603
- pipeline:
604
- name: acme-corp-customers
605
- client: acme-corp
606
- version: "1.0"
607
- entity: CustomerInfo
608
- description: Customer master — legacy Sage SQL to IFS ERP
609
-
610
- source:
611
- adapter: mssql
612
- connection: ${SOURCE_MSSQL}
613
- query: |
614
- SELECT
615
- c.CUST_CODE, c.CUST_NAME, c.ADDR1, c.ADDR2,
616
- c.POST_CODE, c.COUNTRY, c.EMAIL, c.TEL,
617
- c.CREDIT_LIMIT, c.CURRENCY, c.ACCT_MGR_ID
618
- FROM dbo.Customers c
619
- WHERE c.Active = 1 AND c.DELETED = 0
620
-
621
- dq:
622
- stopOnCritical: true
623
- rejectionFile: ./output/acme-corp-customers-rejected.csv
624
- rules:
625
- - field: CUST_CODE
626
- checks:
627
- - { type: notNull, severity: critical }
628
- - { type: unique, severity: critical }
629
- - { type: pattern, value: "^[A-Z0-9]{3,10}$", severity: warning }
630
- - field: CUST_NAME
631
- checks:
632
- - { type: notNull, severity: critical }
633
- - { type: maxLength, value: 100, severity: warning }
634
- - field: POST_CODE
635
- checks:
636
- - { type: ukPostcode, severity: warning }
637
- - field: EMAIL
638
- checks:
639
- - { type: email, severity: warning }
640
- - field: CREDIT_LIMIT
641
- checks:
642
- - { type: min, value: 0, severity: critical }
643
- - { type: max, value: 500000, severity: warning }
644
- - field: COUNTRY
645
- checks:
646
- - { type: allowedValues, value: [GB, IE, US, DE, FR], severity: warning }
647
-
648
- transform:
649
- lookups:
650
- - name: currencyMap
651
- source: { adapter: csv, file: ./lookups/currency-codes.csv }
652
- key: legacyCode
653
- value: isoCode
654
- - name: acctMgrMap
655
- source:
656
- adapter: mssql
657
- connection: ${SOURCE_MSSQL}
658
- query: "SELECT STAFF_ID as key, IFS_USER_ID as value FROM dbo.Staff"
659
- key: key
660
- value: value
661
- fields:
662
- - { from: CUST_CODE, to: CustomerNo, type: string, max: 20 }
663
- - { from: CUST_NAME, to: Name, type: string, max: 100, cleanse: trim|titleCase }
664
- - { from: [ADDR1, ADDR2], to: Address1, type: concat, separator: ", ", cleanse: trim }
665
- - { from: POST_CODE, to: ZipCode, type: string, cleanse: trim|uppercase }
666
- - { from: COUNTRY, to: Country, type: string, default: GB }
667
- - { from: CURRENCY, to: CurrencyCode, type: lookup, lookup: currencyMap, default: GBP }
668
- - { from: ACCT_MGR_ID, to: SalesmanCode, type: lookup, lookup: acctMgrMap, optional: true }
669
- - { from: CREDIT_LIMIT, to: CreditLimit, type: decimal, precision: 2 }
670
- - { from: EMAIL, to: Email, type: string, cleanse: trim|lowercase }
671
- - { to: CustomerGroup, type: constant, value: DOMESTIC }
672
- - { to: SearchName, type: expression, value: "row.CUST_NAME.toUpperCase().substring(0, 20)" }
673
-
674
- target:
675
- adapter: ifs
676
- entity: CustomerInfo
677
- output: ./output/acme-corp-customers-ifs.csv
678
- includeHeader: false
679
- columnOrder: [CustomerNo, Name, Address1, ZipCode, Country, CurrencyCode,
680
- SalesmanCode, CreditLimit, Email, CustomerGroup, SearchName]
681
-
682
- run:
683
- mode: full
684
- batchSize: 500
685
- logLevel: info
686
- dryRun: false
687
- ```
688
-
689
- ---
690
-
691
- ### Full example — Style Co styles (CSV → BlueCherry)
692
-
693
- ```yaml
694
- pipeline:
695
- name: style-co-styles
696
- client: style-co
697
- version: "1.0"
698
- entity: Style
699
- description: Style master migration from legacy CSV exports to BlueCherry ERP
700
-
701
- source:
702
- adapter: csv
703
- file: ./data/styles-export.csv
704
- encoding: utf-8
705
-
706
- dq:
707
- stopOnCritical: true
708
- rejectionFile: ./output/style-co-styles-rejected.csv
709
- rules:
710
- - field: STYLE_NO
711
- checks:
712
- - { type: notNull, severity: critical }
713
- - { type: unique, severity: critical }
714
- - { type: maxLength, value: 20, severity: warning }
715
- - field: STYLE_DESC
716
- checks:
717
- - { type: notNull, severity: critical }
718
- - { type: maxLength, value: 255, severity: warning }
719
- - field: DIVISION
720
- checks:
721
- - { type: notNull, severity: critical }
722
- - { type: allowedValues, value: [WOMENS, MENS, ACCESSORIES], severity: warning }
723
- - field: SEASON_CODE
724
- checks:
725
- - { type: notNull, severity: warning }
726
- - { type: pattern, value: "^(SS|AW)[0-9]{2}$", severity: warning }
727
- - field: COST_PRICE
728
- checks:
729
- - { type: min, value: 0, severity: critical }
730
- - { type: max, value: 9999.99, severity: warning }
731
- - field: RETAIL_PRICE
732
- checks:
733
- - { type: min, value: 0, severity: critical }
734
-
735
- transform:
736
- lookups:
737
- - name: divisionMap
738
- source: { adapter: csv, file: ./lookups/division-codes.csv }
739
- key: legacyCode
740
- value: bcCode
741
- - name: vendorMap
742
- source: { adapter: csv, file: ./lookups/vendor-codes.csv }
743
- key: legacyVendorCode
744
- value: bcVendorNo
745
- fields:
746
- - { from: STYLE_NO, to: StyleNo, type: string, max: 20, cleanse: trim|uppercase }
747
- - { from: STYLE_DESC, to: StyleDesc, type: string, max: 255, cleanse: trim|normaliseUnicode }
748
- - { from: DIVISION, to: Division, type: lookup, lookup: divisionMap }
749
- - { from: SEASON_CODE, to: Season, type: string, max: 10 }
750
- - { from: VENDOR_CODE, to: VendorNo, type: lookup, lookup: vendorMap, optional: true }
751
- - { from: COST_PRICE, to: CostPrice, type: decimal, precision: 2 }
752
- - { from: RETAIL_PRICE, to: RetailPrice, type: decimal, precision: 2 }
753
- - { from: WEIGHT_KG, to: Weight, type: decimal, precision: 3, default: "0.000" }
754
- - { from: COUNTRY_ORIG, to: CountryOrigin, type: string, default: GB }
755
- - { from: FIBRE_CONTENT, to: FibreContent, type: string, max: 200, cleanse: trim }
756
- - { to: ActiveFlag, type: constant, value: "Y" }
757
- - { to: CreatedDate, type: expression, value: "js: new Date().toLocaleDateString('en-US')" }
758
-
759
- target:
760
- adapter: bluecherry
761
- entity: Style
762
- output: ./output/style-co-styles-bc.csv
763
- includeHeader: true
764
- dateFormat: MM/DD/YYYY
765
- nullValue: ""
766
-
767
- run:
768
- mode: full
769
- batchSize: 200
770
- logLevel: info
771
- dryRun: false
772
- ```
773
-
774
- ---
775
-
776
- ## ═══════════════════════════════════════════════════════════
777
- ## MULTI-SOURCE PIPELINES (Phase 3)
778
- ## ═══════════════════════════════════════════════════════════
779
-
780
- A multi-source pipeline replaces the single `source:` block with a top-level
781
- `sources:` array (min 2 entries) plus a `merge:` block. The rest of the YAML
782
- (`pipeline`, `dq`, `transform`, `target`, `run`) is unchanged. `PipelineSchema`
783
- requires *either* `source` (single) *or* both `sources` + `merge` (multi) —
784
- never both — and the CLI auto-routes multi-source configs to
785
- `MultiSourcePipelineRunner` (see `src/cli.ts:createRunnerForPipeline`).
786
-
787
- ### Top-level layout
788
-
789
- ```yaml
790
- pipeline: { ... }
791
- sources: [ { ... }, { ... } ] # REQUIRED in multi-source mode; min 2 entries
792
- merge: { ... } # REQUIRED when `sources` is present
793
- dq: { ... }
794
- transform: { ... }
795
- target: { ... }
796
- run: { ... }
797
- ```
798
-
799
- ### `sources` entries
800
-
801
- Each entry is a `SourceConfig` with three extra multi-source-only fields:
802
-
803
- ```yaml
804
- sources:
805
- - id: sql-server # REQUIRED. Lowercase alphanumeric + hyphens only;
806
- # must be unique across the array; used as the
807
- # staging table suffix (stg_raw_sql-server).
808
- priority: 1 # REQUIRED. Positive integer. Lower priority =
809
- # higher precedence in coalesce / priority-override.
810
- adapter: mssql
811
- connection: ${SOURCE_2_MSSQL}
812
- query: |
813
- SELECT STYLE_NO, STYLE_DESC, COST_PRICE FROM dbo.Styles WHERE Active = 1
814
-
815
- - id: excel
816
- priority: 2
817
- adapter: xlsx
818
- file: ./data/product-data.xlsx
819
- sheet: "Products"
820
- rename: # Optional. { 'old column': 'new column' }.
821
- Style Number: STYLE_NO # Applied in-place after extract, before DQ and
822
- Description: STYLE_DESC # merge. Intended for CSV/XLSX sources where
823
- Fibre: FIBRE_CONTENT # column headers are fixed; SQL/REST sources
824
- # should rename in the query or field selection.
825
- # Unknown keys are logged as warnings, not errors.
826
- ```
827
-
828
- ### `merge` block
829
-
830
- ```yaml
831
- merge:
832
- key: STYLE_NO # REQUIRED. Single column name or array of
833
- # columns (composite key). Must exist in every
834
- # source after `rename` is applied.
835
-
836
- strategy: coalesce # Default: coalesce. One of:
837
- # coalesce first non-null value wins
838
- # (priority-ordered; whitespace
839
- # treated as blank)
840
- # priority-override highest-priority source
841
- # wins (even if null/blank)
842
- # union all rows from all sources
843
- # (dedupe by key)
844
- # intersect only rows present in ALL
845
- # sources
846
-
847
- onUnmatched: include # Default: include. One of:
848
- # include (default) keep unmatched rows
849
- # exclude drop them
850
- # warn keep and log a warning
851
- # error fail the pipeline
852
- # Ignored by `intersect`, which always excludes.
853
-
854
- fieldStrategies: # Optional. Per-field overrides of the
855
- # top-level strategy.
856
- - field: FIBRE_CONTENT
857
- source: excel # Force this field to always come from the
858
- # named source, ignoring priority.
859
- - field: COST_PRICE
860
- strategy: priority-override # Override just this field's strategy.
861
-
862
- conflictLog: ./output/style-co-products-conflicts.csv
863
- # Optional. CSV of (key, field, winning_source,
864
- # winning_value, source_values). Only written
865
- # when at least one conflict is detected.
866
-
867
- incrementalSource: sql-server # REQUIRED when `run.mode: incremental`.
868
- # Must match one of the source `id` values.
869
- # The named source is filtered by
870
- # `run.incrementalField` / state-file lastRunAt;
871
- # other sources run full each time.
872
- ```
873
-
874
- ### Multi-source DQ rules
875
-
876
- `dq.rules[].sourceId` (optional) scopes a rule to a specific pre-merge source
877
- table. Rules without `sourceId` run post-merge against `stg_merged`:
878
-
879
- ```yaml
880
- dq:
881
- stopOnCritical: true
882
- rules:
883
- - field: STYLE_NO # Pre-merge: runs against stg_raw_sql-server only.
884
- sourceId: sql-server
885
- checks:
886
- - { type: notNull, severity: critical }
887
- - { type: unique, severity: critical }
888
-
889
- - field: STYLE_DESC # Post-merge: runs against stg_merged.
890
- checks:
891
- - { type: notNull, severity: critical }
892
- - { type: maxLength, value: 255, severity: warning }
893
- ```
894
-
895
- Per-source rejection files are auto-named by appending `-{sourceId}` to the
896
- configured `rejectionFile` stem. Rows failing a critical pre-merge rule are
897
- filtered out of that source's staging table *before* the merge phase.
898
-
899
- ### Full example
900
-
901
- See [tests/fixtures/style-co-products-merged.pipeline.yaml](tests/fixtures/style-co-products-merged.pipeline.yaml)
902
- for a complete, tested multi-source pipeline (MSSQL + REST + XLSX → BlueCherry
903
- with `coalesce` + `fieldStrategies` + `incrementalSource`).
904
-
905
- ### Invocation
906
-
907
- ```bash
908
- sluice check tests/fixtures/style-co-products-merged.pipeline.yaml
909
- sluice run tests/fixtures/style-co-products-merged.pipeline.yaml
910
- sluice merge list-strategies
911
- sluice merge info coalesce
912
- ```
913
-
914
- ---
915
-
916
- ## ═══════════════════════════════════════════════════════════
917
- ## ZOD SCHEMA (src/config/schema.ts)
918
- ## ═══════════════════════════════════════════════════════════
919
-
920
- Reproduce this schema exactly. Do not invent additional fields or rename enums.
921
-
922
- ```typescript
923
- import { z } from 'zod';
924
-
925
- const Severity = z.enum(['critical', 'warning', 'info']);
926
- const SourceAd = z.enum(['mssql', 'pg', 'csv', 'xlsx', 'rest']);
927
- const TargetAd = z.enum(['bc', 'ifs', 'bluecherry', 'csv', 'pg', 'rest']);
928
- const CleanseOps = z.string().regex(/^[a-zA-Z|:0-9]+$/);
929
-
930
- const PaginationSchema = z.object({
931
- type: z.enum(['offset', 'cursor', 'page']),
932
- pageSize: z.number().int().positive().default(100),
933
- pageParam: z.string().optional(),
934
- totalField: z.string().optional(),
935
- dataField: z.string().optional(),
936
- cursorField: z.string().optional(),
937
- cursorParam: z.string().optional(),
938
- });
939
-
940
- export const SourceSchema = z.object({
941
- adapter: SourceAd,
942
- connection: z.string().optional(),
943
- query: z.string().optional(),
944
- file: z.string().optional(),
945
- endpoint: z.string().optional(),
946
- headers: z.record(z.string()).optional(),
947
- delimiter: z.string().default(','),
948
- encoding: z.string().default('utf-8'),
949
- sheet: z.union([z.string(), z.number()]).optional(),
950
- pagination: PaginationSchema.optional(),
951
- }).refine(s => s.query || s.file || s.endpoint,
952
- { message: 'source must have query, file, or endpoint' });
953
-
954
- const CheckType = z.enum([
955
- 'notNull', 'unique', 'pattern', 'email', 'ukPostcode',
956
- 'maxLength', 'min', 'max', 'allowedValues',
957
- ]);
958
-
959
- const CheckSchema = z.object({
960
- type: CheckType,
961
- value: z.union([z.string(), z.number(), z.array(z.string())]).optional(),
962
- severity: Severity,
963
- message: z.string().optional(),
964
- });
965
-
966
- const DqRuleSchema = z.object({
967
- field: z.string(),
968
- checks: z.array(CheckSchema).min(1),
969
- });
970
-
971
- export const DqSchema = z.object({
972
- stopOnCritical: z.boolean().default(true),
973
- rejectionFile: z.string().optional(),
974
- rules: z.array(DqRuleSchema).default([]),
975
- });
976
-
977
- const LookupSchema = z.object({
978
- name: z.string(),
979
- source: SourceSchema,
980
- key: z.string(),
981
- value: z.string(),
982
- });
983
-
984
- const FieldType = z.enum([
985
- 'string', 'number', 'decimal', 'boolean', 'date',
986
- 'lookup', 'concat', 'constant', 'expression',
987
- ]);
988
-
989
- const FieldMappingSchema = z.object({
990
- from: z.union([z.string(), z.array(z.string())]).optional(),
991
- to: z.string(),
992
- type: FieldType,
993
- max: z.number().optional(),
994
- precision: z.number().optional(),
995
- format: z.string().optional(),
996
- cleanse: CleanseOps.optional(),
997
- lookup: z.string().optional(),
998
- separator: z.string().optional(),
999
- value: z.union([z.string(), z.number(), z.boolean()]).optional(),
1000
- default: z.union([z.string(), z.number(), z.boolean(), z.null()]).optional(),
1001
- optional: z.boolean().default(false),
1002
- });
1003
-
1004
- export const TransformSchema = z.object({
1005
- lookups: z.array(LookupSchema).default([]),
1006
- fields: z.array(FieldMappingSchema).min(1),
1007
- });
1008
-
1009
- export const TargetSchema = z.object({
1010
- adapter: TargetAd,
1011
- output: z.string().optional(),
1012
- entity: z.string().optional(),
1013
- includeHeader: z.boolean().optional(),
1014
- columnOrder: z.array(z.string()).optional(),
1015
- dateFormat: z.string().optional(),
1016
- delimiter: z.string().default(','),
1017
- encoding: z.string().default('utf-8'),
1018
- nullValue: z.string().default(''),
1019
- template: z.string().optional(),
1020
- // BC REST
1021
- baseUrl: z.string().optional(),
1022
- company: z.string().optional(),
1023
- apiVersion: z.string().default('v2.0'),
1024
- onConflict: z.enum(['fail', 'upsert', 'ignore']).default('fail'),
1025
- upsertKey: z.array(z.string()).optional(),
1026
- batchEndpoint: z.boolean().default(true),
1027
- // PostgreSQL
1028
- connection: z.string().optional(),
1029
- table: z.string().optional(),
1030
- schema: z.string().default('public'),
1031
- });
1032
-
1033
- export const RunSchema = z.object({
1034
- mode: z.enum(['full', 'incremental', 'validate-only']).default('full'),
1035
- batchSize: z.number().int().positive().default(500),
1036
- onError: z.enum(['continue', 'stop']).default('continue'),
1037
- logLevel: z.enum(['debug', 'info', 'warn', 'error']).default('info'),
1038
- dryRun: z.boolean().default(false),
1039
- outputDir: z.string().default('./output'),
1040
- stagingDb: z.string().default(''),
1041
- // Phase 4a — enrich tuning (consumed by @caracal-lynx/sluice-enrich)
1042
- enrichConcurrency: z.number().int().positive().default(5),
1043
- enrichTimeoutMs: z.number().int().positive().default(5000),
1044
- enrichMaxRetries: z.number().int().min(0).max(5).default(3),
1045
- incrementalField: z.string().optional(),
1046
- incrementalSince: z.string().optional(),
1047
- });
1048
-
1049
- export const PipelineSchema = z.object({
1050
- pipeline: z.object({
1051
- name: z.string().regex(/^[a-z0-9-]+$/),
1052
- client: z.string(),
1053
- version: z.string(),
1054
- entity: z.string(),
1055
- description: z.string().optional(),
1056
- }),
1057
- source: SourceSchema,
1058
- enrich: EnrichSchema.optional(), // Phase 4a — runs between Extract/Merge and DQ
1059
- dq: DqSchema,
1060
- transform: TransformSchema,
1061
- target: TargetSchema,
1062
- run: RunSchema.default({}),
1063
- });
1064
-
1065
- // Inferred types — use these everywhere; do not write manual interfaces.
1066
- export type Pipeline = z.infer<typeof PipelineSchema>;
1067
- export type SourceConfig = z.infer<typeof SourceSchema>;
1068
- export type TargetConfig = z.infer<typeof TargetSchema>;
1069
- export type RunConfig = z.infer<typeof RunSchema>;
1070
- export type FieldMapping = z.infer<typeof FieldMappingSchema>;
1071
- export type DqRule = z.infer<typeof DqRuleSchema>;
1072
- export type Lookup = z.infer<typeof LookupSchema>;
1073
- ```
1074
-
1075
- ### Phase 2 schema additions (already in `src/config/schema.ts`)
1076
-
1077
- The following are forward-looking additions that extend the canonical schema above.
1078
- They are live in the codebase and tested. Do not remove them.
1079
-
1080
- - **`DqSchema.rulesFile`** (`z.string().optional()`) — path to a composite rule
1081
- library YAML file. `ConfigLoader` expands composite rule references into
1082
- built-in check types before Zod validation, so the pipeline runner only sees
1083
- standard checks.
1084
- - **`FieldType` includes `'custom'`** — delegates to a `TransformPlugin` via
1085
- `customOp`. Requires `customOp` to be set (enforced by a `.refine()`).
1086
- - **`FieldMappingSchema.customOp`** (`z.string().optional()`) — plugin ID for
1087
- `type: custom` fields.
1088
- - **`FieldMappingSchema.options`** (`z.record(z.unknown()).optional()`) — arbitrary
1089
- per-plugin config passed through to the transform plugin.
1090
- - **`FieldMappingSchema` refinement** — field types in `TYPES_REQUIRING_FROM`
1091
- (`string`, `number`, `decimal`, `boolean`, `date`, `lookup`, `concat`) must
1092
- declare `from`. Only `constant`, `expression`, and `custom` may omit it.
1093
- - **`TargetSchema` refinement** — when `onConflict: 'upsert'`, a non-empty
1094
- `upsertKey` is required (checked at config-parse time).
1095
- - **`ToolkitConfigSchema`** — schema for `sluice.config.yaml` (toolkit-level
1096
- plugin loading). Consumed by `PipelineRunner.loadAllPlugins()` via
1097
- `plugins/loader.ts → loadNpmPlugins()` at the start of every run.
1098
- - **`CompositeRuleSchema` / `CompositeRuleLibrarySchema`** — schemas for the
1099
- shared rule library YAML files referenced by `dq.rulesFile`.
1100
-
1101
- ### Phase 3 schema additions (multi-source merge)
1102
-
1103
- - **`DqRuleSchema.sourceId`** (`z.string().optional()`) — scopes a rule to a
1104
- named pre-merge source; omitted for post-merge rules.
1105
- - **`PipelineSchema.source`** — now `optional()`; mutually exclusive with
1106
- `sources` (enforced by `.refine()`).
1107
- - **`PipelineSchema.sources`** (`z.array(MultiSourceEntrySchema).min(2).optional()`)
1108
- — the multi-source array. Refinement also checks unique source ids and
1109
- (in incremental mode) that `merge.incrementalSource` matches a source id.
1110
- - **`PipelineSchema.merge`** (`MergeSchema.optional()`) — per-pipeline merge
1111
- config. Defaults: `strategy: 'coalesce'`, `onUnmatched: 'include'`.
1112
- - **`MergeSchema`** — `key`, `strategy`, `onUnmatched`, `fieldStrategies[]`,
1113
- `conflictLog`, `incrementalSource`.
1114
- - **`MergeFieldStrategySchema`** — per-field override: `field`, optional
1115
- `strategy`, optional `source` (at least one required).
1116
- - **`MultiSourceEntrySchema`** — extends `SourceBaseSchema` with `id`,
1117
- `priority`, and optional `rename`.
1118
- - **`isSingleSource(p)` / `isMultiSource(p)`** — exported type guards that
1119
- narrow `Pipeline` to the single- or multi-source shape.
1120
-
1121
- ---
1122
-
1123
- ## ═══════════════════════════════════════════════════════════
1124
- ## PLUGIN INTERFACES
1125
- ## ═══════════════════════════════════════════════════════════
1126
-
1127
- ### SourceAdapter (src/adapters/source/types.ts)
1128
-
1129
- ```typescript
1130
- export interface SourceAdapter {
1131
- readonly id: string;
1132
- connect(config: SourceConfig): Promise<void>;
1133
- extract(
1134
- config: SourceConfig,
1135
- store: StagingStore,
1136
- runConfig: RunConfig,
1137
- onProgress: (rows: number) => void,
1138
- targetTable?: string // defaults to 'stg_raw'; set per-source in
1139
- // multi-source pipelines
1140
- ): Promise<ExtractResult>;
1141
- disconnect(): Promise<void>;
1142
- }
1143
-
1144
- export interface ExtractResult {
1145
- rowsExtracted: number;
1146
- tableName: string; // caller-supplied; 'stg_raw' for single-source,
1147
- // 'stg_raw_{sourceId}' for each source in a
1148
- // multi-source pipeline
1149
- columns: ColumnMeta[];
1150
- }
1151
-
1152
- export interface ColumnMeta {
1153
- name: string;
1154
- duckDbType: string; // VARCHAR | BIGINT | DOUBLE | BOOLEAN | TIMESTAMP
1155
- }
1156
- ```
1157
-
1158
- ### TargetAdapter (src/adapters/target/types.ts)
1159
-
1160
- ```typescript
1161
- export interface TargetAdapter {
1162
- readonly id: string;
1163
- connect(config: TargetConfig): Promise<void>;
1164
- load(
1165
- config: TargetConfig,
1166
- store: StagingStore,
1167
- runConfig: RunConfig,
1168
- onProgress: (rows: number) => void
1169
- ): Promise<LoadResult>;
1170
- disconnect(): Promise<void>;
1171
- }
1172
-
1173
- export interface LoadResult {
1174
- rowsLoaded: number;
1175
- rowsFailed: number;
1176
- outputPath?: string; // set for file-based targets
1177
- }
1178
- ```
1179
-
1180
- ### DQ Rule (src/dq/rules/types.ts)
1181
-
1182
- ```typescript
1183
- export interface Rule {
1184
- readonly id: string;
1185
- validate(
1186
- value: unknown,
1187
- config: CheckConfig,
1188
- rowIndex: number,
1189
- field: string
1190
- ): RuleViolation | null;
1191
- }
1192
-
1193
- export interface RuleViolation {
1194
- field: string;
1195
- rowIndex: number;
1196
- value: unknown;
1197
- rule: string;
1198
- severity: 'critical' | 'warning' | 'info';
1199
- message: string;
1200
- }
1201
- ```
1202
-
1203
- ### MergeStrategyPlugin (src/merge/types.ts)
1204
-
1205
- ```typescript
1206
- export interface MergeSourceMeta {
1207
- id: string;
1208
- priority: number;
1209
- tableName: string; // e.g. 'stg_raw_sql-server'
1210
- }
1211
-
1212
- export interface MergeResult {
1213
- rowsMerged: number;
1214
- conflicts: number; // fields where two non-null values disagreed
1215
- unmatched: number; // records present in only one source
1216
- tableName: 'stg_merged';
1217
- }
1218
-
1219
- export interface MergeStrategyPlugin {
1220
- readonly id: string; // matches MergeSchema.strategy value
1221
- readonly description?: string; // shown by `sluice merge list-strategies`
1222
-
1223
- merge(
1224
- store: StagingStore,
1225
- sources: MergeSourceMeta[], // priority-ordered (priority 1 first)
1226
- config: MergeConfig,
1227
- ): Promise<MergeResult>;
1228
- }
1229
- ```
1230
-
1231
- Built-in strategies: `coalesce`, `priority-override`, `union`, `intersect`
1232
- (all pre-registered in `MergeStrategyRegistry`; live in
1233
- `src/merge/strategies/*.ts`). Custom strategies can be dropped into a
1234
- `plugins/` folder as `*.merge.ts` files exporting `const mergeStrategy`.
1235
-
1236
- ---
1237
-
1238
- ## ═══════════════════════════════════════════════════════════
1239
- ## ADAPTER IMPLEMENTATION NOTES
1240
- ## ═══════════════════════════════════════════════════════════
1241
-
1242
- ### mssql source
1243
-
1244
- - Stream results: `request.stream = true` + `RecordSet` events.
1245
- - SQL Server → DuckDB type map: `varchar/nvarchar/char → VARCHAR`,
1246
- `int/bigint → BIGINT`, `decimal/numeric/money → DOUBLE`,
1247
- `bit → BOOLEAN`, `datetime/date → TIMESTAMP`, `float/real → DOUBLE`.
1248
- - Trusted connection: detect `trustedConnection: true` in JSON connection config.
1249
-
1250
- ### csv source
1251
-
1252
- - `csv-parse` options: `{ columns: true, skip_empty_lines: true, bom: true }`.
1253
- `bom: true` strips the UTF-8 BOM common in Excel-generated CSVs.
1254
- - All columns inferred as `VARCHAR` in DuckDB.
1255
- - Support glob patterns: concatenate all matching files into a single staging table.
1256
-
1257
- ### xlsx source
1258
-
1259
- - SheetJS: convert to CSV via `xlsx.utils.sheet_to_csv`, then pipe through csv-parse.
1260
- - Log a warning if workbook has more than one sheet and `source.sheet` is unset.
1261
-
1262
- ### rest source
1263
-
1264
- - `axios-retry`: 3 retries, exponential backoff, retry on 429 and 5xx.
1265
- - Flatten nested JSON using `__` separator (`address.postCode` → `address__postCode`).
1266
- - All three pagination types must be supported: offset, page, cursor.
1267
-
1268
- ### IFS target
1269
-
1270
- - UTF-8 CSV via `csv-stringify`.
1271
- - `includeHeader` defaults to `false` for this adapter.
1272
- - Apply `target.columnOrder` if specified.
1273
- - Format date columns using `dayjs` with `target.dateFormat` (default `YYYY-MM-DD`).
1274
-
1275
- ### BlueCherry target (src/adapters/target/bluecherry.ts)
1276
-
1277
- BlueCherry ERP (CGS — Computer Generated Solutions) uses fixed-format CSV for
1278
- bulk import. Each entity type has a required column set. The adapter validates
1279
- required columns at `connect()` time, before any data is read.
1280
-
1281
- **Required columns per entity:**
1282
-
1283
- ```typescript
1284
- const REQUIRED_COLUMNS: Record<string, string[]> = {
1285
- Style: [
1286
- 'StyleNo', 'StyleDesc', 'Division', 'Season',
1287
- 'CostPrice', 'RetailPrice', 'ActiveFlag',
1288
- ],
1289
- Vendor: [
1290
- 'VendorNo', 'VendorName', 'Country', 'CurrencyCode',
1291
- ],
1292
- PurchaseOrder: [
1293
- 'PONumber', 'VendorNo', 'Season', 'OrderDate', 'DeliveryDate',
1294
- ],
1295
- PODetail: [
1296
- 'PONumber', 'StyleNo', 'ColourCode', 'SizeCode', 'Quantity', 'CostPrice',
1297
- ],
1298
- Season: [
1299
- 'SeasonCode', 'SeasonDesc', 'StartDate', 'EndDate',
1300
- ],
1301
- ColourSize: [
1302
- 'StyleNo', 'ColourCode', 'ColourDesc', 'SizeCode', 'SizeDesc',
1303
- ],
1304
- };
1305
- ```
1306
-
1307
- **Behaviour:**
1308
- - `includeHeader` defaults to `true`.
1309
- - Default `dateFormat` is `MM/DD/YYYY` (BlueCherry is US-origin software).
1310
- - Any column whose name ends with `Date` (case-insensitive) is automatically
1311
- formatted using `target.dateFormat` via `dayjs`.
1312
- - `nullValue` (default `""`) is used for all null/undefined fields.
1313
- - At `connect()`:
1314
- 1. Verify `target.entity` is in `REQUIRED_COLUMNS`. Throw `ConfigError` if not.
1315
- 2. Query `store.columnNames('stg_transformed')` and verify all required columns
1316
- are present. Throw `ConfigError` listing any missing columns.
1317
- 3. If `target.template` is a file path, read its header row and use it as the
1318
- definitive column order for the output. If `target.template === 'default'`,
1319
- use the required columns list as column order, with any additional columns
1320
- from `stg_transformed` appended.
1321
-
1322
- **Note on BlueCherry column names:** The column names in `REQUIRED_COLUMNS` are
1323
- internal conventions for this toolkit. Verify them against the actual BlueCherry
1324
- import documentation before running a live migration. The `template` feature exists
1325
- precisely to override these if the client's BlueCherry instance uses different names.
1326
-
1327
- ### Business Central REST target
1328
-
1329
- - OAuth2 client credentials: `POST https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token`
1330
- - Cache token in memory; refresh 60 seconds before expiry.
1331
- - OData `$batch`: `POST {baseUrl}/api/{version}/companies({company})/$batch`
1332
- with `Content-Type: multipart/mixed; boundary=batch_{uuid}`.
1333
- Maximum 100 operations per batch request.
1334
- - HTTP 409 with `onConflict: upsert` → issue PATCH to individual entity URL.
1335
- - HTTP 4xx (non-409): log error, increment `rowsFailed`, continue if
1336
- `run.onError: continue`.
1337
-
1338
- ---
1339
-
1340
- ## ═══════════════════════════════════════════════════════════
1341
- ## PIPELINE RUNNER — EXECUTION ORDER
1342
- ## ═══════════════════════════════════════════════════════════
1343
-
1344
- **Important:** `ConfigLoader.load()` interpolates `${ENV_VAR}` tokens from
1345
- `process.env` but does **not** call `loadEnv()` / `dotenv.config()` itself.
1346
- The CLI entry point must call `loadEnv()` before invoking the loader. This keeps
1347
- `ConfigLoader` side-effect-free and testable (tests stub `process.env` directly).
1348
-
1349
- ```
1350
- 1. Load + validate config ConfigLoader.load(yamlPath)
1351
- 2. Resolve output directory create if not exists
1352
- 3. Open DuckDB staging store StagingStore.open(dbPath)
1353
- 4. Connect source adapter
1354
- 5. Extract → 'stg_raw' log: rows extracted
1355
- 5a. Disconnect source adapter always in finally
1356
- 5b. Phase 4a Enrich (optional) runs only when:
1357
- - `enrich:` block configured
1358
- - --no-enrich NOT set
1359
- - mode != validate-only and not dryRun
1360
- - @caracal-lynx/sluice-enrich is installed
1361
- and has called registerEnrichPhase()
1362
- Otherwise skipped (WARN log if last bullet
1363
- fails). Writes new columns to 'stg_raw'.
1364
- 6. Run DQ rules against 'stg_raw'
1365
- a. Collect all RuleViolations
1366
- b. Write rejection CSV
1367
- c. Write summary JSON
1368
- d. Log DQ summary (info)
1369
- e. If stopOnCritical AND criticalCount > 0 → throw PipelineDQError
1370
- 7. Resolve all lookups LookupResolver.loadAll()
1371
- 8. Transform 'stg_raw' → 'stg_transformed' (batch by batchSize)
1372
- 9. If dryRun === true → STOP (log summary, exit 0)
1373
- 10. If mode === 'validate-only' → STOP (log summary, exit 0)
1374
- 11. Connect target adapter
1375
- 12. Load 'stg_transformed' → target
1376
- 12a.Disconnect target adapter always in finally
1377
- 13. Close DuckDB staging store always in finally
1378
- 14. Write run state file {outputDir}/{name}-state.json
1379
- 15. Log final summary (info)
1380
- ```
1381
-
1382
- **Run state file** `{outputDir}/{name}-state.json`:
1383
- ```json
1384
- {
1385
- "pipeline": "acme-corp-customers",
1386
- "lastRunAt": "2026-04-15T09:30:00.000Z",
1387
- "lastMode": "full",
1388
- "rowsExtracted": 1842,
1389
- "rowsLoaded": 1801,
1390
- "criticalViolations": 0,
1391
- "warnings": 41,
1392
- "incrementalSince": ""
1393
- }
1394
- ```
1395
-
1396
- Used by `mode: incremental` to auto-determine the `since` timestamp.
1397
-
1398
- ### Multi-source execution order (`MultiSourcePipelineRunner`)
1399
-
1400
- For a pipeline with `sources` + `merge`, the CLI selects
1401
- `MultiSourcePipelineRunner` (a subclass of `PipelineRunner` that overrides
1402
- `run()`, `profile()`, and `writeStateFile()` and reuses the protected
1403
- `runExtract`, `runDQ`, `runTransform`, `runLoad` phase methods).
1404
-
1405
- ```
1406
- 1. Load + validate config ConfigLoader.load(yamlPath)
1407
- 2. Load plugins files + sluice.config.yaml (Tier 2/3)
1408
- 3. Resolve output dir, open DuckDB staging store
1409
- 4. For each source (priority-ordered):
1410
- a. runExtract → 'stg_raw_{sourceId}'
1411
- b. If source.rename is set StagingStore.renameColumns(...)
1412
- c. If mode: incremental AND source.id === merge.incrementalSource:
1413
- apply TRY_CAST(... AS TIMESTAMP) >= since filter
1414
- d. Filter dq.rules by sourceId; runDQ against 'stg_raw_{sourceId}'
1415
- (writes per-source rejection CSV, stops on critical)
1416
- e. Rewrite 'stg_raw_{sourceId}' to only the accepted rows
1417
- 5. MergeEngine.run(store, sources, merge)
1418
- → creates 'stg_merge_joined', 'stg_merged', 'stg_merge_conflicts'
1419
- → writes conflictLog CSV if configured
1420
- 5a. Phase 4a Enrich (optional) runs once against 'stg_merged' if
1421
- `enrich:` block is present and the four
1422
- gating conditions hold (see single-source
1423
- step 5b above). Single post-merge pass —
1424
- never per-source.
1425
- 6. runDQ on the post-merge rules (no sourceId) against 'stg_merged'
1426
- 7. Filter rejected rows; runTransform against the filtered merge result
1427
- 8. If dryRun OR validate-only → STOP
1428
- 9. runLoad → target adapter reads 'stg_transformed'
1429
- 10. writeStateFile → per-source lastRunAt block + top-level summary
1430
- 11. Close DuckDB
1431
- ```
1432
-
1433
- **Multi-source state file** adds a `sources` block keyed by source id:
1434
-
1435
- ```json
1436
- {
1437
- "pipeline": "style-co-products-merged",
1438
- "lastRunAt": "2026-04-19T09:30:00.000Z",
1439
- "lastMode": "incremental",
1440
- "rowsMerged": 3201,
1441
- "rowsLoaded": 3188,
1442
- "criticalViolations": 0,
1443
- "warnings": 14,
1444
- "incrementalSince": "",
1445
- "sources": {
1446
- "sql-server": {
1447
- "lastRunAt": "2026-04-19T09:30:00.000Z",
1448
- "rowsExtracted": 2910,
1449
- "incrementalSince": "2026-04-18T22:00:00.000Z"
1450
- },
1451
- "excel": { "lastRunAt": "...", "rowsExtracted": 412, "incrementalSince": "" }
1452
- }
1453
- }
1454
- ```
1455
-
1456
- ---
1457
-
1458
- ## ═══════════════════════════════════════════════════════════
1459
- ## DUCKDB STAGING STORE (src/staging/store.ts)
1460
- ## ═══════════════════════════════════════════════════════════
1461
-
1462
- ```typescript
1463
- class StagingStore {
1464
- constructor(private dbPath: string) {} // ':memory:' for dryRun/tests
1465
-
1466
- async open(): Promise<void>
1467
- async close(): Promise<void>
1468
- async createTable(name: string, columns: ColumnMeta[]): Promise<void>
1469
- async insertBatch(table: string, rows: Record<string, unknown>[]): Promise<void>
1470
- async query<T>(sql: string, params?: unknown[]): Promise<T[]>
1471
- async tableExists(name: string): Promise<boolean>
1472
- async dropTable(name: string): Promise<void>
1473
- async rowCount(table: string): Promise<number>
1474
- async columnNames(table: string): Promise<string[]>
1475
- async exportToCsv(
1476
- table: string,
1477
- outputPath: string,
1478
- options?: { delimiter?: string; header?: boolean; encoding?: string }
1479
- ): Promise<void>
1480
- async renameColumns( // Phase 3: used by MultiSourcePipelineRunner
1481
- tableName: string, // after a per-source extract. Implemented as
1482
- renames: Record<string, string> // CREATE OR REPLACE TABLE ... AS SELECT ...
1483
- ): Promise<void> // Unknown keys log a warning, not an error.
1484
- }
1485
- ```
1486
-
1487
- Default DuckDB path: `{outputDir}/{pipelineName}.duckdb`
1488
- Use `':memory:'` when `dryRun: true` or `stagingDb: ':memory:'`.
1489
-
1490
- ---
1491
-
1492
- ## ═══════════════════════════════════════════════════════════
1493
- ## TRANSFORM ENGINE (src/transform/engine.ts)
1494
- ## ═══════════════════════════════════════════════════════════
1495
-
1496
- ### Field type behaviours
1497
-
1498
- | type | behaviour |
1499
- |---|---|
1500
- | `string` | `String(value)`, cleanse ops, then truncate to `max` |
1501
- | `number` | `Math.round(Number(value))`. Throw `TransformError` if NaN. |
1502
- | `decimal` | `parseFloat(value).toFixed(precision)` stored as string |
1503
- | `boolean` | `['1','true','yes','y','t'].includes(String(v).toLowerCase())` |
1504
- | `date` | Parse with `dayjs(value, format)`; output as `target.dateFormat` or ISO |
1505
- | `lookup` | `LookupResolver.resolve(lookupName, value)` |
1506
- | `concat` | Join `from[]` with `separator`, then cleanse |
1507
- | `constant` | Emit `value` verbatim |
1508
- | `expression` | `ExpressionEvaluator.evaluate(expression, row)` |
1509
-
1510
- ### Expression evaluator (src/transform/expression.ts)
1511
-
1512
- **Must not use `eval()` or `new Function()`.**
1513
-
1514
- 1. Expression does NOT start with `js:` → use `expr-eval` Parser.
1515
- Provide `row` as a variable containing all source field values.
1516
- 2. Expression starts with `js:` → strip prefix, execute via
1517
- `vm.runInNewContext(code, { row, Date, Math, JSON, String, Number, Boolean })`.
1518
- Log a `warn` whenever the `js:` path is taken.
1519
-
1520
- ---
1521
-
1522
- ## ═══════════════════════════════════════════════════════════
1523
- ## DQ REPORTER OUTPUT (src/dq/reporter.ts)
1524
- ## ═══════════════════════════════════════════════════════════
1525
-
1526
- **Rejection CSV** columns: `row_index`, `field`, `value`, `rule`, `severity`, `message`
1527
-
1528
- **Summary JSON** (`{outputDir}/{name}-dq-summary.json`):
1529
- ```json
1530
- {
1531
- "pipeline": "acme-corp-customers",
1532
- "runAt": "2026-04-15T09:30:00Z",
1533
- "rowsChecked": 1842,
1534
- "rowsPassed": 1801,
1535
- "rowsRejected": 41,
1536
- "violations": { "critical": 0, "warning": 38, "info": 3 },
1537
- "byField": {
1538
- "POST_CODE": { "critical": 0, "warning": 22 },
1539
- "EMAIL": { "critical": 0, "warning": 16 }
1540
- }
1541
- }
1542
- ```
1543
-
1544
- ---
1545
-
1546
- ## ═══════════════════════════════════════════════════════════
1547
- ## ERROR TYPES (src/utils/errors.ts)
1548
- ## ═══════════════════════════════════════════════════════════
1549
-
1550
- ```typescript
1551
- export class PipelineError extends Error {
1552
- constructor(message: string, public readonly cause?: unknown) {
1553
- super(message);
1554
- this.name = this.constructor.name;
1555
- if (Error.captureStackTrace) {
1556
- Error.captureStackTrace(this, this.constructor);
1557
- }
1558
- }
1559
- }
1560
- export class ConfigError extends PipelineError {}
1561
- export class SourceError extends PipelineError {}
1562
- export class StagingError extends PipelineError {}
1563
- export class DQError extends PipelineError {}
1564
- export class PipelineDQError extends DQError {
1565
- constructor(
1566
- public readonly criticalCount: number,
1567
- public readonly reportPath: string,
1568
- ) {
1569
- super(`Pipeline halted: ${criticalCount} critical DQ violations. See ${reportPath}`);
1570
- }
1571
- }
1572
- export class TransformError extends PipelineError {}
1573
- export class ExpressionError extends TransformError {}
1574
- export class LoadError extends PipelineError {}
1575
- export class EnrichError extends PipelineError {} // Phase 4a — exit code 4
1576
- ```
1577
-
1578
- All error subclasses inherit `this.name = this.constructor.name` from
1579
- `PipelineError`, so `err.name` reflects the actual class (e.g. `"ConfigError"`,
1580
- `"PipelineDQError"`). `Error.captureStackTrace` (V8-only) trims the constructor
1581
- frame from stack traces for cleaner output.
1582
-
1583
- ---
1584
-
1585
- ## ═══════════════════════════════════════════════════════════
1586
- ## CLI (src/cli.ts)
1587
- ## ═══════════════════════════════════════════════════════════
1588
-
1589
- ```
1590
- sluice run <pipeline.yaml> Full pipeline run (auto-detects single vs multi-source)
1591
- sluice validate <pipeline.yaml> DQ + transform only; no load
1592
- sluice profile <pipeline.yaml> Extract + column profiling; no DQ
1593
- sluice check <pipeline.yaml> Config validation only; no execution
1594
- sluice plugins List all loaded rule/transform/merge plugins
1595
- sluice merge list-strategies List all registered merge strategies
1596
- sluice merge info <strategy> Show details about a specific merge strategy
1597
-
1598
- Global options:
1599
- --log-level <level> debug | info | warn | error
1600
- --env <file> Path to .env file (default: ./.env)
1601
- --output <dir> Override outputDir
1602
- --plugins <dir...> Additional plugin directory/directories to load
1603
- --dry-run Force dryRun: true
1604
- --silent Suppress the progress bar on stdout (logs still go to stderr)
1605
-
1606
- `sluice run` options:
1607
- --no-enrich Skip the Phase 4a enrich phase even if `enrich:` is configured.
1608
- (validate / profile / check do not run enrich at all, by design.)
1609
- ```
1610
-
1611
- **Progress feedback:** `sluice run`, `sluice validate`, and `sluice profile`
1612
- render a phase-by-phase progress bar to stdout via
1613
- `src/utils/progress.ts → ProgressReporter`, with per-phase emoji icons
1614
- (🔎 extract · 🛡️ DQ · 🔀 merge · 🌐 enrich · 🔧 transform · 📤 load), an ETA for
1615
- determinate phases, and a coloured ✅/⚠️/❌ run-summary line. The bar
1616
- degrades gracefully:
1617
- - `--silent` → no stdout output at all
1618
- - `--log-level debug` → bar disabled; per-row debug lines are used instead
1619
- - `process.stdout.isTTY` → false: plain-ASCII lines (one per phase),
1620
- no emojis, no ANSI escapes — log-file friendly
1621
- - `NO_COLOR` env var → ANSI colour dropped (handled by `picocolors`)
1622
-
1623
- **Exit codes:** `0` success · `1` pipeline error · `2` DQ critical violations · `3` config error · `4` enrich error (Phase 4a)
1624
-
1625
- ---
1626
-
1627
- ## ═══════════════════════════════════════════════════════════
1628
- ## LOGGING (src/utils/logger.ts)
1629
- ## ═══════════════════════════════════════════════════════════
1630
-
1631
- Single `pino` instance. All log records (every level) go to **stderr**; stdout
1632
- is reserved exclusively for the progress bar and final summary rendered by
1633
- `ProgressReporter`. This mirrors how git, cargo, and npm split streams.
1634
-
1635
- No `console.log` in `src/`. Operators who want logs in a file can run
1636
- `sluice run p.yaml 2>run.log` — the bar stays visible on the terminal while
1637
- every pino record is captured to the file. Use `--log-level error` to narrow
1638
- the file to errors only.
1639
-
1640
- | Level | Used for |
1641
- |---|---|
1642
- | `debug` | Per-row progress, SQL queries, lookup cache hits |
1643
- | `info` | Phase transitions, row counts, file paths, run summary |
1644
- | `warn` | DQ warnings, missing optional lookups, `js:` expression usage |
1645
- | `error` | All caught errors before re-throw |
1646
-
1647
- Dev: `npx sluice run pipeline.yaml | npx pino-pretty`
1648
-
1649
- ---
1650
-
1651
- ## ═══════════════════════════════════════════════════════════
1652
- ## ENVIRONMENT VARIABLES (.env.example)
1653
- ## ═══════════════════════════════════════════════════════════
1654
-
1655
- ```bash
1656
- # ── Acme Corp — source ────────────────────────────────────
1657
- SOURCE_MSSQL=mssql://user:password@serverlegacy.example.local/LegacyDB
1658
-
1659
- # ── Acme Corp — IFS target ────────────────────────────────
1660
- IFS_IMPORT_PATH=C:\IFS\Import
1661
-
1662
- # ── Business Central target (any client using the `bc` adapter) ──
1663
- BC_BASE_URL=https://api.businesscentral.dynamics.com/v2.0
1664
- BC_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
1665
- BC_CLIENT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
1666
- BC_CLIENT_SECRET=your-client-secret
1667
- BC_COMPANY=Example Company Ltd
1668
-
1669
- # ── Style Co — source ───────────────────────────────────
1670
- SOURCE_2_MSSQL=mssql://user:password@serverlegacy2.example.local/LegacyDB
1671
-
1672
- # ── Style Co — BlueCherry (file-based; no API creds) ───
1673
- BC_IMPORT_PATH=C:\BlueCherry\Import
1674
-
1675
- # ── Runtime ───────────────────────────────────────────────────
1676
- NODE_ENV=development
1677
- LOG_LEVEL=info
1678
- ```
1679
-
1680
- ---
1681
-
1682
- ## ═══════════════════════════════════════════════════════════
1683
- ## TESTING
1684
- ## ═══════════════════════════════════════════════════════════
1685
-
1686
- - **Vitest only.** No Jest.
1687
- - Unit tests: mock all I/O with `vi.mock`.
1688
- - Integration tests: real DuckDB (`:memory:`) + CSV fixtures.
1689
- - No tests against live SQL Server, BC, IFS, or BlueCherry.
1690
- - Target: 80% line coverage across `src/dq/` and `src/transform/`.
1691
- - Both full example pipelines in this file must parse cleanly in the config tests.
1692
-
1693
- **Required test cases:**
1694
-
1695
- Config loader: `${ENV_VAR}` resolution · missing var → `ConfigError` ·
1696
- invalid YAML → `ZodError` · minimal pipeline with all defaults · both example
1697
- pipelines in this spec parse cleanly.
1698
-
1699
- DQ engine: `notNull` on null/empty/whitespace · `unique` with duplicates ·
1700
- `ukPostcode` valid and invalid formats · `allowedValues` case sensitivity ·
1701
- `stopOnCritical` throws `PipelineDQError` · reporter writes correct CSV and JSON.
1702
-
1703
- Transform engine: `concat` with separator · `lookup` miss + `optional: true` → null ·
1704
- `lookup` miss + `optional: false` → `TransformError` · `expression` basic eval ·
1705
- `expression` with `js:` prefix · `cleanse: trim|titleCase` · `cleanse: padStart:6:0` ·
1706
- `cleanse: normaliseUnicode` · `type: date` with `format: DD/MM/YYYY` ·
1707
- `type: boolean` all truthy/falsy variants.
1708
-
1709
- BlueCherry adapter: missing required column → `ConfigError` at `connect()` ·
1710
- date columns formatted with `target.dateFormat` · header row present ·
1711
- `nullValue` respected · `template` CSV used as column order.
1712
-
1713
- Staging store: insert/query round-trip all DuckDB types · `exportToCsv` delimiter
1714
- and header options · `:memory:` mode works correctly.
1715
-
1716
- ---
1717
-
1718
- ## ═══════════════════════════════════════════════════════════
1719
- ## BUILD, SCRIPTS, CI
1720
- ## ═══════════════════════════════════════════════════════════
1721
-
1722
- **package.json scripts:**
1723
- ```json
1724
- {
1725
- "name": "@caracal-lynx/sluice",
1726
- "scripts": {
1727
- "build": "tsc -p tsconfig.json",
1728
- "dev": "tsx watch src/cli.ts",
1729
- "lint": "eslint src tests",
1730
- "format": "prettier --write src tests",
1731
- "test": "vitest run",
1732
- "test:watch": "vitest",
1733
- "test:cov": "vitest run --coverage",
1734
- "sluice": "tsx src/cli.ts"
1735
- },
1736
- "bin": { "sluice": "dist/cli.js" }
1737
- }
1738
- ```
1739
-
1740
- Use `tsx` (not `ts-node`) for development execution — handles tsconfig path aliases
1741
- on Windows without extra configuration.
1742
-
1743
- **GitHub Actions** (`.github/workflows/ci.yml`):
1744
- ```yaml
1745
- on: [push, pull_request]
1746
- jobs:
1747
- test:
1748
- runs-on: ubuntu-latest
1749
- steps:
1750
- - uses: actions/checkout@v4
1751
- - uses: actions/setup-node@v4
1752
- with: { node-version: '24', cache: 'npm' }
1753
- - run: npm ci
1754
- - run: npm run lint
1755
- - run: npm run build
1756
- - run: npm run test:cov
1757
- - uses: actions/upload-artifact@v4
1758
- with: { name: coverage, path: coverage/ }
1759
- ```
1760
-
1761
- ---
1762
-
1763
- ## ═══════════════════════════════════════════════════════════
1764
- ## WINDOWS / POWERSHELL NOTES
1765
- ## ═══════════════════════════════════════════════════════════
1766
-
1767
- - All file paths: `path.join()` / `path.resolve()`. Never string concat with `/`.
1768
- - `.env` uses LF line endings (set in `.gitattributes`).
1769
- - DuckDB npm package includes the `win32-x64` native binary automatically.
1770
- - Do not write Windows-only shell commands in CI (CI runs ubuntu-latest).
1771
- - Developer shell: PowerShell 7 on Windows Terminal.
1772
-
1773
- ---
1774
-
1775
- ## ═══════════════════════════════════════════════════════════
1776
- ## WHAT NOT TO DO
1777
- ## ═══════════════════════════════════════════════════════════
1778
-
1779
- - Do not use `ts-node` — use `tsx`.
1780
- - Do not use `jest` — use `vitest`.
1781
- - Do not use `console.log` in `src/` — use the pino logger.
1782
- - Do not write manual TypeScript interfaces for config types — use `z.infer<>`.
1783
- - Do not use `eval()` or `new Function()` — use `expr-eval` or `vm.runInNewContext`.
1784
- - Do not hard-code connection strings, credentials, or client-specific values.
1785
- - Do not import from `@duckdb/node-api` directly outside `src/staging/store.ts`.
1786
- - Do not create `StagingStore` instances outside `PipelineRunner`.
1787
- - Do not add UI, REST server, or dashboard code.
1788
- - Do not add adapter-specific logic to `PipelineRunner`.
1789
- - Do not invent new top-level YAML keys — the schema is fixed.
1790
- - Do not add cleanse ops without adding them to the reference table in this file.
1791
- - Do not add BlueCherry entity types to `REQUIRED_COLUMNS` without verifying
1792
- column names against actual BlueCherry import documentation first.
1793
- - Do not use `dayjs` plugins without importing them explicitly at the call site.
1794
-
1795
- ---
1796
-
1797
- ## ═══════════════════════════════════════════════════════════
1798
- ## SUGGESTED BUILD ORDER FOR CLAUDE CODE
1799
- ## ═══════════════════════════════════════════════════════════
1800
-
1801
- Work phase by phase. Do not start the next phase until the current phase passes
1802
- `npm run build` and `npm test` without errors. Ask before proceeding if anything
1803
- in this spec is ambiguous.
1804
-
1805
- 1. **Scaffold** — `package.json`, `tsconfig.json`, `src/utils/`, `src/config/`.
1806
- Verify both example pipelines parse cleanly.
1807
- 2. **Staging store** — `src/staging/`. Unit tests with `:memory:`.
1808
- 3. **Source adapters** — `csv` first, then `mssql`, `pg`, `xlsx`, `rest`.
1809
- Mock all external connections in tests.
1810
- 4. **DQ engine** — `src/dq/` including all rules and reporter.
1811
- 5. **Transform engine** — `src/transform/` — all types, cleanse ops, expression eval.
1812
- 6. **Target adapters** — `csv` → `ifs` → `bluecherry` → `bc` (BC is most complex;
1813
- mock OAuth2 token endpoint in tests).
1814
- 7. **PipelineRunner** — wire all phases; integration test both fixture pipelines.
1815
- 8. **CLI** — all four commands and exit codes.
1816
- 9. **CI** — `.github/workflows/ci.yml`.
1817
-
1818
- ---
1819
-
1820
- *This file is the authoritative specification for Sluice. If anything in the
1821
- codebase contradicts this file, the codebase is wrong. Update this file whenever
1822
- the architecture evolves — then tell Claude Code to re-read it before continuing.*
1
+ # Sluice — CLAUDE.md
2
+ # Project specification for Claude Code
3
+ # Sluice: config-driven ETL toolkit for ERP data migrations
4
+ # npm package: @caracal-lynx/sluice
5
+ # Owner: Michael Scott, Caracal Lynx Ltd. (SC826823)
6
+ # Last updated: 2026-04-20
7
+
8
+ ---
9
+
10
+ ## Project overview
11
+
12
+ **Sluice** is a config-driven ETL toolkit for ERP data migrations, developed and
13
+ maintained by Caracal Lynx Ltd. The engine is written once; each client
14
+ engagement is delivered as a folder of YAML pipeline configs. There is no UI, no
15
+ server, and no cloud dependency — just the `sluice` CLI and a set of TypeScript
16
+ modules that can be imported by other tools (e.g. n8n custom nodes, GitHub Actions).
17
+
18
+ *Clean data flows through.*
19
+
20
+ **Known clients and targets:**
21
+
22
+ | Client | Source(s) | Target ERP | Adapter |
23
+ |---|---|---|---|
24
+ | Acme Corp | MSSQL legacy DB | IFS ERP | `ifs` |
25
+ | Style Co | MSSQL / CSV exports | BlueCherry ERP | `bluecherry` |
26
+
27
+ **Primary use cases:**
28
+ - Extract data from legacy SQL databases, CSV/Excel exports, and REST APIs
29
+ - Validate data quality against a configurable rule set
30
+ - Transform field mappings, apply lookups, cleanse values, evaluate expressions
31
+ - Load output to BC via REST API, IFS via CSV import, BlueCherry via CSV import,
32
+ or generic CSV/JSON for any other target
33
+ - Run from the command line on a developer laptop (Windows, PowerShell 7)
34
+ - Run unattended in GitHub Actions CI
35
+
36
+ **Non-goals:**
37
+ - No web UI or dashboard
38
+ - No streaming / real-time ingestion
39
+ - No data warehouse or lake — DuckDB is used only as a local staging store
40
+ - No multi-tenant SaaS — this is a consultant's toolkit, not a product
41
+
42
+ **Related docs:**
43
+ - [README.md](README.md) — install, quick-start, composite rules (Tier 1)
44
+ - [PLUGINS.md](PLUGINS.md) — Tier 2 (file) and Tier 3 (npm) plugin author guide
45
+ - [docs/architecture-diagrams.md](docs/architecture-diagrams.md) — Mermaid diagrams of
46
+ the single- and multi-source pipeline flow
47
+
48
+ ---
49
+
50
+ ## Repository structure
51
+
52
+ ```
53
+ sluice/
54
+ ├── CLAUDE.md ← you are here
55
+ ├── PLUGINS.md ← Tier 2 / Tier 3 plugin author guide
56
+ ├── README.md
57
+ ├── package.json
58
+ ├── tsconfig.json
59
+ ├── tsconfig.test.json
60
+ ├── .env.example
61
+ ├── .gitignore
62
+ ├── eslint.config.js
63
+ ├── .prettierrc
64
+ ├── .github/workflows/ci.yml
65
+ ├── docs/
66
+ │ └── architecture-diagrams.md
67
+ ├── examples/ ← sample pipelines (not run by tests)
68
+
69
+ ├── src/
70
+ │ ├── index.ts ← public API barrel (re-exports from all modules)
71
+ │ ├── cli.ts ← commander CLI entry point
72
+ │ ├── runner.ts ← PipelineRunner (single-source)
73
+ │ ├── multi-source-runner.ts ← MultiSourcePipelineRunner (extends PipelineRunner)
74
+ │ │
75
+ │ ├── config/
76
+ │ │ ├── index.ts ← re-exports schema + types
77
+ │ │ ├── schema.ts ← Zod schema (PipelineSchema + sub-schemas)
78
+ │ │ ├── loader.ts ← YAML load + ${ENV_VAR} interp + composite-rule expansion + parse
79
+ │ │ └── types.ts ← re-exports of all inferred Zod types + guards
80
+ │ │
81
+ │ ├── adapters/
82
+ │ │ ├── source/
83
+ │ │ │ ├── index.ts ← barrel (self-registers built-ins on import)
84
+ │ │ │ ├── registry.ts ← SourceAdapterRegistry
85
+ │ │ │ ├── types.ts ← SourceAdapter + ExtractResult
86
+ │ │ │ ├── mssql.ts
87
+ │ │ │ ├── pg.ts
88
+ │ │ │ ├── csv.ts
89
+ │ │ │ ├── xlsx.ts
90
+ │ │ │ └── rest.ts
91
+ │ │ └── target/
92
+ │ │ ├── index.ts ← barrel (self-registers built-ins on import)
93
+ │ │ ├── registry.ts ← TargetAdapterRegistry
94
+ │ │ ├── types.ts ← TargetAdapter + LoadResult
95
+ │ │ ├── bc.ts ← Business Central REST (+ BcTokenManager)
96
+ │ │ ├── ifs.ts ← IFS ERP CSV import
97
+ │ │ ├── bluecherry.ts ← BlueCherry ERP CSV import
98
+ │ │ ├── csv.ts ← generic CSV
99
+ │ │ └── pg.ts
100
+ │ │
101
+ │ ├── staging/
102
+ │ │ ├── index.ts ← barrel
103
+ │ │ ├── store.ts ← DuckDB wrapper (the only file that imports `@duckdb/node-api`)
104
+ │ │ └── schema.ts ← ColumnMeta, quoteIdent, buildCreateTableSql
105
+ │ │
106
+ │ ├── dq/
107
+ │ │ ├── index.ts ← barrel
108
+ │ │ ├── engine.ts ← DQEngine
109
+ │ │ ├── reporter.ts ← writeRejectionCsv, writeSummaryJson
110
+ │ │ ├── types.ts ← DQSummary, ViolationCounts
111
+ │ │ └── rules/
112
+ │ │ ├── index.ts ← BUILT_IN_RULES map (id → Rule instance)
113
+ │ │ ├── types.ts ← Rule = RulePlugin, RuleViolation (re-exported from plugins)
114
+ │ │ ├── notNull.ts
115
+ │ │ ├── unique.ts
116
+ │ │ ├── pattern.ts
117
+ │ │ ├── email.ts
118
+ │ │ ├── ukPostcode.ts
119
+ │ │ ├── maxLength.ts
120
+ │ │ ├── minMax.ts
121
+ │ │ └── allowedValues.ts
122
+ │ │
123
+ │ ├── transform/
124
+ │ │ ├── index.ts
125
+ │ │ ├── engine.ts ← TransformEngine (built-in types + custom plugins)
126
+ │ │ ├── lookup.ts
127
+ │ │ ├── cleanse.ts
128
+ │ │ ├── expression.ts ← expr-eval + `js:` vm sandbox
129
+ │ │ └── types.ts ← TransformResult
130
+ │ │
131
+ │ ├── merge/ ← multi-source merge engine + strategies
132
+ │ │ ├── index.ts ← MergeStrategyRegistry (pre-registers all built-ins)
133
+ │ │ ├── engine.ts ← MergeEngine
134
+ │ │ ├── sql-builder.ts ← shared JOIN + coalesce SQL helpers
135
+ │ │ ├── conflict-log.ts ← conflict CSV writer
136
+ │ │ ├── types.ts ← MergeStrategyPlugin, MergeSourceMeta, MergeResult
137
+ │ │ └── strategies/
138
+ │ │ ├── index.ts
139
+ │ │ ├── coalesce.ts
140
+ │ │ ├── priority-override.ts
141
+ │ │ ├── union.ts
142
+ │ │ └── intersect.ts
143
+ │ │
144
+ │ ├── plugins/ ← Tier 2 / Tier 3 plugin system
145
+ │ │ ├── index.ts ← barrel
146
+ │ │ ├── types.ts ← RulePlugin, TransformPlugin, PluginPackage
147
+ │ │ ├── registry.ts ← RuleRegistry, TransformRegistry (custom plugin holders)
148
+ │ │ └── loader.ts ← loadPlugins (file-based), loadNpmPlugins (sluice.config.yaml)
149
+ │ │
150
+ │ ├── enrich/ ← Phase 4a public surface (types only)
151
+ │ │ └── types.ts ← EnrichPlugin, EnrichResult, EnrichOptions, EnrichSummary,
152
+ │ │ EnrichPhaseFactory (implementation lives in private
153
+ │ │ @caracal-lynx/sluice-enrich package)
154
+ │ │
155
+ │ └── utils/
156
+ │ ├── index.ts
157
+ │ ├── logger.ts ← pino singleton
158
+ │ ├── env.ts ← loadEnv + requireEnv
159
+ │ └── errors.ts
160
+
161
+ ├── tests/
162
+ │ ├── fixtures/
163
+ │ │ ├── acme-corp-customers.pipeline.yaml
164
+ │ │ ├── style-co-styles.pipeline.yaml
165
+ │ │ ├── style-co-products-merged.pipeline.yaml ← multi-source
166
+ │ │ ├── multi-source-no-merge.pipeline.yaml ← negative-path multi-source
167
+ │ │ ├── shared-rules.yaml ← composite rule library
168
+ │ │ └── plugins/ ← test plugin fixtures (Tier 2 files)
169
+ │ │
170
+ │ ├── unit/
171
+ │ │ ├── cli.test.ts
172
+ │ │ ├── runner.test.ts
173
+ │ │ ├── adapters/
174
+ │ │ │ ├── source/ ← csv, mssql, pg, rest, xlsx
175
+ │ │ │ └── target/ ← bc, bluecherry, ifs, pg
176
+ │ │ ├── config/ ← loader, schema, multi-source, composite-expansion
177
+ │ │ ├── dq/ ← engine, reporter, rules
178
+ │ │ ├── merge/ ← engine, registry, strategies
179
+ │ │ ├── plugins/ ← loader, registry, composite-expansion
180
+ │ │ ├── staging/ ← store
181
+ │ │ └── transform/ ← cleanse, expression, engine, custom
182
+ │ │
183
+ │ └── integration/
184
+ │ ├── cli-check.test.ts
185
+ │ ├── cli-commands.test.ts
186
+ │ ├── cli-plugins.test.ts
187
+ │ ├── csv-to-csv-mvp.test.ts
188
+ │ ├── dq-integration.test.ts
189
+ │ ├── style-co-styles-mini.test.ts
190
+ │ ├── merge-strategies.test.ts
191
+ │ ├── multi-source-runner.test.ts
192
+ │ └── runner-plugin-wiring.test.ts
193
+
194
+ └── clients/ ← gitignored in this repo; each client
195
+ ├── acme-corp/ gets their own private repo
196
+ │ ├── .env
197
+ │ ├── customers.pipeline.yaml
198
+ │ ├── items.pipeline.yaml
199
+ │ ├── vendors.pipeline.yaml
200
+ │ └── lookups/
201
+ └── style-co/
202
+ ├── .env
203
+ ├── styles.pipeline.yaml
204
+ ├── vendors.pipeline.yaml
205
+ ├── purchase-orders.pipeline.yaml
206
+ └── lookups/
207
+ ```
208
+
209
+ ---
210
+
211
+ ## Technology stack
212
+
213
+ | Concern | Package | Notes |
214
+ |---|---|---|
215
+ | Language | TypeScript 5.x | `strict: true`, `exactOptionalPropertyTypes: true` |
216
+ | Runtime | Node.js 24 LTS | No Bun, no Deno — must run in GitHub Actions |
217
+ | Config parsing | `js-yaml` | YAML 1.2 only |
218
+ | Config validation | `zod` v3 | All config types inferred from Zod |
219
+ | SQL Server | `mssql` | Trusted + SQL auth both supported |
220
+ | PostgreSQL | `pg` + `@types/pg` | |
221
+ | CSV | `csv-parse` + `csv-stringify` | Streaming |
222
+ | Excel | `xlsx` (SheetJS) | Read-only |
223
+ | HTTP | `axios` + `axios-retry` | 3 retries, exponential backoff |
224
+ | Dates | `dayjs` | All date parsing and formatting |
225
+ | Staging | `@duckdb/node-api` | Embedded; no server. Replaces deprecated `duckdb` package — ABI-stable (no `npm rebuild` after Node ABI bumps). |
226
+ | CLI | `commander` v12 | |
227
+ | Logging | `pino` | JSON; `pino-pretty` in dev |
228
+ | Testing | `vitest` | No Jest |
229
+ | Env vars | `dotenv` | Loaded once at CLI entry |
230
+ | Linting | `eslint` + `@typescript-eslint` | |
231
+ | Formatting | `prettier` | 2-space, single quotes, trailing commas |
232
+ | Expressions | `expr-eval` | Safe expression parser; no eval() |
233
+
234
+ ---
235
+
236
+ ## TypeScript conventions
237
+
238
+ - **All config types come from Zod inference.** Do not write manual `type` or
239
+ `interface` declarations for anything that maps to pipeline config.
240
+ Use `z.infer<typeof SomeSchema>`.
241
+ - **No `any`.** Use `unknown` and narrow explicitly.
242
+ - **No `eval()` or `Function()`** anywhere. See expression evaluator section.
243
+ - **Async throughout.** All I/O must be `async/await`. No callbacks.
244
+ - **Error handling:** throw typed errors from `src/utils/errors.ts`. Never throw
245
+ raw strings. Catch at the `PipelineRunner` boundary.
246
+ - **Barrel exports:** each directory has an `index.ts`. Do not import from internal
247
+ files across module boundaries.
248
+ - **No circular imports.** Dependency direction:
249
+ `cli` → `runner` / `multi-source-runner` → `adapters`, `staging`, `dq`,
250
+ `transform`, `merge`, `plugins`, `config`, `enrich`. `plugins/` is imported by
251
+ `runner`, `dq`, `transform`, and `merge`; it must not import any of them.
252
+ `enrich/` is type-only (no runtime imports of other modules in this repo —
253
+ the implementation lives in the private `@caracal-lynx/sluice-enrich`).
254
+ Utils are imported by everyone.
255
+ - **Path aliases:** `@/` → `src/` in tsconfig.
256
+
257
+ ---
258
+
259
+ ## ═══════════════════════════════════════════════════════════
260
+ ## YAML PIPELINE CONFIG SPECIFICATION
261
+ ## ═══════════════════════════════════════════════════════════
262
+
263
+ Every pipeline is a single YAML file. One file = one migrated entity
264
+ (e.g. customers, items, vendors, styles, purchase orders).
265
+
266
+ ### Top-level structure
267
+
268
+ ```yaml
269
+ pipeline: { ... } # identity and metadata
270
+ source: { ... } # where to read from
271
+ enrich: { ... } # OPTIONAL — Phase 4a; external API lookups (private)
272
+ dq: { ... } # data quality rules
273
+ transform: { ... } # field mappings and lookups
274
+ target: { ... } # where to write to
275
+ run: { ... } # execution options (all fields optional; all have defaults)
276
+ ```
277
+
278
+ > **Phase 4a — Enrich Phase (private):** the `enrich:` block, when present, runs after Extract (and after Merge for multi-source pipelines) and before DQ. The framework that drives it lives in the **private** `@caracal-lynx/sluice-enrich` package — the open-source core only ships the Zod schema, the public `EnrichPlugin` interface (`src/enrich/types.ts`), and the `registerEnrichPhase()` injection hook on `PipelineRunner`. With `sluice-enrich` not installed, an `enrich:` block is parsed and validated but the phase is skipped with a `WARN` log. See [docs/PHASE-04-enrich-phase.md](docs/PHASE-04-enrich-phase.md) for the full spec.
279
+
280
+ ---
281
+
282
+ ### `pipeline` section
283
+
284
+ ```yaml
285
+ pipeline:
286
+ name: acme-corp-customers # REQUIRED. Slug: lowercase, hyphens only.
287
+ # Used in output filenames and log messages.
288
+ client: acme-corp # REQUIRED. Client identifier.
289
+ version: "1.0" # REQUIRED. Quote to ensure string type.
290
+ entity: CustomerInfo # REQUIRED. Logical entity name (used in
291
+ # load reports and target adapter metadata).
292
+ description: > # Optional. Human-readable description.
293
+ Customer master migration —
294
+ legacy SQL to IFS ERP
295
+ ```
296
+
297
+ ---
298
+
299
+ ### `source` section
300
+
301
+ Exactly one of `query`, `file`, or `endpoint` must be present.
302
+
303
+ ```yaml
304
+ source:
305
+ adapter: mssql # REQUIRED. One of: mssql | pg | csv | xlsx | rest
306
+
307
+ # ── SQL adapters (mssql, pg) ──────────────────────────────
308
+ connection: ${SOURCE_MSSQL} # Connection string from .env.
309
+ # mssql: mssql://user:pass@host/database
310
+ # Or a JSON string for trusted/advanced config.
311
+ query: |
312
+ SELECT c.CUST_CODE, c.CUST_NAME, c.POST_CODE
313
+ FROM dbo.Customers c
314
+ WHERE c.Active = 1
315
+
316
+ # ── CSV adapter ───────────────────────────────────────────
317
+ file: ./data/customers.csv # Path or glob (./data/export-*.csv).
318
+ delimiter: "," # Default: ","
319
+ encoding: utf-8 # Default: utf-8
320
+
321
+ # ── XLSX adapter ──────────────────────────────────────────
322
+ file: ./data/customers.xlsx
323
+ sheet: "Customer Export" # Sheet name or 0-based index. Default: 0.
324
+
325
+ # ── REST adapter ──────────────────────────────────────────
326
+ endpoint: ${API_BASE}/customers # Full URL. ${ENV_VAR} resolved at runtime.
327
+ headers: # Optional. Added to every request.
328
+ Authorization: Bearer ${API_TOKEN}
329
+ Accept: application/json
330
+ pagination: # Optional. Omit for single-page responses.
331
+ type: offset # offset | cursor | page
332
+ pageSize: 100
333
+ pageParam: skip # Query param name for the offset/page value.
334
+ totalField: data.total # Dot-path to total count in response body.
335
+ dataField: data.items # Dot-path to the records array.
336
+ cursorField: nextCursor # For cursor pagination: field in response body.
337
+ cursorParam: cursor # For cursor pagination: query param name.
338
+ ```
339
+
340
+ ---
341
+
342
+ ### `dq` section
343
+
344
+ ```yaml
345
+ dq:
346
+ stopOnCritical: true # Default: true. Halt pipeline if any critical rule fails.
347
+ rejectionFile: ./output/acme-corp-customers-rejected.csv
348
+ # Default: ./output/{pipeline.name}-rejected.csv
349
+
350
+ rules:
351
+ - field: FIELD_NAME # Source column name (pre-transform).
352
+ checks:
353
+
354
+ # notNull — fails if null, undefined, empty string, or whitespace-only
355
+ - type: notNull
356
+ severity: critical
357
+
358
+ # unique — fails if value appears more than once across the full dataset
359
+ - type: unique
360
+ severity: critical
361
+
362
+ # pattern — ECMAScript regex, tested with new RegExp(value)
363
+ - type: pattern
364
+ value: "^[A-Z0-9]{3,10}$"
365
+ severity: warning
366
+ message: "Must be 3-10 uppercase alphanumeric characters"
367
+ # message is optional; overrides default.
368
+
369
+ # email — RFC 5322-ish email validation
370
+ - type: email
371
+ severity: warning
372
+
373
+ # ukPostcode — all current UK postcode formats; strips spaces before testing
374
+ - type: ukPostcode
375
+ severity: warning
376
+
377
+ # maxLength — maximum string length (integer)
378
+ - type: maxLength
379
+ value: 100
380
+ severity: warning
381
+
382
+ # min / max — numeric comparison; coerces value to float
383
+ - type: min
384
+ value: 0
385
+ severity: critical
386
+ - type: max
387
+ value: 500000
388
+ severity: warning
389
+
390
+ # allowedValues — case-sensitive array of permitted string values
391
+ - type: allowedValues
392
+ value: [GB, IE, US, DE, FR]
393
+ severity: warning
394
+
395
+ # Severity:
396
+ # critical row is rejected; pipeline halts if stopOnCritical: true
397
+ # warning row is flagged in rejection report but NOT removed from output
398
+ # info recorded in summary JSON only
399
+ ```
400
+
401
+ ---
402
+
403
+ ### `transform` section
404
+
405
+ ```yaml
406
+ transform:
407
+
408
+ # ── Lookup tables ─────────────────────────────────────────
409
+ # Loaded once at start of transform phase, cached in memory.
410
+ lookups:
411
+ - name: currencyMap # Referenced by field mappings.
412
+ source: # Any source adapter works here.
413
+ adapter: csv
414
+ file: ./lookups/currency-codes.csv
415
+ key: legacyCode # Column to match against source value.
416
+ value: isoCode # Column to return as resolved value.
417
+
418
+ - name: acctMgrMap
419
+ source:
420
+ adapter: mssql
421
+ connection: ${SOURCE_MSSQL}
422
+ query: "SELECT STAFF_ID as key, IFS_USER_ID as value FROM dbo.Staff"
423
+ key: key
424
+ value: value
425
+
426
+ # ── Field mappings ────────────────────────────────────────
427
+ fields:
428
+
429
+ # type: string
430
+ - from: CUST_CODE
431
+ to: CustomerNo
432
+ type: string
433
+ max: 20 # Optional. Truncate after cleanse.
434
+
435
+ - from: CUST_NAME
436
+ to: Name
437
+ type: string
438
+ max: 100
439
+ cleanse: trim|titleCase # Pipe-separated cleanse ops. See table below.
440
+
441
+ # type: number — coerce to integer; throws if NaN
442
+ - from: QTY
443
+ to: Quantity
444
+ type: number
445
+
446
+ # type: decimal — fixed precision; stored as string in staging
447
+ - from: CREDIT_LIMIT
448
+ to: CreditLimit
449
+ type: decimal
450
+ precision: 2 # Default: 2
451
+
452
+ # type: boolean
453
+ # Truthy: '1','true','yes','y','t' (case-insensitive). All else false.
454
+ - from: IS_ACTIVE
455
+ to: Active
456
+ type: boolean
457
+
458
+ # type: date — parse source date, output as dateFormat (default ISO)
459
+ - from: START_DATE
460
+ to: StartDate
461
+ type: date
462
+ format: DD/MM/YYYY # Optional source parse format (dayjs tokens).
463
+
464
+ # type: lookup — resolve via a named lookup table
465
+ - from: CURRENCY
466
+ to: CurrencyCode
467
+ type: lookup
468
+ lookup: currencyMap # Must match a lookup name above.
469
+ default: GBP # Emitted when lookup key not found.
470
+ optional: false # Default: false. true = null on miss (no error).
471
+
472
+ # type: concat — join multiple source fields
473
+ - from: [ADDR1, ADDR2] # Array of source field names.
474
+ to: Address1
475
+ type: concat
476
+ separator: ", " # Default: " "
477
+ cleanse: trim|nullIfEmpty
478
+
479
+ # type: constant — emit a fixed value regardless of source data
480
+ - to: CustomerGroup
481
+ type: constant
482
+ value: DOMESTIC
483
+
484
+ # type: expression — evaluate against source row
485
+ - to: SearchName
486
+ type: expression
487
+ value: "row.CUST_NAME.toUpperCase().substring(0, 20)"
488
+ # For logic beyond expr-eval, prefix with js:
489
+ # value: "js: row.PRICE * (1 - row.DISCOUNT / 100)"
490
+
491
+ # Common optional field properties:
492
+ # optional: true null result does not cause a pipeline error
493
+ # default: <val> fallback value if source is null/empty
494
+ # max: <n> truncate string to n chars AFTER cleanse
495
+ ```
496
+
497
+ #### Cleanse operations reference
498
+
499
+ Applied left-to-right in the pipe chain. Defined in `src/transform/cleanse.ts`.
500
+
501
+ | Op | Example input | Example output |
502
+ |---|---|---|
503
+ | `trim` | `" hello "` | `"hello"` |
504
+ | `uppercase` | `"hello"` | `"HELLO"` |
505
+ | `lowercase` | `"HELLO"` | `"hello"` |
506
+ | `titleCase` | `"john smith"` | `"John Smith"` |
507
+ | `stripNonAlpha` | `"AB-12!"` | `"AB"` |
508
+ | `stripNonNumeric` | `"AB-12!"` | `"12"` |
509
+ | `stripWhitespace` | `"h e l l o"` | `"hello"` |
510
+ | `padStart:6:0` | `"42"` | `"000042"` |
511
+ | `truncate:20` | 21-char string | 20-char string |
512
+ | `nullIfEmpty` | `""` | `null` |
513
+ | `normaliseQuotes` | `"it\u2019s"` | `"it's"` |
514
+ | `normaliseUnicode` | `"caf\u00e9"` | `"cafe"` (NFD→ASCII) |
515
+
516
+ ---
517
+
518
+ ### `target` section
519
+
520
+ ```yaml
521
+ target:
522
+ adapter: ifs # REQUIRED. One of:
523
+ # bc | ifs | bluecherry | csv | pg | rest
524
+
525
+ # ── IFS adapter ───────────────────────────────────────────
526
+ adapter: ifs
527
+ output: ./output/acme-corp-customers-ifs.csv
528
+ entity: CustomerInfo # IFS entity name (used in import log).
529
+ includeHeader: false # Default: false (standard IFS import format).
530
+ columnOrder: # Optional. Forces specific column ordering.
531
+ - CustomerNo # Must match transform 'to' field names.
532
+ - Name
533
+ - Address1
534
+ dateFormat: YYYY-MM-DD # Default: YYYY-MM-DD
535
+ delimiter: "," # Default: ","
536
+ encoding: utf-8 # Default: utf-8
537
+
538
+ # ── BlueCherry adapter ────────────────────────────────────
539
+ adapter: bluecherry
540
+ entity: Style # REQUIRED. One of: Style | Vendor |
541
+ # PurchaseOrder | PODetail | Season | ColourSize
542
+ output: ./output/style-co-styles-bc.csv
543
+ template: default # Optional. 'default' uses built-in required
544
+ # columns. Or path to a header-only template CSV
545
+ # whose first row defines column order.
546
+ includeHeader: true # Default: true (BlueCherry expects headers).
547
+ dateFormat: MM/DD/YYYY # Default: MM/DD/YYYY (BlueCherry is US-origin).
548
+ delimiter: ","
549
+ encoding: utf-8
550
+ nullValue: "" # How nulls are rendered. Default: ""
551
+
552
+ # ── Business Central REST adapter ─────────────────────────
553
+ adapter: bc
554
+ baseUrl: ${BC_BASE_URL}
555
+ company: ${BC_COMPANY}
556
+ entity: customers # OData entity name (lowercase, plural).
557
+ apiVersion: v2.0 # Default: v2.0
558
+ onConflict: fail # fail | upsert. Default: fail.
559
+ batchEndpoint: true # Use OData $batch. Default: true.
560
+
561
+ # ── Generic CSV adapter ───────────────────────────────────
562
+ adapter: csv
563
+ output: ./output/data.csv
564
+ includeHeader: true
565
+ delimiter: ","
566
+ encoding: utf-8
567
+ nullValue: ""
568
+
569
+ # ── PostgreSQL adapter ────────────────────────────────────
570
+ adapter: pg
571
+ connection: ${TARGET_PG}
572
+ table: customers
573
+ schema: public # Default: public
574
+ onConflict: fail # fail | upsert | ignore
575
+ upsertKey: [customer_no] # REQUIRED if onConflict: upsert
576
+ ```
577
+
578
+ ---
579
+
580
+ ### `run` section
581
+
582
+ All fields optional. Shown with defaults.
583
+
584
+ ```yaml
585
+ run:
586
+ mode: full # full | incremental | validate-only
587
+ batchSize: 500 # Rows per DuckDB insert batch.
588
+ onError: continue # continue | stop
589
+ logLevel: info # debug | info | warn | error
590
+ dryRun: false # true: DQ + transform, no output written.
591
+ outputDir: ./output # Base directory for all output files.
592
+ stagingDb: "" # DuckDB path. Default: {outputDir}/{name}.duckdb
593
+ # Set ':memory:' to force in-memory mode.
594
+ incrementalField: UPDATED_AT # Source field for incremental mode.
595
+ incrementalSince: "" # ISO datetime. If empty, reads from state file.
596
+ ```
597
+
598
+ ---
599
+
600
+ ### Full example — Acme Corp customers (MSSQL → IFS)
601
+
602
+ ```yaml
603
+ pipeline:
604
+ name: acme-corp-customers
605
+ client: acme-corp
606
+ version: "1.0"
607
+ entity: CustomerInfo
608
+ description: Customer master — legacy Sage SQL to IFS ERP
609
+
610
+ source:
611
+ adapter: mssql
612
+ connection: ${SOURCE_MSSQL}
613
+ query: |
614
+ SELECT
615
+ c.CUST_CODE, c.CUST_NAME, c.ADDR1, c.ADDR2,
616
+ c.POST_CODE, c.COUNTRY, c.EMAIL, c.TEL,
617
+ c.CREDIT_LIMIT, c.CURRENCY, c.ACCT_MGR_ID
618
+ FROM dbo.Customers c
619
+ WHERE c.Active = 1 AND c.DELETED = 0
620
+
621
+ dq:
622
+ stopOnCritical: true
623
+ rejectionFile: ./output/acme-corp-customers-rejected.csv
624
+ rules:
625
+ - field: CUST_CODE
626
+ checks:
627
+ - { type: notNull, severity: critical }
628
+ - { type: unique, severity: critical }
629
+ - { type: pattern, value: "^[A-Z0-9]{3,10}$", severity: warning }
630
+ - field: CUST_NAME
631
+ checks:
632
+ - { type: notNull, severity: critical }
633
+ - { type: maxLength, value: 100, severity: warning }
634
+ - field: POST_CODE
635
+ checks:
636
+ - { type: ukPostcode, severity: warning }
637
+ - field: EMAIL
638
+ checks:
639
+ - { type: email, severity: warning }
640
+ - field: CREDIT_LIMIT
641
+ checks:
642
+ - { type: min, value: 0, severity: critical }
643
+ - { type: max, value: 500000, severity: warning }
644
+ - field: COUNTRY
645
+ checks:
646
+ - { type: allowedValues, value: [GB, IE, US, DE, FR], severity: warning }
647
+
648
+ transform:
649
+ lookups:
650
+ - name: currencyMap
651
+ source: { adapter: csv, file: ./lookups/currency-codes.csv }
652
+ key: legacyCode
653
+ value: isoCode
654
+ - name: acctMgrMap
655
+ source:
656
+ adapter: mssql
657
+ connection: ${SOURCE_MSSQL}
658
+ query: "SELECT STAFF_ID as key, IFS_USER_ID as value FROM dbo.Staff"
659
+ key: key
660
+ value: value
661
+ fields:
662
+ - { from: CUST_CODE, to: CustomerNo, type: string, max: 20 }
663
+ - { from: CUST_NAME, to: Name, type: string, max: 100, cleanse: trim|titleCase }
664
+ - { from: [ADDR1, ADDR2], to: Address1, type: concat, separator: ", ", cleanse: trim }
665
+ - { from: POST_CODE, to: ZipCode, type: string, cleanse: trim|uppercase }
666
+ - { from: COUNTRY, to: Country, type: string, default: GB }
667
+ - { from: CURRENCY, to: CurrencyCode, type: lookup, lookup: currencyMap, default: GBP }
668
+ - { from: ACCT_MGR_ID, to: SalesmanCode, type: lookup, lookup: acctMgrMap, optional: true }
669
+ - { from: CREDIT_LIMIT, to: CreditLimit, type: decimal, precision: 2 }
670
+ - { from: EMAIL, to: Email, type: string, cleanse: trim|lowercase }
671
+ - { to: CustomerGroup, type: constant, value: DOMESTIC }
672
+ - { to: SearchName, type: expression, value: "row.CUST_NAME.toUpperCase().substring(0, 20)" }
673
+
674
+ target:
675
+ adapter: ifs
676
+ entity: CustomerInfo
677
+ output: ./output/acme-corp-customers-ifs.csv
678
+ includeHeader: false
679
+ columnOrder: [CustomerNo, Name, Address1, ZipCode, Country, CurrencyCode,
680
+ SalesmanCode, CreditLimit, Email, CustomerGroup, SearchName]
681
+
682
+ run:
683
+ mode: full
684
+ batchSize: 500
685
+ logLevel: info
686
+ dryRun: false
687
+ ```
688
+
689
+ ---
690
+
691
+ ### Full example — Style Co styles (CSV → BlueCherry)
692
+
693
+ ```yaml
694
+ pipeline:
695
+ name: style-co-styles
696
+ client: style-co
697
+ version: "1.0"
698
+ entity: Style
699
+ description: Style master migration from legacy CSV exports to BlueCherry ERP
700
+
701
+ source:
702
+ adapter: csv
703
+ file: ./data/styles-export.csv
704
+ encoding: utf-8
705
+
706
+ dq:
707
+ stopOnCritical: true
708
+ rejectionFile: ./output/style-co-styles-rejected.csv
709
+ rules:
710
+ - field: STYLE_NO
711
+ checks:
712
+ - { type: notNull, severity: critical }
713
+ - { type: unique, severity: critical }
714
+ - { type: maxLength, value: 20, severity: warning }
715
+ - field: STYLE_DESC
716
+ checks:
717
+ - { type: notNull, severity: critical }
718
+ - { type: maxLength, value: 255, severity: warning }
719
+ - field: DIVISION
720
+ checks:
721
+ - { type: notNull, severity: critical }
722
+ - { type: allowedValues, value: [WOMENS, MENS, ACCESSORIES], severity: warning }
723
+ - field: SEASON_CODE
724
+ checks:
725
+ - { type: notNull, severity: warning }
726
+ - { type: pattern, value: "^(SS|AW)[0-9]{2}$", severity: warning }
727
+ - field: COST_PRICE
728
+ checks:
729
+ - { type: min, value: 0, severity: critical }
730
+ - { type: max, value: 9999.99, severity: warning }
731
+ - field: RETAIL_PRICE
732
+ checks:
733
+ - { type: min, value: 0, severity: critical }
734
+
735
+ transform:
736
+ lookups:
737
+ - name: divisionMap
738
+ source: { adapter: csv, file: ./lookups/division-codes.csv }
739
+ key: legacyCode
740
+ value: bcCode
741
+ - name: vendorMap
742
+ source: { adapter: csv, file: ./lookups/vendor-codes.csv }
743
+ key: legacyVendorCode
744
+ value: bcVendorNo
745
+ fields:
746
+ - { from: STYLE_NO, to: StyleNo, type: string, max: 20, cleanse: trim|uppercase }
747
+ - { from: STYLE_DESC, to: StyleDesc, type: string, max: 255, cleanse: trim|normaliseUnicode }
748
+ - { from: DIVISION, to: Division, type: lookup, lookup: divisionMap }
749
+ - { from: SEASON_CODE, to: Season, type: string, max: 10 }
750
+ - { from: VENDOR_CODE, to: VendorNo, type: lookup, lookup: vendorMap, optional: true }
751
+ - { from: COST_PRICE, to: CostPrice, type: decimal, precision: 2 }
752
+ - { from: RETAIL_PRICE, to: RetailPrice, type: decimal, precision: 2 }
753
+ - { from: WEIGHT_KG, to: Weight, type: decimal, precision: 3, default: "0.000" }
754
+ - { from: COUNTRY_ORIG, to: CountryOrigin, type: string, default: GB }
755
+ - { from: FIBRE_CONTENT, to: FibreContent, type: string, max: 200, cleanse: trim }
756
+ - { to: ActiveFlag, type: constant, value: "Y" }
757
+ - { to: CreatedDate, type: expression, value: "js: new Date().toLocaleDateString('en-US')" }
758
+
759
+ target:
760
+ adapter: bluecherry
761
+ entity: Style
762
+ output: ./output/style-co-styles-bc.csv
763
+ includeHeader: true
764
+ dateFormat: MM/DD/YYYY
765
+ nullValue: ""
766
+
767
+ run:
768
+ mode: full
769
+ batchSize: 200
770
+ logLevel: info
771
+ dryRun: false
772
+ ```
773
+
774
+ ---
775
+
776
+ ## ═══════════════════════════════════════════════════════════
777
+ ## MULTI-SOURCE PIPELINES (Phase 3)
778
+ ## ═══════════════════════════════════════════════════════════
779
+
780
+ A multi-source pipeline replaces the single `source:` block with a top-level
781
+ `sources:` array (min 2 entries) plus a `merge:` block. The rest of the YAML
782
+ (`pipeline`, `dq`, `transform`, `target`, `run`) is unchanged. `PipelineSchema`
783
+ requires *either* `source` (single) *or* both `sources` + `merge` (multi) —
784
+ never both — and the CLI auto-routes multi-source configs to
785
+ `MultiSourcePipelineRunner` (see `src/cli.ts:createRunnerForPipeline`).
786
+
787
+ ### Top-level layout
788
+
789
+ ```yaml
790
+ pipeline: { ... }
791
+ sources: [ { ... }, { ... } ] # REQUIRED in multi-source mode; min 2 entries
792
+ merge: { ... } # REQUIRED when `sources` is present
793
+ dq: { ... }
794
+ transform: { ... }
795
+ target: { ... }
796
+ run: { ... }
797
+ ```
798
+
799
+ ### `sources` entries
800
+
801
+ Each entry is a `SourceConfig` with three extra multi-source-only fields:
802
+
803
+ ```yaml
804
+ sources:
805
+ - id: sql-server # REQUIRED. Lowercase alphanumeric + hyphens only;
806
+ # must be unique across the array; used as the
807
+ # staging table suffix (stg_raw_sql-server).
808
+ priority: 1 # REQUIRED. Positive integer. Lower priority =
809
+ # higher precedence in coalesce / priority-override.
810
+ adapter: mssql
811
+ connection: ${SOURCE_2_MSSQL}
812
+ query: |
813
+ SELECT STYLE_NO, STYLE_DESC, COST_PRICE FROM dbo.Styles WHERE Active = 1
814
+
815
+ - id: excel
816
+ priority: 2
817
+ adapter: xlsx
818
+ file: ./data/product-data.xlsx
819
+ sheet: "Products"
820
+ rename: # Optional. { 'old column': 'new column' }.
821
+ Style Number: STYLE_NO # Applied in-place after extract, before DQ and
822
+ Description: STYLE_DESC # merge. Intended for CSV/XLSX sources where
823
+ Fibre: FIBRE_CONTENT # column headers are fixed; SQL/REST sources
824
+ # should rename in the query or field selection.
825
+ # Unknown keys are logged as warnings, not errors.
826
+ ```
827
+
828
+ ### `merge` block
829
+
830
+ ```yaml
831
+ merge:
832
+ key: STYLE_NO # REQUIRED. Single column name or array of
833
+ # columns (composite key). Must exist in every
834
+ # source after `rename` is applied.
835
+
836
+ strategy: coalesce # Default: coalesce. One of:
837
+ # coalesce first non-null value wins
838
+ # (priority-ordered; whitespace
839
+ # treated as blank)
840
+ # priority-override highest-priority source
841
+ # wins (even if null/blank)
842
+ # union all rows from all sources
843
+ # (dedupe by key)
844
+ # intersect only rows present in ALL
845
+ # sources
846
+
847
+ onUnmatched: include # Default: include. One of:
848
+ # include (default) keep unmatched rows
849
+ # exclude drop them
850
+ # warn keep and log a warning
851
+ # error fail the pipeline
852
+ # Ignored by `intersect`, which always excludes.
853
+
854
+ fieldStrategies: # Optional. Per-field overrides of the
855
+ # top-level strategy.
856
+ - field: FIBRE_CONTENT
857
+ source: excel # Force this field to always come from the
858
+ # named source, ignoring priority.
859
+ - field: COST_PRICE
860
+ strategy: priority-override # Override just this field's strategy.
861
+
862
+ conflictLog: ./output/style-co-products-conflicts.csv
863
+ # Optional. CSV of (key, field, winning_source,
864
+ # winning_value, source_values). Only written
865
+ # when at least one conflict is detected.
866
+
867
+ incrementalSource: sql-server # REQUIRED when `run.mode: incremental`.
868
+ # Must match one of the source `id` values.
869
+ # The named source is filtered by
870
+ # `run.incrementalField` / state-file lastRunAt;
871
+ # other sources run full each time.
872
+ ```
873
+
874
+ ### Multi-source DQ rules
875
+
876
+ `dq.rules[].sourceId` (optional) scopes a rule to a specific pre-merge source
877
+ table. Rules without `sourceId` run post-merge against `stg_merged`:
878
+
879
+ ```yaml
880
+ dq:
881
+ stopOnCritical: true
882
+ rules:
883
+ - field: STYLE_NO # Pre-merge: runs against stg_raw_sql-server only.
884
+ sourceId: sql-server
885
+ checks:
886
+ - { type: notNull, severity: critical }
887
+ - { type: unique, severity: critical }
888
+
889
+ - field: STYLE_DESC # Post-merge: runs against stg_merged.
890
+ checks:
891
+ - { type: notNull, severity: critical }
892
+ - { type: maxLength, value: 255, severity: warning }
893
+ ```
894
+
895
+ Per-source rejection files are auto-named by appending `-{sourceId}` to the
896
+ configured `rejectionFile` stem. Rows failing a critical pre-merge rule are
897
+ filtered out of that source's staging table *before* the merge phase.
898
+
899
+ ### Full example
900
+
901
+ See [tests/fixtures/style-co-products-merged.pipeline.yaml](tests/fixtures/style-co-products-merged.pipeline.yaml)
902
+ for a complete, tested multi-source pipeline (MSSQL + REST + XLSX → BlueCherry
903
+ with `coalesce` + `fieldStrategies` + `incrementalSource`).
904
+
905
+ ### Invocation
906
+
907
+ ```bash
908
+ sluice check tests/fixtures/style-co-products-merged.pipeline.yaml
909
+ sluice run tests/fixtures/style-co-products-merged.pipeline.yaml
910
+ sluice merge list-strategies
911
+ sluice merge info coalesce
912
+ ```
913
+
914
+ ---
915
+
916
+ ## ═══════════════════════════════════════════════════════════
917
+ ## ZOD SCHEMA (src/config/schema.ts)
918
+ ## ═══════════════════════════════════════════════════════════
919
+
920
+ Reproduce this schema exactly. Do not invent additional fields or rename enums.
921
+
922
+ ```typescript
923
+ import { z } from 'zod';
924
+
925
+ const Severity = z.enum(['critical', 'warning', 'info']);
926
+ const SourceAd = z.enum(['mssql', 'pg', 'csv', 'xlsx', 'rest']);
927
+ const TargetAd = z.enum(['bc', 'ifs', 'bluecherry', 'csv', 'pg', 'rest']);
928
+ const CleanseOps = z.string().regex(/^[a-zA-Z|:0-9]+$/);
929
+
930
+ const PaginationSchema = z.object({
931
+ type: z.enum(['offset', 'cursor', 'page']),
932
+ pageSize: z.number().int().positive().default(100),
933
+ pageParam: z.string().optional(),
934
+ totalField: z.string().optional(),
935
+ dataField: z.string().optional(),
936
+ cursorField: z.string().optional(),
937
+ cursorParam: z.string().optional(),
938
+ });
939
+
940
+ export const SourceSchema = z.object({
941
+ adapter: SourceAd,
942
+ connection: z.string().optional(),
943
+ query: z.string().optional(),
944
+ file: z.string().optional(),
945
+ endpoint: z.string().optional(),
946
+ headers: z.record(z.string()).optional(),
947
+ delimiter: z.string().default(','),
948
+ encoding: z.string().default('utf-8'),
949
+ sheet: z.union([z.string(), z.number()]).optional(),
950
+ pagination: PaginationSchema.optional(),
951
+ }).refine(s => s.query || s.file || s.endpoint,
952
+ { message: 'source must have query, file, or endpoint' });
953
+
954
+ const CheckType = z.enum([
955
+ 'notNull', 'unique', 'pattern', 'email', 'ukPostcode',
956
+ 'maxLength', 'min', 'max', 'allowedValues',
957
+ ]);
958
+
959
+ const CheckSchema = z.object({
960
+ type: CheckType,
961
+ value: z.union([z.string(), z.number(), z.array(z.string())]).optional(),
962
+ severity: Severity,
963
+ message: z.string().optional(),
964
+ });
965
+
966
+ const DqRuleSchema = z.object({
967
+ field: z.string(),
968
+ checks: z.array(CheckSchema).min(1),
969
+ });
970
+
971
+ export const DqSchema = z.object({
972
+ stopOnCritical: z.boolean().default(true),
973
+ rejectionFile: z.string().optional(),
974
+ rules: z.array(DqRuleSchema).default([]),
975
+ });
976
+
977
+ const LookupSchema = z.object({
978
+ name: z.string(),
979
+ source: SourceSchema,
980
+ key: z.string(),
981
+ value: z.string(),
982
+ });
983
+
984
+ const FieldType = z.enum([
985
+ 'string', 'number', 'decimal', 'boolean', 'date',
986
+ 'lookup', 'concat', 'constant', 'expression',
987
+ ]);
988
+
989
+ const FieldMappingSchema = z.object({
990
+ from: z.union([z.string(), z.array(z.string())]).optional(),
991
+ to: z.string(),
992
+ type: FieldType,
993
+ max: z.number().optional(),
994
+ precision: z.number().optional(),
995
+ format: z.string().optional(),
996
+ cleanse: CleanseOps.optional(),
997
+ lookup: z.string().optional(),
998
+ separator: z.string().optional(),
999
+ value: z.union([z.string(), z.number(), z.boolean()]).optional(),
1000
+ default: z.union([z.string(), z.number(), z.boolean(), z.null()]).optional(),
1001
+ optional: z.boolean().default(false),
1002
+ });
1003
+
1004
+ export const TransformSchema = z.object({
1005
+ lookups: z.array(LookupSchema).default([]),
1006
+ fields: z.array(FieldMappingSchema).min(1),
1007
+ });
1008
+
1009
+ export const TargetSchema = z.object({
1010
+ adapter: TargetAd,
1011
+ output: z.string().optional(),
1012
+ entity: z.string().optional(),
1013
+ includeHeader: z.boolean().optional(),
1014
+ columnOrder: z.array(z.string()).optional(),
1015
+ dateFormat: z.string().optional(),
1016
+ delimiter: z.string().default(','),
1017
+ encoding: z.string().default('utf-8'),
1018
+ nullValue: z.string().default(''),
1019
+ template: z.string().optional(),
1020
+ // BC REST
1021
+ baseUrl: z.string().optional(),
1022
+ company: z.string().optional(),
1023
+ apiVersion: z.string().default('v2.0'),
1024
+ onConflict: z.enum(['fail', 'upsert', 'ignore']).default('fail'),
1025
+ upsertKey: z.array(z.string()).optional(),
1026
+ batchEndpoint: z.boolean().default(true),
1027
+ // PostgreSQL
1028
+ connection: z.string().optional(),
1029
+ table: z.string().optional(),
1030
+ schema: z.string().default('public'),
1031
+ });
1032
+
1033
+ export const RunSchema = z.object({
1034
+ mode: z.enum(['full', 'incremental', 'validate-only']).default('full'),
1035
+ batchSize: z.number().int().positive().default(500),
1036
+ onError: z.enum(['continue', 'stop']).default('continue'),
1037
+ logLevel: z.enum(['debug', 'info', 'warn', 'error']).default('info'),
1038
+ dryRun: z.boolean().default(false),
1039
+ outputDir: z.string().default('./output'),
1040
+ stagingDb: z.string().default(''),
1041
+ // Phase 4a — enrich tuning (consumed by @caracal-lynx/sluice-enrich)
1042
+ enrichConcurrency: z.number().int().positive().default(5),
1043
+ enrichTimeoutMs: z.number().int().positive().default(5000),
1044
+ enrichMaxRetries: z.number().int().min(0).max(5).default(3),
1045
+ incrementalField: z.string().optional(),
1046
+ incrementalSince: z.string().optional(),
1047
+ });
1048
+
1049
+ export const PipelineSchema = z.object({
1050
+ pipeline: z.object({
1051
+ name: z.string().regex(/^[a-z0-9-]+$/),
1052
+ client: z.string(),
1053
+ version: z.string(),
1054
+ entity: z.string(),
1055
+ description: z.string().optional(),
1056
+ }),
1057
+ source: SourceSchema,
1058
+ enrich: EnrichSchema.optional(), // Phase 4a — runs between Extract/Merge and DQ
1059
+ dq: DqSchema,
1060
+ transform: TransformSchema,
1061
+ target: TargetSchema,
1062
+ run: RunSchema.default({}),
1063
+ });
1064
+
1065
+ // Inferred types — use these everywhere; do not write manual interfaces.
1066
+ export type Pipeline = z.infer<typeof PipelineSchema>;
1067
+ export type SourceConfig = z.infer<typeof SourceSchema>;
1068
+ export type TargetConfig = z.infer<typeof TargetSchema>;
1069
+ export type RunConfig = z.infer<typeof RunSchema>;
1070
+ export type FieldMapping = z.infer<typeof FieldMappingSchema>;
1071
+ export type DqRule = z.infer<typeof DqRuleSchema>;
1072
+ export type Lookup = z.infer<typeof LookupSchema>;
1073
+ ```
1074
+
1075
+ ### Phase 2 schema additions (already in `src/config/schema.ts`)
1076
+
1077
+ The following are forward-looking additions that extend the canonical schema above.
1078
+ They are live in the codebase and tested. Do not remove them.
1079
+
1080
+ - **`DqSchema.rulesFile`** (`z.string().optional()`) — path to a composite rule
1081
+ library YAML file. `ConfigLoader` expands composite rule references into
1082
+ built-in check types before Zod validation, so the pipeline runner only sees
1083
+ standard checks.
1084
+ - **`FieldType` includes `'custom'`** — delegates to a `TransformPlugin` via
1085
+ `customOp`. Requires `customOp` to be set (enforced by a `.refine()`).
1086
+ - **`FieldMappingSchema.customOp`** (`z.string().optional()`) — plugin ID for
1087
+ `type: custom` fields.
1088
+ - **`FieldMappingSchema.options`** (`z.record(z.unknown()).optional()`) — arbitrary
1089
+ per-plugin config passed through to the transform plugin.
1090
+ - **`FieldMappingSchema` refinement** — field types in `TYPES_REQUIRING_FROM`
1091
+ (`string`, `number`, `decimal`, `boolean`, `date`, `lookup`, `concat`) must
1092
+ declare `from`. Only `constant`, `expression`, and `custom` may omit it.
1093
+ - **`TargetSchema` refinement** — when `onConflict: 'upsert'`, a non-empty
1094
+ `upsertKey` is required (checked at config-parse time).
1095
+ - **`ToolkitConfigSchema`** — schema for `sluice.config.yaml` (toolkit-level
1096
+ plugin loading). Consumed by `PipelineRunner.loadAllPlugins()` via
1097
+ `plugins/loader.ts → loadNpmPlugins()` at the start of every run.
1098
+ - **`CompositeRuleSchema` / `CompositeRuleLibrarySchema`** — schemas for the
1099
+ shared rule library YAML files referenced by `dq.rulesFile`.
1100
+
1101
+ ### Phase 3 schema additions (multi-source merge)
1102
+
1103
+ - **`DqRuleSchema.sourceId`** (`z.string().optional()`) — scopes a rule to a
1104
+ named pre-merge source; omitted for post-merge rules.
1105
+ - **`PipelineSchema.source`** — now `optional()`; mutually exclusive with
1106
+ `sources` (enforced by `.refine()`).
1107
+ - **`PipelineSchema.sources`** (`z.array(MultiSourceEntrySchema).min(2).optional()`)
1108
+ — the multi-source array. Refinement also checks unique source ids and
1109
+ (in incremental mode) that `merge.incrementalSource` matches a source id.
1110
+ - **`PipelineSchema.merge`** (`MergeSchema.optional()`) — per-pipeline merge
1111
+ config. Defaults: `strategy: 'coalesce'`, `onUnmatched: 'include'`.
1112
+ - **`MergeSchema`** — `key`, `strategy`, `onUnmatched`, `fieldStrategies[]`,
1113
+ `conflictLog`, `incrementalSource`.
1114
+ - **`MergeFieldStrategySchema`** — per-field override: `field`, optional
1115
+ `strategy`, optional `source` (at least one required).
1116
+ - **`MultiSourceEntrySchema`** — extends `SourceBaseSchema` with `id`,
1117
+ `priority`, and optional `rename`.
1118
+ - **`isSingleSource(p)` / `isMultiSource(p)`** — exported type guards that
1119
+ narrow `Pipeline` to the single- or multi-source shape.
1120
+
1121
+ ---
1122
+
1123
+ ## ═══════════════════════════════════════════════════════════
1124
+ ## PLUGIN INTERFACES
1125
+ ## ═══════════════════════════════════════════════════════════
1126
+
1127
+ ### SourceAdapter (src/adapters/source/types.ts)
1128
+
1129
+ ```typescript
1130
+ export interface SourceAdapter {
1131
+ readonly id: string;
1132
+ connect(config: SourceConfig): Promise<void>;
1133
+ extract(
1134
+ config: SourceConfig,
1135
+ store: StagingStore,
1136
+ runConfig: RunConfig,
1137
+ onProgress: (rows: number) => void,
1138
+ targetTable?: string // defaults to 'stg_raw'; set per-source in
1139
+ // multi-source pipelines
1140
+ ): Promise<ExtractResult>;
1141
+ disconnect(): Promise<void>;
1142
+ }
1143
+
1144
+ export interface ExtractResult {
1145
+ rowsExtracted: number;
1146
+ tableName: string; // caller-supplied; 'stg_raw' for single-source,
1147
+ // 'stg_raw_{sourceId}' for each source in a
1148
+ // multi-source pipeline
1149
+ columns: ColumnMeta[];
1150
+ }
1151
+
1152
+ export interface ColumnMeta {
1153
+ name: string;
1154
+ duckDbType: string; // VARCHAR | BIGINT | DOUBLE | BOOLEAN | TIMESTAMP
1155
+ }
1156
+ ```
1157
+
1158
+ ### TargetAdapter (src/adapters/target/types.ts)
1159
+
1160
+ ```typescript
1161
+ export interface TargetAdapter {
1162
+ readonly id: string;
1163
+ connect(config: TargetConfig): Promise<void>;
1164
+ load(
1165
+ config: TargetConfig,
1166
+ store: StagingStore,
1167
+ runConfig: RunConfig,
1168
+ onProgress: (rows: number) => void
1169
+ ): Promise<LoadResult>;
1170
+ disconnect(): Promise<void>;
1171
+ }
1172
+
1173
+ export interface LoadResult {
1174
+ rowsLoaded: number;
1175
+ rowsFailed: number;
1176
+ outputPath?: string; // set for file-based targets
1177
+ }
1178
+ ```
1179
+
1180
+ ### DQ Rule (src/dq/rules/types.ts)
1181
+
1182
+ ```typescript
1183
+ export interface Rule {
1184
+ readonly id: string;
1185
+ validate(
1186
+ value: unknown,
1187
+ config: CheckConfig,
1188
+ rowIndex: number,
1189
+ field: string
1190
+ ): RuleViolation | null;
1191
+ }
1192
+
1193
+ export interface RuleViolation {
1194
+ field: string;
1195
+ rowIndex: number;
1196
+ value: unknown;
1197
+ rule: string;
1198
+ severity: 'critical' | 'warning' | 'info';
1199
+ message: string;
1200
+ }
1201
+ ```
1202
+
1203
+ ### MergeStrategyPlugin (src/merge/types.ts)
1204
+
1205
+ ```typescript
1206
+ export interface MergeSourceMeta {
1207
+ id: string;
1208
+ priority: number;
1209
+ tableName: string; // e.g. 'stg_raw_sql-server'
1210
+ }
1211
+
1212
+ export interface MergeResult {
1213
+ rowsMerged: number;
1214
+ conflicts: number; // fields where two non-null values disagreed
1215
+ unmatched: number; // records present in only one source
1216
+ tableName: 'stg_merged';
1217
+ }
1218
+
1219
+ export interface MergeStrategyPlugin {
1220
+ readonly id: string; // matches MergeSchema.strategy value
1221
+ readonly description?: string; // shown by `sluice merge list-strategies`
1222
+
1223
+ merge(
1224
+ store: StagingStore,
1225
+ sources: MergeSourceMeta[], // priority-ordered (priority 1 first)
1226
+ config: MergeConfig,
1227
+ ): Promise<MergeResult>;
1228
+ }
1229
+ ```
1230
+
1231
+ Built-in strategies: `coalesce`, `priority-override`, `union`, `intersect`
1232
+ (all pre-registered in `MergeStrategyRegistry`; live in
1233
+ `src/merge/strategies/*.ts`). Custom strategies can be dropped into a
1234
+ `plugins/` folder as `*.merge.ts` files exporting `const mergeStrategy`.
1235
+
1236
+ ---
1237
+
1238
+ ## ═══════════════════════════════════════════════════════════
1239
+ ## ADAPTER IMPLEMENTATION NOTES
1240
+ ## ═══════════════════════════════════════════════════════════
1241
+
1242
+ ### mssql source
1243
+
1244
+ - Stream results: `request.stream = true` + `RecordSet` events.
1245
+ - SQL Server → DuckDB type map: `varchar/nvarchar/char → VARCHAR`,
1246
+ `int/bigint → BIGINT`, `decimal/numeric/money → DOUBLE`,
1247
+ `bit → BOOLEAN`, `datetime/date → TIMESTAMP`, `float/real → DOUBLE`.
1248
+ - Trusted connection: detect `trustedConnection: true` in JSON connection config.
1249
+
1250
+ ### csv source
1251
+
1252
+ - `csv-parse` options: `{ columns: true, skip_empty_lines: true, bom: true }`.
1253
+ `bom: true` strips the UTF-8 BOM common in Excel-generated CSVs.
1254
+ - All columns inferred as `VARCHAR` in DuckDB.
1255
+ - Support glob patterns: concatenate all matching files into a single staging table.
1256
+
1257
+ ### xlsx source
1258
+
1259
+ - SheetJS: convert to CSV via `xlsx.utils.sheet_to_csv`, then pipe through csv-parse.
1260
+ - Log a warning if workbook has more than one sheet and `source.sheet` is unset.
1261
+
1262
+ ### rest source
1263
+
1264
+ - `axios-retry`: 3 retries, exponential backoff, retry on 429 and 5xx.
1265
+ - Flatten nested JSON using `__` separator (`address.postCode` → `address__postCode`).
1266
+ - All three pagination types must be supported: offset, page, cursor.
1267
+
1268
+ ### IFS target
1269
+
1270
+ - UTF-8 CSV via `csv-stringify`.
1271
+ - `includeHeader` defaults to `false` for this adapter.
1272
+ - Apply `target.columnOrder` if specified.
1273
+ - Format date columns using `dayjs` with `target.dateFormat` (default `YYYY-MM-DD`).
1274
+
1275
+ ### BlueCherry target (src/adapters/target/bluecherry.ts)
1276
+
1277
+ BlueCherry ERP (CGS — Computer Generated Solutions) uses fixed-format CSV for
1278
+ bulk import. Each entity type has a required column set. The adapter validates
1279
+ required columns at `connect()` time, before any data is read.
1280
+
1281
+ **Required columns per entity:**
1282
+
1283
+ ```typescript
1284
+ const REQUIRED_COLUMNS: Record<string, string[]> = {
1285
+ Style: [
1286
+ 'StyleNo', 'StyleDesc', 'Division', 'Season',
1287
+ 'CostPrice', 'RetailPrice', 'ActiveFlag',
1288
+ ],
1289
+ Vendor: [
1290
+ 'VendorNo', 'VendorName', 'Country', 'CurrencyCode',
1291
+ ],
1292
+ PurchaseOrder: [
1293
+ 'PONumber', 'VendorNo', 'Season', 'OrderDate', 'DeliveryDate',
1294
+ ],
1295
+ PODetail: [
1296
+ 'PONumber', 'StyleNo', 'ColourCode', 'SizeCode', 'Quantity', 'CostPrice',
1297
+ ],
1298
+ Season: [
1299
+ 'SeasonCode', 'SeasonDesc', 'StartDate', 'EndDate',
1300
+ ],
1301
+ ColourSize: [
1302
+ 'StyleNo', 'ColourCode', 'ColourDesc', 'SizeCode', 'SizeDesc',
1303
+ ],
1304
+ };
1305
+ ```
1306
+
1307
+ **Behaviour:**
1308
+ - `includeHeader` defaults to `true`.
1309
+ - Default `dateFormat` is `MM/DD/YYYY` (BlueCherry is US-origin software).
1310
+ - Any column whose name ends with `Date` (case-insensitive) is automatically
1311
+ formatted using `target.dateFormat` via `dayjs`.
1312
+ - `nullValue` (default `""`) is used for all null/undefined fields.
1313
+ - At `connect()`:
1314
+ 1. Verify `target.entity` is in `REQUIRED_COLUMNS`. Throw `ConfigError` if not.
1315
+ 2. Query `store.columnNames('stg_transformed')` and verify all required columns
1316
+ are present. Throw `ConfigError` listing any missing columns.
1317
+ 3. If `target.template` is a file path, read its header row and use it as the
1318
+ definitive column order for the output. If `target.template === 'default'`,
1319
+ use the required columns list as column order, with any additional columns
1320
+ from `stg_transformed` appended.
1321
+
1322
+ **Note on BlueCherry column names:** The column names in `REQUIRED_COLUMNS` are
1323
+ internal conventions for this toolkit. Verify them against the actual BlueCherry
1324
+ import documentation before running a live migration. The `template` feature exists
1325
+ precisely to override these if the client's BlueCherry instance uses different names.
1326
+
1327
+ ### Business Central REST target
1328
+
1329
+ - OAuth2 client credentials: `POST https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token`
1330
+ - Cache token in memory; refresh 60 seconds before expiry.
1331
+ - OData `$batch`: `POST {baseUrl}/api/{version}/companies({company})/$batch`
1332
+ with `Content-Type: multipart/mixed; boundary=batch_{uuid}`.
1333
+ Maximum 100 operations per batch request.
1334
+ - HTTP 409 with `onConflict: upsert` → issue PATCH to individual entity URL.
1335
+ - HTTP 4xx (non-409): log error, increment `rowsFailed`, continue if
1336
+ `run.onError: continue`.
1337
+
1338
+ ---
1339
+
1340
+ ## ═══════════════════════════════════════════════════════════
1341
+ ## PIPELINE RUNNER — EXECUTION ORDER
1342
+ ## ═══════════════════════════════════════════════════════════
1343
+
1344
+ **Important:** `ConfigLoader.load()` interpolates `${ENV_VAR}` tokens from
1345
+ `process.env` but does **not** call `loadEnv()` / `dotenv.config()` itself.
1346
+ The CLI entry point must call `loadEnv()` before invoking the loader. This keeps
1347
+ `ConfigLoader` side-effect-free and testable (tests stub `process.env` directly).
1348
+
1349
+ ```
1350
+ 1. Load + validate config ConfigLoader.load(yamlPath)
1351
+ 2. Resolve output directory create if not exists
1352
+ 3. Open DuckDB staging store StagingStore.open(dbPath)
1353
+ 4. Connect source adapter
1354
+ 5. Extract → 'stg_raw' log: rows extracted
1355
+ 5a. Disconnect source adapter always in finally
1356
+ 5b. Phase 4a Enrich (optional) runs only when:
1357
+ - `enrich:` block configured
1358
+ - --no-enrich NOT set
1359
+ - mode != validate-only and not dryRun
1360
+ - @caracal-lynx/sluice-enrich is installed
1361
+ and has called registerEnrichPhase()
1362
+ Otherwise skipped (WARN log if last bullet
1363
+ fails). Writes new columns to 'stg_raw'.
1364
+ 6. Run DQ rules against 'stg_raw'
1365
+ a. Collect all RuleViolations
1366
+ b. Write rejection CSV
1367
+ c. Write summary JSON
1368
+ d. Log DQ summary (info)
1369
+ e. If stopOnCritical AND criticalCount > 0 → throw PipelineDQError
1370
+ 7. Resolve all lookups LookupResolver.loadAll()
1371
+ 8. Transform 'stg_raw' → 'stg_transformed' (batch by batchSize)
1372
+ 9. If dryRun === true → STOP (log summary, exit 0)
1373
+ 10. If mode === 'validate-only' → STOP (log summary, exit 0)
1374
+ 11. Connect target adapter
1375
+ 12. Load 'stg_transformed' → target
1376
+ 12a.Disconnect target adapter always in finally
1377
+ 13. Close DuckDB staging store always in finally
1378
+ 14. Write run state file {outputDir}/{name}-state.json
1379
+ 15. Log final summary (info)
1380
+ ```
1381
+
1382
+ **Run state file** `{outputDir}/{name}-state.json`:
1383
+ ```json
1384
+ {
1385
+ "pipeline": "acme-corp-customers",
1386
+ "lastRunAt": "2026-04-15T09:30:00.000Z",
1387
+ "lastMode": "full",
1388
+ "rowsExtracted": 1842,
1389
+ "rowsLoaded": 1801,
1390
+ "criticalViolations": 0,
1391
+ "warnings": 41,
1392
+ "incrementalSince": ""
1393
+ }
1394
+ ```
1395
+
1396
+ Used by `mode: incremental` to auto-determine the `since` timestamp.
1397
+
1398
+ ### Multi-source execution order (`MultiSourcePipelineRunner`)
1399
+
1400
+ For a pipeline with `sources` + `merge`, the CLI selects
1401
+ `MultiSourcePipelineRunner` (a subclass of `PipelineRunner` that overrides
1402
+ `run()`, `profile()`, and `writeStateFile()` and reuses the protected
1403
+ `runExtract`, `runDQ`, `runTransform`, `runLoad` phase methods).
1404
+
1405
+ ```
1406
+ 1. Load + validate config ConfigLoader.load(yamlPath)
1407
+ 2. Load plugins files + sluice.config.yaml (Tier 2/3)
1408
+ 3. Resolve output dir, open DuckDB staging store
1409
+ 4. For each source (priority-ordered):
1410
+ a. runExtract → 'stg_raw_{sourceId}'
1411
+ b. If source.rename is set StagingStore.renameColumns(...)
1412
+ c. If mode: incremental AND source.id === merge.incrementalSource:
1413
+ apply TRY_CAST(... AS TIMESTAMP) >= since filter
1414
+ d. Filter dq.rules by sourceId; runDQ against 'stg_raw_{sourceId}'
1415
+ (writes per-source rejection CSV, stops on critical)
1416
+ e. Rewrite 'stg_raw_{sourceId}' to only the accepted rows
1417
+ 5. MergeEngine.run(store, sources, merge)
1418
+ → creates 'stg_merge_joined', 'stg_merged', 'stg_merge_conflicts'
1419
+ → writes conflictLog CSV if configured
1420
+ 5a. Phase 4a Enrich (optional) runs once against 'stg_merged' if
1421
+ `enrich:` block is present and the four
1422
+ gating conditions hold (see single-source
1423
+ step 5b above). Single post-merge pass —
1424
+ never per-source.
1425
+ 6. runDQ on the post-merge rules (no sourceId) against 'stg_merged'
1426
+ 7. Filter rejected rows; runTransform against the filtered merge result
1427
+ 8. If dryRun OR validate-only → STOP
1428
+ 9. runLoad → target adapter reads 'stg_transformed'
1429
+ 10. writeStateFile → per-source lastRunAt block + top-level summary
1430
+ 11. Close DuckDB
1431
+ ```
1432
+
1433
+ **Multi-source state file** adds a `sources` block keyed by source id:
1434
+
1435
+ ```json
1436
+ {
1437
+ "pipeline": "style-co-products-merged",
1438
+ "lastRunAt": "2026-04-19T09:30:00.000Z",
1439
+ "lastMode": "incremental",
1440
+ "rowsMerged": 3201,
1441
+ "rowsLoaded": 3188,
1442
+ "criticalViolations": 0,
1443
+ "warnings": 14,
1444
+ "incrementalSince": "",
1445
+ "sources": {
1446
+ "sql-server": {
1447
+ "lastRunAt": "2026-04-19T09:30:00.000Z",
1448
+ "rowsExtracted": 2910,
1449
+ "incrementalSince": "2026-04-18T22:00:00.000Z"
1450
+ },
1451
+ "excel": { "lastRunAt": "...", "rowsExtracted": 412, "incrementalSince": "" }
1452
+ }
1453
+ }
1454
+ ```
1455
+
1456
+ ---
1457
+
1458
+ ## ═══════════════════════════════════════════════════════════
1459
+ ## DUCKDB STAGING STORE (src/staging/store.ts)
1460
+ ## ═══════════════════════════════════════════════════════════
1461
+
1462
+ ```typescript
1463
+ class StagingStore {
1464
+ constructor(private dbPath: string) {} // ':memory:' for dryRun/tests
1465
+
1466
+ async open(): Promise<void>
1467
+ async close(): Promise<void>
1468
+ async createTable(name: string, columns: ColumnMeta[]): Promise<void>
1469
+ async insertBatch(table: string, rows: Record<string, unknown>[]): Promise<void>
1470
+ async query<T>(sql: string, params?: unknown[]): Promise<T[]>
1471
+ async tableExists(name: string): Promise<boolean>
1472
+ async dropTable(name: string): Promise<void>
1473
+ async rowCount(table: string): Promise<number>
1474
+ async columnNames(table: string): Promise<string[]>
1475
+ async exportToCsv(
1476
+ table: string,
1477
+ outputPath: string,
1478
+ options?: { delimiter?: string; header?: boolean; encoding?: string }
1479
+ ): Promise<void>
1480
+ async renameColumns( // Phase 3: used by MultiSourcePipelineRunner
1481
+ tableName: string, // after a per-source extract. Implemented as
1482
+ renames: Record<string, string> // CREATE OR REPLACE TABLE ... AS SELECT ...
1483
+ ): Promise<void> // Unknown keys log a warning, not an error.
1484
+ }
1485
+ ```
1486
+
1487
+ Default DuckDB path: `{outputDir}/{pipelineName}.duckdb`
1488
+ Use `':memory:'` when `dryRun: true` or `stagingDb: ':memory:'`.
1489
+
1490
+ ---
1491
+
1492
+ ## ═══════════════════════════════════════════════════════════
1493
+ ## TRANSFORM ENGINE (src/transform/engine.ts)
1494
+ ## ═══════════════════════════════════════════════════════════
1495
+
1496
+ ### Field type behaviours
1497
+
1498
+ | type | behaviour |
1499
+ |---|---|
1500
+ | `string` | `String(value)`, cleanse ops, then truncate to `max` |
1501
+ | `number` | `Math.round(Number(value))`. Throw `TransformError` if NaN. |
1502
+ | `decimal` | `parseFloat(value).toFixed(precision)` stored as string |
1503
+ | `boolean` | `['1','true','yes','y','t'].includes(String(v).toLowerCase())` |
1504
+ | `date` | Parse with `dayjs(value, format)`; output as `target.dateFormat` or ISO |
1505
+ | `lookup` | `LookupResolver.resolve(lookupName, value)` |
1506
+ | `concat` | Join `from[]` with `separator`, then cleanse |
1507
+ | `constant` | Emit `value` verbatim |
1508
+ | `expression` | `ExpressionEvaluator.evaluate(expression, row)` |
1509
+
1510
+ ### Expression evaluator (src/transform/expression.ts)
1511
+
1512
+ **Must not use `eval()` or `new Function()`.**
1513
+
1514
+ 1. Expression does NOT start with `js:` → use `expr-eval` Parser.
1515
+ Provide `row` as a variable containing all source field values.
1516
+ 2. Expression starts with `js:` → strip prefix, execute via
1517
+ `vm.runInNewContext(code, { row, Date, Math, JSON, String, Number, Boolean })`.
1518
+ Log a `warn` whenever the `js:` path is taken.
1519
+
1520
+ ---
1521
+
1522
+ ## ═══════════════════════════════════════════════════════════
1523
+ ## DQ REPORTER OUTPUT (src/dq/reporter.ts)
1524
+ ## ═══════════════════════════════════════════════════════════
1525
+
1526
+ **Rejection CSV** columns: `row_index`, `field`, `value`, `rule`, `severity`, `message`
1527
+
1528
+ **Summary JSON** (`{outputDir}/{name}-dq-summary.json`):
1529
+ ```json
1530
+ {
1531
+ "pipeline": "acme-corp-customers",
1532
+ "runAt": "2026-04-15T09:30:00Z",
1533
+ "rowsChecked": 1842,
1534
+ "rowsPassed": 1801,
1535
+ "rowsRejected": 41,
1536
+ "violations": { "critical": 0, "warning": 38, "info": 3 },
1537
+ "byField": {
1538
+ "POST_CODE": { "critical": 0, "warning": 22 },
1539
+ "EMAIL": { "critical": 0, "warning": 16 }
1540
+ }
1541
+ }
1542
+ ```
1543
+
1544
+ ---
1545
+
1546
+ ## ═══════════════════════════════════════════════════════════
1547
+ ## ERROR TYPES (src/utils/errors.ts)
1548
+ ## ═══════════════════════════════════════════════════════════
1549
+
1550
+ ```typescript
1551
+ export class PipelineError extends Error {
1552
+ constructor(message: string, public readonly cause?: unknown) {
1553
+ super(message);
1554
+ this.name = this.constructor.name;
1555
+ if (Error.captureStackTrace) {
1556
+ Error.captureStackTrace(this, this.constructor);
1557
+ }
1558
+ }
1559
+ }
1560
+ export class ConfigError extends PipelineError {}
1561
+ export class SourceError extends PipelineError {}
1562
+ export class StagingError extends PipelineError {}
1563
+ export class DQError extends PipelineError {}
1564
+ export class PipelineDQError extends DQError {
1565
+ constructor(
1566
+ public readonly criticalCount: number,
1567
+ public readonly reportPath: string,
1568
+ ) {
1569
+ super(`Pipeline halted: ${criticalCount} critical DQ violations. See ${reportPath}`);
1570
+ }
1571
+ }
1572
+ export class TransformError extends PipelineError {}
1573
+ export class ExpressionError extends TransformError {}
1574
+ export class LoadError extends PipelineError {}
1575
+ export class EnrichError extends PipelineError {} // Phase 4a — exit code 4
1576
+ ```
1577
+
1578
+ All error subclasses inherit `this.name = this.constructor.name` from
1579
+ `PipelineError`, so `err.name` reflects the actual class (e.g. `"ConfigError"`,
1580
+ `"PipelineDQError"`). `Error.captureStackTrace` (V8-only) trims the constructor
1581
+ frame from stack traces for cleaner output.
1582
+
1583
+ ---
1584
+
1585
+ ## ═══════════════════════════════════════════════════════════
1586
+ ## CLI (src/cli.ts)
1587
+ ## ═══════════════════════════════════════════════════════════
1588
+
1589
+ ```
1590
+ sluice run <pipeline.yaml> Full pipeline run (auto-detects single vs multi-source)
1591
+ sluice validate <pipeline.yaml> DQ + transform only; no load
1592
+ sluice profile <pipeline.yaml> Extract + column profiling; no DQ
1593
+ sluice check <pipeline.yaml> Config validation only; no execution
1594
+ sluice plugins List all loaded rule/transform/merge plugins
1595
+ sluice merge list-strategies List all registered merge strategies
1596
+ sluice merge info <strategy> Show details about a specific merge strategy
1597
+
1598
+ Global options:
1599
+ --log-level <level> debug | info | warn | error
1600
+ --env <file> Path to .env file (default: ./.env)
1601
+ --output <dir> Override outputDir
1602
+ --plugins <dir...> Additional plugin directory/directories to load
1603
+ --dry-run Force dryRun: true
1604
+ --silent Suppress the progress bar on stdout (logs still go to stderr)
1605
+
1606
+ `sluice run` options:
1607
+ --no-enrich Skip the Phase 4a enrich phase even if `enrich:` is configured.
1608
+ (validate / profile / check do not run enrich at all, by design.)
1609
+ ```
1610
+
1611
+ **Progress feedback:** `sluice run`, `sluice validate`, and `sluice profile`
1612
+ render a phase-by-phase progress bar to stdout via
1613
+ `src/utils/progress.ts → ProgressReporter`, with per-phase emoji icons
1614
+ (🔎 extract · 🛡️ DQ · 🔀 merge · 🌐 enrich · 🔧 transform · 📤 load), an ETA for
1615
+ determinate phases, and a coloured ✅/⚠️/❌ run-summary line. The bar
1616
+ degrades gracefully:
1617
+ - `--silent` → no stdout output at all
1618
+ - `--log-level debug` → bar disabled; per-row debug lines are used instead
1619
+ - `process.stdout.isTTY` → false: plain-ASCII lines (one per phase),
1620
+ no emojis, no ANSI escapes — log-file friendly
1621
+ - `NO_COLOR` env var → ANSI colour dropped (handled by `picocolors`)
1622
+
1623
+ **Exit codes:** `0` success · `1` pipeline error · `2` DQ critical violations · `3` config error · `4` enrich error (Phase 4a)
1624
+
1625
+ ---
1626
+
1627
+ ## ═══════════════════════════════════════════════════════════
1628
+ ## LOGGING (src/utils/logger.ts)
1629
+ ## ═══════════════════════════════════════════════════════════
1630
+
1631
+ Single `pino` instance. All log records (every level) go to **stderr**; stdout
1632
+ is reserved exclusively for the progress bar and final summary rendered by
1633
+ `ProgressReporter`. This mirrors how git, cargo, and npm split streams.
1634
+
1635
+ No `console.log` in `src/`. Operators who want logs in a file can run
1636
+ `sluice run p.yaml 2>run.log` — the bar stays visible on the terminal while
1637
+ every pino record is captured to the file. Use `--log-level error` to narrow
1638
+ the file to errors only.
1639
+
1640
+ | Level | Used for |
1641
+ |---|---|
1642
+ | `debug` | Per-row progress, SQL queries, lookup cache hits |
1643
+ | `info` | Phase transitions, row counts, file paths, run summary |
1644
+ | `warn` | DQ warnings, missing optional lookups, `js:` expression usage |
1645
+ | `error` | All caught errors before re-throw |
1646
+
1647
+ Dev: `npx sluice run pipeline.yaml | npx pino-pretty`
1648
+
1649
+ ---
1650
+
1651
+ ## ═══════════════════════════════════════════════════════════
1652
+ ## ENVIRONMENT VARIABLES (.env.example)
1653
+ ## ═══════════════════════════════════════════════════════════
1654
+
1655
+ ```bash
1656
+ # ── Acme Corp — source ────────────────────────────────────
1657
+ SOURCE_MSSQL=mssql://user:password@serverlegacy.example.local/LegacyDB
1658
+
1659
+ # ── Acme Corp — IFS target ────────────────────────────────
1660
+ IFS_IMPORT_PATH=C:\IFS\Import
1661
+
1662
+ # ── Business Central target (any client using the `bc` adapter) ──
1663
+ BC_BASE_URL=https://api.businesscentral.dynamics.com/v2.0
1664
+ BC_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
1665
+ BC_CLIENT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
1666
+ BC_CLIENT_SECRET=your-client-secret
1667
+ BC_COMPANY=Example Company Ltd
1668
+
1669
+ # ── Style Co — source ───────────────────────────────────
1670
+ SOURCE_2_MSSQL=mssql://user:password@serverlegacy2.example.local/LegacyDB
1671
+
1672
+ # ── Style Co — BlueCherry (file-based; no API creds) ───
1673
+ BC_IMPORT_PATH=C:\BlueCherry\Import
1674
+
1675
+ # ── Runtime ───────────────────────────────────────────────────
1676
+ NODE_ENV=development
1677
+ LOG_LEVEL=info
1678
+ ```
1679
+
1680
+ ---
1681
+
1682
+ ## ═══════════════════════════════════════════════════════════
1683
+ ## TESTING
1684
+ ## ═══════════════════════════════════════════════════════════
1685
+
1686
+ - **Vitest only.** No Jest.
1687
+ - Unit tests: mock all I/O with `vi.mock`.
1688
+ - Integration tests: real DuckDB (`:memory:`) + CSV fixtures.
1689
+ - No tests against live SQL Server, BC, IFS, or BlueCherry.
1690
+ - Target: 80% line coverage across `src/dq/` and `src/transform/`.
1691
+ - Both full example pipelines in this file must parse cleanly in the config tests.
1692
+
1693
+ **Required test cases:**
1694
+
1695
+ Config loader: `${ENV_VAR}` resolution · missing var → `ConfigError` ·
1696
+ invalid YAML → `ZodError` · minimal pipeline with all defaults · both example
1697
+ pipelines in this spec parse cleanly.
1698
+
1699
+ DQ engine: `notNull` on null/empty/whitespace · `unique` with duplicates ·
1700
+ `ukPostcode` valid and invalid formats · `allowedValues` case sensitivity ·
1701
+ `stopOnCritical` throws `PipelineDQError` · reporter writes correct CSV and JSON.
1702
+
1703
+ Transform engine: `concat` with separator · `lookup` miss + `optional: true` → null ·
1704
+ `lookup` miss + `optional: false` → `TransformError` · `expression` basic eval ·
1705
+ `expression` with `js:` prefix · `cleanse: trim|titleCase` · `cleanse: padStart:6:0` ·
1706
+ `cleanse: normaliseUnicode` · `type: date` with `format: DD/MM/YYYY` ·
1707
+ `type: boolean` all truthy/falsy variants.
1708
+
1709
+ BlueCherry adapter: missing required column → `ConfigError` at `connect()` ·
1710
+ date columns formatted with `target.dateFormat` · header row present ·
1711
+ `nullValue` respected · `template` CSV used as column order.
1712
+
1713
+ Staging store: insert/query round-trip all DuckDB types · `exportToCsv` delimiter
1714
+ and header options · `:memory:` mode works correctly.
1715
+
1716
+ ---
1717
+
1718
+ ## ═══════════════════════════════════════════════════════════
1719
+ ## BUILD, SCRIPTS, CI
1720
+ ## ═══════════════════════════════════════════════════════════
1721
+
1722
+ **package.json scripts:**
1723
+ ```json
1724
+ {
1725
+ "name": "@caracal-lynx/sluice",
1726
+ "scripts": {
1727
+ "build": "tsc -p tsconfig.json",
1728
+ "dev": "tsx watch src/cli.ts",
1729
+ "lint": "eslint src tests",
1730
+ "format": "prettier --write src tests",
1731
+ "test": "vitest run",
1732
+ "test:watch": "vitest",
1733
+ "test:cov": "vitest run --coverage",
1734
+ "sluice": "tsx src/cli.ts"
1735
+ },
1736
+ "bin": { "sluice": "dist/cli.js" }
1737
+ }
1738
+ ```
1739
+
1740
+ Use `tsx` (not `ts-node`) for development execution — handles tsconfig path aliases
1741
+ on Windows without extra configuration.
1742
+
1743
+ **GitHub Actions** (`.github/workflows/ci.yml`):
1744
+ ```yaml
1745
+ on: [push, pull_request]
1746
+ jobs:
1747
+ test:
1748
+ runs-on: ubuntu-latest
1749
+ steps:
1750
+ - uses: actions/checkout@v4
1751
+ - uses: actions/setup-node@v4
1752
+ with: { node-version: '24', cache: 'npm' }
1753
+ - run: npm ci
1754
+ - run: npm run lint
1755
+ - run: npm run build
1756
+ - run: npm run test:cov
1757
+ - uses: actions/upload-artifact@v4
1758
+ with: { name: coverage, path: coverage/ }
1759
+ ```
1760
+
1761
+ ---
1762
+
1763
+ ## ═══════════════════════════════════════════════════════════
1764
+ ## WINDOWS / POWERSHELL NOTES
1765
+ ## ═══════════════════════════════════════════════════════════
1766
+
1767
+ - All file paths: `path.join()` / `path.resolve()`. Never string concat with `/`.
1768
+ - `.env` uses LF line endings (set in `.gitattributes`).
1769
+ - DuckDB npm package includes the `win32-x64` native binary automatically.
1770
+ - Do not write Windows-only shell commands in CI (CI runs ubuntu-latest).
1771
+ - Developer shell: PowerShell 7 on Windows Terminal.
1772
+
1773
+ ---
1774
+
1775
+ ## ═══════════════════════════════════════════════════════════
1776
+ ## WHAT NOT TO DO
1777
+ ## ═══════════════════════════════════════════════════════════
1778
+
1779
+ - Do not use `ts-node` — use `tsx`.
1780
+ - Do not use `jest` — use `vitest`.
1781
+ - Do not use `console.log` in `src/` — use the pino logger.
1782
+ - Do not write manual TypeScript interfaces for config types — use `z.infer<>`.
1783
+ - Do not use `eval()` or `new Function()` — use `expr-eval` or `vm.runInNewContext`.
1784
+ - Do not hard-code connection strings, credentials, or client-specific values.
1785
+ - Do not import from `@duckdb/node-api` directly outside `src/staging/store.ts`.
1786
+ - Do not create `StagingStore` instances outside `PipelineRunner`.
1787
+ - Do not add UI, REST server, or dashboard code.
1788
+ - Do not add adapter-specific logic to `PipelineRunner`.
1789
+ - Do not invent new top-level YAML keys — the schema is fixed.
1790
+ - Do not add cleanse ops without adding them to the reference table in this file.
1791
+ - Do not add BlueCherry entity types to `REQUIRED_COLUMNS` without verifying
1792
+ column names against actual BlueCherry import documentation first.
1793
+ - Do not use `dayjs` plugins without importing them explicitly at the call site.
1794
+
1795
+ ---
1796
+
1797
+ ## ═══════════════════════════════════════════════════════════
1798
+ ## SUGGESTED BUILD ORDER FOR CLAUDE CODE
1799
+ ## ═══════════════════════════════════════════════════════════
1800
+
1801
+ Work phase by phase. Do not start the next phase until the current phase passes
1802
+ `npm run build` and `npm test` without errors. Ask before proceeding if anything
1803
+ in this spec is ambiguous.
1804
+
1805
+ 1. **Scaffold** — `package.json`, `tsconfig.json`, `src/utils/`, `src/config/`.
1806
+ Verify both example pipelines parse cleanly.
1807
+ 2. **Staging store** — `src/staging/`. Unit tests with `:memory:`.
1808
+ 3. **Source adapters** — `csv` first, then `mssql`, `pg`, `xlsx`, `rest`.
1809
+ Mock all external connections in tests.
1810
+ 4. **DQ engine** — `src/dq/` including all rules and reporter.
1811
+ 5. **Transform engine** — `src/transform/` — all types, cleanse ops, expression eval.
1812
+ 6. **Target adapters** — `csv` → `ifs` → `bluecherry` → `bc` (BC is most complex;
1813
+ mock OAuth2 token endpoint in tests).
1814
+ 7. **PipelineRunner** — wire all phases; integration test both fixture pipelines.
1815
+ 8. **CLI** — all four commands and exit codes.
1816
+ 9. **CI** — `.github/workflows/ci.yml`.
1817
+
1818
+ ---
1819
+
1820
+ *This file is the authoritative specification for Sluice. If anything in the
1821
+ codebase contradicts this file, the codebase is wrong. Update this file whenever
1822
+ the architecture evolves — then tell Claude Code to re-read it before continuing.*