@caracal-lynx/sluice 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,681 +1,683 @@
1
- ![Sluice](./images/sluice_banner.png)
2
-
3
- > *"A sluice is a channel that controls the flow of water. Sluice is a toolkit that controls the flow of data. Except data doesn't flood your basement. Usually."*
4
-
5
- **`@caracal-lynx/sluice`** — a config-driven ETL toolkit for ERP data migrations, built by [Caracal Lynx Ltd.](https://caracallynx.com).
6
-
7
- [![npm](https://img.shields.io/npm/v/@caracal-lynx/sluice)](https://www.npmjs.com/package/@caracal-lynx/sluice)
8
- [![Node 24](https://img.shields.io/badge/Node-24_LTS-green)](https://nodejs.org)
9
- [![TypeScript](https://img.shields.io/badge/TypeScript-6.x-blue)](https://www.typescriptlang.org)
10
- [![License](https://img.shields.io/badge/license-Elastic_2.0-blue)](LICENCE-FAQ.md)
11
- <!-- TODO: add Docs badge once Phase 8 ships -->
12
-
13
- ---
14
-
15
- > **Data quality is the hidden blocker for both migrations and AI adoption.**
16
- >
17
- > Sluice is a data migration and data quality tool that validates your data *before* it reaches its destination — not after. You describe the entire migration as a YAML file: where the data comes from, the quality rules it has to pass, how each field maps to the target. Sluice validates the source, transforms it, and loads only the clean records — the bad rows go to a rejection report so you can fix the source.
18
- >
19
- > *Clean data flows through.*
20
-
21
- ---
22
-
23
- ## 🤔 What is this thing?
24
-
25
- ![Gold Sluice](./images/sluice-for-gold.jpg)
26
-
27
- Sluice takes the pain out of ERP data migrations. You know the drill — a client has 20 years of customer records in a legacy SQL database, and they need them in a shiny new ERP system by Monday. The data is a mess, the field names are cryptic, and someone has helpfully stored postcodes in a column called `ADDR5`.
28
-
29
- Sluice lets you describe the entire migration as a **YAML pipeline config**where to get the data, what quality rules to enforce, how to transform the fields, and where to load the result. The engine is written once; every client engagement is just a folder of YAML files.
30
-
31
- **No UI. No server. No cloud dependency.** Just the `sluice` CLI, TypeScript modules, and a strong cup of tea.
32
-
33
- ---
34
-
35
- ## ✨ What it does
36
-
37
- The data flows through four stages — like water through a sluice gate:
38
-
39
- ```
40
- 💾 Source(s) 🔍 Data Quality ✨ Transform 🎯 Target
41
- ───────────────── → ───────────────── → ───────────────── → ─────────────────
42
- MSSQL / CSV / Validate rules Map fields Business Central
43
- XLSX / REST / Reject bad rows Apply lookups IFS ERP
44
- PostgreSQL Write DQ report Cleanse values BlueCherry ERP
45
- Evaluate expressions CSV / PostgreSQL
46
- (1..N sources)
47
-
48
- 🔀 Optional Merge
49
- coalesce, union,
50
- intersect, priority
51
- ```
52
-
53
- Under the bonnet, all extracted data passes through a **local DuckDB staging store** before being transformed and loaded. Think of it as a staging area where data sits while it gets its act together before being presented to the target ERP. 🦆
54
-
55
- Pipelines can be **single-source** (one YAML per entity, one `source:` block) or **multi-source** 2+ sources merged on a key column using one of four built-in strategies before DQ and transform run. See [Multi-Source Merge](#-multi-source-merge) below.
56
-
57
- ---
58
-
59
- ## 🏗️ Architecture
60
-
61
- ### Single-source pipeline
62
-
63
- ```mermaid
64
- flowchart LR
65
- A[📄 Pipeline YAML] --> B[⚙️ Config Loader<br/>Zod validation<br/>ENV var resolution<br/>Composite rule expansion]
66
- B --> C[🔌 Source Adapter<br/>mssql / pg / csv<br/>xlsx / rest]
67
- C --> D[(🦆 DuckDB<br/>stg_raw)]
68
- D --> E[🔍 DQ Engine<br/>Rules validation<br/>Rejection report]
69
- E --> F[ Transform Engine<br/>Field mapping<br/>Lookup resolution<br/>Cleanse ops<br/>Custom plugins]
70
- F --> G[(🦆 DuckDB<br/>stg_transformed)]
71
- G --> H[🎯 Target Adapter<br/>bc / ifs / bluecherry<br/>csv / pg]
72
- H --> I[📦 Output<br/>CSV / REST / DB]
73
- E -->|❌ critical failures| J[🛑 Pipeline halted<br/>dq-summary.json<br/>rejected.csv]
74
- ```
75
-
76
- ### Multi-source pipeline
77
-
78
- ```mermaid
79
- flowchart LR
80
- A[📄 Pipeline YAML<br/>sources + merge] --> B[⚙️ Config Loader]
81
- B --> C1[🔌 Source 1]
82
- B --> C2[🔌 Source 2]
83
- B --> C3[🔌 Source N]
84
- C1 --> D1[(🦆 stg_raw_src1<br/>+ rename + per-source DQ)]
85
- C2 --> D2[(🦆 stg_raw_src2<br/>+ rename + per-source DQ)]
86
- C3 --> D3[(🦆 stg_raw_srcN<br/>+ rename + per-source DQ)]
87
- D1 --> M[🔀 MergeEngine<br/>coalesce / union<br/>intersect / priority-override]
88
- D2 --> M
89
- D3 --> M
90
- M --> G[(🦆 stg_merged<br/>+ stg_merge_conflicts.csv)]
91
- G --> E[🔍 Post-merge DQ]
92
- E --> F[ Transform → stg_transformed]
93
- F --> H[🎯 Target Adapter]
94
- H --> I[📦 Output]
95
- ```
96
-
97
- ---
98
-
99
- ## 🧰 Tech Stack
100
-
101
- | What | Package | Why |
102
- |------|---------|-----|
103
- | 🔤 Language | TypeScript 5.x `strict` | Because `any` is a cry for help |
104
- | 🟢 Runtime | Node.js 24 LTS | Active LTS until April 2028; OpenSSL 3.5; ESM-stable |
105
- | 📋 Config | `js-yaml` + `zod` | YAML in, typed objects out |
106
- | 🗄️ SQL Server | `mssql` | Because the legacy DB is always SQL Server |
107
- | 📊 Staging | `@duckdb/node-api` (embedded) | Promise-native, ABI-stable — no server, no `npm rebuild` after Node version bumps |
108
- | 📁 CSV | `csv-parse` + `csv-stringify` | Streaming, handles BOM, the works |
109
- | 📈 Excel | `xlsx` (SheetJS) | Read-onlywe're migrating away from it, after all |
110
- | 🌐 HTTP | `axios` + `axios-retry` | 3 retries, exponential backoff, rate limit respect |
111
- | 📅 Dates | `dayjs` | Because time zones are already somebody else's problem |
112
- | 🖥️ CLI | `commander` v12 | Clean commands, sane flags |
113
- | 📝 Logging | `pino` | Structured JSON logs pretty in dev, parseable in CI |
114
- | 🧪 Testing | `vitest` | Not Jest. Never Jest. |
115
- | 🔒 Expressions | `expr-eval` | Safe expression parsingno `eval()` here, thank you very much |
116
-
117
- ---
118
-
119
- ## 🧩 Extension model
120
-
121
- Sluice's pipeline schema is fixed by design (readability, reviewability, predictable validation). Anything you can't express in the schema, you add via plugins. Three tiers, scaling from "no code, no install" to "publishable npm package":
122
-
123
- | Tier | What it is | Where it lives | Best for |
124
- |---|---|---|---|
125
- | **Tier 1** | YAML composite rules — bundle built-in DQ checks under a single ID | `shared/rules.yaml` in your project | Reusing common check combinations across pipelines without writing code |
126
- | **Tier 2** | TypeScript file plugins — `*.rule.ts` / `*.transform.ts` / `*.merge.ts` | `plugins/` next to your YAML | Custom logic for one project; rapid iteration |
127
- | **Tier 3** | npm packages exporting `register()` | npmjs.com (public or private) | Distributing rules / adapters / strategies across teams or as paid products |
128
-
129
- See **[PLUGINS.md](PLUGINS.md)** for the full author's guide with worked examples for all three tiers.
130
-
131
- ---
132
-
133
- ## 🚀 Quick Start
134
-
135
- A complete pipeline in 20 lines: read a CSV, validate emails, lowercase them, write the clean rows to a new CSV. The full file is checked into the repo at [`examples/hello-world.pipeline.yaml`](examples/hello-world.pipeline.yaml) with sample data at [`examples/data/hello-world.csv`](examples/data/hello-world.csv).
136
-
137
- ```yaml
138
- pipeline:
139
- name: hello-world
140
- client: demo
141
- version: "1.0"
142
- entity: Customer
143
-
144
- source:
145
- adapter: csv
146
- file: ./examples/data/hello-world.csv
147
-
148
- dq:
149
- rules:
150
- - field: email
151
- checks:
152
- - { type: notNull, severity: critical }
153
- - { type: email, severity: warning }
154
-
155
- transform:
156
- fields:
157
- - { from: name, to: Name, type: string, cleanse: trim }
158
- - { from: email, to: Email, type: string, cleanse: trim|lowercase }
159
- - { from: country, to: Country, type: string, default: GB }
160
-
161
- target:
162
- adapter: csv
163
- output: ./output/hello-world-clean.csv
164
- ```
165
-
166
- Run it end to end:
167
-
168
- ```bash
169
- # 1. Install
170
- npm install -g @caracal-lynx/sluice
171
-
172
- # 2. Validate the config without touching any data
173
- sluice check examples/hello-world.pipeline.yaml
174
-
175
- # 3. Dry-run: extract + DQ + transform but don't write the target
176
- sluice run examples/hello-world.pipeline.yaml --dry-run
177
-
178
- # 4. Live run — writes ./output/hello-world-clean.csv +
179
- # ./output/hello-world-rejected.csv (if any DQ failures)
180
- sluice run examples/hello-world.pipeline.yaml
181
- ```
182
-
183
- The sample data has one row with a malformed email — that's a `warning`, so the row is kept in the output but flagged in `output/hello-world-rejected.csv`. Open both CSVs side by side to see what passed and what got reported. Add an `unknown@bad`-style row (or strip an email entirely) to see how a `critical` failure halts the pipeline before any output is written.
184
-
185
- ### Other CLI commands
186
-
187
- ```bash
188
- # Run DQ + transform; skip the load (faster than --dry-run for spec checks)
189
- sluice validate customers.pipeline.yaml
190
-
191
- # Profile source data — column stats, distinct counts, samples; no DQ
192
- sluice profile customers.pipeline.yaml
193
-
194
- # Inspect loaded plugins and merge strategies
195
- sluice plugins
196
- sluice merge list-strategies
197
- sluice merge info coalesce
198
- ```
199
-
200
- ### CLI flags
201
-
202
- | Flag | What it does |
203
- |------|-------------|
204
- | `--log-level debug\|info\|warn\|error` | How chatty do you want the logs? |
205
- | `--env <file>` | Path to your `.env` file (default: `./.env`) |
206
- | `--output <dir>` | Override the output directory |
207
- | `--plugins <dir...>` | Load additional plugin directories (alongside the pipeline `plugins/` folder) |
208
- | `--dry-run` | Extract + DQ + transform, but don't write a single byte to the target |
209
-
210
- When multiple plugin directories resolve to the same absolute path (for example,
211
- `--plugins ./plugins`), Sluice de-duplicates them before loading.
212
-
213
- ### Exit codes
214
-
215
- | Code | Meaning |
216
- |------|---------|
217
- | `0` | All good |
218
- | `1` | ❌ Pipeline error |
219
- | `2` | 🛑 Critical DQ violations halted the pipeline |
220
- | `3` | 📋 Config validation failed |
221
-
222
- ---
223
-
224
- ## 📄 Pipeline Config Format
225
-
226
- Each migration entity gets its own YAML file. One entity, one file. Nice and tidy.
227
-
228
- ```
229
- 💡 One YAML file = one migrated entity
230
- (customers, items, vendors, styles, purchase orders, etc.)
231
- ```
232
-
233
- A single-source pipeline has five sections:
234
-
235
- ```yaml
236
- pipeline: { name, client, version, entity, description }
237
- source: { adapter, connection/file/endpoint, ... }
238
- dq: { rules, stopOnCritical, rejectionFile }
239
- transform: { lookups, fields }
240
- target: { adapter, output/baseUrl, ... }
241
- run: { mode, batchSize, logLevel, dryRun, ... } # all optional
242
- ```
243
-
244
- A multi-source pipeline swaps `source:` for `sources:` + `merge:`:
245
-
246
- ```yaml
247
- pipeline: { ... }
248
- sources: [ { id, priority, adapter, ..., rename? }, ... ] # 2+ entries
249
- merge: { key, strategy, onUnmatched, fieldStrategies, conflictLog, incrementalSource? }
250
- dq: { ... } # rules can be scoped via sourceId
251
- transform: { ... }
252
- target: { ... }
253
- run: { ... }
254
- ```
255
-
256
- `PipelineSchema` requires *either* `source:` (single) *or* both `sources:` + `merge:` (multi) — never both. The CLI auto-routes based on which shape the YAML has, so there's no flag to remember.
257
-
258
- ### 📥 Source Adapters
259
-
260
- | Adapter | Use when... |
261
- |---------|-------------|
262
- | `mssql` | The legacy system is SQL Server (it's always SQL Server) |
263
- | `pg` | The legacy system is PostgreSQL (you lucky thing) |
264
- | `csv` | Someone emailed you a CSV export at 11pm the night before go-live |
265
- | `xlsx` | Same as above but Excel, complete with merged cells and mystery formatting |
266
- | `rest` | The source system has an API! Progress! |
267
-
268
- ### 🎯 Target Adapters
269
-
270
- | Adapter | Loads to... |
271
- |---------|-------------|
272
- | `bc` | Microsoft Dynamics 365 Business Central (via OData REST + OAuth2) |
273
- | `ifs` | IFS ERP (via fixed-format CSV import — no header, specific column order) |
274
- | `bluecherry` | BlueCherry ERP / CGS (CSV import, US-format dates, headers required) |
275
- | `csv` | Generic CSV — for anything else or for manual inspection |
276
- | `pg` | PostgreSQL useful for intermediate staging or custom targets |
277
-
278
- ### 🔍 Data Quality Rules
279
-
280
- Nine built-in rule types, configurable per field:
281
-
282
- ```yaml
283
- dq:
284
- stopOnCritical: true
285
- rules:
286
- - field: CUST_CODE
287
- checks:
288
- - { type: notNull, severity: critical } # 💥 stops the pipeline
289
- - { type: unique, severity: critical }
290
- - { type: pattern, value: "^[A-Z0-9]{3,10}$", severity: warning }
291
-
292
- - field: EMAIL
293
- checks:
294
- - { type: email, severity: warning } # ⚠️ flagged but not rejected
295
-
296
- - field: POST_CODE
297
- checks:
298
- - { type: ukPostcode, severity: warning } # 🇬🇧 all UK formats
299
- ```
300
-
301
- | Rule | What it checks |
302
- |------|---------------|
303
- | `notNull` | Not null, not empty, not just whitespace |
304
- | `unique` | No duplicates across the whole dataset |
305
- | `pattern` | ECMAScript regex |
306
- | `email` | RFC 5322-ish email validation |
307
- | `ukPostcode` | All current UK postcode formats |
308
- | `maxLength` | String length cap |
309
- | `min` / `max` | Numeric range |
310
- | `allowedValues` | Enum-style allowed value list |
311
-
312
- Severity levels: `critical` (row rejected, pipeline can halt) · `warning` (flagged in report, row kept) · `info` (summary only)
313
-
314
- ### ✨ Transform: Field Mapping Types
315
-
316
- | Type | What it does |
317
- |------|-------------|
318
- | `string` | Cast + optional cleanse ops + optional truncation |
319
- | `number` | Integer coercion (NaN = error) |
320
- | `decimal` | Fixed-precision decimal stored as string |
321
- | `boolean` | `'1','true','yes','y','t'` true. Everything else → false |
322
- | `date` | Parse source date, output in target format |
323
- | `lookup` | Resolve via a CSV or SQL lookup table |
324
- | `concat` | Join multiple source fields with a separator |
325
- | `constant` | Emit a fixed value (e.g. `CustomerGroup: DOMESTIC`) |
326
- | `expression` | Evaluate an expression against the source row |
327
- | `custom` | Delegate to a `TransformPlugin` via `customOp` (Phase 2) |
328
-
329
- ### 🧹 Cleanse Operations
330
-
331
- Pipe-chain them: `cleanse: trim|titleCase|normaliseUnicode`
332
-
333
- | Op | Before | After |
334
- |----|--------|-------|
335
- | `trim` | `" hello "` | `"hello"` |
336
- | `uppercase` | `"hello"` | `"HELLO"` |
337
- | `lowercase` | `"HELLO"` | `"hello"` |
338
- | `titleCase` | `"john smith"` | `"John Smith"` |
339
- | `stripNonAlpha` | `"AB-12!"` | `"AB"` |
340
- | `stripNonNumeric` | `"AB-12!"` | `"12"` |
341
- | `padStart:6:0` | `"42"` | `"000042"` |
342
- | `nullIfEmpty` | `""` | `null` |
343
- | `normaliseUnicode` | `"café"` | `"cafe"` |
344
- | `normaliseQuotes` | `"it's"` | `"it's"` |
345
-
346
- ---
347
-
348
- ## 📁 Repository Structure
349
-
350
- ```
351
- sluice/
352
- ├── src/
353
- │ ├── cli.ts ← CLI entry point (commander)
354
- ├── runner.ts ← PipelineRunner — single-source orchestration
355
- │ ├── multi-source-runner.ts MultiSourcePipelineRunner (Phase 3)
356
- │ ├── config/ Zod schema, YAML loader, ENV var + composite expansion
357
- │ ├── adapters/
358
- ├── source/ mssql, pg, csv, xlsx, rest
359
- │ └── target/ ← bc, ifs, bluecherry, csv, pg
360
- │ ├── staging/ DuckDB wrapper (stg_raw stg_merged → stg_transformed)
361
- ├── dq/ DQ engine, rules, rejection reporter
362
- │ ├── transform/ Transform engine, lookup resolver, cleanse ops
363
- │ ├── merge/ MergeEngine, SQL builder, 4 built-in strategies
364
- │ ├── plugins/ Rule/Transform/Merge registries + file & npm loaders
365
- └── utils/ ← logger (pino), errors, env helpers
366
- ├── tests/
367
- ├── fixtures/ sample pipeline YAMLs, CSV/rules data, plugin files
368
- ├── unit/ ← unit tests (all I/O mocked)
369
- └── integration/ real DuckDB :memory: + CSV fixtures
370
- └── clients/ 🙈 gitignored each client has their own repo
371
- ├── acme-corp/ Acme Corp pipelines
372
- └── style-co/ Style Co pipelines
373
- ```
374
-
375
- ---
376
-
377
- ## ⚙️ Environment Variables
378
-
379
- Connection strings and credentials live in `.env` (never in YAML files, never in Git).
380
-
381
- ```bash
382
- # .env
383
- SOURCE_MSSQL=mssql://user:password@serverlegacy.example.local/LegacyDB
384
- BC_BASE_URL=https://api.businesscentral.dynamics.com/v2.0
385
- BC_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
386
- BC_CLIENT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
387
- BC_CLIENT_SECRET=your-secret-here
388
- BC_COMPANY=Example Company Ltd
389
- ```
390
-
391
- Reference them in YAML with `${ENV_VAR}` — resolved at runtime, never stored in config:
392
-
393
- ```yaml
394
- source:
395
- adapter: mssql
396
- connection: ${SOURCE_MSSQL}
397
- ```
398
-
399
- ---
400
-
401
- ## 🧩 Phase 2: Extension System
402
-
403
- Phase 2 adds a three-tier plugin system so you can extend Sluice without touching the core engine.
404
-
405
- ### Tier 1 Composite Rules (YAML) 📋
406
-
407
- Name a bundle of checks in a shared rules file and reference them like built-ins:
408
-
409
- ```yaml
410
- # shared/rules.yaml
411
- rules:
412
- - id: style-coStyleNo
413
- checks:
414
- - { type: notNull, severity: critical }
415
- - { type: pattern, value: "^[A-Z]{2}[0-9]{4}$", severity: critical }
416
- - { type: maxLength, value: 6, severity: critical }
417
- ```
418
-
419
- ```yaml
420
- # In your pipeline:
421
- dq:
422
- rulesFile: ../../shared/rules.yaml
423
- rules:
424
- - field: STYLE_NO
425
- checks:
426
- - { type: style-coStyleNo } # expands to the three checks above ✨
427
- ```
428
-
429
- ### Tier 2 — Plugin Files (TypeScript) 🔌
430
-
431
- Drop a `*.rule.ts`, `*.transform.ts`, or `*.merge.ts` file into a `plugins/` folder next to your pipeline YAMLs. Auto-discovered at startup:
432
-
433
- ```typescript
434
- // plugins/ukVatNumber.rule.ts
435
- export const rule: RulePlugin = {
436
- id: 'ukVatNumber',
437
- validate(value, config, rowIndex, field) {
438
- const valid = /^GB([0-9]{9}|[0-9]{12}|(GD|HA)[0-9]{3})$/.test(String(value));
439
- return valid ? null : { field, rowIndex, value, rule: 'ukVatNumber',
440
- severity: config.severity, message: 'Invalid UK VAT number' };
441
- }
442
- };
443
- ```
444
-
445
- ### Tier 3 — npm Packages 📦
446
-
447
- When plugins are useful across multiple clients, promote them to scoped npm packages and declare them in `sluice.config.yaml`:
448
-
449
- ```yaml
450
- # sluice.config.yaml
451
- plugins:
452
- - package: "@caracal-lynx/etl-rules-uk"
453
- - package: "@caracal-lynx/etl-rules-fashion"
454
- - package: "@caracal-lynx/etl-transform-ifs"
455
- ```
456
-
457
- All three tiers use the same registry interfaces and are invoked identically by the engines. The engine doesn't know or care which tier a rule came from. 🤷
458
-
459
- ### List Loaded Plugins
460
-
461
- ```bash
462
- sluice plugins
463
-
464
- # Include extra plugin directories outside the pipeline folder
465
- sluice plugins --plugins ./shared/plugins ./team/plugins
466
- ```
467
-
468
- Output:
469
- ```
470
- 📋 Data Quality Rules:
471
- • ukVatNumber
472
- bcAccountCode
473
- iso8601Date
474
-
475
- 🔄 Transform Operations:
476
- • slugGenerator
477
- normalizeCompanyName
478
- fixedDecimal
479
-
480
- 🔀 Merge Strategies:
481
- • coalesce
482
- priority-override
483
- union
484
- intersect
485
- ```
486
-
487
- ### Getting Started with Plugins
488
-
489
- Detailed guide: **[PLUGINS.md](./PLUGINS.md)**
490
-
491
- - Create a custom DQ rule
492
- - Create a custom transform operation
493
- - Create a custom merge strategy
494
- - Package plugins as npm packages
495
- - Test and debug plugins
496
- - Real-world examples
497
-
498
- ---
499
-
500
- ## 🔀 Multi-Source Merge
501
-
502
- Phase 3 lets a single pipeline extract from **2+ sources** and merge them on a key column before DQ and transform. Useful when the master record for an entity is scattered across systems — master data in SQL Server, pricing enrichment in an Excel sheet, product descriptions in a REST API, and so on.
503
-
504
- ### Built-in merge strategies
505
-
506
- | Strategy | Behaviour | When to use |
507
- |---|---|---|
508
- | `coalesce` | First non-null value wins (priority-ordered; whitespace treated as blank) | Enriching a primary source with fallback data from lower-priority sources |
509
- | `priority-override` | Highest-priority source wins, even if null or blank | Strict priority — the trusted source is the trusted source, full stop |
510
- | `union` | All rows from all sources, deduplicated by key | Combining independent datasets (e.g. multi-warehouse inventory) |
511
- | `intersect` | Only rows present in **all** sources | Reconciliation / "find the records that agree" |
512
-
513
- Custom strategies can be dropped in as `*.merge.ts` plugins or shipped as npm packages same three-tier model as DQ rules and transforms.
514
-
515
- ### A minimal multi-source pipeline
516
-
517
- ```yaml
518
- pipeline:
519
- name: style-co-products-merged
520
- client: style-co
521
- version: "1.0"
522
- entity: Style
523
-
524
- sources:
525
- - id: sql-server # staging table: stg_raw_sql-server
526
- priority: 1 # lower = higher precedence
527
- adapter: mssql
528
- connection: ${SOURCE_2_MSSQL}
529
- query: "SELECT STYLE_NO, STYLE_DESC, COST_PRICE FROM dbo.Styles WHERE Active = 1"
530
-
531
- - id: excel
532
- priority: 2
533
- adapter: xlsx
534
- file: ./data/product-data.xlsx
535
- sheet: "Products"
536
- rename: # applied in-place after extract, before DQ
537
- Style Number: STYLE_NO
538
- Description: STYLE_DESC
539
- Fibre: FIBRE_CONTENT
540
-
541
- merge:
542
- key: STYLE_NO # single column or array for composite keys
543
- strategy: coalesce
544
- onUnmatched: include # include | exclude | warn | error
545
- fieldStrategies: # per-field overrides
546
- - { field: FIBRE_CONTENT, source: excel } # pin to one source
547
- - { field: COST_PRICE, strategy: priority-override }
548
- conflictLog: ./output/style-co-products-conflicts.csv # optional CSV of field disagreements
549
-
550
- dq:
551
- stopOnCritical: true
552
- rules:
553
- - field: STYLE_NO # 🎯 pre-merge: scoped to one source
554
- sourceId: sql-server
555
- checks: [ { type: notNull, severity: critical }, { type: unique, severity: critical } ]
556
- - field: STYLE_DESC # 🎯 post-merge: runs against stg_merged
557
- checks: [ { type: notNull, severity: critical } ]
558
-
559
- transform: { ... }
560
- target: { ... }
561
- ```
562
-
563
- Pre-merge rules (`sourceId: …`) run against each source's staging table before merging and generate per-source rejection CSVs (suffixed `-{sourceId}`). Post-merge rules (no `sourceId`) run once against `stg_merged`.
564
-
565
- ### Incremental multi-source
566
-
567
- ```yaml
568
- merge:
569
- incrementalSource: sql-server # must match a source id; required in incremental mode
570
- run:
571
- mode: incremental
572
- incrementalField: UPDATED_AT
573
- ```
574
-
575
- Only the named source is filtered by timestamp; other sources run full each time. The state file gains a per-source `sources` block tracking each source's last run time.
576
-
577
- ### Inspect merge strategies
578
-
579
- ```bash
580
- sluice merge list-strategies # ids + descriptions for all registered strategies
581
- sluice merge info coalesce # details for one strategy
582
- ```
583
-
584
- A full working example lives at [tests/fixtures/style-co-products-merged.pipeline.yaml](tests/fixtures/style-co-products-merged.pipeline.yaml).
585
-
586
- ---
587
-
588
- ## 🧪 Testing
589
-
590
- ```bash
591
- npm test # run tests once
592
- npm run test:watch # watch mode (great for TDD)
593
- npm run test:cov # with coverage report
594
- ```
595
-
596
- - **Unit tests** mock all I/O with `vi.mock` — no live databases required
597
- - **Integration tests** use real DuckDB (`:memory:`) with CSV fixtures
598
- - Target: 80% line coverage across `src/dq/` and `src/transform/`
599
- - CI runs on `ubuntu-latest` via GitHub Actions
600
-
601
- ---
602
-
603
- ## 🏗️ Development
604
-
605
- ```bash
606
- npm run build # tsc compile
607
- npm run dev # tsx watch src/cli.ts (live reload)
608
- npm run lint # eslint
609
- npm run format # prettier
610
-
611
- # Pretty logs in dev:
612
- npm run dev -- run customers.pipeline.yaml | npx pino-pretty
613
- ```
614
-
615
- > **Note:** Uses `tsx`, not `ts-node`. Path aliases work correctly on Windows without extra configuration. 🪟
616
-
617
- ---
618
-
619
- ## 🚫 Things Sluice Is Not
620
-
621
- - A web application or dashboard (there's no UI — this is a good thing)
622
- - ❌ A streaming / real-time ingestion platform
623
- - ❌ A data warehouse
624
- - ❌ A multi-tenant SaaS product
625
- - ❌ An excuse to use `eval()` anywhere
626
-
627
- ---
628
-
629
- ## 🏢 Sluice + Caracal Lynx Professional Services
630
-
631
- The Sluice core CLI is open-source and free to use. Caracal Lynx offers additional paid services built on top of it:
632
-
633
- | Service | What it is |
634
- |---|---|
635
- | **Enrichment Service** | Async API lookups (EU VAT, UK VAT, trade tariff) — fills gaps in source data |
636
- | **Application Adapters** | Pre-built ERP adapters (IFS, Business Central, BlueCherry) |
637
- | **Domain Rule Packages** | UK compliance rules, fashion/retail data standards |
638
- | **Client-Specific Plugins** | Bespoke plugins tailored to your source system and data model |
639
- | **Sluice MCP Server** 🚧 | AI-assisted migration using Claude — agentic pipeline authoring, live schema inspection, automatic DQ iteration. *Coming soon — Phase 9.* |
640
- | **Migration Delivery** | Full end-to-end data migration, delivered by Caracal Lynx |
641
-
642
- 📧 **michael.scott@caracallynx.com**
643
- 🌐 **[caracallynx.com](https://caracallynx.com)**
644
-
645
- ---
646
-
647
- ## 🤝 Community
648
-
649
- - 🐛 [Report a bug or request a feature](https://github.com/caracal-lynx/sluice/issues/new/choose)
650
- - 💬 [Ask a question or share a use case](https://github.com/caracal-lynx/sluice/discussions)
651
- - 🤲 [Contributing guide](CONTRIBUTING.md)
652
- - 🤝 [Code of Conduct](CODE_OF_CONDUCT.md)
653
-
654
- ---
655
-
656
- ## 🔐 Security
657
-
658
- Found a vulnerability? Please **do not** open a public issue. See [SECURITY.md](SECURITY.md) for the disclosure process — `security@caracallynx.com`, 48-hour acknowledgement, 90-day disclosure SLA.
659
-
660
- ---
661
-
662
- ## ⚖️ Licence
663
-
664
- Sluice is licensed under the [Elastic Licence 2.0](LICENSE). See [LICENCE-FAQ.md](LICENCE-FAQ.md) for a plain-English explainer of what you can and can't do with it. Short version: use it freely for your own data migrations; don't resell it as a hosted service or strip the licence headers.
665
-
666
- ---
667
-
668
- ## 🏷️ About
669
-
670
- Built and maintained by [Caracal Lynx Ltd.](https://caracallynx.com) (SC826823) — Gretna, Scotland.
671
-
672
- ```
673
- npm package: @caracal-lynx/sluice
674
- owner: Caracal Lynx Ltd. (SC826823)
675
- author: Michael Scott
676
- maintainers: Michael Scott, Carolyn Scott, Andrew Scott, Duncan Scott
677
- ```
678
-
679
- ---
680
-
681
- *Clean data flows through.* 💧
1
+ ![Sluice](./images/sluice_banner.png)
2
+
3
+ > *"A sluice is a channel that controls the flow of water. Sluice is a toolkit that controls the flow of data. Except data doesn't flood your basement. Usually."*
4
+
5
+ **`@caracal-lynx/sluice`** — a config-driven ETL toolkit for ERP data migrations, built by [Caracal Lynx Ltd.](https://caracallynx.com).
6
+
7
+ [![npm](https://img.shields.io/npm/v/@caracal-lynx/sluice)](https://www.npmjs.com/package/@caracal-lynx/sluice)
8
+ [![Node 24](https://img.shields.io/badge/Node-24_LTS-green)](https://nodejs.org)
9
+ [![TypeScript](https://img.shields.io/badge/TypeScript-6.x-blue)](https://www.typescriptlang.org)
10
+ [![License](https://img.shields.io/badge/license-Elastic_2.0-blue)](LICENCE-FAQ.md)
11
+ [![Docs](https://img.shields.io/badge/docs-caracal--lynx.github.io-00b8d4)](https://caracal-lynx.github.io/sluice/)
12
+
13
+ 📖 **Full documentation:** <https://caracal-lynx.github.io/sluice/>
14
+
15
+ ---
16
+
17
+ > **Data quality is the hidden blocker for both migrations and AI adoption.**
18
+ >
19
+ > Sluice is a data migration and data quality tool that validates your data *before* it reaches its destination — not after. You describe the entire migration as a YAML file: where the data comes from, the quality rules it has to pass, how each field maps to the target. Sluice validates the source, transforms it, and loads only the clean records — the bad rows go to a rejection report so you can fix the source.
20
+ >
21
+ > *Clean data flows through.*
22
+
23
+ ---
24
+
25
+ ## 🤔 What is this thing?
26
+
27
+ ![Gold Sluice](./images/sluice-for-gold.jpg)
28
+
29
+ Sluice takes the pain out of ERP data migrations. You know the drill a client has 20 years of customer records in a legacy SQL database, and they need them in a shiny new ERP system by Monday. The data is a mess, the field names are cryptic, and someone has helpfully stored postcodes in a column called `ADDR5`.
30
+
31
+ Sluice lets you describe the entire migration as a **YAML pipeline config** where to get the data, what quality rules to enforce, how to transform the fields, and where to load the result. The engine is written once; every client engagement is just a folder of YAML files.
32
+
33
+ **No UI. No server. No cloud dependency.** Just the `sluice` CLI, TypeScript modules, and a strong cup of tea. ☕
34
+
35
+ ---
36
+
37
+ ## What it does
38
+
39
+ The data flows through four stages — like water through a sluice gate:
40
+
41
+ ```
42
+ 💾 Source(s) 🔍 Data Quality Transform 🎯 Target
43
+ ───────────────── → ───────────────── → ───────────────── → ─────────────────
44
+ MSSQL / CSV / Validate rules Map fields Business Central
45
+ XLSX / REST / Reject bad rows Apply lookups IFS ERP
46
+ PostgreSQL Write DQ report Cleanse values BlueCherry ERP
47
+ Evaluate expressions CSV / PostgreSQL
48
+ (1..N sources)
49
+
50
+ 🔀 Optional Merge
51
+ coalesce, union,
52
+ intersect, priority
53
+ ```
54
+
55
+ Under the bonnet, all extracted data passes through a **local DuckDB staging store** before being transformed and loaded. Think of it as a staging area where data sits while it gets its act together before being presented to the target ERP. 🦆
56
+
57
+ Pipelines can be **single-source** (one YAML per entity, one `source:` block) or **multi-source** — 2+ sources merged on a key column using one of four built-in strategies before DQ and transform run. See [Multi-Source Merge](#-multi-source-merge) below.
58
+
59
+ ---
60
+
61
+ ## 🏗️ Architecture
62
+
63
+ ### Single-source pipeline
64
+
65
+ ```mermaid
66
+ flowchart LR
67
+ A[📄 Pipeline YAML] --> B[⚙️ Config Loader<br/>Zod validation<br/>ENV var resolution<br/>Composite rule expansion]
68
+ B --> C[🔌 Source Adapter<br/>mssql / pg / csv<br/>xlsx / rest]
69
+ C --> D[(🦆 DuckDB<br/>stg_raw)]
70
+ D --> E[🔍 DQ Engine<br/>Rules validation<br/>Rejection report]
71
+ E --> F[ Transform Engine<br/>Field mapping<br/>Lookup resolution<br/>Cleanse ops<br/>Custom plugins]
72
+ F --> G[(🦆 DuckDB<br/>stg_transformed)]
73
+ G --> H[🎯 Target Adapter<br/>bc / ifs / bluecherry<br/>csv / pg]
74
+ H --> I[📦 Output<br/>CSV / REST / DB]
75
+ E -->|❌ critical failures| J[🛑 Pipeline halted<br/>dq-summary.json<br/>rejected.csv]
76
+ ```
77
+
78
+ ### Multi-source pipeline
79
+
80
+ ```mermaid
81
+ flowchart LR
82
+ A[📄 Pipeline YAML<br/>sources + merge] --> B[⚙️ Config Loader]
83
+ B --> C1[🔌 Source 1]
84
+ B --> C2[🔌 Source 2]
85
+ B --> C3[🔌 Source N]
86
+ C1 --> D1[(🦆 stg_raw_src1<br/>+ rename + per-source DQ)]
87
+ C2 --> D2[(🦆 stg_raw_src2<br/>+ rename + per-source DQ)]
88
+ C3 --> D3[(🦆 stg_raw_srcN<br/>+ rename + per-source DQ)]
89
+ D1 --> M[🔀 MergeEngine<br/>coalesce / union<br/>intersect / priority-override]
90
+ D2 --> M
91
+ D3 --> M
92
+ M --> G[(🦆 stg_merged<br/>+ stg_merge_conflicts.csv)]
93
+ G --> E[🔍 Post-merge DQ]
94
+ E --> F[ Transform → stg_transformed]
95
+ F --> H[🎯 Target Adapter]
96
+ H --> I[📦 Output]
97
+ ```
98
+
99
+ ---
100
+
101
+ ## 🧰 Tech Stack
102
+
103
+ | What | Package | Why |
104
+ |------|---------|-----|
105
+ | 🔤 Language | TypeScript 5.x `strict` | Because `any` is a cry for help |
106
+ | 🟢 Runtime | Node.js 24 LTS | Active LTS until April 2028; OpenSSL 3.5; ESM-stable |
107
+ | 📋 Config | `js-yaml` + `zod` | YAML in, typed objects out |
108
+ | 🗄️ SQL Server | `mssql` | Because the legacy DB is always SQL Server |
109
+ | 📊 Staging | `@duckdb/node-api` (embedded) | Promise-native, ABI-stable no server, no `npm rebuild` after Node version bumps |
110
+ | 📁 CSV | `csv-parse` + `csv-stringify` | Streaming, handles BOM, the works |
111
+ | 📈 Excel | `xlsx` (SheetJS) | Read-only we're migrating away from it, after all |
112
+ | 🌐 HTTP | `axios` + `axios-retry` | 3 retries, exponential backoff, rate limit respect |
113
+ | 📅 Dates | `dayjs` | Because time zones are already somebody else's problem |
114
+ | 🖥️ CLI | `commander` v12 | Clean commands, sane flags |
115
+ | 📝 Logging | `pino` | Structured JSON logspretty in dev, parseable in CI |
116
+ | 🧪 Testing | `vitest` | Not Jest. Never Jest. |
117
+ | 🔒 Expressions | `expr-eval` | Safe expression parsing — no `eval()` here, thank you very much |
118
+
119
+ ---
120
+
121
+ ## 🧩 Extension model
122
+
123
+ Sluice's pipeline schema is fixed by design (readability, reviewability, predictable validation). Anything you can't express in the schema, you add via plugins. Three tiers, scaling from "no code, no install" to "publishable npm package":
124
+
125
+ | Tier | What it is | Where it lives | Best for |
126
+ |---|---|---|---|
127
+ | **Tier 1** | YAML composite rules bundle built-in DQ checks under a single ID | `shared/rules.yaml` in your project | Reusing common check combinations across pipelines without writing code |
128
+ | **Tier 2** | TypeScript file plugins — `*.rule.ts` / `*.transform.ts` / `*.merge.ts` | `plugins/` next to your YAML | Custom logic for one project; rapid iteration |
129
+ | **Tier 3** | npm packages exporting `register()` | npmjs.com (public or private) | Distributing rules / adapters / strategies across teams or as paid products |
130
+
131
+ See **[PLUGINS.md](PLUGINS.md)** for the full author's guide with worked examples for all three tiers.
132
+
133
+ ---
134
+
135
+ ## 🚀 Quick Start
136
+
137
+ A complete pipeline in 20 lines: read a CSV, validate emails, lowercase them, write the clean rows to a new CSV. The full file is checked into the repo at [`examples/hello-world.pipeline.yaml`](examples/hello-world.pipeline.yaml) with sample data at [`examples/data/hello-world.csv`](examples/data/hello-world.csv).
138
+
139
+ ```yaml
140
+ pipeline:
141
+ name: hello-world
142
+ client: demo
143
+ version: "1.0"
144
+ entity: Customer
145
+
146
+ source:
147
+ adapter: csv
148
+ file: ./examples/data/hello-world.csv
149
+
150
+ dq:
151
+ rules:
152
+ - field: email
153
+ checks:
154
+ - { type: notNull, severity: critical }
155
+ - { type: email, severity: warning }
156
+
157
+ transform:
158
+ fields:
159
+ - { from: name, to: Name, type: string, cleanse: trim }
160
+ - { from: email, to: Email, type: string, cleanse: trim|lowercase }
161
+ - { from: country, to: Country, type: string, default: GB }
162
+
163
+ target:
164
+ adapter: csv
165
+ output: ./output/hello-world-clean.csv
166
+ ```
167
+
168
+ Run it end to end:
169
+
170
+ ```bash
171
+ # 1. Install
172
+ npm install -g @caracal-lynx/sluice
173
+
174
+ # 2. Validate the config without touching any data
175
+ sluice check examples/hello-world.pipeline.yaml
176
+
177
+ # 3. Dry-run: extract + DQ + transform but don't write the target
178
+ sluice run examples/hello-world.pipeline.yaml --dry-run
179
+
180
+ # 4. Live run — writes ./output/hello-world-clean.csv +
181
+ # ./output/hello-world-rejected.csv (if any DQ failures)
182
+ sluice run examples/hello-world.pipeline.yaml
183
+ ```
184
+
185
+ The sample data has one row with a malformed email — that's a `warning`, so the row is kept in the output but flagged in `output/hello-world-rejected.csv`. Open both CSVs side by side to see what passed and what got reported. Add an `unknown@bad`-style row (or strip an email entirely) to see how a `critical` failure halts the pipeline before any output is written.
186
+
187
+ ### Other CLI commands
188
+
189
+ ```bash
190
+ # Run DQ + transform; skip the load (faster than --dry-run for spec checks)
191
+ sluice validate customers.pipeline.yaml
192
+
193
+ # Profile source data — column stats, distinct counts, samples; no DQ
194
+ sluice profile customers.pipeline.yaml
195
+
196
+ # Inspect loaded plugins and merge strategies
197
+ sluice plugins
198
+ sluice merge list-strategies
199
+ sluice merge info coalesce
200
+ ```
201
+
202
+ ### CLI flags
203
+
204
+ | Flag | What it does |
205
+ |------|-------------|
206
+ | `--log-level debug\|info\|warn\|error` | How chatty do you want the logs? |
207
+ | `--env <file>` | Path to your `.env` file (default: `./.env`) |
208
+ | `--output <dir>` | Override the output directory |
209
+ | `--plugins <dir...>` | Load additional plugin directories (alongside the pipeline `plugins/` folder) |
210
+ | `--dry-run` | Extract + DQ + transform, but don't write a single byte to the target |
211
+
212
+ When multiple plugin directories resolve to the same absolute path (for example,
213
+ `--plugins ./plugins`), Sluice de-duplicates them before loading.
214
+
215
+ ### Exit codes
216
+
217
+ | Code | Meaning |
218
+ |------|---------|
219
+ | `0` | All good |
220
+ | `1` | Pipeline error |
221
+ | `2` | 🛑 Critical DQ violations halted the pipeline |
222
+ | `3` | 📋 Config validation failed |
223
+
224
+ ---
225
+
226
+ ## 📄 Pipeline Config Format
227
+
228
+ Each migration entity gets its own YAML file. One entity, one file. Nice and tidy.
229
+
230
+ ```
231
+ 💡 One YAML file = one migrated entity
232
+ (customers, items, vendors, styles, purchase orders, etc.)
233
+ ```
234
+
235
+ A single-source pipeline has five sections:
236
+
237
+ ```yaml
238
+ pipeline: { name, client, version, entity, description }
239
+ source: { adapter, connection/file/endpoint, ... }
240
+ dq: { rules, stopOnCritical, rejectionFile }
241
+ transform: { lookups, fields }
242
+ target: { adapter, output/baseUrl, ... }
243
+ run: { mode, batchSize, logLevel, dryRun, ... } # all optional
244
+ ```
245
+
246
+ A multi-source pipeline swaps `source:` for `sources:` + `merge:`:
247
+
248
+ ```yaml
249
+ pipeline: { ... }
250
+ sources: [ { id, priority, adapter, ..., rename? }, ... ] # 2+ entries
251
+ merge: { key, strategy, onUnmatched, fieldStrategies, conflictLog, incrementalSource? }
252
+ dq: { ... } # rules can be scoped via sourceId
253
+ transform: { ... }
254
+ target: { ... }
255
+ run: { ... }
256
+ ```
257
+
258
+ `PipelineSchema` requires *either* `source:` (single) *or* both `sources:` + `merge:` (multi) — never both. The CLI auto-routes based on which shape the YAML has, so there's no flag to remember.
259
+
260
+ ### 📥 Source Adapters
261
+
262
+ | Adapter | Use when... |
263
+ |---------|-------------|
264
+ | `mssql` | The legacy system is SQL Server (it's always SQL Server) |
265
+ | `pg` | The legacy system is PostgreSQL (you lucky thing) |
266
+ | `csv` | Someone emailed you a CSV export at 11pm the night before go-live |
267
+ | `xlsx` | Same as above but Excel, complete with merged cells and mystery formatting |
268
+ | `rest` | The source system has an API! Progress! |
269
+
270
+ ### 🎯 Target Adapters
271
+
272
+ | Adapter | Loads to... |
273
+ |---------|-------------|
274
+ | `bc` | Microsoft Dynamics 365 Business Central (via OData REST + OAuth2) |
275
+ | `ifs` | IFS ERP (via fixed-format CSV import no header, specific column order) |
276
+ | `bluecherry` | BlueCherry ERP / CGS (CSV import, US-format dates, headers required) |
277
+ | `csv` | Generic CSV — for anything else or for manual inspection |
278
+ | `pg` | PostgreSQL — useful for intermediate staging or custom targets |
279
+
280
+ ### 🔍 Data Quality Rules
281
+
282
+ Nine built-in rule types, configurable per field:
283
+
284
+ ```yaml
285
+ dq:
286
+ stopOnCritical: true
287
+ rules:
288
+ - field: CUST_CODE
289
+ checks:
290
+ - { type: notNull, severity: critical } # 💥 stops the pipeline
291
+ - { type: unique, severity: critical }
292
+ - { type: pattern, value: "^[A-Z0-9]{3,10}$", severity: warning }
293
+
294
+ - field: EMAIL
295
+ checks:
296
+ - { type: email, severity: warning } # ⚠️ flagged but not rejected
297
+
298
+ - field: POST_CODE
299
+ checks:
300
+ - { type: ukPostcode, severity: warning } # 🇬🇧 all UK formats
301
+ ```
302
+
303
+ | Rule | What it checks |
304
+ |------|---------------|
305
+ | `notNull` | Not null, not empty, not just whitespace |
306
+ | `unique` | No duplicates across the whole dataset |
307
+ | `pattern` | ECMAScript regex |
308
+ | `email` | RFC 5322-ish email validation |
309
+ | `ukPostcode` | All current UK postcode formats |
310
+ | `maxLength` | String length cap |
311
+ | `min` / `max` | Numeric range |
312
+ | `allowedValues` | Enum-style allowed value list |
313
+
314
+ Severity levels: `critical` (row rejected, pipeline can halt) · `warning` (flagged in report, row kept) · `info` (summary only)
315
+
316
+ ### Transform: Field Mapping Types
317
+
318
+ | Type | What it does |
319
+ |------|-------------|
320
+ | `string` | Cast + optional cleanse ops + optional truncation |
321
+ | `number` | Integer coercion (NaN = error) |
322
+ | `decimal` | Fixed-precision decimal stored as string |
323
+ | `boolean` | `'1','true','yes','y','t'` true. Everything else false |
324
+ | `date` | Parse source date, output in target format |
325
+ | `lookup` | Resolve via a CSV or SQL lookup table |
326
+ | `concat` | Join multiple source fields with a separator |
327
+ | `constant` | Emit a fixed value (e.g. `CustomerGroup: DOMESTIC`) |
328
+ | `expression` | Evaluate an expression against the source row |
329
+ | `custom` | Delegate to a `TransformPlugin` via `customOp` (Phase 2) |
330
+
331
+ ### 🧹 Cleanse Operations
332
+
333
+ Pipe-chain them: `cleanse: trim|titleCase|normaliseUnicode`
334
+
335
+ | Op | Before | After |
336
+ |----|--------|-------|
337
+ | `trim` | `" hello "` | `"hello"` |
338
+ | `uppercase` | `"hello"` | `"HELLO"` |
339
+ | `lowercase` | `"HELLO"` | `"hello"` |
340
+ | `titleCase` | `"john smith"` | `"John Smith"` |
341
+ | `stripNonAlpha` | `"AB-12!"` | `"AB"` |
342
+ | `stripNonNumeric` | `"AB-12!"` | `"12"` |
343
+ | `padStart:6:0` | `"42"` | `"000042"` |
344
+ | `nullIfEmpty` | `""` | `null` |
345
+ | `normaliseUnicode` | `"café"` | `"cafe"` |
346
+ | `normaliseQuotes` | `"it's"` | `"it's"` |
347
+
348
+ ---
349
+
350
+ ## 📁 Repository Structure
351
+
352
+ ```
353
+ sluice/
354
+ ├── src/
355
+ │ ├── cli.ts CLI entry point (commander)
356
+ │ ├── runner.ts PipelineRunner single-source orchestration
357
+ │ ├── multi-source-runner.ts ← MultiSourcePipelineRunner (Phase 3)
358
+ │ ├── config/ Zod schema, YAML loader, ENV var + composite expansion
359
+ ├── adapters/
360
+ ├── source/ mssql, pg, csv, xlsx, rest
361
+ │ └── target/ bc, ifs, bluecherry, csv, pg
362
+ │ ├── staging/ DuckDB wrapper (stg_raw stg_merged → stg_transformed)
363
+ │ ├── dq/ DQ engine, rules, rejection reporter
364
+ │ ├── transform/ ← Transform engine, lookup resolver, cleanse ops
365
+ ├── merge/ ← MergeEngine, SQL builder, 4 built-in strategies
366
+ ├── plugins/ ← Rule/Transform/Merge registries + file & npm loaders
367
+ └── utils/ logger (pino), errors, env helpers
368
+ ├── tests/
369
+ ├── fixtures/ sample pipeline YAMLs, CSV/rules data, plugin files
370
+ │ ├── unit/ unit tests (all I/O mocked)
371
+ │ └── integration/ real DuckDB :memory: + CSV fixtures
372
+ └── clients/ 🙈 gitignored — each client has their own repo
373
+ ├── acme-corp/ ← Acme Corp pipelines
374
+ └── style-co/ ← Style Co pipelines
375
+ ```
376
+
377
+ ---
378
+
379
+ ## ⚙️ Environment Variables
380
+
381
+ Connection strings and credentials live in `.env` (never in YAML files, never in Git).
382
+
383
+ ```bash
384
+ # .env
385
+ SOURCE_MSSQL=mssql://user:password@serverlegacy.example.local/LegacyDB
386
+ BC_BASE_URL=https://api.businesscentral.dynamics.com/v2.0
387
+ BC_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
388
+ BC_CLIENT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
389
+ BC_CLIENT_SECRET=your-secret-here
390
+ BC_COMPANY=Example Company Ltd
391
+ ```
392
+
393
+ Reference them in YAML with `${ENV_VAR}` — resolved at runtime, never stored in config:
394
+
395
+ ```yaml
396
+ source:
397
+ adapter: mssql
398
+ connection: ${SOURCE_MSSQL}
399
+ ```
400
+
401
+ ---
402
+
403
+ ## 🧩 Phase 2: Extension System
404
+
405
+ Phase 2 adds a three-tier plugin system so you can extend Sluice without touching the core engine.
406
+
407
+ ### Tier 1 Composite Rules (YAML) 📋
408
+
409
+ Name a bundle of checks in a shared rules file and reference them like built-ins:
410
+
411
+ ```yaml
412
+ # shared/rules.yaml
413
+ rules:
414
+ - id: style-coStyleNo
415
+ checks:
416
+ - { type: notNull, severity: critical }
417
+ - { type: pattern, value: "^[A-Z]{2}[0-9]{4}$", severity: critical }
418
+ - { type: maxLength, value: 6, severity: critical }
419
+ ```
420
+
421
+ ```yaml
422
+ # In your pipeline:
423
+ dq:
424
+ rulesFile: ../../shared/rules.yaml
425
+ rules:
426
+ - field: STYLE_NO
427
+ checks:
428
+ - { type: style-coStyleNo } # expands to the three checks above ✨
429
+ ```
430
+
431
+ ### Tier 2 Plugin Files (TypeScript) 🔌
432
+
433
+ Drop a `*.rule.ts`, `*.transform.ts`, or `*.merge.ts` file into a `plugins/` folder next to your pipeline YAMLs. Auto-discovered at startup:
434
+
435
+ ```typescript
436
+ // plugins/ukVatNumber.rule.ts
437
+ export const rule: RulePlugin = {
438
+ id: 'ukVatNumber',
439
+ validate(value, config, rowIndex, field) {
440
+ const valid = /^GB([0-9]{9}|[0-9]{12}|(GD|HA)[0-9]{3})$/.test(String(value));
441
+ return valid ? null : { field, rowIndex, value, rule: 'ukVatNumber',
442
+ severity: config.severity, message: 'Invalid UK VAT number' };
443
+ }
444
+ };
445
+ ```
446
+
447
+ ### Tier 3 npm Packages 📦
448
+
449
+ When plugins are useful across multiple clients, promote them to scoped npm packages and declare them in `sluice.config.yaml`:
450
+
451
+ ```yaml
452
+ # sluice.config.yaml
453
+ plugins:
454
+ - package: "@caracal-lynx/etl-rules-uk"
455
+ - package: "@caracal-lynx/etl-rules-fashion"
456
+ - package: "@caracal-lynx/etl-transform-ifs"
457
+ ```
458
+
459
+ All three tiers use the same registry interfaces and are invoked identically by the engines. The engine doesn't know or care which tier a rule came from. 🤷
460
+
461
+ ### List Loaded Plugins
462
+
463
+ ```bash
464
+ sluice plugins
465
+
466
+ # Include extra plugin directories outside the pipeline folder
467
+ sluice plugins --plugins ./shared/plugins ./team/plugins
468
+ ```
469
+
470
+ Output:
471
+ ```
472
+ 📋 Data Quality Rules:
473
+ ukVatNumber
474
+ • bcAccountCode
475
+ iso8601Date
476
+
477
+ 🔄 Transform Operations:
478
+ slugGenerator
479
+ • normalizeCompanyName
480
+ fixedDecimal
481
+
482
+ 🔀 Merge Strategies:
483
+ coalesce
484
+ priority-override
485
+ • union
486
+ • intersect
487
+ ```
488
+
489
+ ### Getting Started with Plugins
490
+
491
+ Detailed guide: **[PLUGINS.md](./PLUGINS.md)**
492
+
493
+ - Create a custom DQ rule
494
+ - Create a custom transform operation
495
+ - Create a custom merge strategy
496
+ - Package plugins as npm packages
497
+ - Test and debug plugins
498
+ - Real-world examples
499
+
500
+ ---
501
+
502
+ ## 🔀 Multi-Source Merge
503
+
504
+ Phase 3 lets a single pipeline extract from **2+ sources** and merge them on a key column before DQ and transform. Useful when the master record for an entity is scattered across systems — master data in SQL Server, pricing enrichment in an Excel sheet, product descriptions in a REST API, and so on.
505
+
506
+ ### Built-in merge strategies
507
+
508
+ | Strategy | Behaviour | When to use |
509
+ |---|---|---|
510
+ | `coalesce` | First non-null value wins (priority-ordered; whitespace treated as blank) | Enriching a primary source with fallback data from lower-priority sources |
511
+ | `priority-override` | Highest-priority source wins, even if null or blank | Strict priority the trusted source is the trusted source, full stop |
512
+ | `union` | All rows from all sources, deduplicated by key | Combining independent datasets (e.g. multi-warehouse inventory) |
513
+ | `intersect` | Only rows present in **all** sources | Reconciliation / "find the records that agree" |
514
+
515
+ Custom strategies can be dropped in as `*.merge.ts` plugins or shipped as npm packages — same three-tier model as DQ rules and transforms.
516
+
517
+ ### A minimal multi-source pipeline
518
+
519
+ ```yaml
520
+ pipeline:
521
+ name: style-co-products-merged
522
+ client: style-co
523
+ version: "1.0"
524
+ entity: Style
525
+
526
+ sources:
527
+ - id: sql-server # staging table: stg_raw_sql-server
528
+ priority: 1 # lower = higher precedence
529
+ adapter: mssql
530
+ connection: ${SOURCE_2_MSSQL}
531
+ query: "SELECT STYLE_NO, STYLE_DESC, COST_PRICE FROM dbo.Styles WHERE Active = 1"
532
+
533
+ - id: excel
534
+ priority: 2
535
+ adapter: xlsx
536
+ file: ./data/product-data.xlsx
537
+ sheet: "Products"
538
+ rename: # applied in-place after extract, before DQ
539
+ Style Number: STYLE_NO
540
+ Description: STYLE_DESC
541
+ Fibre: FIBRE_CONTENT
542
+
543
+ merge:
544
+ key: STYLE_NO # single column or array for composite keys
545
+ strategy: coalesce
546
+ onUnmatched: include # include | exclude | warn | error
547
+ fieldStrategies: # per-field overrides
548
+ - { field: FIBRE_CONTENT, source: excel } # pin to one source
549
+ - { field: COST_PRICE, strategy: priority-override }
550
+ conflictLog: ./output/style-co-products-conflicts.csv # optional CSV of field disagreements
551
+
552
+ dq:
553
+ stopOnCritical: true
554
+ rules:
555
+ - field: STYLE_NO # 🎯 pre-merge: scoped to one source
556
+ sourceId: sql-server
557
+ checks: [ { type: notNull, severity: critical }, { type: unique, severity: critical } ]
558
+ - field: STYLE_DESC # 🎯 post-merge: runs against stg_merged
559
+ checks: [ { type: notNull, severity: critical } ]
560
+
561
+ transform: { ... }
562
+ target: { ... }
563
+ ```
564
+
565
+ Pre-merge rules (`sourceId: …`) run against each source's staging table before merging and generate per-source rejection CSVs (suffixed `-{sourceId}`). Post-merge rules (no `sourceId`) run once against `stg_merged`.
566
+
567
+ ### Incremental multi-source
568
+
569
+ ```yaml
570
+ merge:
571
+ incrementalSource: sql-server # must match a source id; required in incremental mode
572
+ run:
573
+ mode: incremental
574
+ incrementalField: UPDATED_AT
575
+ ```
576
+
577
+ Only the named source is filtered by timestamp; other sources run full each time. The state file gains a per-source `sources` block tracking each source's last run time.
578
+
579
+ ### Inspect merge strategies
580
+
581
+ ```bash
582
+ sluice merge list-strategies # ids + descriptions for all registered strategies
583
+ sluice merge info coalesce # details for one strategy
584
+ ```
585
+
586
+ A full working example lives at [tests/fixtures/style-co-products-merged.pipeline.yaml](tests/fixtures/style-co-products-merged.pipeline.yaml).
587
+
588
+ ---
589
+
590
+ ## 🧪 Testing
591
+
592
+ ```bash
593
+ npm test # run tests once
594
+ npm run test:watch # watch mode (great for TDD)
595
+ npm run test:cov # with coverage report
596
+ ```
597
+
598
+ - **Unit tests** mock all I/O with `vi.mock` no live databases required
599
+ - **Integration tests** use real DuckDB (`:memory:`) with CSV fixtures
600
+ - Target: 80% line coverage across `src/dq/` and `src/transform/`
601
+ - CI runs on `ubuntu-latest` via GitHub Actions
602
+
603
+ ---
604
+
605
+ ## 🏗️ Development
606
+
607
+ ```bash
608
+ npm run build # tsc compile
609
+ npm run dev # tsx watch src/cli.ts (live reload)
610
+ npm run lint # eslint
611
+ npm run format # prettier
612
+
613
+ # Pretty logs in dev:
614
+ npm run dev -- run customers.pipeline.yaml | npx pino-pretty
615
+ ```
616
+
617
+ > **Note:** Uses `tsx`, not `ts-node`. Path aliases work correctly on Windows without extra configuration. 🪟
618
+
619
+ ---
620
+
621
+ ## 🚫 Things Sluice Is Not
622
+
623
+ - ❌ A web application or dashboard (there's no UI — this is a good thing)
624
+ - ❌ A streaming / real-time ingestion platform
625
+ - ❌ A data warehouse
626
+ - ❌ A multi-tenant SaaS product
627
+ - ❌ An excuse to use `eval()` anywhere
628
+
629
+ ---
630
+
631
+ ## 🏢 Sluice + Caracal Lynx Professional Services
632
+
633
+ The Sluice core CLI is open-source and free to use. Caracal Lynx offers additional paid services built on top of it:
634
+
635
+ | Service | What it is |
636
+ |---|---|
637
+ | **Enrichment Service** | Async API lookups (EU VAT, UK VAT, trade tariff) — fills gaps in source data |
638
+ | **Application Adapters** | Pre-built ERP adapters (IFS, Business Central, BlueCherry) |
639
+ | **Domain Rule Packages** | UK compliance rules, fashion/retail data standards |
640
+ | **Client-Specific Plugins** | Bespoke plugins tailored to your source system and data model |
641
+ | **Sluice MCP Server** 🚧 | AI-assisted migration using Claude — agentic pipeline authoring, live schema inspection, automatic DQ iteration. *Coming soon — Phase 9.* |
642
+ | **Migration Delivery** | Full end-to-end data migration, delivered by Caracal Lynx |
643
+
644
+ 📧 **sluice@caracallynx.com**
645
+ 🌐 **[caracallynx.com](https://caracallynx.com)**
646
+
647
+ ---
648
+
649
+ ## 🤝 Community
650
+
651
+ - 🐛 [Report a bug or request a feature](https://github.com/caracal-lynx/sluice/issues/new/choose)
652
+ - 💬 [Ask a question or share a use case](https://github.com/caracal-lynx/sluice/discussions)
653
+ - 🤲 [Contributing guide](CONTRIBUTING.md)
654
+ - 🤝 [Code of Conduct](CODE_OF_CONDUCT.md)
655
+
656
+ ---
657
+
658
+ ## 🔐 Security
659
+
660
+ Found a vulnerability? Please **do not** open a public issue. See [SECURITY.md](SECURITY.md) for the disclosure process — `security@caracallynx.com`, 48-hour acknowledgement, 90-day disclosure SLA.
661
+
662
+ ---
663
+
664
+ ## ⚖️ Licence
665
+
666
+ Sluice is licensed under the [Elastic Licence 2.0](LICENSE). See [LICENCE-FAQ.md](LICENCE-FAQ.md) for a plain-English explainer of what you can and can't do with it. Short version: use it freely for your own data migrations; don't resell it as a hosted service or strip the licence headers.
667
+
668
+ ---
669
+
670
+ ## 🏷️ About
671
+
672
+ Built and maintained by [Caracal Lynx Ltd.](https://caracallynx.com) (SC826823) — Gretna, Scotland.
673
+
674
+ ```
675
+ npm package: @caracal-lynx/sluice
676
+ owner: Caracal Lynx Ltd. (SC826823)
677
+ author: Michael Scott
678
+ maintainers: Michael Scott, Carolyn Scott, Andrew Scott, Duncan Scott
679
+ ```
680
+
681
+ ---
682
+
683
+ *Clean data flows through.* 💧