@caracal-lynx/sluice 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/PLUGINS.md +294 -0
  2. package/README.md +119 -11
  3. package/package.json +3 -2
package/PLUGINS.md ADDED
@@ -0,0 +1,294 @@
1
+ # Sluice — Plugin Author Guide
2
+
3
+ Sluice's pipeline configs are intentionally constrained — the schema is fixed, the field types are enumerated, the DQ check types are an explicit list. That keeps configs readable and reviewable. But every real migration eventually needs *something* the built-ins don't cover: a regex pattern that only makes sense for one client's data, a date format only one ERP uses, a merge strategy with custom precedence rules.
4
+
5
+ Plugins fill that gap without forcing you to fork the engine. Sluice exposes a **three-tier extension model** that scales from "I just want a reusable composite rule for this client" to "I want to publish a paid adapter package on npm."
6
+
7
+ ---
8
+
9
+ ## The three tiers at a glance
10
+
11
+ | Tier | What it is | Where it lives | Who writes it | Distribution |
12
+ |---|---|---|---|---|
13
+ | **Tier 1 — Composite YAML rules** | A named bundle of built-in DQ checks | YAML files in your project | Anyone — no code | In your repo |
14
+ | **Tier 2 — File-based plugins** | TypeScript module exporting a `RulePlugin`, `TransformPlugin`, or `MergeStrategyPlugin` | `plugins/*.{rule,transform,merge}.ts` in your project | Anyone with TypeScript | In your repo |
15
+ | **Tier 3 — npm package plugins** | A published npm package with a `register()` function | Anywhere installable via `npm install` | Plugin authors | npmjs.com (public or private) |
16
+
17
+ You can mix tiers freely — a single pipeline can pull composite rules (Tier 1), a local dev's file plugin (Tier 2), and an installed npm rule pack (Tier 3) all at once.
18
+
19
+ ---
20
+
21
+ ## Tier 1 — Composite YAML rules
22
+
23
+ A composite rule is a **named bundle of built-in checks** that you reference in pipeline DQ rules by a single ID. Useful when the same combination of checks (e.g., `notNull` + `pattern` + `maxLength` for an internal style number) repeats across many fields and many pipelines.
24
+
25
+ ### Library file
26
+
27
+ Composite rules live in a YAML file referenced by `dq.rulesFile`. Convention: `shared/rules.yaml` at the repo root.
28
+
29
+ ```yaml
30
+ # shared/rules.yaml
31
+ version: "1.0"
32
+
33
+ rules:
34
+ - id: ukVatNumber
35
+ description: UK VAT registration number — format only (existence check is a separate concern)
36
+ checks:
37
+ - { type: pattern, value: "^GB([0-9]{9}|[0-9]{12}|(GD|HA)[0-9]{3})$", severity: warning }
38
+
39
+ - id: positivePrice
40
+ description: Price must be a non-negative number with sensible upper bound
41
+ checks:
42
+ - { type: notNull, severity: critical }
43
+ - { type: min, value: 0, severity: critical }
44
+ - { type: max, value: 99999.99, severity: warning }
45
+ ```
46
+
47
+ ### Use in a pipeline
48
+
49
+ ```yaml
50
+ # customers.pipeline.yaml
51
+ dq:
52
+ rulesFile: ./shared/rules.yaml # tell ConfigLoader where to find composite rules
53
+ rules:
54
+ - field: VAT_NUMBER
55
+ checks:
56
+ - { type: ukVatNumber } # expands to the pattern check above at load time
57
+
58
+ - field: COST_PRICE
59
+ checks:
60
+ - { type: positivePrice } # expands to all three checks at load time
61
+ ```
62
+
63
+ `ConfigLoader` expands composite-rule references into their underlying built-in checks **before** Zod validation runs, so the DQ engine only ever sees standard check types.
64
+
65
+ ### When to reach for Tier 1
66
+
67
+ - The same combination of checks repeats across multiple fields or pipelines.
68
+ - The combination doesn't need any custom code — built-in checks suffice.
69
+ - You want non-developers (data analysts, project managers) to be able to read and edit the rules.
70
+
71
+ ### Constraints
72
+
73
+ - Composite rule IDs must be valid identifiers (`^[a-zA-Z][a-zA-Z0-9_-]*$`) and must not collide with built-in check type names (`notNull`, `unique`, `pattern`, etc.).
74
+ - Composite rules can only contain built-in checks — they cannot reference other composite rules (no nesting).
75
+
76
+ ---
77
+
78
+ ## Tier 2 — File-based plugins
79
+
80
+ When you need actual logic — a check that calls a custom regex with side conditions, a transform that does business-specific date parsing, a merge strategy that picks the maximum value across sources — write a TypeScript plugin file.
81
+
82
+ Plugins are auto-discovered from a `plugins/` directory next to your pipeline YAML (or from any directory passed via `--plugins`).
83
+
84
+ ### File naming convention
85
+
86
+ | Filename suffix | Plugin type | Exported symbol |
87
+ |---|---|---|
88
+ | `*.rule.ts` (or `.rule.js`) | DQ rule | `export const rule: RulePlugin` |
89
+ | `*.transform.ts` (or `.transform.js`) | Field transform | `export const transform: TransformPlugin` |
90
+ | `*.merge.ts` (or `.merge.js`) | Merge strategy | `export const mergeStrategy: MergeStrategyPlugin` |
91
+
92
+ ### Example — DQ rule plugin
93
+
94
+ ```typescript
95
+ // plugins/ifs-customer-no.rule.ts
96
+ import type { RulePlugin } from '@caracal-lynx/sluice';
97
+
98
+ export const rule: RulePlugin = {
99
+ id: 'ifsCustomerNo',
100
+ description: 'IFS customer number — three uppercase letters followed by 4–7 digits',
101
+
102
+ validate(value, config, rowIndex, field) {
103
+ if (typeof value !== 'string') return null;
104
+ if (/^[A-Z]{3}[0-9]{4,7}$/.test(value)) return null;
105
+ return {
106
+ field,
107
+ rowIndex,
108
+ value,
109
+ rule: 'ifsCustomerNo',
110
+ severity: config.severity,
111
+ message: config.message ?? `${value} is not a valid IFS customer number`,
112
+ };
113
+ },
114
+ };
115
+ ```
116
+
117
+ Use it in a pipeline:
118
+
119
+ ```yaml
120
+ dq:
121
+ rules:
122
+ - field: CUSTOMER_NO
123
+ checks:
124
+ - { type: ifsCustomerNo, severity: critical }
125
+ ```
126
+
127
+ ### Example — transform plugin
128
+
129
+ ```typescript
130
+ // plugins/season-from-date.transform.ts
131
+ import type { TransformPlugin } from '@caracal-lynx/sluice';
132
+
133
+ export const transform: TransformPlugin = {
134
+ id: 'seasonFromDate',
135
+ description: 'Derives a fashion season code (SS25, AW25 …) from a YYYY-MM-DD launch date',
136
+
137
+ apply(value, row, config) {
138
+ if (typeof value !== 'string') return null;
139
+ const match = /^(\d{4})-(\d{2})-/.exec(value);
140
+ if (!match) return null;
141
+ const [, year, month] = match;
142
+ const yy = year!.slice(2);
143
+ const mm = parseInt(month!, 10);
144
+ return mm >= 1 && mm <= 6 ? `SS${yy}` : `AW${yy}`;
145
+ },
146
+ };
147
+ ```
148
+
149
+ Use it in a pipeline:
150
+
151
+ ```yaml
152
+ transform:
153
+ fields:
154
+ - { from: LAUNCH_DATE, to: Season, type: custom, customOp: seasonFromDate }
155
+ ```
156
+
157
+ ### Example — merge strategy plugin
158
+
159
+ ```typescript
160
+ // plugins/max-cost.merge.ts
161
+ import type { MergeStrategyPlugin } from '@caracal-lynx/sluice';
162
+
163
+ export const mergeStrategy: MergeStrategyPlugin = {
164
+ id: 'max-cost',
165
+ description: 'Coalesce by key, picking the highest COST_PRICE across all sources',
166
+ async merge(store, sources, config) {
167
+ // Implementation uses the StagingStore SQL surface — see docs for full example
168
+ // ...
169
+ return { rowsMerged: 0, conflicts: 0, unmatched: 0, tableName: 'stg_merged' };
170
+ },
171
+ };
172
+ ```
173
+
174
+ ### Loading & discovery
175
+
176
+ By default, the runner scans `{cwd}/plugins/` for files matching the suffixes above. You can pass extra directories with `--plugins`:
177
+
178
+ ```bash
179
+ sluice run customers.pipeline.yaml --plugins ./shared/plugins --plugins ./team/plugins
180
+ ```
181
+
182
+ All discovered plugins are registered before any pipeline phase runs. Duplicate IDs (across files, directories, or with built-ins) raise a `ConfigError` at startup — fail fast.
183
+
184
+ ### Constraints
185
+
186
+ - **Plugins must be pure.** No I/O, no async, no mutation of the input row, no global state. The DQ and transform engines call them in tight loops — side effects break determinism.
187
+ - **Errors must be predictable.** `RulePlugin.validate` returns `null` for "valid"; throw only on unrecoverable bugs. `TransformPlugin.apply` returns the transformed value; throw `TransformError` to fail the row.
188
+ - **Plugin IDs are global.** `ifsCustomerNo` lives in the same namespace as the built-in `notNull`, `unique`, etc. Choose distinctive IDs.
189
+
190
+ ---
191
+
192
+ ## Tier 3 — npm package plugins
193
+
194
+ When you want to **distribute** a plugin — to other projects in your organisation, to a client engagement, or to the public — package it as an npm module and register it via Sluice's package-discovery mechanism.
195
+
196
+ ### Package shape
197
+
198
+ A plugin package exports a `register()` function that registers any combination of rules, transforms, and merge strategies:
199
+
200
+ ```typescript
201
+ // @your-org/sluice-rules-uk/src/index.ts
202
+ import type {
203
+ PluginPackage,
204
+ RuleRegistry,
205
+ TransformRegistry,
206
+ MergeStrategyRegistry,
207
+ } from '@caracal-lynx/sluice';
208
+ import { ukVatNumber } from './rules/uk-vat-number.js';
209
+ import { ukPostcode } from './rules/uk-postcode-strict.js';
210
+ import { sortCodeAccount } from './rules/sort-code-account.js';
211
+
212
+ export const plugin: PluginPackage = {
213
+ register(rules: RuleRegistry, transforms: TransformRegistry, options, merges) {
214
+ rules.register(ukVatNumber);
215
+ rules.register(ukPostcode);
216
+ rules.register(sortCodeAccount);
217
+ // transforms.register(...) and merges?.register(...) work the same way
218
+ },
219
+ };
220
+ ```
221
+
222
+ `package.json`:
223
+
224
+ ```json
225
+ {
226
+ "name": "@your-org/sluice-rules-uk",
227
+ "version": "1.0.0",
228
+ "main": "dist/index.js",
229
+ "types": "dist/index.d.ts",
230
+ "peerDependencies": {
231
+ "@caracal-lynx/sluice": "^0.1.0"
232
+ }
233
+ }
234
+ ```
235
+
236
+ ### Wiring it into a pipeline project
237
+
238
+ Each Sluice project can declare its npm plugin packages in a top-level `sluice.config.yaml`:
239
+
240
+ ```yaml
241
+ # sluice.config.yaml — alongside your pipeline YAMLs
242
+ version: "1.0"
243
+
244
+ plugins:
245
+ - package: '@your-org/sluice-rules-uk'
246
+ options: # passed verbatim to register()
247
+ enableExperimental: false
248
+ - package: '@your-org/sluice-rules-fashion'
249
+ ```
250
+
251
+ Then `npm install` the packages and `sluice run` will load them automatically:
252
+
253
+ ```bash
254
+ npm install @your-org/sluice-rules-uk @your-org/sluice-rules-fashion
255
+ sluice run customers.pipeline.yaml
256
+ ```
257
+
258
+ ### When to reach for Tier 3
259
+
260
+ - You want to share a plugin across multiple projects or teams.
261
+ - You want to publish a commercial plugin (paid adapter, domain rule pack).
262
+ - You want versioning and changelogs separate from your pipeline configs.
263
+
264
+ ### Distribution
265
+
266
+ Public plugins go on the public npm registry. Private plugins use a private registry (npmjs.com Pro plan, GitHub Packages, Verdaccio, etc.) — Sluice doesn't care which.
267
+
268
+ If you publish a public plugin, please mention `@caracal-lynx/sluice` as a `peerDependency` rather than a direct dependency, and declare a SemVer range that matches the Sluice public API surface you depend on.
269
+
270
+ ---
271
+
272
+ ## Plugin contracts (the rules)
273
+
274
+ Regardless of tier, every Sluice plugin must obey these:
275
+
276
+ 1. **Pure.** No filesystem, network, database, or environment access. No timers, no `Math.random()` without a seed, no `Date.now()`. Plugins run inside tight loops — side effects break determinism and reproducibility.
277
+ 2. **Synchronous.** `RulePlugin.validate`, `TransformPlugin.apply`, and `MergeStrategyPlugin.merge` are called via the engine's synchronous (or `async`-but-deterministic) hot path. (`MergeStrategyPlugin` is `async` because merge strategies query the staging DB; their async-ness is bounded to that.)
278
+ 3. **Idempotent and stateless.** Calling a plugin twice with the same input must produce the same output. No instance state, no closures over external mutables.
279
+ 4. **Throw `TransformError` / return `RuleViolation`** — don't throw raw strings, don't return `undefined`. The engine catches at the pipeline boundary.
280
+ 5. **Don't mutate the row.** Plugins receive the source row by reference for cross-field reads; treat it as read-only.
281
+
282
+ The one exception to Rule 1 is the **enrich phase** (Phase 4) — `EnrichPlugin` is async and may call external APIs. That's a separate plugin interface (`@caracal-lynx/sluice-enrich`) with its own contract; see the enrich-phase docs for details.
283
+
284
+ ---
285
+
286
+ ## More
287
+
288
+ - The full schema reference for pipeline YAML, including how built-in checks and transforms work, lives in [CLAUDE.md](CLAUDE.md).
289
+ - The runtime types (`RulePlugin`, `TransformPlugin`, `MergeStrategyPlugin`, `PluginPackage`) are exported from the package root: `import type { RulePlugin } from '@caracal-lynx/sluice'`.
290
+ - Working examples of all three tiers ship in this repo's `tests/fixtures/plugins/` and `tests/fixtures/shared-rules.yaml`.
291
+
292
+ Questions, gaps, or contributions to this guide? Open a Discussion or send a PR.
293
+
294
+ — Caracal Lynx Ltd.
package/README.md CHANGED
@@ -4,9 +4,19 @@
4
4
 
5
5
  **`@caracal-lynx/sluice`** — a config-driven ETL toolkit for ERP data migrations, built by [Caracal Lynx Ltd.](https://caracallynx.com).
6
6
 
7
+ [![npm](https://img.shields.io/npm/v/@caracal-lynx/sluice)](https://www.npmjs.com/package/@caracal-lynx/sluice)
7
8
  [![Node 24](https://img.shields.io/badge/Node-24_LTS-green)](https://nodejs.org)
8
9
  [![TypeScript](https://img.shields.io/badge/TypeScript-6.x-blue)](https://www.typescriptlang.org)
9
- [![License](https://img.shields.io/badge/license-Elastic_2.0-blue)](LICENSE)
10
+ [![License](https://img.shields.io/badge/license-Elastic_2.0-blue)](LICENCE-FAQ.md)
11
+ <!-- TODO: add Docs badge once Phase 8 ships -->
12
+
13
+ ---
14
+
15
+ > **Data quality is the hidden blocker for both migrations and AI adoption.**
16
+ >
17
+ > Sluice is a data migration and data quality tool that validates your data *before* it reaches its destination — not after. You describe the entire migration as a YAML file: where the data comes from, the quality rules it has to pass, how each field maps to the target. Sluice validates the source, transforms it, and loads only the clean records — the bad rows go to a rejection report so you can fix the source.
18
+ >
19
+ > *Clean data flows through.*
10
20
 
11
21
  ---
12
22
 
@@ -106,22 +116,79 @@ flowchart LR
106
116
 
107
117
  ---
108
118
 
119
+ ## 🧩 Extension model
120
+
121
+ Sluice's pipeline schema is fixed by design (readability, reviewability, predictable validation). Anything you can't express in the schema, you add via plugins. Three tiers, scaling from "no code, no install" to "publishable npm package":
122
+
123
+ | Tier | What it is | Where it lives | Best for |
124
+ |---|---|---|---|
125
+ | **Tier 1** | YAML composite rules — bundle built-in DQ checks under a single ID | `shared/rules.yaml` in your project | Reusing common check combinations across pipelines without writing code |
126
+ | **Tier 2** | TypeScript file plugins — `*.rule.ts` / `*.transform.ts` / `*.merge.ts` | `plugins/` next to your YAML | Custom logic for one project; rapid iteration |
127
+ | **Tier 3** | npm packages exporting `register()` | npmjs.com (public or private) | Distributing rules / adapters / strategies across teams or as paid products |
128
+
129
+ See **[PLUGINS.md](PLUGINS.md)** for the full author's guide with worked examples for all three tiers.
130
+
131
+ ---
132
+
109
133
  ## 🚀 Quick Start
110
134
 
135
+ A complete pipeline in 20 lines: read a CSV, validate emails, lowercase them, write the clean rows to a new CSV. The full file is checked into the repo at [`examples/hello-world.pipeline.yaml`](examples/hello-world.pipeline.yaml) with sample data at [`examples/data/hello-world.csv`](examples/data/hello-world.csv).
136
+
137
+ ```yaml
138
+ pipeline:
139
+ name: hello-world
140
+ client: demo
141
+ version: "1.0"
142
+ entity: Customer
143
+
144
+ source:
145
+ adapter: csv
146
+ file: ./examples/data/hello-world.csv
147
+
148
+ dq:
149
+ rules:
150
+ - field: email
151
+ checks:
152
+ - { type: notNull, severity: critical }
153
+ - { type: email, severity: warning }
154
+
155
+ transform:
156
+ fields:
157
+ - { from: name, to: Name, type: string, cleanse: trim }
158
+ - { from: email, to: Email, type: string, cleanse: trim|lowercase }
159
+ - { from: country, to: Country, type: string, default: GB }
160
+
161
+ target:
162
+ adapter: csv
163
+ output: ./output/hello-world-clean.csv
164
+ ```
165
+
166
+ Run it end to end:
167
+
111
168
  ```bash
112
- # Install
113
- npm install @caracal-lynx/sluice
169
+ # 1. Install
170
+ npm install -g @caracal-lynx/sluice
114
171
 
115
- # Check a pipeline config is valid (no data touched)
116
- sluice check customers.pipeline.yaml
172
+ # 2. Validate the config without touching any data
173
+ sluice check examples/hello-world.pipeline.yaml
117
174
 
118
- # Run DQ and transform but don't write output
119
- sluice validate customers.pipeline.yaml
175
+ # 3. Dry-run: extract + DQ + transform but don't write the target
176
+ sluice run examples/hello-world.pipeline.yaml --dry-run
177
+
178
+ # 4. Live run — writes ./output/hello-world-clean.csv +
179
+ # ./output/hello-world-rejected.csv (if any DQ failures)
180
+ sluice run examples/hello-world.pipeline.yaml
181
+ ```
182
+
183
+ The sample data has one row with a malformed email — that's a `warning`, so the row is kept in the output but flagged in `output/hello-world-rejected.csv`. Open both CSVs side by side to see what passed and what got reported. Add an `unknown@bad`-style row (or strip an email entirely) to see how a `critical` failure halts the pipeline before any output is written.
184
+
185
+ ### Other CLI commands
120
186
 
121
- # Go for it 🚀
122
- sluice run customers.pipeline.yaml
187
+ ```bash
188
+ # Run DQ + transform; skip the load (faster than --dry-run for spec checks)
189
+ sluice validate customers.pipeline.yaml
123
190
 
124
- # Profile source data (column stats, no DQ)
191
+ # Profile source data column stats, distinct counts, samples; no DQ
125
192
  sluice profile customers.pipeline.yaml
126
193
 
127
194
  # Inspect loaded plugins and merge strategies
@@ -559,7 +626,48 @@ npm run dev -- run customers.pipeline.yaml | npx pino-pretty
559
626
 
560
627
  ---
561
628
 
562
- ## 📦 Package Info
629
+ ## 🏢 Sluice + Caracal Lynx Professional Services
630
+
631
+ The Sluice core CLI is open-source and free to use. Caracal Lynx offers additional paid services built on top of it:
632
+
633
+ | Service | What it is |
634
+ |---|---|
635
+ | **Enrichment Service** | Async API lookups (EU VAT, UK VAT, trade tariff) — fills gaps in source data |
636
+ | **Application Adapters** | Pre-built ERP adapters (IFS, Business Central, BlueCherry) |
637
+ | **Domain Rule Packages** | UK compliance rules, fashion/retail data standards |
638
+ | **Client-Specific Plugins** | Bespoke plugins tailored to your source system and data model |
639
+ | **Sluice MCP Server** 🚧 | AI-assisted migration using Claude — agentic pipeline authoring, live schema inspection, automatic DQ iteration. *Coming soon — Phase 9.* |
640
+ | **Migration Delivery** | Full end-to-end data migration, delivered by Caracal Lynx |
641
+
642
+ 📧 **michael.scott@caracallynx.com**
643
+ 🌐 **[caracallynx.com](https://caracallynx.com)**
644
+
645
+ ---
646
+
647
+ ## 🤝 Community
648
+
649
+ - 🐛 [Report a bug or request a feature](https://github.com/caracal-lynx/sluice/issues/new/choose)
650
+ - 💬 [Ask a question or share a use case](https://github.com/caracal-lynx/sluice/discussions)
651
+ - 🤲 [Contributing guide](CONTRIBUTING.md)
652
+ - 🤝 [Code of Conduct](CODE_OF_CONDUCT.md)
653
+
654
+ ---
655
+
656
+ ## 🔐 Security
657
+
658
+ Found a vulnerability? Please **do not** open a public issue. See [SECURITY.md](SECURITY.md) for the disclosure process — `security@caracallynx.com`, 48-hour acknowledgement, 90-day disclosure SLA.
659
+
660
+ ---
661
+
662
+ ## ⚖️ Licence
663
+
664
+ Sluice is licensed under the [Elastic Licence 2.0](LICENSE). See [LICENCE-FAQ.md](LICENCE-FAQ.md) for a plain-English explainer of what you can and can't do with it. Short version: use it freely for your own data migrations; don't resell it as a hosted service or strip the licence headers.
665
+
666
+ ---
667
+
668
+ ## 🏷️ About
669
+
670
+ Built and maintained by [Caracal Lynx Ltd.](https://caracallynx.com) (SC826823) — Gretna, Scotland.
563
671
 
564
672
  ```
565
673
  npm package: @caracal-lynx/sluice
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@caracal-lynx/sluice",
3
- "version": "0.1.1",
3
+ "version": "0.1.2",
4
4
  "description": "Config-driven ETL toolkit for ERP data migrations",
5
5
  "license": "Elastic-2.0",
6
6
  "author": "Caracal Lynx Ltd. <michael.scott@caracallynx.com> (https://caracallynx.com)",
@@ -34,7 +34,8 @@
34
34
  "README.md",
35
35
  "LICENSE",
36
36
  "LICENCE-FAQ.md",
37
- "CLAUDE.md"
37
+ "CLAUDE.md",
38
+ "PLUGINS.md"
38
39
  ],
39
40
  "publishConfig": {
40
41
  "access": "public"