goldenpipe 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +157 -0
- package/dist/cli.cjs +1055 -0
- package/dist/cli.cjs.map +1 -0
- package/dist/cli.d.cts +1 -0
- package/dist/cli.d.ts +1 -0
- package/dist/cli.js +1053 -0
- package/dist/cli.js.map +1 -0
- package/dist/core/index.cjs +898 -0
- package/dist/core/index.cjs.map +1 -0
- package/dist/core/index.d.cts +439 -0
- package/dist/core/index.d.ts +439 -0
- package/dist/core/index.js +861 -0
- package/dist/core/index.js.map +1 -0
- package/dist/index.cjs +898 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.cts +2 -0
- package/dist/index.d.ts +2 -0
- package/dist/index.js +861 -0
- package/dist/index.js.map +1 -0
- package/dist/node/index.cjs +1081 -0
- package/dist/node/index.cjs.map +1 -0
- package/dist/node/index.d.cts +43 -0
- package/dist/node/index.d.ts +43 -0
- package/dist/node/index.js +1039 -0
- package/dist/node/index.js.map +1 -0
- package/package.json +90 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Ben Severn
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,157 @@
|
|
|
1
|
+
# goldenpipe
|
|
2
|
+
|
|
3
|
+
Golden Suite orchestrator for TypeScript — chains **GoldenCheck → GoldenFlow → GoldenMatch** into one adaptive, pluggable pipeline. TypeScript port of the [`goldenpipe`](https://github.com/benseverndev-oss/goldenmatch/tree/main/packages/python/goldenpipe) Python library.
|
|
4
|
+
|
|
5
|
+
It composes the edge-safe cores of the three sibling packages:
|
|
6
|
+
|
|
7
|
+
- [`goldencheck`](https://www.npmjs.com/package/goldencheck) — data-quality scan (`scanData`)
|
|
8
|
+
- [`goldenflow`](https://www.npmjs.com/package/goldenflow) — transforms / standardization (`TransformEngine`)
|
|
9
|
+
- [`goldenmatch`](https://www.npmjs.com/package/goldenmatch) — dedupe / entity resolution (`dedupe`)
|
|
10
|
+
|
|
11
|
+
Data flows through the pipeline as `Row[]` (arrays of plain objects).
|
|
12
|
+
|
|
13
|
+
## Install
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
npm install goldenpipe
|
|
17
|
+
# the three siblings come along as dependencies
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
`yaml` is an optional peer dependency, needed only for YAML config loading:
|
|
21
|
+
|
|
22
|
+
```bash
|
|
23
|
+
npm install yaml
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
## Quick start
|
|
27
|
+
|
|
28
|
+
```ts
|
|
29
|
+
import { runDf } from "goldenpipe";
|
|
30
|
+
|
|
31
|
+
const rows = [
|
|
32
|
+
{ first_name: "John", last_name: "Smith", email: "john@example.com" },
|
|
33
|
+
{ first_name: "Jon", last_name: "Smith", email: "john@example.com" },
|
|
34
|
+
{ first_name: "Jane", last_name: "Doe", email: "jane@example.com" },
|
|
35
|
+
];
|
|
36
|
+
|
|
37
|
+
// Zero-config: runs goldencheck.scan -> goldenflow.transform -> goldenmatch.dedupe
|
|
38
|
+
const result = await runDf(rows);
|
|
39
|
+
|
|
40
|
+
console.log(result.status); // "success"
|
|
41
|
+
console.log(result.inputRows); // 3
|
|
42
|
+
console.log(result.artifacts.golden); // golden (canonical) records
|
|
43
|
+
console.log(result.artifacts.unique); // distinct records
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
> **Async:** the runner is async because GoldenMatch's `dedupe` is async. `runDf`, `runStages`, `Pipeline.run`, and the node `run(source)` all return promises.
|
|
47
|
+
|
|
48
|
+
### From a CSV file (Node)
|
|
49
|
+
|
|
50
|
+
```ts
|
|
51
|
+
import { run } from "goldenpipe/node";
|
|
52
|
+
|
|
53
|
+
const result = await run("people.csv"); // zero-config
|
|
54
|
+
const result2 = await run("people.csv", { config: "pipeline.yml" });
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Custom pipeline config
|
|
58
|
+
|
|
59
|
+
```ts
|
|
60
|
+
import { runDf, makePipelineConfig, makeStageSpec } from "goldenpipe";
|
|
61
|
+
|
|
62
|
+
const config = makePipelineConfig({
|
|
63
|
+
pipeline: "check-and-dedupe",
|
|
64
|
+
stages: [
|
|
65
|
+
"goldencheck.scan",
|
|
66
|
+
makeStageSpec({ use: "goldenmatch.dedupe", config: { threshold: 0.9 } }),
|
|
67
|
+
// omit goldenflow.transform to skip transformation
|
|
68
|
+
],
|
|
69
|
+
});
|
|
70
|
+
|
|
71
|
+
const result = await runDf(rows, config);
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### Programmatic stages
|
|
75
|
+
|
|
76
|
+
```ts
|
|
77
|
+
import { runStages, stage, StageStatus } from "goldenpipe";
|
|
78
|
+
|
|
79
|
+
const myStage = stage(
|
|
80
|
+
{ name: "tagger", produces: ["tag"], consumes: ["df"] },
|
|
81
|
+
(ctx) => {
|
|
82
|
+
ctx.artifacts.tag = (ctx.df ?? []).length;
|
|
83
|
+
return { status: StageStatus.SUCCESS };
|
|
84
|
+
},
|
|
85
|
+
);
|
|
86
|
+
|
|
87
|
+
const result = await runStages([myStage], rows);
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
## CLI
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
goldenpipe-js run people.csv [-c pipeline.yml] [-v] # run the chain on a CSV
|
|
94
|
+
goldenpipe-js stages # list registered stages
|
|
95
|
+
goldenpipe-js validate -c pipeline.yml # dry-run wiring validation
|
|
96
|
+
goldenpipe-js init [-d .] # scaffold a goldenpipe.yml
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
## Architecture
|
|
100
|
+
|
|
101
|
+
```mermaid
|
|
102
|
+
flowchart LR
|
|
103
|
+
L[load] --> C[goldencheck.scan]
|
|
104
|
+
C --> F[goldenflow.transform]
|
|
105
|
+
F --> M[goldenmatch.dedupe]
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
| Stage | Wraps | Produces |
|
|
109
|
+
|-------|-------|----------|
|
|
110
|
+
| `load` | built-in | `df` |
|
|
111
|
+
| `goldencheck.scan` | `scanData(TabularData)` | `findings`, `profile`, `column_contexts` |
|
|
112
|
+
| `goldenflow.transform` | `new TransformEngine(cfg).transformDf(rows)` | `df`, `manifest` |
|
|
113
|
+
| `goldenmatch.dedupe` | `await dedupe(rows, { config })` | `clusters`, `golden`, `unique`, `dupes`, `match_stats`, `scored_pairs` |
|
|
114
|
+
|
|
115
|
+
The engine layer mirrors the Python design:
|
|
116
|
+
|
|
117
|
+
- **registry** — a STATIC registry (`buildDefaultRegistry()`) replacing Python's entry-point discovery.
|
|
118
|
+
- **resolver** — builds an `ExecutionPlan`, auto-prepends `load`, validates `consumes`/`produces` wiring.
|
|
119
|
+
- **router** — applies a stage's `Decision` (skip / insert / abort) to the remaining plan.
|
|
120
|
+
- **runner** — async stage execution with per-stage error handling + `skipIf` gating.
|
|
121
|
+
- **reporter** — assembles the `PipeResult` (status, stages, artifacts, errors, reasoning, timing).
|
|
122
|
+
|
|
123
|
+
A **column-context pipeline** carries semantic metadata across stages: GoldenCheck builds `ColumnContext`s (name-regex classification + IQR cardinality banding + identifier inference), GoldenFlow enriches them (date transforms confirm date type), and GoldenMatch consumes them to build a targeted dedupe config (`buildConfigFromContexts`) instead of re-profiling.
|
|
124
|
+
|
|
125
|
+
## Decisions (adaptive routing)
|
|
126
|
+
|
|
127
|
+
`severityGate`, `piiRouter`, and `rowCountGate` are ported. They are not wired into the default chain — add them to a custom runner / stage that returns their `Decision`.
|
|
128
|
+
|
|
129
|
+
> **TS sibling skew:** GoldenCheck-JS `Finding.severity` is a numeric enum (INFO/WARNING/ERROR) with no `"critical"` level, and there is no `"pii_detection"` check. So `severityGate` and `piiRouter` are effectively no-ops against current GoldenCheck-JS output — they exist for structural parity and so custom stages emitting those findings still route.
|
|
130
|
+
|
|
131
|
+
## Deferred (not in this v1 port)
|
|
132
|
+
|
|
133
|
+
- **`identity_resolve` stage** — GoldenMatch-JS Identity Graph wiring through the pipeline. The edge-safe `InMemoryIdentityStore` exists in `goldenmatch`, but the pipeline-driven `resolveClusters` population is not yet exposed.
|
|
134
|
+
- **`infer_schema` stage** — InferMap-based schema inference is not ported.
|
|
135
|
+
- **Servers/TUI** — the FastAPI REST API, A2A agent server, MCP server, and Textual TUI from the Python CLI are not ported.
|
|
136
|
+
|
|
137
|
+
### Sibling version-skew artifacts
|
|
138
|
+
|
|
139
|
+
The TS siblings are version-skewed from the Python ones, so some artifacts the Python pipeline surfaces are shaped differently or absent here:
|
|
140
|
+
|
|
141
|
+
- `golden` artifact maps to GoldenMatch-JS `DedupeResult.goldenRecords` (the Python sibling exposes `.golden`).
|
|
142
|
+
- `scored_pairs` is GoldenMatch-JS `result.scoredPairs` (camelCase).
|
|
143
|
+
- `matchkey_used` is derived from the *built config's* first matchkey — the JS `DedupeResult` does not carry the resolved matchkey list back (the Python result does after auto-config).
|
|
144
|
+
- The Python `goldencheck.scan` adapter calls `scan_file(path)`, so the in-memory `run_df` path fails that stage. GoldenCheck-JS's `scanData` operates on rows, so the TS adapter's scan **succeeds** in both the in-memory (`runDf`) and file (`run`) paths.
|
|
145
|
+
|
|
146
|
+
## Cross-language parity
|
|
147
|
+
|
|
148
|
+
`tests/parity/pipe-parity.test.ts` asserts skew-robust invariants (`status`, `input_rows`, ordered per-stage status/skip sequence, final `golden`/`unique` counts) against Python-generated goldens in `tests/fixtures/pipe_parity.json`. Regenerate the goldens with:
|
|
149
|
+
|
|
150
|
+
```bash
|
|
151
|
+
uv run --project packages/python/goldenpipe python \
|
|
152
|
+
packages/python/goldenpipe/scripts/emit_ts_parity_fixtures.py
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
## License
|
|
156
|
+
|
|
157
|
+
MIT
|