goldenpipe 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Ben Severn
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,157 @@
1
+ # goldenpipe
2
+
3
+ Golden Suite orchestrator for TypeScript — chains **GoldenCheck → GoldenFlow → GoldenMatch** into one adaptive, pluggable pipeline. TypeScript port of the [`goldenpipe`](https://github.com/benseverndev-oss/goldenmatch/tree/main/packages/python/goldenpipe) Python library.
4
+
5
+ It composes the edge-safe cores of the three sibling packages:
6
+
7
+ - [`goldencheck`](https://www.npmjs.com/package/goldencheck) — data-quality scan (`scanData`)
8
+ - [`goldenflow`](https://www.npmjs.com/package/goldenflow) — transforms / standardization (`TransformEngine`)
9
+ - [`goldenmatch`](https://www.npmjs.com/package/goldenmatch) — dedupe / entity resolution (`dedupe`)
10
+
11
+ Data flows through the pipeline as `Row[]` (arrays of plain objects).
12
+
13
+ ## Install
14
+
15
+ ```bash
16
+ npm install goldenpipe
17
+ # the three siblings come along as dependencies
18
+ ```
19
+
20
+ `yaml` is an optional peer dependency, needed only for YAML config loading:
21
+
22
+ ```bash
23
+ npm install yaml
24
+ ```
25
+
26
+ ## Quick start
27
+
28
+ ```ts
29
+ import { runDf } from "goldenpipe";
30
+
31
+ const rows = [
32
+ { first_name: "John", last_name: "Smith", email: "john@example.com" },
33
+ { first_name: "Jon", last_name: "Smith", email: "john@example.com" },
34
+ { first_name: "Jane", last_name: "Doe", email: "jane@example.com" },
35
+ ];
36
+
37
+ // Zero-config: runs goldencheck.scan -> goldenflow.transform -> goldenmatch.dedupe
38
+ const result = await runDf(rows);
39
+
40
+ console.log(result.status); // "success"
41
+ console.log(result.inputRows); // 3
42
+ console.log(result.artifacts.golden); // golden (canonical) records
43
+ console.log(result.artifacts.unique); // distinct records
44
+ ```
45
+
46
+ > **Async:** the runner is async because GoldenMatch's `dedupe` is async. `runDf`, `runStages`, `Pipeline.run`, and the node `run(source)` all return promises.
47
+
48
+ ### From a CSV file (Node)
49
+
50
+ ```ts
51
+ import { run } from "goldenpipe/node";
52
+
53
+ const result = await run("people.csv"); // zero-config
54
+ const result2 = await run("people.csv", { config: "pipeline.yml" });
55
+ ```
56
+
57
+ ### Custom pipeline config
58
+
59
+ ```ts
60
+ import { runDf, makePipelineConfig, makeStageSpec } from "goldenpipe";
61
+
62
+ const config = makePipelineConfig({
63
+ pipeline: "check-and-dedupe",
64
+ stages: [
65
+ "goldencheck.scan",
66
+ makeStageSpec({ use: "goldenmatch.dedupe", config: { threshold: 0.9 } }),
67
+ // omit goldenflow.transform to skip transformation
68
+ ],
69
+ });
70
+
71
+ const result = await runDf(rows, config);
72
+ ```
73
+
74
+ ### Programmatic stages
75
+
76
+ ```ts
77
+ import { runStages, stage, StageStatus } from "goldenpipe";
78
+
79
+ const myStage = stage(
80
+ { name: "tagger", produces: ["tag"], consumes: ["df"] },
81
+ (ctx) => {
82
+ ctx.artifacts.tag = (ctx.df ?? []).length;
83
+ return { status: StageStatus.SUCCESS };
84
+ },
85
+ );
86
+
87
+ const result = await runStages([myStage], rows);
88
+ ```
89
+
90
+ ## CLI
91
+
92
+ ```bash
93
+ goldenpipe-js run people.csv [-c pipeline.yml] [-v] # run the chain on a CSV
94
+ goldenpipe-js stages # list registered stages
95
+ goldenpipe-js validate -c pipeline.yml # dry-run wiring validation
96
+ goldenpipe-js init [-d .] # scaffold a goldenpipe.yml
97
+ ```
98
+
99
+ ## Architecture
100
+
101
+ ```mermaid
102
+ flowchart LR
103
+ L[load] --> C[goldencheck.scan]
104
+ C --> F[goldenflow.transform]
105
+ F --> M[goldenmatch.dedupe]
106
+ ```
107
+
108
+ | Stage | Wraps | Produces |
109
+ |-------|-------|----------|
110
+ | `load` | built-in | `df` |
111
+ | `goldencheck.scan` | `scanData(TabularData)` | `findings`, `profile`, `column_contexts` |
112
+ | `goldenflow.transform` | `new TransformEngine(cfg).transformDf(rows)` | `df`, `manifest` |
113
+ | `goldenmatch.dedupe` | `await dedupe(rows, { config })` | `clusters`, `golden`, `unique`, `dupes`, `match_stats`, `scored_pairs` |
114
+
115
+ The engine layer mirrors the Python design:
116
+
117
+ - **registry** — a STATIC registry (`buildDefaultRegistry()`) replacing Python's entry-point discovery.
118
+ - **resolver** — builds an `ExecutionPlan`, auto-prepends `load`, validates `consumes`/`produces` wiring.
119
+ - **router** — applies a stage's `Decision` (skip / insert / abort) to the remaining plan.
120
+ - **runner** — async stage execution with per-stage error handling + `skipIf` gating.
121
+ - **reporter** — assembles the `PipeResult` (status, stages, artifacts, errors, reasoning, timing).
122
+
123
+ A **column-context pipeline** carries semantic metadata across stages: GoldenCheck builds `ColumnContext`s (name-regex classification + IQR cardinality banding + identifier inference), GoldenFlow enriches them (date transforms confirm date type), and GoldenMatch consumes them to build a targeted dedupe config (`buildConfigFromContexts`) instead of re-profiling.
124
+
125
+ ## Decisions (adaptive routing)
126
+
127
+ `severityGate`, `piiRouter`, and `rowCountGate` are ported. They are not wired into the default chain — add them to a custom runner / stage that returns their `Decision`.
128
+
129
+ > **TS sibling skew:** GoldenCheck-JS `Finding.severity` is a numeric enum (INFO/WARNING/ERROR) with no `"critical"` level, and there is no `"pii_detection"` check. So `severityGate` and `piiRouter` are effectively no-ops against current GoldenCheck-JS output — they exist for structural parity and so custom stages emitting those findings still route.
130
+
131
+ ## Deferred (not in this v1 port)
132
+
133
+ - **`identity_resolve` stage** — GoldenMatch-JS Identity Graph wiring through the pipeline. The edge-safe `InMemoryIdentityStore` exists in `goldenmatch`, but the pipeline-driven `resolveClusters` population is not yet exposed.
134
+ - **`infer_schema` stage** — InferMap-based schema inference is not ported.
135
+ - **Servers/TUI** — the FastAPI REST API, A2A agent server, MCP server, and Textual TUI from the Python CLI are not ported.
136
+
137
+ ### Sibling version-skew artifacts
138
+
139
+ The TS siblings are version-skewed from the Python ones, so some artifacts the Python pipeline surfaces are shaped differently or absent here:
140
+
141
+ - `golden` artifact maps to GoldenMatch-JS `DedupeResult.goldenRecords` (the Python sibling exposes `.golden`).
142
+ - `scored_pairs` is GoldenMatch-JS `result.scoredPairs` (camelCase).
143
+ - `matchkey_used` is derived from the *built config's* first matchkey — the JS `DedupeResult` does not carry the resolved matchkey list back (the Python result does after auto-config).
144
+ - The Python `goldencheck.scan` adapter calls `scan_file(path)`, so the in-memory `run_df` path fails that stage. GoldenCheck-JS's `scanData` operates on rows, so the TS adapter's scan **succeeds** in both the in-memory (`runDf`) and file (`run`) paths.
145
+
146
+ ## Cross-language parity
147
+
148
+ `tests/parity/pipe-parity.test.ts` asserts skew-robust invariants (`status`, `input_rows`, ordered per-stage status/skip sequence, final `golden`/`unique` counts) against Python-generated goldens in `tests/fixtures/pipe_parity.json`. Regenerate the goldens with:
149
+
150
+ ```bash
151
+ uv run --project packages/python/goldenpipe python \
152
+ packages/python/goldenpipe/scripts/emit_ts_parity_fixtures.py
153
+ ```
154
+
155
+ ## License
156
+
157
+ MIT