datacontract-x 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. datacontract_x-0.1.0/LICENSE +21 -0
  2. datacontract_x-0.1.0/PKG-INFO +328 -0
  3. datacontract_x-0.1.0/README.md +290 -0
  4. datacontract_x-0.1.0/datacontract_x.egg-info/PKG-INFO +328 -0
  5. datacontract_x-0.1.0/datacontract_x.egg-info/SOURCES.txt +39 -0
  6. datacontract_x-0.1.0/datacontract_x.egg-info/dependency_links.txt +1 -0
  7. datacontract_x-0.1.0/datacontract_x.egg-info/entry_points.txt +2 -0
  8. datacontract_x-0.1.0/datacontract_x.egg-info/requires.txt +13 -0
  9. datacontract_x-0.1.0/datacontract_x.egg-info/top_level.txt +1 -0
  10. datacontract_x-0.1.0/dcx/__init__.py +10 -0
  11. datacontract_x-0.1.0/dcx/api.py +1277 -0
  12. datacontract_x-0.1.0/dcx/apply/__init__.py +10 -0
  13. datacontract_x-0.1.0/dcx/apply/snowflake.py +623 -0
  14. datacontract_x-0.1.0/dcx/cli.py +129 -0
  15. datacontract_x-0.1.0/dcx/enrich/__init__.py +41 -0
  16. datacontract_x-0.1.0/dcx/enrich/all.py +151 -0
  17. datacontract_x-0.1.0/dcx/enrich/base.py +216 -0
  18. datacontract_x-0.1.0/dcx/enrich/columns.py +429 -0
  19. datacontract_x-0.1.0/dcx/enrich/quality.py +395 -0
  20. datacontract_x-0.1.0/dcx/enrich/tags.py +482 -0
  21. datacontract_x-0.1.0/dcx/exporters/__init__.py +6 -0
  22. datacontract_x-0.1.0/dcx/exporters/command.py +111 -0
  23. datacontract_x-0.1.0/dcx/exporters/snowflake.py +523 -0
  24. datacontract_x-0.1.0/dcx/import_commands.py +163 -0
  25. datacontract_x-0.1.0/dcx/importers/__init__.py +7 -0
  26. datacontract_x-0.1.0/dcx/importers/kafka.py +143 -0
  27. datacontract_x-0.1.0/dcx/importers/registry.py +163 -0
  28. datacontract_x-0.1.0/dcx/importers/snowflake.py +487 -0
  29. datacontract_x-0.1.0/dcx/serve.py +33 -0
  30. datacontract_x-0.1.0/dcx/target/__init__.py +10 -0
  31. datacontract_x-0.1.0/dcx/target/command.py +1257 -0
  32. datacontract_x-0.1.0/pyproject.toml +74 -0
  33. datacontract_x-0.1.0/setup.cfg +4 -0
  34. datacontract_x-0.1.0/tests/test_api.py +567 -0
  35. datacontract_x-0.1.0/tests/test_apply.py +597 -0
  36. datacontract_x-0.1.0/tests/test_enrich.py +1146 -0
  37. datacontract_x-0.1.0/tests/test_kafka_import.py +144 -0
  38. datacontract_x-0.1.0/tests/test_smoke.py +19 -0
  39. datacontract_x-0.1.0/tests/test_snowflake_export.py +616 -0
  40. datacontract_x-0.1.0/tests/test_snowflake_import.py +515 -0
  41. datacontract_x-0.1.0/tests/test_target.py +477 -0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 dcx contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,328 @@
1
+ Metadata-Version: 2.4
2
+ Name: datacontract-x
3
+ Version: 0.1.0
4
+ Summary: Data Contract eXtended — AI-native, platform-extensible data contracts: LLM enrichment (descriptions, tags, data quality), live import, and apply. Built on datacontract-cli.
5
+ Author: MickaelBZH
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/MickaelBZH/data-contract-x
8
+ Project-URL: Repository, https://github.com/MickaelBZH/data-contract-x
9
+ Project-URL: Issues, https://github.com/MickaelBZH/data-contract-x/issues
10
+ Keywords: data-contract,odcs,data-quality,data-governance,data-catalog,llm,ai,datacontract-cli,metadata,tagging,snowflake
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Information Technology
15
+ Classifier: Operating System :: OS Independent
16
+ Classifier: Programming Language :: Python :: 3 :: Only
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Database
21
+ Classifier: Topic :: Software Development :: Quality Assurance
22
+ Requires-Python: <3.13,>=3.10
23
+ Description-Content-Type: text/markdown
24
+ License-File: LICENSE
25
+ Requires-Dist: datacontract-cli<0.13,>=0.12.4
26
+ Requires-Dist: typer<0.26,>=0.18.0
27
+ Requires-Dist: open-data-contract-standard<4.0.0,>=3.1.2
28
+ Requires-Dist: fastapi<0.137.0,>=0.115.0
29
+ Requires-Dist: uvicorn<1.0,>=0.30
30
+ Requires-Dist: pyyaml<7.0,>=6.0
31
+ Requires-Dist: snowflake-connector-python<4.0,>=3.0
32
+ Requires-Dist: litellm<2.0,>=1.50
33
+ Provides-Extra: dev
34
+ Requires-Dist: pytest>=8.0; extra == "dev"
35
+ Requires-Dist: ruff>=0.6; extra == "dev"
36
+ Requires-Dist: httpx<1.0,>=0.27; extra == "dev"
37
+ Dynamic: license-file
38
+
39
+ <p align="center">
40
+ <img src="assets/logo.svg" alt="dcx — Data Contract eXtended" width="520">
41
+ </p>
42
+
43
+ <h3 align="center">Data Contract e<strong>X</strong>tended — AI-native, platform-extensible data contracts</h3>
44
+
45
+ <p align="center">
46
+ Author data contracts with an LLM, sync them with your live platforms.<br>
47
+ A lean, no-fork extension of <a href="https://github.com/datacontract/datacontract-cli">datacontract-cli</a>, built on the <a href="https://bitol.io/">Open Data Contract Standard (ODCS)</a>.
48
+ </p>
49
+
50
+ <p align="center">
51
+ <img alt="PyPI" src="https://img.shields.io/pypi/v/datacontract-x?color=6366F1&label=pypi">
52
+ <img alt="Python" src="https://img.shields.io/badge/python-3.10%20|%203.11%20|%203.12-3776AB?logo=python&logoColor=white">
53
+ <img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-22c55e">
54
+ <img alt="ODCS" src="https://img.shields.io/badge/ODCS-v3.1-0EA5E9">
55
+ <img alt="Built on datacontract-cli" src="https://img.shields.io/badge/built%20on-datacontract--cli-6366F1">
56
+ </p>
57
+
58
+ ---
59
+
60
+ ## What is dcx?
61
+
62
+ **dcx (Data Contract eXtended)** adds three things to the Open Data Contract Standard workflow that plain datacontract-cli doesn't do:
63
+
64
+ 1. **AI authoring** — use an LLM to enrich a contract with column descriptions, validation constraints, governance **tags** from your own catalog, and an executable **data-quality** suite.
65
+ 2. **Live import** — build a contract *from* a running system (its real columns, keys, comments, tags).
66
+ 3. **Apply** — push the contract's governance *back* to the platform (comments, tags, data-quality, and the table itself).
67
+
68
+ It's **platform-extensible by design**: each platform is a small importer / exporter / apply module that plugs into datacontract-cli's factories. **Snowflake is the first end-to-end platform** (import → enrich → apply), with Kafka import today and more platforms built to slot in the same way.
69
+
70
+ The pipeline is: **import** a live schema into an ODCS contract → **enrich** it (columns · tags · quality) → **apply** it back to the platform, or **export** it to SQL / docs / schemas. Everything is available both as a **CLI** and as a **REST API** (`dcx api`).
71
+
72
+ ## Why dcx?
73
+
74
+ - 🧠 **AI authoring that's safe to ship.** Forced tool-calling, `temperature=0`, and strict server-side validation against the ODCS schema — the model can only produce spec-valid output, never free-form guesses.
75
+ - 🏷️ **A tag *manager*, not a tag guesser.** You define a controlled [tag catalog](#the-tag-catalog) (names, allowed values, examples); the LLM classifies columns into *your* vocabulary, with optional defaults.
76
+ - ✅ **Executable, portable data quality.** Quality rules prefer ODCS `library` metrics (portable, mappable to platform-native checks) and fall back to portable `sql` checks — across all seven ODCS dimensions.
77
+ - 🔌 **Any LLM provider.** Powered by [litellm](https://github.com/BerriAI/litellm) — Anthropic, OpenAI, Azure, Bedrock, Gemini, Ollama, … behind one `--model` flag.
78
+ - 🧩 **Pluggable platforms, no fork.** You keep all 30+ upstream importers/exporters and `lint` / `test` / `changelog`, and gain the AI + platform layer on top.
79
+ - 🔐 **Auth that makes sense per surface.** Live platform operations over the API use **caller-supplied OAuth**; secrets are never CLI flags.
80
+
81
+ ## Install
82
+
83
+ ```bash
84
+ pip install datacontract-x
85
+ ```
86
+
87
+ The import package and CLI are both `dcx`:
88
+
89
+ ```bash
90
+ dcx --help
91
+ dcx info
92
+ ```
93
+
94
+ From source (for development):
95
+
96
+ ```bash
97
+ git clone https://github.com/MickaelBZH/data-contract-x.git
98
+ cd data-contract-x
99
+ pip install -e ".[dev]"
100
+ ```
101
+
102
+ > Requires Python 3.10–3.12. Installing pulls in `datacontract-cli`, `litellm`, FastAPI, and the platform connectors automatically.
103
+
104
+ ## Quickstart
105
+
106
+ The full loop — import a live schema, enrich it with an LLM, sync it back. Snowflake here is the example platform.
107
+
108
+ ```bash
109
+ # 1. Import an existing schema into a contract (real columns, PKs, comments, tags)
110
+ dcx import snowflake --database MY_DB --schema LOAD --authenticator externalbrowser --output contract.yaml
111
+
112
+ # 2. Enrich with an LLM: descriptions + constraints + tags + data-quality tests
113
+ export ANTHROPIC_API_KEY=... # or OPENAI_API_KEY / AZURE_API_KEY / ...
114
+ dcx enrich all contract.yaml --catalog tags_catalog.yaml --output contract.enriched.yaml
115
+
116
+ # 3. Preview exactly what will run — no connection needed
117
+ dcx apply snowflake contract.enriched.yaml --include-quality --dry-run
118
+
119
+ # 4. Apply it: creates the table if missing, governs it (comments + tags + DQ) if it exists
120
+ dcx apply snowflake contract.enriched.yaml --include-quality
121
+ ```
122
+
123
+ ---
124
+
125
+ ## Commands
126
+
127
+ Every command is `dcx <command>`, and most are mirrored to a REST endpoint when you run [`dcx api`](#rest-api). Each section below lists the sub-commands, a CLI example, and the matching API call. Run `dcx <command> --help` for the full option list.
128
+
129
+ ### `import` — build a contract from a source
130
+
131
+ | Sub-command | Source |
132
+ |---|---|
133
+ | `dcx import snowflake` | A live Snowflake schema (columns, primary keys, comments, tags) |
134
+ | `dcx import kafka` | A Kafka topic's value schema (Confluent Schema Registry) |
135
+ | `dcx import <format>` | A file/document — `sql`, `avro`, `dbml`, `glue`, `bigquery`, `unity`, `jsonschema`, `json`, `odcs`, `parquet`, `csv`, `protobuf`, `spark`, `iceberg`, `excel`, `dbt` |
136
+
137
+ ```bash
138
+ dcx import snowflake --database MY_DB --schema LOAD --authenticator externalbrowser --output contract.yaml
139
+ dcx import kafka --schema-registry https://sr:8081 --topic orders --output contract.yaml
140
+ dcx import sql --source schema.sql --dialect snowflake --output contract.yaml
141
+ ```
142
+
143
+ **API**
144
+ - `POST /import/snowflake` — live import, authenticated by the caller's Snowflake OAuth token (`Authorization: Bearer <token>`).
145
+ - `POST /import/{format}` — file-based importers; send the document inline as `source_content`.
146
+ - *(Kafka import is CLI-only.)*
147
+
148
+ ### `enrich` — AI authoring with an LLM
149
+
150
+ | Sub-command | Adds |
151
+ |---|---|
152
+ | `dcx enrich columns` | Business descriptions, `logicalTypeOptions` constraints, `required` / `unique` flags |
153
+ | `dcx enrich tags` | Governance tags, classified against your [tag catalog](#the-tag-catalog) |
154
+ | `dcx enrich quality` | An executable data-quality suite across all ODCS dimensions |
155
+ | `dcx enrich all` | columns → tags → quality, in that order so each stage grounds the next |
156
+
157
+ Each sub-command is independent and idempotent (existing values are preserved unless you pass `--overwrite`). The provider key is read from the environment — there is no `--api-key` flag. Use `--model` for any litellm model and `--base-url` for a proxy / Azure / Ollama endpoint.
158
+
159
+ ```bash
160
+ dcx enrich columns contract.yaml --output contract.enriched.yaml
161
+ dcx enrich tags contract.yaml --catalog tags_catalog.yaml --output contract.tagged.yaml
162
+ dcx enrich quality contract.yaml --model gpt-4o --output contract.dq.yaml
163
+ dcx enrich all contract.yaml --catalog tags_catalog.yaml --output contract.full.yaml
164
+ ```
165
+
166
+ **API** (the LLM key comes from the *server's* environment)
167
+ - `POST /enrich/columns` · `POST /enrich/quality`
168
+ - `POST /enrich/tags` · `POST /enrich/all` — take the tag catalog inline in the request body.
169
+
170
+ ### `export` — convert a contract to a target format
171
+
172
+ | Sub-command | Output |
173
+ |---|---|
174
+ | `dcx export snowflake-full` | A Snowflake setup script: DDL + tags + Data Metric Functions, in one file |
175
+ | `dcx export <format>` | Any upstream format — `sql`, `jsonschema`, `html`, `markdown`, `mermaid`, `dbt-*`, `avro`, `protobuf`, `bigquery`, `spark`, `sqlalchemy`, `iceberg`, `sodacl`, `great-expectations`, `dbml`, `pydantic-model`, `odcs`, `rdf`, `go`, `excel`, … |
176
+
177
+ `snowflake-full` options: `--include-tags`, `--include-quality`, `--create-tags`, `--tag-namespace DB.SCHEMA`, and `--structured-types` (render nested columns as typed `OBJECT(field type, …)` / `ARRAY(type)`).
178
+
179
+ ```bash
180
+ dcx export snowflake-full contract.yaml --include-quality --create-tags --output setup.sql
181
+ dcx export html contract.yaml --output contract.html
182
+ ```
183
+
184
+ **API**
185
+ - `POST /export/{format}` — including `POST /export/snowflake-full`. The response media type depends on the format (JSON / YAML / text / binary).
186
+
187
+ ### `apply` — push governance to a live platform
188
+
189
+ | Sub-command | Target |
190
+ |---|---|
191
+ | `dcx apply snowflake` | A live Snowflake account |
192
+
193
+ With the default `--ddl-mode auto` you don't need to know whether the table exists: **missing tables are created** (`CREATE TABLE IF NOT EXISTS`) and **existing ones are governed** — column/table comments, tags, and (with `--include-quality`) data-quality metrics. For existing tables, dcx also **compares the live schema to the contract** and reports drift as warnings — or, with `--strict`, an error that aborts before any change (the check uses `DESCRIBE TABLE`, so it needs no active warehouse).
194
+
195
+ | Option | Effect |
196
+ |---|---|
197
+ | `--ddl-mode auto\|always\|never` | create-if-missing-then-govern (default) · always `CREATE TABLE` · govern existing only |
198
+ | `--strict` | fail instead of warn on schema drift |
199
+ | `--structured-types` | typed nested `OBJECT(...)` / `ARRAY(...)` |
200
+ | `--include-quality` · `--create-tags` · `--tag-namespace` | data-metric functions · `CREATE TAG IF NOT EXISTS` · qualify tag refs |
201
+ | `--dry-run` | print the SQL without connecting |
202
+
203
+ ```bash
204
+ dcx apply snowflake contract.yaml --dry-run # preview
205
+ dcx apply snowflake contract.yaml --include-quality # create-or-govern
206
+ ```
207
+
208
+ **API**
209
+ - `POST /apply/snowflake` — authenticated by the caller's Snowflake OAuth token. Supports `dry_run`, `ddl_mode`, `strict`, `structured_types`, … and returns the executed SQL plus any drift `warnings`.
210
+
211
+ ### `target` — bind a contract to a platform
212
+
213
+ `dcx target <type>` sets the contract's server block and resolves each column's `physicalType` for that platform. ~30 types: `snowflake`, `bigquery`, `databricks`, `postgres`, `redshift`, `mysql`, `sqlserver`, `oracle`, `s3`, `kafka`, `trino`, `athena`, `glue`, `duckdb`, `local`, …
214
+
215
+ ```bash
216
+ dcx target snowflake contract.yaml --output contract.snowflake.yaml
217
+ ```
218
+
219
+ **API**
220
+ - `POST /target/{type}` — one route per supported platform type.
221
+
222
+ ### From datacontract-cli
223
+
224
+ These commands work unchanged — `dcx <command>` behaves exactly like `datacontract <command>`.
225
+
226
+ | Command | Sub-commands | Purpose | API |
227
+ |---|---|---|---|
228
+ | `dcx init` | — | Create an empty data contract | — |
229
+ | `dcx lint` | — | Validate a contract against the ODCS schema | `POST /lint` |
230
+ | `dcx test` | — | Run schema + data-quality tests against a configured server | `POST /test` |
231
+ | `dcx ci` | — | `test` for CI/CD — emits GitHub Actions annotations | — |
232
+ | `dcx changelog` | — | Semantic changelog between two contract versions | `POST /changelog` |
233
+ | `dcx catalog` | — | Render an HTML catalog of many contracts | — |
234
+ | `dcx publish` | — | Publish a contract to Entropy Data | — |
235
+ | `dcx dbt` | `sync` | Sync contracts into a dbt project | — |
236
+
237
+ ### `api` / `info`
238
+
239
+ ```bash
240
+ dcx api --port 4242 # start the REST server (Swagger UI at /docs)
241
+ dcx info # show dcx + datacontract-cli versions (API: GET /info)
242
+ ```
243
+
244
+ ---
245
+
246
+ ## The tag catalog
247
+
248
+ `dcx enrich tags` does **controlled-vocabulary** tagging: instead of letting the model invent tags, you give it a catalog of allowed names and values, and it classifies each column into that vocabulary. The catalog is a small YAML (or JSON) file — the only extra input auto-tagging needs.
249
+
250
+ ```yaml
251
+ # tags_catalog.yaml
252
+ tags:
253
+ - name: DATA_CLASSIFICATION # the tag name (becomes the platform TAG name)
254
+ description: > # tells the model what this tag is for
255
+ Data sensitivity level. Assign exactly one — the highest level that applies.
256
+ multiple: false # false = at most one value per column; true = many
257
+ values:
258
+ - value: PUBLIC # the model may only pick from these values
259
+ description: Non-sensitive data that can be shared freely.
260
+ examples: [country_code, currency, language, product_category] # guide classification
261
+ - value: INTERNAL
262
+ description: Internal business data, not for public release. The default.
263
+ default: true # assigned when the model picks nothing else
264
+ examples: [order_id, status, created_at, loyalty_points]
265
+ - value: CONFIDENTIAL
266
+ description: Personal data or sensitive business data; need-to-know access.
267
+ examples: [full_name, email, phone, home_address, date_of_birth]
268
+ - value: RESTRICTED
269
+ description: Highly sensitive data under legal/regulatory controls (financial, health, credentials, IDs).
270
+ examples: [national_id, passport_number, iban, credit_card_number, health_status]
271
+
272
+ - name: DATA_DOMAIN # you can define several tags
273
+ description: The business domain that owns the column.
274
+ multiple: false
275
+ values:
276
+ - value: CUSTOMER
277
+ examples: [customer_id, email, loyalty_points]
278
+ - value: FINANCE
279
+ examples: [amount, currency, invoice_id, iban]
280
+ ```
281
+
282
+ | Field | Meaning |
283
+ |---|---|
284
+ | `name` | Tag name. Required. Becomes the tag key everywhere downstream. |
285
+ | `description` | What the tag means — given to the model as classification guidance. |
286
+ | `multiple` | `false` (default): at most one value per column. `true`: a column may carry several. |
287
+ | `values[].value` | An allowed value. **The model may only assign values listed here** — anything else is dropped. |
288
+ | `values[].description` | What the value means — strongly improves accuracy. |
289
+ | `values[].examples` | Example column names that fit this value — the model's strongest signal. |
290
+ | `values[].default` | If `true`, assigned to columns the model leaves unclassified for this tag. At most one per tag. |
291
+
292
+ Assigned tags are written on each column as `NAME=VALUE` (e.g. `DATA_CLASSIFICATION=CONFIDENTIAL`) — the convention `export snowflake-full` and `apply snowflake` consume. A worked catalog and example contracts live in [`examples/`](examples/).
293
+
294
+ ## REST API
295
+
296
+ ```bash
297
+ dcx api --port 4242 # Swagger UI at http://127.0.0.1:4242/docs
298
+ ```
299
+
300
+ Every command above is mirrored to an endpoint, with request **and** response schemas in the OpenAPI spec. Auth model:
301
+
302
+ - **Live platform operations** (`/import/snowflake`, `/apply/snowflake`) act *as the caller* — the OAuth bearer token comes from the `Authorization` header, so the server never uses ambient credentials for someone else's data.
303
+ - **Enrichment** (`/enrich/*`) uses the **server's** LLM key (from the environment). Put service-level auth/quota in front of it before exposing it publicly.
304
+ - **The CLI never takes secrets as flags** — platform secrets come from env vars or the platform's own config; LLM keys from the provider's standard env var.
305
+
306
+ ## How it fits with datacontract-cli
307
+
308
+ dcx is a **separate package that depends on datacontract-cli as a library** — no fork. It registers new importers (`snowflake`, `kafka`) and the `snowflake-full` exporter into the upstream factories, adds `target` / `enrich` / `apply` sub-apps and live-import commands to the upstream Typer app, and mirrors every command to FastAPI for `dcx api`. So you keep all of upstream's importers, exporters, `lint`, `test`, and `changelog`, and gain the AI + platform layer on top.
309
+
310
+ ## Development
311
+
312
+ ```bash
313
+ pip install -e ".[dev]"
314
+ pytest # 211 tests
315
+ ruff check dcx # lint
316
+ ```
317
+
318
+ Tests never hit live services or real LLMs — platform connections, the Schema Registry, and every LLM call are mocked, so the suite stays fast and offline. See [`RELEASING.md`](RELEASING.md) for the PyPI release process.
319
+
320
+ ## Contributing
321
+
322
+ Issues and PRs welcome. Please run `pytest` and `ruff check dcx` before opening a PR, and add tests for new behavior.
323
+
324
+ ## License
325
+
326
+ [MIT](LICENSE) © MickaelBZH.
327
+
328
+ <p align="center"><sub>Built on <a href="https://github.com/datacontract/datacontract-cli">datacontract-cli</a> · <a href="https://bitol.io/">Open Data Contract Standard</a> · <a href="https://github.com/BerriAI/litellm">litellm</a></sub></p>
@@ -0,0 +1,290 @@
1
+ <p align="center">
2
+ <img src="assets/logo.svg" alt="dcx — Data Contract eXtended" width="520">
3
+ </p>
4
+
5
+ <h3 align="center">Data Contract e<strong>X</strong>tended — AI-native, platform-extensible data contracts</h3>
6
+
7
+ <p align="center">
8
+ Author data contracts with an LLM, sync them with your live platforms.<br>
9
+ A lean, no-fork extension of <a href="https://github.com/datacontract/datacontract-cli">datacontract-cli</a>, built on the <a href="https://bitol.io/">Open Data Contract Standard (ODCS)</a>.
10
+ </p>
11
+
12
+ <p align="center">
13
+ <img alt="PyPI" src="https://img.shields.io/pypi/v/datacontract-x?color=6366F1&label=pypi">
14
+ <img alt="Python" src="https://img.shields.io/badge/python-3.10%20|%203.11%20|%203.12-3776AB?logo=python&logoColor=white">
15
+ <img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-22c55e">
16
+ <img alt="ODCS" src="https://img.shields.io/badge/ODCS-v3.1-0EA5E9">
17
+ <img alt="Built on datacontract-cli" src="https://img.shields.io/badge/built%20on-datacontract--cli-6366F1">
18
+ </p>
19
+
20
+ ---
21
+
22
+ ## What is dcx?
23
+
24
+ **dcx (Data Contract eXtended)** adds three things to the Open Data Contract Standard workflow that plain datacontract-cli doesn't do:
25
+
26
+ 1. **AI authoring** — use an LLM to enrich a contract with column descriptions, validation constraints, governance **tags** from your own catalog, and an executable **data-quality** suite.
27
+ 2. **Live import** — build a contract *from* a running system (its real columns, keys, comments, tags).
28
+ 3. **Apply** — push the contract's governance *back* to the platform (comments, tags, data-quality, and the table itself).
29
+
30
+ It's **platform-extensible by design**: each platform is a small importer / exporter / apply module that plugs into datacontract-cli's factories. **Snowflake is the first end-to-end platform** (import → enrich → apply), with Kafka import today and more platforms built to slot in the same way.
31
+
32
+ The pipeline is: **import** a live schema into an ODCS contract → **enrich** it (columns · tags · quality) → **apply** it back to the platform, or **export** it to SQL / docs / schemas. Everything is available both as a **CLI** and as a **REST API** (`dcx api`).
33
+
34
+ ## Why dcx?
35
+
36
+ - 🧠 **AI authoring that's safe to ship.** Forced tool-calling, `temperature=0`, and strict server-side validation against the ODCS schema — the model can only produce spec-valid output, never free-form guesses.
37
+ - 🏷️ **A tag *manager*, not a tag guesser.** You define a controlled [tag catalog](#the-tag-catalog) (names, allowed values, examples); the LLM classifies columns into *your* vocabulary, with optional defaults.
38
+ - ✅ **Executable, portable data quality.** Quality rules prefer ODCS `library` metrics (portable, mappable to platform-native checks) and fall back to portable `sql` checks — across all seven ODCS dimensions.
39
+ - 🔌 **Any LLM provider.** Powered by [litellm](https://github.com/BerriAI/litellm) — Anthropic, OpenAI, Azure, Bedrock, Gemini, Ollama, … behind one `--model` flag.
40
+ - 🧩 **Pluggable platforms, no fork.** You keep all 30+ upstream importers/exporters and `lint` / `test` / `changelog`, and gain the AI + platform layer on top.
41
+ - 🔐 **Auth that makes sense per surface.** Live platform operations over the API use **caller-supplied OAuth**; secrets are never CLI flags.
42
+
43
+ ## Install
44
+
45
+ ```bash
46
+ pip install datacontract-x
47
+ ```
48
+
49
+ The import package and CLI are both `dcx`:
50
+
51
+ ```bash
52
+ dcx --help
53
+ dcx info
54
+ ```
55
+
56
+ From source (for development):
57
+
58
+ ```bash
59
+ git clone https://github.com/MickaelBZH/data-contract-x.git
60
+ cd data-contract-x
61
+ pip install -e ".[dev]"
62
+ ```
63
+
64
+ > Requires Python 3.10–3.12. Installing pulls in `datacontract-cli`, `litellm`, FastAPI, and the platform connectors automatically.
65
+
66
+ ## Quickstart
67
+
68
+ The full loop — import a live schema, enrich it with an LLM, sync it back. Snowflake here is the example platform.
69
+
70
+ ```bash
71
+ # 1. Import an existing schema into a contract (real columns, PKs, comments, tags)
72
+ dcx import snowflake --database MY_DB --schema LOAD --authenticator externalbrowser --output contract.yaml
73
+
74
+ # 2. Enrich with an LLM: descriptions + constraints + tags + data-quality tests
75
+ export ANTHROPIC_API_KEY=... # or OPENAI_API_KEY / AZURE_API_KEY / ...
76
+ dcx enrich all contract.yaml --catalog tags_catalog.yaml --output contract.enriched.yaml
77
+
78
+ # 3. Preview exactly what will run — no connection needed
79
+ dcx apply snowflake contract.enriched.yaml --include-quality --dry-run
80
+
81
+ # 4. Apply it: creates the table if missing, governs it (comments + tags + DQ) if it exists
82
+ dcx apply snowflake contract.enriched.yaml --include-quality
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Commands
88
+
89
+ Every command is `dcx <command>`, and most are mirrored to a REST endpoint when you run [`dcx api`](#rest-api). Each section below lists the sub-commands, a CLI example, and the matching API call. Run `dcx <command> --help` for the full option list.
90
+
91
+ ### `import` — build a contract from a source
92
+
93
+ | Sub-command | Source |
94
+ |---|---|
95
+ | `dcx import snowflake` | A live Snowflake schema (columns, primary keys, comments, tags) |
96
+ | `dcx import kafka` | A Kafka topic's value schema (Confluent Schema Registry) |
97
+ | `dcx import <format>` | A file/document — `sql`, `avro`, `dbml`, `glue`, `bigquery`, `unity`, `jsonschema`, `json`, `odcs`, `parquet`, `csv`, `protobuf`, `spark`, `iceberg`, `excel`, `dbt` |
98
+
99
+ ```bash
100
+ dcx import snowflake --database MY_DB --schema LOAD --authenticator externalbrowser --output contract.yaml
101
+ dcx import kafka --schema-registry https://sr:8081 --topic orders --output contract.yaml
102
+ dcx import sql --source schema.sql --dialect snowflake --output contract.yaml
103
+ ```
104
+
105
+ **API**
106
+ - `POST /import/snowflake` — live import, authenticated by the caller's Snowflake OAuth token (`Authorization: Bearer <token>`).
107
+ - `POST /import/{format}` — file-based importers; send the document inline as `source_content`.
108
+ - *(Kafka import is CLI-only.)*
109
+
110
+ ### `enrich` — AI authoring with an LLM
111
+
112
+ | Sub-command | Adds |
113
+ |---|---|
114
+ | `dcx enrich columns` | Business descriptions, `logicalTypeOptions` constraints, `required` / `unique` flags |
115
+ | `dcx enrich tags` | Governance tags, classified against your [tag catalog](#the-tag-catalog) |
116
+ | `dcx enrich quality` | An executable data-quality suite across all ODCS dimensions |
117
+ | `dcx enrich all` | columns → tags → quality, in that order so each stage grounds the next |
118
+
119
+ Each sub-command is independent and idempotent (existing values are preserved unless you pass `--overwrite`). The provider key is read from the environment — there is no `--api-key` flag. Use `--model` for any litellm model and `--base-url` for a proxy / Azure / Ollama endpoint.
120
+
121
+ ```bash
122
+ dcx enrich columns contract.yaml --output contract.enriched.yaml
123
+ dcx enrich tags contract.yaml --catalog tags_catalog.yaml --output contract.tagged.yaml
124
+ dcx enrich quality contract.yaml --model gpt-4o --output contract.dq.yaml
125
+ dcx enrich all contract.yaml --catalog tags_catalog.yaml --output contract.full.yaml
126
+ ```
127
+
128
+ **API** (the LLM key comes from the *server's* environment)
129
+ - `POST /enrich/columns` · `POST /enrich/quality`
130
+ - `POST /enrich/tags` · `POST /enrich/all` — take the tag catalog inline in the request body.
131
+
132
+ ### `export` — convert a contract to a target format
133
+
134
+ | Sub-command | Output |
135
+ |---|---|
136
+ | `dcx export snowflake-full` | A Snowflake setup script: DDL + tags + Data Metric Functions, in one file |
137
+ | `dcx export <format>` | Any upstream format — `sql`, `jsonschema`, `html`, `markdown`, `mermaid`, `dbt-*`, `avro`, `protobuf`, `bigquery`, `spark`, `sqlalchemy`, `iceberg`, `sodacl`, `great-expectations`, `dbml`, `pydantic-model`, `odcs`, `rdf`, `go`, `excel`, … |
138
+
139
+ `snowflake-full` options: `--include-tags`, `--include-quality`, `--create-tags`, `--tag-namespace DB.SCHEMA`, and `--structured-types` (render nested columns as typed `OBJECT(field type, …)` / `ARRAY(type)`).
140
+
141
+ ```bash
142
+ dcx export snowflake-full contract.yaml --include-quality --create-tags --output setup.sql
143
+ dcx export html contract.yaml --output contract.html
144
+ ```
145
+
146
+ **API**
147
+ - `POST /export/{format}` — including `POST /export/snowflake-full`. The response media type depends on the format (JSON / YAML / text / binary).
148
+
149
+ ### `apply` — push governance to a live platform
150
+
151
+ | Sub-command | Target |
152
+ |---|---|
153
+ | `dcx apply snowflake` | A live Snowflake account |
154
+
155
+ With the default `--ddl-mode auto` you don't need to know whether the table exists: **missing tables are created** (`CREATE TABLE IF NOT EXISTS`) and **existing ones are governed** — column/table comments, tags, and (with `--include-quality`) data-quality metrics. For existing tables, dcx also **compares the live schema to the contract** and reports drift as warnings — or, with `--strict`, an error that aborts before any change (the check uses `DESCRIBE TABLE`, so it needs no active warehouse).
156
+
157
+ | Option | Effect |
158
+ |---|---|
159
+ | `--ddl-mode auto\|always\|never` | create-if-missing-then-govern (default) · always `CREATE TABLE` · govern existing only |
160
+ | `--strict` | fail instead of warn on schema drift |
161
+ | `--structured-types` | typed nested `OBJECT(...)` / `ARRAY(...)` |
162
+ | `--include-quality` · `--create-tags` · `--tag-namespace` | data-metric functions · `CREATE TAG IF NOT EXISTS` · qualify tag refs |
163
+ | `--dry-run` | print the SQL without connecting |
164
+
165
+ ```bash
166
+ dcx apply snowflake contract.yaml --dry-run # preview
167
+ dcx apply snowflake contract.yaml --include-quality # create-or-govern
168
+ ```
169
+
170
+ **API**
171
+ - `POST /apply/snowflake` — authenticated by the caller's Snowflake OAuth token. Supports `dry_run`, `ddl_mode`, `strict`, `structured_types`, … and returns the executed SQL plus any drift `warnings`.
172
+
173
+ ### `target` — bind a contract to a platform
174
+
175
+ `dcx target <type>` sets the contract's server block and resolves each column's `physicalType` for that platform. ~30 types: `snowflake`, `bigquery`, `databricks`, `postgres`, `redshift`, `mysql`, `sqlserver`, `oracle`, `s3`, `kafka`, `trino`, `athena`, `glue`, `duckdb`, `local`, …
176
+
177
+ ```bash
178
+ dcx target snowflake contract.yaml --output contract.snowflake.yaml
179
+ ```
180
+
181
+ **API**
182
+ - `POST /target/{type}` — one route per supported platform type.
183
+
184
+ ### From datacontract-cli
185
+
186
+ These commands work unchanged — `dcx <command>` behaves exactly like `datacontract <command>`.
187
+
188
+ | Command | Sub-commands | Purpose | API |
189
+ |---|---|---|---|
190
+ | `dcx init` | — | Create an empty data contract | — |
191
+ | `dcx lint` | — | Validate a contract against the ODCS schema | `POST /lint` |
192
+ | `dcx test` | — | Run schema + data-quality tests against a configured server | `POST /test` |
193
+ | `dcx ci` | — | `test` for CI/CD — emits GitHub Actions annotations | — |
194
+ | `dcx changelog` | — | Semantic changelog between two contract versions | `POST /changelog` |
195
+ | `dcx catalog` | — | Render an HTML catalog of many contracts | — |
196
+ | `dcx publish` | — | Publish a contract to Entropy Data | — |
197
+ | `dcx dbt` | `sync` | Sync contracts into a dbt project | — |
198
+
199
+ ### `api` / `info`
200
+
201
+ ```bash
202
+ dcx api --port 4242 # start the REST server (Swagger UI at /docs)
203
+ dcx info # show dcx + datacontract-cli versions (API: GET /info)
204
+ ```
205
+
206
+ ---
207
+
208
+ ## The tag catalog
209
+
210
+ `dcx enrich tags` does **controlled-vocabulary** tagging: instead of letting the model invent tags, you give it a catalog of allowed names and values, and it classifies each column into that vocabulary. The catalog is a small YAML (or JSON) file — the only extra input auto-tagging needs.
211
+
212
+ ```yaml
213
+ # tags_catalog.yaml
214
+ tags:
215
+ - name: DATA_CLASSIFICATION # the tag name (becomes the platform TAG name)
216
+ description: > # tells the model what this tag is for
217
+ Data sensitivity level. Assign exactly one — the highest level that applies.
218
+ multiple: false # false = at most one value per column; true = many
219
+ values:
220
+ - value: PUBLIC # the model may only pick from these values
221
+ description: Non-sensitive data that can be shared freely.
222
+ examples: [country_code, currency, language, product_category] # guide classification
223
+ - value: INTERNAL
224
+ description: Internal business data, not for public release. The default.
225
+ default: true # assigned when the model picks nothing else
226
+ examples: [order_id, status, created_at, loyalty_points]
227
+ - value: CONFIDENTIAL
228
+ description: Personal data or sensitive business data; need-to-know access.
229
+ examples: [full_name, email, phone, home_address, date_of_birth]
230
+ - value: RESTRICTED
231
+ description: Highly sensitive data under legal/regulatory controls (financial, health, credentials, IDs).
232
+ examples: [national_id, passport_number, iban, credit_card_number, health_status]
233
+
234
+ - name: DATA_DOMAIN # you can define several tags
235
+ description: The business domain that owns the column.
236
+ multiple: false
237
+ values:
238
+ - value: CUSTOMER
239
+ examples: [customer_id, email, loyalty_points]
240
+ - value: FINANCE
241
+ examples: [amount, currency, invoice_id, iban]
242
+ ```
243
+
244
+ | Field | Meaning |
245
+ |---|---|
246
+ | `name` | Tag name. Required. Becomes the tag key everywhere downstream. |
247
+ | `description` | What the tag means — given to the model as classification guidance. |
248
+ | `multiple` | `false` (default): at most one value per column. `true`: a column may carry several. |
249
+ | `values[].value` | An allowed value. **The model may only assign values listed here** — anything else is dropped. |
250
+ | `values[].description` | What the value means — strongly improves accuracy. |
251
+ | `values[].examples` | Example column names that fit this value — the model's strongest signal. |
252
+ | `values[].default` | If `true`, assigned to columns the model leaves unclassified for this tag. At most one per tag. |
253
+
254
+ Assigned tags are written on each column as `NAME=VALUE` (e.g. `DATA_CLASSIFICATION=CONFIDENTIAL`) — the convention `export snowflake-full` and `apply snowflake` consume. A worked catalog and example contracts live in [`examples/`](examples/).
255
+
256
+ ## REST API
257
+
258
+ ```bash
259
+ dcx api --port 4242 # Swagger UI at http://127.0.0.1:4242/docs
260
+ ```
261
+
262
+ Every command above is mirrored to an endpoint, with request **and** response schemas in the OpenAPI spec. Auth model:
263
+
264
+ - **Live platform operations** (`/import/snowflake`, `/apply/snowflake`) act *as the caller* — the OAuth bearer token comes from the `Authorization` header, so the server never uses ambient credentials for someone else's data.
265
+ - **Enrichment** (`/enrich/*`) uses the **server's** LLM key (from the environment). Put service-level auth/quota in front of it before exposing it publicly.
266
+ - **The CLI never takes secrets as flags** — platform secrets come from env vars or the platform's own config; LLM keys from the provider's standard env var.
267
+
268
+ ## How it fits with datacontract-cli
269
+
270
+ dcx is a **separate package that depends on datacontract-cli as a library** — no fork. It registers new importers (`snowflake`, `kafka`) and the `snowflake-full` exporter into the upstream factories, adds `target` / `enrich` / `apply` sub-apps and live-import commands to the upstream Typer app, and mirrors every command to FastAPI for `dcx api`. So you keep all of upstream's importers, exporters, `lint`, `test`, and `changelog`, and gain the AI + platform layer on top.
271
+
272
+ ## Development
273
+
274
+ ```bash
275
+ pip install -e ".[dev]"
276
+ pytest # 211 tests
277
+ ruff check dcx # lint
278
+ ```
279
+
280
+ Tests never hit live services or real LLMs — platform connections, the Schema Registry, and every LLM call are mocked, so the suite stays fast and offline. See [`RELEASING.md`](RELEASING.md) for the PyPI release process.
281
+
282
+ ## Contributing
283
+
284
+ Issues and PRs welcome. Please run `pytest` and `ruff check dcx` before opening a PR, and add tests for new behavior.
285
+
286
+ ## License
287
+
288
+ [MIT](LICENSE) © MickaelBZH.
289
+
290
+ <p align="center"><sub>Built on <a href="https://github.com/datacontract/datacontract-cli">datacontract-cli</a> · <a href="https://bitol.io/">Open Data Contract Standard</a> · <a href="https://github.com/BerriAI/litellm">litellm</a></sub></p>