opencode-skills-collection 2.0.0 → 2.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled-skills/.antigravity-install-manifest.json +6 -1
- package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
- package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
- package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
- package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
- package/bundled-skills/docs/users/bundles.md +1 -1
- package/bundled-skills/docs/users/claude-code-skills.md +1 -1
- package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
- package/bundled-skills/docs/users/getting-started.md +1 -1
- package/bundled-skills/docs/users/kiro-integration.md +1 -1
- package/bundled-skills/docs/users/usage.md +4 -4
- package/bundled-skills/docs/users/visual-guide.md +4 -4
- package/bundled-skills/manage-skills/SKILL.md +187 -0
- package/bundled-skills/monte-carlo-monitor-creation/SKILL.md +222 -0
- package/bundled-skills/monte-carlo-monitor-creation/references/comparison-monitor.md +426 -0
- package/bundled-skills/monte-carlo-monitor-creation/references/custom-sql-monitor.md +207 -0
- package/bundled-skills/monte-carlo-monitor-creation/references/metric-monitor.md +292 -0
- package/bundled-skills/monte-carlo-monitor-creation/references/table-monitor.md +231 -0
- package/bundled-skills/monte-carlo-monitor-creation/references/validation-monitor.md +404 -0
- package/bundled-skills/monte-carlo-prevent/SKILL.md +252 -0
- package/bundled-skills/monte-carlo-prevent/references/TROUBLESHOOTING.md +23 -0
- package/bundled-skills/monte-carlo-prevent/references/parameters.md +32 -0
- package/bundled-skills/monte-carlo-prevent/references/workflows.md +478 -0
- package/bundled-skills/monte-carlo-push-ingestion/SKILL.md +363 -0
- package/bundled-skills/monte-carlo-push-ingestion/references/anomaly-detection.md +87 -0
- package/bundled-skills/monte-carlo-push-ingestion/references/custom-lineage.md +203 -0
- package/bundled-skills/monte-carlo-push-ingestion/references/direct-http-api.md +207 -0
- package/bundled-skills/monte-carlo-push-ingestion/references/prerequisites.md +150 -0
- package/bundled-skills/monte-carlo-push-ingestion/references/push-lineage.md +160 -0
- package/bundled-skills/monte-carlo-push-ingestion/references/push-metadata.md +158 -0
- package/bundled-skills/monte-carlo-push-ingestion/references/push-query-logs.md +219 -0
- package/bundled-skills/monte-carlo-push-ingestion/references/validation.md +257 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/sample_verify.py +357 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_and_push_lineage.py +70 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_and_push_metadata.py +65 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_and_push_query_logs.py +70 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_lineage.py +214 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_metadata.py +160 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_query_logs.py +164 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/push_lineage.py +198 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/push_metadata.py +193 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/push_query_logs.py +207 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/collect_and_push_metadata.py +71 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/collect_and_push_query_logs.py +64 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/collect_metadata.py +253 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/collect_query_logs.py +149 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/push_metadata.py +190 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/push_query_logs.py +208 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_and_push_lineage.py +83 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_and_push_metadata.py +77 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_and_push_query_logs.py +83 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_lineage.py +240 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_metadata.py +212 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_query_logs.py +204 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/push_lineage.py +192 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/push_metadata.py +178 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/push_query_logs.py +200 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_and_push_lineage.py +119 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_and_push_metadata.py +119 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_and_push_query_logs.py +117 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_lineage.py +265 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_metadata.py +313 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_query_logs.py +284 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/push_lineage.py +309 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/push_metadata.py +245 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/push_query_logs.py +255 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_and_push_lineage.py +78 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_and_push_metadata.py +80 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_and_push_query_logs.py +88 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_lineage.py +235 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_metadata.py +219 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_query_logs.py +239 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/push_lineage.py +178 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/push_metadata.py +178 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/push_query_logs.py +196 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_and_push_lineage.py +154 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_and_push_metadata.py +137 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_and_push_query_logs.py +137 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_lineage.py +349 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_metadata.py +329 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_query_logs.py +254 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/push_lineage.py +307 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/push_metadata.py +228 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/push_query_logs.py +248 -0
- package/bundled-skills/monte-carlo-push-ingestion/scripts/test_template_sdk_usage.py +340 -0
- package/bundled-skills/monte-carlo-validation-notebook/SKILL.md +685 -0
- package/bundled-skills/monte-carlo-validation-notebook/scripts/generate_notebook_url.py +141 -0
- package/bundled-skills/monte-carlo-validation-notebook/scripts/resolve_dbt_schema.py +161 -0
- package/package.json +1 -1
- package/skills_index.json +503 -61
|
@@ -0,0 +1,207 @@
|
|
|
1
|
+
# Direct HTTP API (without pycarlo)
|
|
2
|
+
|
|
3
|
+
The `pycarlo` SDK is optional. You can call the push APIs directly over HTTPS from any
|
|
4
|
+
language or tool (curl, Postman, etc.) as long as you:
|
|
5
|
+
- authenticate with an integration key whose scope is `Ingestion`
|
|
6
|
+
- send a JSON body that matches the ingest schema
|
|
7
|
+
- send to the correct integration gateway endpoint
|
|
8
|
+
|
|
9
|
+
## Endpoint
|
|
10
|
+
|
|
11
|
+
The host is environment-specific:
|
|
12
|
+
- **Production**: `https://integrations.getmontecarlo.com`
|
|
13
|
+
|
|
14
|
+
## Authentication headers
|
|
15
|
+
|
|
16
|
+
All requests use the same headers:
|
|
17
|
+
```
|
|
18
|
+
x-mcd-id: <integration-key-id>
|
|
19
|
+
x-mcd-token: <integration-key-secret>
|
|
20
|
+
Content-Type: application/json
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Response
|
|
24
|
+
|
|
25
|
+
On success, all endpoints return:
|
|
26
|
+
```json
|
|
27
|
+
{"invocation_id": "<uuid>"}
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
Save the `invocation_id` — it is the primary trace ID for debugging across downstream systems.
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Metadata example
|
|
35
|
+
|
|
36
|
+
`POST /ingest/v1/metadata`
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/metadata" \
|
|
40
|
+
-H "Content-Type: application/json" \
|
|
41
|
+
-H "x-mcd-id: <integration-key-id>" \
|
|
42
|
+
-H "x-mcd-token: <integration-key-secret>" \
|
|
43
|
+
-d '{
|
|
44
|
+
"event_type": "RELATIONAL_ASSET",
|
|
45
|
+
"resource": {
|
|
46
|
+
"uuid": "<warehouse-uuid>",
|
|
47
|
+
"resource_type": "snowflake"
|
|
48
|
+
},
|
|
49
|
+
"events": [
|
|
50
|
+
{
|
|
51
|
+
"type": "TABLE",
|
|
52
|
+
"metadata": {
|
|
53
|
+
"name": "orders",
|
|
54
|
+
"database": "analytics",
|
|
55
|
+
"schema": "public",
|
|
56
|
+
"description": "Orders table"
|
|
57
|
+
},
|
|
58
|
+
"fields": [
|
|
59
|
+
{"name": "id", "type": "INTEGER"},
|
|
60
|
+
{"name": "amount", "type": "DECIMAL(10,2)"}
|
|
61
|
+
],
|
|
62
|
+
"volume": {
|
|
63
|
+
"row_count": 1000000,
|
|
64
|
+
"byte_count": 111111111
|
|
65
|
+
},
|
|
66
|
+
"freshness": {
|
|
67
|
+
"last_update_time": "2026-03-12T14:30:00Z"
|
|
68
|
+
}
|
|
69
|
+
}
|
|
70
|
+
]
|
|
71
|
+
}'
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
`volume` and `freshness` are optional — you can push schema-only metadata.
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## Table lineage example
|
|
79
|
+
|
|
80
|
+
`POST /ingest/v1/lineage` with `event_type: "LINEAGE"`
|
|
81
|
+
|
|
82
|
+
```bash
|
|
83
|
+
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/lineage" \
|
|
84
|
+
-H "Content-Type: application/json" \
|
|
85
|
+
-H "x-mcd-id: <integration-key-id>" \
|
|
86
|
+
-H "x-mcd-token: <integration-key-secret>" \
|
|
87
|
+
-d '{
|
|
88
|
+
"event_type": "LINEAGE",
|
|
89
|
+
"resource": {
|
|
90
|
+
"uuid": "<warehouse-uuid>",
|
|
91
|
+
"resource_type": "snowflake"
|
|
92
|
+
},
|
|
93
|
+
"events": [
|
|
94
|
+
{
|
|
95
|
+
"source": {
|
|
96
|
+
"name": "orders_raw",
|
|
97
|
+
"database": "analytics",
|
|
98
|
+
"schema": "public"
|
|
99
|
+
},
|
|
100
|
+
"destination": {
|
|
101
|
+
"name": "orders_curated",
|
|
102
|
+
"database": "analytics",
|
|
103
|
+
"schema": "public"
|
|
104
|
+
}
|
|
105
|
+
}
|
|
106
|
+
]
|
|
107
|
+
}'
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## Column lineage example
|
|
113
|
+
|
|
114
|
+
`POST /ingest/v1/lineage` with `event_type: "COLUMN_LINEAGE"`
|
|
115
|
+
|
|
116
|
+
Same endpoint as table lineage. Column lineage automatically creates the parent table-level
|
|
117
|
+
edge too.
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/lineage" \
|
|
121
|
+
-H "Content-Type: application/json" \
|
|
122
|
+
-H "x-mcd-id: <integration-key-id>" \
|
|
123
|
+
-H "x-mcd-token: <integration-key-secret>" \
|
|
124
|
+
-d '{
|
|
125
|
+
"event_type": "COLUMN_LINEAGE",
|
|
126
|
+
"resource": {
|
|
127
|
+
"uuid": "<warehouse-uuid>",
|
|
128
|
+
"resource_type": "snowflake"
|
|
129
|
+
},
|
|
130
|
+
"events": [
|
|
131
|
+
{
|
|
132
|
+
"source": {
|
|
133
|
+
"name": "customers",
|
|
134
|
+
"database": "analytics",
|
|
135
|
+
"schema": "public"
|
|
136
|
+
},
|
|
137
|
+
"destination": {
|
|
138
|
+
"name": "customer_orders",
|
|
139
|
+
"database": "analytics",
|
|
140
|
+
"schema": "public"
|
|
141
|
+
},
|
|
142
|
+
"col_mappings": [
|
|
143
|
+
{
|
|
144
|
+
"destination_col": "customer_id",
|
|
145
|
+
"source_cols": ["customer_id"]
|
|
146
|
+
},
|
|
147
|
+
{
|
|
148
|
+
"destination_col": "full_name",
|
|
149
|
+
"source_cols": ["first_name", "last_name"]
|
|
150
|
+
}
|
|
151
|
+
]
|
|
152
|
+
}
|
|
153
|
+
]
|
|
154
|
+
}'
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
## Query log example
|
|
160
|
+
|
|
161
|
+
`POST /ingest/v1/querylogs`
|
|
162
|
+
|
|
163
|
+
**Important**: this endpoint uses `log_type` instead of `resource_type` in the resource object.
|
|
164
|
+
This is the only endpoint where the field name differs.
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/querylogs" \
|
|
168
|
+
-H "Content-Type: application/json" \
|
|
169
|
+
-H "x-mcd-id: <integration-key-id>" \
|
|
170
|
+
-H "x-mcd-token: <integration-key-secret>" \
|
|
171
|
+
-d '{
|
|
172
|
+
"event_type": "QUERY_LOG",
|
|
173
|
+
"resource": {
|
|
174
|
+
"uuid": "<warehouse-uuid>",
|
|
175
|
+
"log_type": "snowflake"
|
|
176
|
+
},
|
|
177
|
+
"events": [
|
|
178
|
+
{
|
|
179
|
+
"start_time": "2026-03-02T12:00:00Z",
|
|
180
|
+
"end_time": "2026-03-02T12:00:05Z",
|
|
181
|
+
"query_text": "SELECT * FROM analytics.public.orders",
|
|
182
|
+
"query_id": "query-123",
|
|
183
|
+
"user": "analyst@company.com",
|
|
184
|
+
"returned_rows": 10
|
|
185
|
+
}
|
|
186
|
+
]
|
|
187
|
+
}'
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
Supported `log_type` values: `snowflake`, `bigquery`, `databricks`, `redshift`, `hive-s3`,
|
|
191
|
+
`athena`, `teradata`, `clickhouse`, `databricks-metastore-sql-warehouse`, `s3`, `presto-s3`.
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
## Batching
|
|
196
|
+
|
|
197
|
+
The compressed request body must not exceed **1MB** (Kinesis limit). For large payloads, split
|
|
198
|
+
events into multiple requests. Each request returns its own `invocation_id`.
|
|
199
|
+
|
|
200
|
+
## Expiration summary
|
|
201
|
+
|
|
202
|
+
| Flow | Expiration |
|
|
203
|
+
|---|---|
|
|
204
|
+
| Table metadata | Never expires |
|
|
205
|
+
| Table lineage | Never expires |
|
|
206
|
+
| Column lineage | Expires after 10 days |
|
|
207
|
+
| Query logs | Same as pulled query logs |
|
|
@@ -0,0 +1,150 @@
|
|
|
1
|
+
# Prerequisites
|
|
2
|
+
|
|
3
|
+
## Two keys, two purposes
|
|
4
|
+
|
|
5
|
+
Push ingestion requires **two separate Monte Carlo API keys** — one for pushing data, one
|
|
6
|
+
for reading/verifying it. They use identical header names but different endpoints.
|
|
7
|
+
|
|
8
|
+
| Key | Purpose | Endpoint |
|
|
9
|
+
|---|---|---|
|
|
10
|
+
| **Ingestion key** (scope=`Ingestion`) | Push metadata, lineage, query logs | `https://integrations.getmontecarlo.com` |
|
|
11
|
+
| **GraphQL API key** | Verify pushed data, run management mutations | `https://api.getmontecarlo.com/graphql` |
|
|
12
|
+
|
|
13
|
+
Both authenticate with:
|
|
14
|
+
```
|
|
15
|
+
x-mcd-id: <key-id>
|
|
16
|
+
x-mcd-token: <key-secret>
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
The secret for both is shown **only once** at creation time — store it securely immediately.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Create the Ingestion key (for pushing)
|
|
24
|
+
|
|
25
|
+
Use the Monte Carlo CLI:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
montecarlo integrations create-key \
|
|
29
|
+
--scope Ingestion \
|
|
30
|
+
--description "Push ingestion key"
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
Output:
|
|
34
|
+
```
|
|
35
|
+
Key id: <id>
|
|
36
|
+
Key secret: <secret> ← only shown once
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Install the CLI if needed:
|
|
40
|
+
```bash
|
|
41
|
+
pip install montecarlodata
|
|
42
|
+
montecarlo configure # enter your API key when prompted
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
**Optional — restrict to a specific warehouse:**
|
|
46
|
+
If you want the key to only work for one warehouse UUID, use the GraphQL mutation instead:
|
|
47
|
+
|
|
48
|
+
```graphql
|
|
49
|
+
mutation {
|
|
50
|
+
createIntegrationKey(
|
|
51
|
+
description: "Push key for warehouse XYZ"
|
|
52
|
+
scope: Ingestion
|
|
53
|
+
warehouseIds: ["<warehouse-uuid>"]
|
|
54
|
+
) {
|
|
55
|
+
key { id secret }
|
|
56
|
+
}
|
|
57
|
+
}
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## Create the GraphQL API key (for verification)
|
|
63
|
+
|
|
64
|
+
1. Go to **https://getmontecarlo.com/settings/api**
|
|
65
|
+
2. Click **Add**
|
|
66
|
+
3. Choose key type (personal or account-level — account-level requires Account Owner role)
|
|
67
|
+
4. Copy the **Key ID** and **Secret** immediately
|
|
68
|
+
|
|
69
|
+
The GraphQL endpoint is: `https://api.getmontecarlo.com/graphql`
|
|
70
|
+
|
|
71
|
+
Test it:
|
|
72
|
+
```bash
|
|
73
|
+
curl -s -X POST https://api.getmontecarlo.com/graphql \
|
|
74
|
+
-H "x-mcd-id: <id>" \
|
|
75
|
+
-H "x-mcd-token: <secret>" \
|
|
76
|
+
-H "Content-Type: application/json" \
|
|
77
|
+
-d '{"query": "{ getUser { email } }"}' | python3 -m json.tool
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Find your warehouse (resource) UUID
|
|
83
|
+
|
|
84
|
+
The Ingestion key needs to reference the correct MC resource UUID. To find it:
|
|
85
|
+
|
|
86
|
+
```graphql
|
|
87
|
+
query {
|
|
88
|
+
getUser {
|
|
89
|
+
account {
|
|
90
|
+
warehouses {
|
|
91
|
+
uuid
|
|
92
|
+
name
|
|
93
|
+
connectionType
|
|
94
|
+
}
|
|
95
|
+
}
|
|
96
|
+
}
|
|
97
|
+
}
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Or in the MC UI: **Settings → Integrations** → click the warehouse → copy the UUID from the URL.
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## Install pycarlo (optional)
|
|
105
|
+
|
|
106
|
+
The pycarlo SDK simplifies push calls, but is not required. You can also call the push APIs
|
|
107
|
+
directly via HTTP/curl — see `references/direct-http-api.md`.
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
pip install pycarlo
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Initialize the ingestion client in your script:
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
from pycarlo.core import Client, Session
|
|
117
|
+
from pycarlo.features.ingestion import IngestionService
|
|
118
|
+
|
|
119
|
+
client = Client(session=Session(
|
|
120
|
+
mcd_id="<ingestion-key-id>",
|
|
121
|
+
mcd_token="<ingestion-key-secret>",
|
|
122
|
+
scope="Ingestion",
|
|
123
|
+
))
|
|
124
|
+
service = IngestionService(mc_client=client)
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
Load credentials from environment variables (recommended):
|
|
128
|
+
|
|
129
|
+
```python
|
|
130
|
+
import os
|
|
131
|
+
service = IngestionService(mc_client=Client(session=Session(
|
|
132
|
+
mcd_id=os.environ["MCD_INGEST_ID"],
|
|
133
|
+
mcd_token=os.environ["MCD_INGEST_TOKEN"],
|
|
134
|
+
scope="Ingestion",
|
|
135
|
+
)))
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
---
|
|
139
|
+
|
|
140
|
+
## Environment variable conventions
|
|
141
|
+
|
|
142
|
+
The script templates use these env var names by default:
|
|
143
|
+
|
|
144
|
+
| Variable | Key type | Used by |
|
|
145
|
+
|---|---|---|
|
|
146
|
+
| `MCD_INGEST_ID` | Ingestion key ID | push and collect_and_push scripts |
|
|
147
|
+
| `MCD_INGEST_TOKEN` | Ingestion key secret | push and collect_and_push scripts |
|
|
148
|
+
| `MCD_ID` | GraphQL API key ID | verification scripts, slash commands |
|
|
149
|
+
| `MCD_TOKEN` | GraphQL API key secret | verification scripts, slash commands |
|
|
150
|
+
| `MCD_RESOURCE_UUID` | Warehouse UUID | all scripts |
|
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
# Pushing Table and Column Lineage
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
Both table-level and column-level lineage use the same endpoint: `POST /ingest/v1/lineage`.
|
|
6
|
+
The `event_type` field distinguishes them:
|
|
7
|
+
- `LINEAGE` — table-level: source table → destination table
|
|
8
|
+
- `COLUMN_LINEAGE` — column-level: source table.column → destination table.column
|
|
9
|
+
(also automatically creates the parent table-level edge)
|
|
10
|
+
|
|
11
|
+
Push lineage is **typically visible in the MC lineage graph within seconds to a few minutes**
|
|
12
|
+
via the fast direct path (PushLineageProcessor → S3 CSVs → neo4jLineageLoaderPrivate → Neo4j).
|
|
13
|
+
|
|
14
|
+
**Expiration**:
|
|
15
|
+
- Pushed **table lineage does not expire** (`expire_at = 9999-12-31`).
|
|
16
|
+
- Pushed **column lineage expires after 10 days** (same as pulled column lineage).
|
|
17
|
+
|
|
18
|
+
**Batching**: For large numbers of lineage events, split into batches. The compressed request
|
|
19
|
+
body must not exceed **1MB** (Kinesis limit).
|
|
20
|
+
|
|
21
|
+
## pycarlo models
|
|
22
|
+
|
|
23
|
+
```python
|
|
24
|
+
from pycarlo.features.ingestion import (
|
|
25
|
+
IngestionService,
|
|
26
|
+
LineageEvent,
|
|
27
|
+
LineageAssetRef,
|
|
28
|
+
ColumnLineageField,
|
|
29
|
+
ColumnLineageSourceField,
|
|
30
|
+
)
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
## Table lineage example
|
|
34
|
+
|
|
35
|
+
```python
|
|
36
|
+
event = LineageEvent(
|
|
37
|
+
destination=LineageAssetRef(
|
|
38
|
+
database="analytics",
|
|
39
|
+
schema="public",
|
|
40
|
+
table="customer_orders",
|
|
41
|
+
),
|
|
42
|
+
sources=[
|
|
43
|
+
LineageAssetRef(database="analytics", schema="public", table="customers"),
|
|
44
|
+
LineageAssetRef(database="analytics", schema="public", table="orders"),
|
|
45
|
+
],
|
|
46
|
+
)
|
|
47
|
+
|
|
48
|
+
result = service.send_lineage(
|
|
49
|
+
resource_uuid="<your-resource-uuid>",
|
|
50
|
+
resource_type="data-lake",
|
|
51
|
+
events=[event],
|
|
52
|
+
)
|
|
53
|
+
invocation_id = service.extract_invocation_id(result)
|
|
54
|
+
print("invocation_id:", invocation_id)
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Column lineage example
|
|
58
|
+
|
|
59
|
+
```python
|
|
60
|
+
event = LineageEvent(
|
|
61
|
+
destination=LineageAssetRef(
|
|
62
|
+
database="analytics",
|
|
63
|
+
schema="public",
|
|
64
|
+
table="customer_orders",
|
|
65
|
+
),
|
|
66
|
+
sources=[
|
|
67
|
+
LineageAssetRef(database="analytics", schema="public", table="customers"),
|
|
68
|
+
LineageAssetRef(database="analytics", schema="public", table="orders"),
|
|
69
|
+
],
|
|
70
|
+
# column mappings: dest_col ← src_table.src_col
|
|
71
|
+
fields=[
|
|
72
|
+
ColumnLineageField(
|
|
73
|
+
destination_field="customer_id",
|
|
74
|
+
source_fields=[
|
|
75
|
+
ColumnLineageSourceField(
|
|
76
|
+
database="analytics", schema="public",
|
|
77
|
+
table="customers", field="customer_id",
|
|
78
|
+
)
|
|
79
|
+
],
|
|
80
|
+
),
|
|
81
|
+
ColumnLineageField(
|
|
82
|
+
destination_field="order_amount",
|
|
83
|
+
source_fields=[
|
|
84
|
+
ColumnLineageSourceField(
|
|
85
|
+
database="analytics", schema="public",
|
|
86
|
+
table="orders", field="amount",
|
|
87
|
+
)
|
|
88
|
+
],
|
|
89
|
+
),
|
|
90
|
+
],
|
|
91
|
+
)
|
|
92
|
+
|
|
93
|
+
result = service.send_lineage(
|
|
94
|
+
resource_uuid=resource_uuid,
|
|
95
|
+
resource_type="data-lake",
|
|
96
|
+
events=[event],
|
|
97
|
+
)
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Column lineage push automatically creates a table-level edge too, so you don't need to
|
|
101
|
+
send separate table and column lineage events for the same relationship.
|
|
102
|
+
|
|
103
|
+
## Extracting lineage from SQL logs
|
|
104
|
+
|
|
105
|
+
For warehouses that don't expose a native lineage table, extract lineage by parsing query
|
|
106
|
+
history SQL for `CREATE TABLE AS SELECT`, `INSERT INTO ... SELECT`, and `MERGE INTO` patterns.
|
|
107
|
+
|
|
108
|
+
Simplified example regex:
|
|
109
|
+
```python
|
|
110
|
+
import re
|
|
111
|
+
|
|
112
|
+
CTAS_PATTERN = re.compile(
|
|
113
|
+
r"CREATE\s+(?:OR\s+REPLACE\s+)?TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?(\S+)\s+AS\s+SELECT",
|
|
114
|
+
re.IGNORECASE,
|
|
115
|
+
)
|
|
116
|
+
INSERT_PATTERN = re.compile(
|
|
117
|
+
r"INSERT\s+(?:OVERWRITE\s+)?(?:INTO\s+)?(\S+).*?FROM\s+(\S+)",
|
|
118
|
+
re.IGNORECASE | re.DOTALL,
|
|
119
|
+
)
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
For Snowflake, BigQuery, and Redshift the query history tables provide this SQL.
|
|
123
|
+
For Databricks, use `system.access.table_lineage` directly (no parsing needed).
|
|
124
|
+
For Hive, parse the HiveServer2 log file.
|
|
125
|
+
|
|
126
|
+
## Output manifest (include invocation_id)
|
|
127
|
+
|
|
128
|
+
```python
|
|
129
|
+
manifest = {
|
|
130
|
+
"resource_uuid": resource_uuid,
|
|
131
|
+
"invocation_id": service.extract_invocation_id(result), # ← save this
|
|
132
|
+
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
|
133
|
+
"edges": [
|
|
134
|
+
{
|
|
135
|
+
"destination": {"database": e.destination.database, "table": e.destination.table},
|
|
136
|
+
"sources": [{"database": s.database, "table": s.table} for s in e.sources],
|
|
137
|
+
}
|
|
138
|
+
for e in events
|
|
139
|
+
],
|
|
140
|
+
}
|
|
141
|
+
with open("lineage_output.json", "w") as f:
|
|
142
|
+
json.dump(manifest, f, indent=2)
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
## How push lineage is distinguished from query-derived lineage
|
|
146
|
+
|
|
147
|
+
Push-ingested lineage nodes and edges carry `origin = push_ingest` in Neo4j and
|
|
148
|
+
`origin_type = DIRECT_LINEAGE` in the normalized lineage model. This prevents the lineage
|
|
149
|
+
DAG from overwriting them with query-log-derived edges and gives MC a clear audit trail.
|
|
150
|
+
|
|
151
|
+
## Neo4j node expiry
|
|
152
|
+
|
|
153
|
+
Push-ingested **table lineage** nodes and edges are written with `expire_at = 9999-12-31`
|
|
154
|
+
(never expire). This is handled internally by PushLineageProcessor — you do not need to set
|
|
155
|
+
this manually when using `send_lineage()`.
|
|
156
|
+
|
|
157
|
+
Push-ingested **column lineage** expires after **10 days**, same as pulled column lineage.
|
|
158
|
+
|
|
159
|
+
For custom nodes created via GraphQL mutations, you **do** need to set
|
|
160
|
+
`expireAt: "9999-12-31"` explicitly — see `references/custom-lineage.md`.
|
|
@@ -0,0 +1,158 @@
|
|
|
1
|
+
# Pushing Table Metadata
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
Metadata push sends three types of signals per table:
|
|
6
|
+
- **Schema** — column names and types
|
|
7
|
+
- **Volume** — row count and byte count
|
|
8
|
+
- **Freshness** — last update timestamp
|
|
9
|
+
|
|
10
|
+
All three travel together in a single `RelationalAsset` object via `POST /ingest/v1/metadata`.
|
|
11
|
+
|
|
12
|
+
**Expiration**: Pushed table metadata **does not expire**. Once pushed, it remains in Monte
|
|
13
|
+
Carlo until explicitly deleted via `deletePushIngestedTables`.
|
|
14
|
+
|
|
15
|
+
**Batching**: For large numbers of tables, split assets into batches. The compressed request
|
|
16
|
+
body must not exceed **1MB** (Kinesis limit).
|
|
17
|
+
|
|
18
|
+
## pycarlo models
|
|
19
|
+
|
|
20
|
+
```python
|
|
21
|
+
from pycarlo.features.ingestion import (
|
|
22
|
+
IngestionService,
|
|
23
|
+
RelationalAsset,
|
|
24
|
+
AssetMetadata,
|
|
25
|
+
AssetField,
|
|
26
|
+
AssetVolume,
|
|
27
|
+
AssetFreshness,
|
|
28
|
+
)
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Minimal example
|
|
32
|
+
|
|
33
|
+
```python
|
|
34
|
+
asset = RelationalAsset(
|
|
35
|
+
type="TABLE", # ONLY "TABLE" or "VIEW" — normalize warehouse-native values
|
|
36
|
+
metadata=AssetMetadata(
|
|
37
|
+
name="orders",
|
|
38
|
+
database="analytics",
|
|
39
|
+
schema="public",
|
|
40
|
+
description="Order transactions",
|
|
41
|
+
),
|
|
42
|
+
fields=[
|
|
43
|
+
AssetField(name="order_id", type="INTEGER"),
|
|
44
|
+
AssetField(name="amount", type="DECIMAL"),
|
|
45
|
+
AssetField(name="created_at", type="TIMESTAMP"),
|
|
46
|
+
],
|
|
47
|
+
volume=AssetVolume(
|
|
48
|
+
row_count=1_500_000,
|
|
49
|
+
byte_count=250_000_000,
|
|
50
|
+
),
|
|
51
|
+
freshness=AssetFreshness(
|
|
52
|
+
last_update_time="2024-03-01T12:00:00Z", # ISO 8601 string, NOT a datetime object
|
|
53
|
+
),
|
|
54
|
+
)
|
|
55
|
+
|
|
56
|
+
result = service.send_metadata(
|
|
57
|
+
resource_uuid="<your-resource-uuid>",
|
|
58
|
+
resource_type="data-lake", # see note below on resource_type
|
|
59
|
+
events=[asset],
|
|
60
|
+
)
|
|
61
|
+
invocation_id = service.extract_invocation_id(result)
|
|
62
|
+
print("invocation_id:", invocation_id) # save this!
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## resource_type
|
|
66
|
+
|
|
67
|
+
The `resource_type` value must match the type of the MC resource (warehouse connection) you
|
|
68
|
+
are pushing to. Use the same string that appears in the MC UI or the `connectionType` field
|
|
69
|
+
from `getUser { account { warehouses { connectionType } } }`.
|
|
70
|
+
|
|
71
|
+
Common values:
|
|
72
|
+
- `"data-lake"` — Hive, EMR, Glue, generic data lake connections
|
|
73
|
+
- `"snowflake"` — Snowflake
|
|
74
|
+
- `"bigquery"` — BigQuery
|
|
75
|
+
- `"databricks"` — Databricks Unity Catalog
|
|
76
|
+
- `"redshift"` — Redshift
|
|
77
|
+
|
|
78
|
+
## Asset type
|
|
79
|
+
|
|
80
|
+
The `type` parameter on `RelationalAsset` must be one of two values (uppercase):
|
|
81
|
+
- `"TABLE"` — tables, external tables, dynamic tables, materialized views, etc.
|
|
82
|
+
- `"VIEW"` — views, secure views
|
|
83
|
+
|
|
84
|
+
**Important**: Warehouse-native type values like `"BASE TABLE"` (Snowflake), `"MANAGED"` /
|
|
85
|
+
`"EXTERNAL"` (Databricks), or `"MATERIALIZED_VIEW"` (BigQuery) are **NOT accepted** by the
|
|
86
|
+
MC API and will cause a 400 error. Always normalize to `"TABLE"` or `"VIEW"` before pushing.
|
|
87
|
+
|
|
88
|
+
## Field types
|
|
89
|
+
|
|
90
|
+
Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical
|
|
91
|
+
values like `INTEGER`, `BIGINT`, `VARCHAR`, `FLOAT`, `BOOLEAN`, `TIMESTAMP`, `DATE`,
|
|
92
|
+
`DECIMAL`, `ARRAY`, `STRUCT` work best with downstream features.
|
|
93
|
+
|
|
94
|
+
## Volume and freshness are optional
|
|
95
|
+
|
|
96
|
+
If your warehouse doesn't expose row counts or last-modified timestamps, omit `volume`
|
|
97
|
+
and/or `freshness` — schema-only metadata is valid.
|
|
98
|
+
|
|
99
|
+
If you send `freshness`, each push must carry a **changed** `last_update_time` to count as
|
|
100
|
+
a new data point for the anomaly detector (repeated identical timestamps don't advance the
|
|
101
|
+
training clock).
|
|
102
|
+
|
|
103
|
+
## Freshness + volume only mode (skip schema)
|
|
104
|
+
|
|
105
|
+
For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema
|
|
106
|
+
on every run — field definitions rarely change. Collection scripts can support a
|
|
107
|
+
`--only-freshness-and-volume` flag that skips the `COLUMNS` / `INFORMATION_SCHEMA` query
|
|
108
|
+
and omits `fields` from the manifest. This is significantly faster on warehouses with many
|
|
109
|
+
tables. Use the full collection (with fields) on the first push and on a daily schedule,
|
|
110
|
+
and the freshness+volume only mode for hourly pushes in between. See the
|
|
111
|
+
[BigQuery Iceberg example](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)
|
|
112
|
+
for a working implementation of this pattern.
|
|
113
|
+
|
|
114
|
+
## Batch multiple tables
|
|
115
|
+
|
|
116
|
+
`events` accepts a list. Push all tables in a single call or in batches:
|
|
117
|
+
|
|
118
|
+
```python
|
|
119
|
+
result = service.send_metadata(
|
|
120
|
+
resource_uuid=resource_uuid,
|
|
121
|
+
resource_type="data-lake",
|
|
122
|
+
events=[asset1, asset2, asset3, ...],
|
|
123
|
+
)
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
## Output manifest (include invocation_id)
|
|
127
|
+
|
|
128
|
+
Always write a local manifest so you can trace issues later:
|
|
129
|
+
|
|
130
|
+
```python
|
|
131
|
+
import json
|
|
132
|
+
from datetime import datetime, timezone
|
|
133
|
+
|
|
134
|
+
manifest = {
|
|
135
|
+
"resource_uuid": resource_uuid,
|
|
136
|
+
"invocation_id": service.extract_invocation_id(result), # ← critical for debugging
|
|
137
|
+
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
|
138
|
+
"assets": [
|
|
139
|
+
{
|
|
140
|
+
"database": a.metadata.database,
|
|
141
|
+
"schema": a.metadata.schema,
|
|
142
|
+
"table": a.metadata.name,
|
|
143
|
+
"row_count": a.volume.row_count if a.volume else None,
|
|
144
|
+
"fields": [{"name": f.name, "type": f.type} for f in a.fields],
|
|
145
|
+
}
|
|
146
|
+
for a in assets
|
|
147
|
+
],
|
|
148
|
+
}
|
|
149
|
+
with open("metadata_output.json", "w") as f:
|
|
150
|
+
json.dump(manifest, f, indent=2)
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
## Push frequency for anomaly detection
|
|
154
|
+
|
|
155
|
+
To keep volume and freshness anomaly detectors active:
|
|
156
|
+
- Push **at most once per hour** (pushing more frequently produces unpredictable behavior)
|
|
157
|
+
- Push **consistently** — gaps longer than a few days will deactivate detectors
|
|
158
|
+
- See `references/anomaly-detection.md` for minimum sample requirements
|