opencode-skills-collection 2.0.0 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (90) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +6 -1
  2. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  3. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  4. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  5. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  6. package/bundled-skills/docs/users/bundles.md +1 -1
  7. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  8. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  9. package/bundled-skills/docs/users/getting-started.md +1 -1
  10. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  11. package/bundled-skills/docs/users/usage.md +4 -4
  12. package/bundled-skills/docs/users/visual-guide.md +4 -4
  13. package/bundled-skills/manage-skills/SKILL.md +187 -0
  14. package/bundled-skills/monte-carlo-monitor-creation/SKILL.md +222 -0
  15. package/bundled-skills/monte-carlo-monitor-creation/references/comparison-monitor.md +426 -0
  16. package/bundled-skills/monte-carlo-monitor-creation/references/custom-sql-monitor.md +207 -0
  17. package/bundled-skills/monte-carlo-monitor-creation/references/metric-monitor.md +292 -0
  18. package/bundled-skills/monte-carlo-monitor-creation/references/table-monitor.md +231 -0
  19. package/bundled-skills/monte-carlo-monitor-creation/references/validation-monitor.md +404 -0
  20. package/bundled-skills/monte-carlo-prevent/SKILL.md +252 -0
  21. package/bundled-skills/monte-carlo-prevent/references/TROUBLESHOOTING.md +23 -0
  22. package/bundled-skills/monte-carlo-prevent/references/parameters.md +32 -0
  23. package/bundled-skills/monte-carlo-prevent/references/workflows.md +478 -0
  24. package/bundled-skills/monte-carlo-push-ingestion/SKILL.md +363 -0
  25. package/bundled-skills/monte-carlo-push-ingestion/references/anomaly-detection.md +87 -0
  26. package/bundled-skills/monte-carlo-push-ingestion/references/custom-lineage.md +203 -0
  27. package/bundled-skills/monte-carlo-push-ingestion/references/direct-http-api.md +207 -0
  28. package/bundled-skills/monte-carlo-push-ingestion/references/prerequisites.md +150 -0
  29. package/bundled-skills/monte-carlo-push-ingestion/references/push-lineage.md +160 -0
  30. package/bundled-skills/monte-carlo-push-ingestion/references/push-metadata.md +158 -0
  31. package/bundled-skills/monte-carlo-push-ingestion/references/push-query-logs.md +219 -0
  32. package/bundled-skills/monte-carlo-push-ingestion/references/validation.md +257 -0
  33. package/bundled-skills/monte-carlo-push-ingestion/scripts/sample_verify.py +357 -0
  34. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_and_push_lineage.py +70 -0
  35. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_and_push_metadata.py +65 -0
  36. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_and_push_query_logs.py +70 -0
  37. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_lineage.py +214 -0
  38. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_metadata.py +160 -0
  39. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/collect_query_logs.py +164 -0
  40. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/push_lineage.py +198 -0
  41. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/push_metadata.py +193 -0
  42. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery/push_query_logs.py +207 -0
  43. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/collect_and_push_metadata.py +71 -0
  44. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/collect_and_push_query_logs.py +64 -0
  45. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/collect_metadata.py +253 -0
  46. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/collect_query_logs.py +149 -0
  47. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/push_metadata.py +190 -0
  48. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/bigquery-iceberg/push_query_logs.py +208 -0
  49. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_and_push_lineage.py +83 -0
  50. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_and_push_metadata.py +77 -0
  51. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_and_push_query_logs.py +83 -0
  52. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_lineage.py +240 -0
  53. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_metadata.py +212 -0
  54. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/collect_query_logs.py +204 -0
  55. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/push_lineage.py +192 -0
  56. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/push_metadata.py +178 -0
  57. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/databricks/push_query_logs.py +200 -0
  58. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_and_push_lineage.py +119 -0
  59. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_and_push_metadata.py +119 -0
  60. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_and_push_query_logs.py +117 -0
  61. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_lineage.py +265 -0
  62. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_metadata.py +313 -0
  63. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/collect_query_logs.py +284 -0
  64. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/push_lineage.py +309 -0
  65. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/push_metadata.py +245 -0
  66. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/hive/push_query_logs.py +255 -0
  67. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_and_push_lineage.py +78 -0
  68. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_and_push_metadata.py +80 -0
  69. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_and_push_query_logs.py +88 -0
  70. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_lineage.py +235 -0
  71. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_metadata.py +219 -0
  72. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/collect_query_logs.py +239 -0
  73. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/push_lineage.py +178 -0
  74. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/push_metadata.py +178 -0
  75. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/redshift/push_query_logs.py +196 -0
  76. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_and_push_lineage.py +154 -0
  77. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_and_push_metadata.py +137 -0
  78. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_and_push_query_logs.py +137 -0
  79. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_lineage.py +349 -0
  80. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_metadata.py +329 -0
  81. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/collect_query_logs.py +254 -0
  82. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/push_lineage.py +307 -0
  83. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/push_metadata.py +228 -0
  84. package/bundled-skills/monte-carlo-push-ingestion/scripts/templates/snowflake/push_query_logs.py +248 -0
  85. package/bundled-skills/monte-carlo-push-ingestion/scripts/test_template_sdk_usage.py +340 -0
  86. package/bundled-skills/monte-carlo-validation-notebook/SKILL.md +685 -0
  87. package/bundled-skills/monte-carlo-validation-notebook/scripts/generate_notebook_url.py +141 -0
  88. package/bundled-skills/monte-carlo-validation-notebook/scripts/resolve_dbt_schema.py +161 -0
  89. package/package.json +1 -1
  90. package/skills_index.json +503 -61
@@ -0,0 +1,207 @@
1
+ # Direct HTTP API (without pycarlo)
2
+
3
+ The `pycarlo` SDK is optional. You can call the push APIs directly over HTTPS from any
4
+ language or tool (curl, Postman, etc.) as long as you:
5
+ - authenticate with an integration key whose scope is `Ingestion`
6
+ - send a JSON body that matches the ingest schema
7
+ - send to the correct integration gateway endpoint
8
+
9
+ ## Endpoint
10
+
11
+ The host is environment-specific:
12
+ - **Production**: `https://integrations.getmontecarlo.com`
13
+
14
+ ## Authentication headers
15
+
16
+ All requests use the same headers:
17
+ ```
18
+ x-mcd-id: <integration-key-id>
19
+ x-mcd-token: <integration-key-secret>
20
+ Content-Type: application/json
21
+ ```
22
+
23
+ ## Response
24
+
25
+ On success, all endpoints return:
26
+ ```json
27
+ {"invocation_id": "<uuid>"}
28
+ ```
29
+
30
+ Save the `invocation_id` — it is the primary trace ID for debugging across downstream systems.
31
+
32
+ ---
33
+
34
+ ## Metadata example
35
+
36
+ `POST /ingest/v1/metadata`
37
+
38
+ ```bash
39
+ curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/metadata" \
40
+ -H "Content-Type: application/json" \
41
+ -H "x-mcd-id: <integration-key-id>" \
42
+ -H "x-mcd-token: <integration-key-secret>" \
43
+ -d '{
44
+ "event_type": "RELATIONAL_ASSET",
45
+ "resource": {
46
+ "uuid": "<warehouse-uuid>",
47
+ "resource_type": "snowflake"
48
+ },
49
+ "events": [
50
+ {
51
+ "type": "TABLE",
52
+ "metadata": {
53
+ "name": "orders",
54
+ "database": "analytics",
55
+ "schema": "public",
56
+ "description": "Orders table"
57
+ },
58
+ "fields": [
59
+ {"name": "id", "type": "INTEGER"},
60
+ {"name": "amount", "type": "DECIMAL(10,2)"}
61
+ ],
62
+ "volume": {
63
+ "row_count": 1000000,
64
+ "byte_count": 111111111
65
+ },
66
+ "freshness": {
67
+ "last_update_time": "2026-03-12T14:30:00Z"
68
+ }
69
+ }
70
+ ]
71
+ }'
72
+ ```
73
+
74
+ `volume` and `freshness` are optional — you can push schema-only metadata.
75
+
76
+ ---
77
+
78
+ ## Table lineage example
79
+
80
+ `POST /ingest/v1/lineage` with `event_type: "LINEAGE"`
81
+
82
+ ```bash
83
+ curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/lineage" \
84
+ -H "Content-Type: application/json" \
85
+ -H "x-mcd-id: <integration-key-id>" \
86
+ -H "x-mcd-token: <integration-key-secret>" \
87
+ -d '{
88
+ "event_type": "LINEAGE",
89
+ "resource": {
90
+ "uuid": "<warehouse-uuid>",
91
+ "resource_type": "snowflake"
92
+ },
93
+ "events": [
94
+ {
95
+ "source": {
96
+ "name": "orders_raw",
97
+ "database": "analytics",
98
+ "schema": "public"
99
+ },
100
+ "destination": {
101
+ "name": "orders_curated",
102
+ "database": "analytics",
103
+ "schema": "public"
104
+ }
105
+ }
106
+ ]
107
+ }'
108
+ ```
109
+
110
+ ---
111
+
112
+ ## Column lineage example
113
+
114
+ `POST /ingest/v1/lineage` with `event_type: "COLUMN_LINEAGE"`
115
+
116
+ Same endpoint as table lineage. Column lineage automatically creates the parent table-level
117
+ edge too.
118
+
119
+ ```bash
120
+ curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/lineage" \
121
+ -H "Content-Type: application/json" \
122
+ -H "x-mcd-id: <integration-key-id>" \
123
+ -H "x-mcd-token: <integration-key-secret>" \
124
+ -d '{
125
+ "event_type": "COLUMN_LINEAGE",
126
+ "resource": {
127
+ "uuid": "<warehouse-uuid>",
128
+ "resource_type": "snowflake"
129
+ },
130
+ "events": [
131
+ {
132
+ "source": {
133
+ "name": "customers",
134
+ "database": "analytics",
135
+ "schema": "public"
136
+ },
137
+ "destination": {
138
+ "name": "customer_orders",
139
+ "database": "analytics",
140
+ "schema": "public"
141
+ },
142
+ "col_mappings": [
143
+ {
144
+ "destination_col": "customer_id",
145
+ "source_cols": ["customer_id"]
146
+ },
147
+ {
148
+ "destination_col": "full_name",
149
+ "source_cols": ["first_name", "last_name"]
150
+ }
151
+ ]
152
+ }
153
+ ]
154
+ }'
155
+ ```
156
+
157
+ ---
158
+
159
+ ## Query log example
160
+
161
+ `POST /ingest/v1/querylogs`
162
+
163
+ **Important**: this endpoint uses `log_type` instead of `resource_type` in the resource object.
164
+ This is the only endpoint where the field name differs.
165
+
166
+ ```bash
167
+ curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/querylogs" \
168
+ -H "Content-Type: application/json" \
169
+ -H "x-mcd-id: <integration-key-id>" \
170
+ -H "x-mcd-token: <integration-key-secret>" \
171
+ -d '{
172
+ "event_type": "QUERY_LOG",
173
+ "resource": {
174
+ "uuid": "<warehouse-uuid>",
175
+ "log_type": "snowflake"
176
+ },
177
+ "events": [
178
+ {
179
+ "start_time": "2026-03-02T12:00:00Z",
180
+ "end_time": "2026-03-02T12:00:05Z",
181
+ "query_text": "SELECT * FROM analytics.public.orders",
182
+ "query_id": "query-123",
183
+ "user": "analyst@company.com",
184
+ "returned_rows": 10
185
+ }
186
+ ]
187
+ }'
188
+ ```
189
+
190
+ Supported `log_type` values: `snowflake`, `bigquery`, `databricks`, `redshift`, `hive-s3`,
191
+ `athena`, `teradata`, `clickhouse`, `databricks-metastore-sql-warehouse`, `s3`, `presto-s3`.
192
+
193
+ ---
194
+
195
+ ## Batching
196
+
197
+ The compressed request body must not exceed **1MB** (Kinesis limit). For large payloads, split
198
+ events into multiple requests. Each request returns its own `invocation_id`.
199
+
200
+ ## Expiration summary
201
+
202
+ | Flow | Expiration |
203
+ |---|---|
204
+ | Table metadata | Never expires |
205
+ | Table lineage | Never expires |
206
+ | Column lineage | Expires after 10 days |
207
+ | Query logs | Same as pulled query logs |
@@ -0,0 +1,150 @@
1
+ # Prerequisites
2
+
3
+ ## Two keys, two purposes
4
+
5
+ Push ingestion requires **two separate Monte Carlo API keys** — one for pushing data, one
6
+ for reading/verifying it. They use identical header names but different endpoints.
7
+
8
+ | Key | Purpose | Endpoint |
9
+ |---|---|---|
10
+ | **Ingestion key** (scope=`Ingestion`) | Push metadata, lineage, query logs | `https://integrations.getmontecarlo.com` |
11
+ | **GraphQL API key** | Verify pushed data, run management mutations | `https://api.getmontecarlo.com/graphql` |
12
+
13
+ Both authenticate with:
14
+ ```
15
+ x-mcd-id: <key-id>
16
+ x-mcd-token: <key-secret>
17
+ ```
18
+
19
+ The secret for both is shown **only once** at creation time — store it securely immediately.
20
+
21
+ ---
22
+
23
+ ## Create the Ingestion key (for pushing)
24
+
25
+ Use the Monte Carlo CLI:
26
+
27
+ ```bash
28
+ montecarlo integrations create-key \
29
+ --scope Ingestion \
30
+ --description "Push ingestion key"
31
+ ```
32
+
33
+ Output:
34
+ ```
35
+ Key id: <id>
36
+ Key secret: <secret> ← only shown once
37
+ ```
38
+
39
+ Install the CLI if needed:
40
+ ```bash
41
+ pip install montecarlodata
42
+ montecarlo configure # enter your API key when prompted
43
+ ```
44
+
45
+ **Optional — restrict to a specific warehouse:**
46
+ If you want the key to only work for one warehouse UUID, use the GraphQL mutation instead:
47
+
48
+ ```graphql
49
+ mutation {
50
+ createIntegrationKey(
51
+ description: "Push key for warehouse XYZ"
52
+ scope: Ingestion
53
+ warehouseIds: ["<warehouse-uuid>"]
54
+ ) {
55
+ key { id secret }
56
+ }
57
+ }
58
+ ```
59
+
60
+ ---
61
+
62
+ ## Create the GraphQL API key (for verification)
63
+
64
+ 1. Go to **https://getmontecarlo.com/settings/api**
65
+ 2. Click **Add**
66
+ 3. Choose key type (personal or account-level — account-level requires Account Owner role)
67
+ 4. Copy the **Key ID** and **Secret** immediately
68
+
69
+ The GraphQL endpoint is: `https://api.getmontecarlo.com/graphql`
70
+
71
+ Test it:
72
+ ```bash
73
+ curl -s -X POST https://api.getmontecarlo.com/graphql \
74
+ -H "x-mcd-id: <id>" \
75
+ -H "x-mcd-token: <secret>" \
76
+ -H "Content-Type: application/json" \
77
+ -d '{"query": "{ getUser { email } }"}' | python3 -m json.tool
78
+ ```
79
+
80
+ ---
81
+
82
+ ## Find your warehouse (resource) UUID
83
+
84
+ The Ingestion key needs to reference the correct MC resource UUID. To find it:
85
+
86
+ ```graphql
87
+ query {
88
+ getUser {
89
+ account {
90
+ warehouses {
91
+ uuid
92
+ name
93
+ connectionType
94
+ }
95
+ }
96
+ }
97
+ }
98
+ ```
99
+
100
+ Or in the MC UI: **Settings → Integrations** → click the warehouse → copy the UUID from the URL.
101
+
102
+ ---
103
+
104
+ ## Install pycarlo (optional)
105
+
106
+ The pycarlo SDK simplifies push calls, but is not required. You can also call the push APIs
107
+ directly via HTTP/curl — see `references/direct-http-api.md`.
108
+
109
+ ```bash
110
+ pip install pycarlo
111
+ ```
112
+
113
+ Initialize the ingestion client in your script:
114
+
115
+ ```python
116
+ from pycarlo.core import Client, Session
117
+ from pycarlo.features.ingestion import IngestionService
118
+
119
+ client = Client(session=Session(
120
+ mcd_id="<ingestion-key-id>",
121
+ mcd_token="<ingestion-key-secret>",
122
+ scope="Ingestion",
123
+ ))
124
+ service = IngestionService(mc_client=client)
125
+ ```
126
+
127
+ Load credentials from environment variables (recommended):
128
+
129
+ ```python
130
+ import os
131
+ service = IngestionService(mc_client=Client(session=Session(
132
+ mcd_id=os.environ["MCD_INGEST_ID"],
133
+ mcd_token=os.environ["MCD_INGEST_TOKEN"],
134
+ scope="Ingestion",
135
+ )))
136
+ ```
137
+
138
+ ---
139
+
140
+ ## Environment variable conventions
141
+
142
+ The script templates use these env var names by default:
143
+
144
+ | Variable | Key type | Used by |
145
+ |---|---|---|
146
+ | `MCD_INGEST_ID` | Ingestion key ID | push and collect_and_push scripts |
147
+ | `MCD_INGEST_TOKEN` | Ingestion key secret | push and collect_and_push scripts |
148
+ | `MCD_ID` | GraphQL API key ID | verification scripts, slash commands |
149
+ | `MCD_TOKEN` | GraphQL API key secret | verification scripts, slash commands |
150
+ | `MCD_RESOURCE_UUID` | Warehouse UUID | all scripts |
@@ -0,0 +1,160 @@
1
+ # Pushing Table and Column Lineage
2
+
3
+ ## Overview
4
+
5
+ Both table-level and column-level lineage use the same endpoint: `POST /ingest/v1/lineage`.
6
+ The `event_type` field distinguishes them:
7
+ - `LINEAGE` — table-level: source table → destination table
8
+ - `COLUMN_LINEAGE` — column-level: source table.column → destination table.column
9
+ (also automatically creates the parent table-level edge)
10
+
11
+ Push lineage is **typically visible in the MC lineage graph within seconds to a few minutes**
12
+ via the fast direct path (PushLineageProcessor → S3 CSVs → neo4jLineageLoaderPrivate → Neo4j).
13
+
14
+ **Expiration**:
15
+ - Pushed **table lineage does not expire** (`expire_at = 9999-12-31`).
16
+ - Pushed **column lineage expires after 10 days** (same as pulled column lineage).
17
+
18
+ **Batching**: For large numbers of lineage events, split into batches. The compressed request
19
+ body must not exceed **1MB** (Kinesis limit).
20
+
21
+ ## pycarlo models
22
+
23
+ ```python
24
+ from pycarlo.features.ingestion import (
25
+ IngestionService,
26
+ LineageEvent,
27
+ LineageAssetRef,
28
+ ColumnLineageField,
29
+ ColumnLineageSourceField,
30
+ )
31
+ ```
32
+
33
+ ## Table lineage example
34
+
35
+ ```python
36
+ event = LineageEvent(
37
+ destination=LineageAssetRef(
38
+ database="analytics",
39
+ schema="public",
40
+ table="customer_orders",
41
+ ),
42
+ sources=[
43
+ LineageAssetRef(database="analytics", schema="public", table="customers"),
44
+ LineageAssetRef(database="analytics", schema="public", table="orders"),
45
+ ],
46
+ )
47
+
48
+ result = service.send_lineage(
49
+ resource_uuid="<your-resource-uuid>",
50
+ resource_type="data-lake",
51
+ events=[event],
52
+ )
53
+ invocation_id = service.extract_invocation_id(result)
54
+ print("invocation_id:", invocation_id)
55
+ ```
56
+
57
+ ## Column lineage example
58
+
59
+ ```python
60
+ event = LineageEvent(
61
+ destination=LineageAssetRef(
62
+ database="analytics",
63
+ schema="public",
64
+ table="customer_orders",
65
+ ),
66
+ sources=[
67
+ LineageAssetRef(database="analytics", schema="public", table="customers"),
68
+ LineageAssetRef(database="analytics", schema="public", table="orders"),
69
+ ],
70
+ # column mappings: dest_col ← src_table.src_col
71
+ fields=[
72
+ ColumnLineageField(
73
+ destination_field="customer_id",
74
+ source_fields=[
75
+ ColumnLineageSourceField(
76
+ database="analytics", schema="public",
77
+ table="customers", field="customer_id",
78
+ )
79
+ ],
80
+ ),
81
+ ColumnLineageField(
82
+ destination_field="order_amount",
83
+ source_fields=[
84
+ ColumnLineageSourceField(
85
+ database="analytics", schema="public",
86
+ table="orders", field="amount",
87
+ )
88
+ ],
89
+ ),
90
+ ],
91
+ )
92
+
93
+ result = service.send_lineage(
94
+ resource_uuid=resource_uuid,
95
+ resource_type="data-lake",
96
+ events=[event],
97
+ )
98
+ ```
99
+
100
+ Column lineage push automatically creates a table-level edge too, so you don't need to
101
+ send separate table and column lineage events for the same relationship.
102
+
103
+ ## Extracting lineage from SQL logs
104
+
105
+ For warehouses that don't expose a native lineage table, extract lineage by parsing query
106
+ history SQL for `CREATE TABLE AS SELECT`, `INSERT INTO ... SELECT`, and `MERGE INTO` patterns.
107
+
108
+ Simplified example regex:
109
+ ```python
110
+ import re
111
+
112
+ CTAS_PATTERN = re.compile(
113
+ r"CREATE\s+(?:OR\s+REPLACE\s+)?TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?(\S+)\s+AS\s+SELECT",
114
+ re.IGNORECASE,
115
+ )
116
+ INSERT_PATTERN = re.compile(
117
+ r"INSERT\s+(?:OVERWRITE\s+)?(?:INTO\s+)?(\S+).*?FROM\s+(\S+)",
118
+ re.IGNORECASE | re.DOTALL,
119
+ )
120
+ ```
121
+
122
+ For Snowflake, BigQuery, and Redshift the query history tables provide this SQL.
123
+ For Databricks, use `system.access.table_lineage` directly (no parsing needed).
124
+ For Hive, parse the HiveServer2 log file.
125
+
126
+ ## Output manifest (include invocation_id)
127
+
128
+ ```python
129
+ manifest = {
130
+ "resource_uuid": resource_uuid,
131
+ "invocation_id": service.extract_invocation_id(result), # ← save this
132
+ "collected_at": datetime.now(tz=timezone.utc).isoformat(),
133
+ "edges": [
134
+ {
135
+ "destination": {"database": e.destination.database, "table": e.destination.table},
136
+ "sources": [{"database": s.database, "table": s.table} for s in e.sources],
137
+ }
138
+ for e in events
139
+ ],
140
+ }
141
+ with open("lineage_output.json", "w") as f:
142
+ json.dump(manifest, f, indent=2)
143
+ ```
144
+
145
+ ## How push lineage is distinguished from query-derived lineage
146
+
147
+ Push-ingested lineage nodes and edges carry `origin = push_ingest` in Neo4j and
148
+ `origin_type = DIRECT_LINEAGE` in the normalized lineage model. This prevents the lineage
149
+ DAG from overwriting them with query-log-derived edges and gives MC a clear audit trail.
150
+
151
+ ## Neo4j node expiry
152
+
153
+ Push-ingested **table lineage** nodes and edges are written with `expire_at = 9999-12-31`
154
+ (never expire). This is handled internally by PushLineageProcessor — you do not need to set
155
+ this manually when using `send_lineage()`.
156
+
157
+ Push-ingested **column lineage** expires after **10 days**, same as pulled column lineage.
158
+
159
+ For custom nodes created via GraphQL mutations, you **do** need to set
160
+ `expireAt: "9999-12-31"` explicitly — see `references/custom-lineage.md`.
@@ -0,0 +1,158 @@
1
+ # Pushing Table Metadata
2
+
3
+ ## Overview
4
+
5
+ Metadata push sends three types of signals per table:
6
+ - **Schema** — column names and types
7
+ - **Volume** — row count and byte count
8
+ - **Freshness** — last update timestamp
9
+
10
+ All three travel together in a single `RelationalAsset` object via `POST /ingest/v1/metadata`.
11
+
12
+ **Expiration**: Pushed table metadata **does not expire**. Once pushed, it remains in Monte
13
+ Carlo until explicitly deleted via `deletePushIngestedTables`.
14
+
15
+ **Batching**: For large numbers of tables, split assets into batches. The compressed request
16
+ body must not exceed **1MB** (Kinesis limit).
17
+
18
+ ## pycarlo models
19
+
20
+ ```python
21
+ from pycarlo.features.ingestion import (
22
+ IngestionService,
23
+ RelationalAsset,
24
+ AssetMetadata,
25
+ AssetField,
26
+ AssetVolume,
27
+ AssetFreshness,
28
+ )
29
+ ```
30
+
31
+ ## Minimal example
32
+
33
+ ```python
34
+ asset = RelationalAsset(
35
+ type="TABLE", # ONLY "TABLE" or "VIEW" — normalize warehouse-native values
36
+ metadata=AssetMetadata(
37
+ name="orders",
38
+ database="analytics",
39
+ schema="public",
40
+ description="Order transactions",
41
+ ),
42
+ fields=[
43
+ AssetField(name="order_id", type="INTEGER"),
44
+ AssetField(name="amount", type="DECIMAL"),
45
+ AssetField(name="created_at", type="TIMESTAMP"),
46
+ ],
47
+ volume=AssetVolume(
48
+ row_count=1_500_000,
49
+ byte_count=250_000_000,
50
+ ),
51
+ freshness=AssetFreshness(
52
+ last_update_time="2024-03-01T12:00:00Z", # ISO 8601 string, NOT a datetime object
53
+ ),
54
+ )
55
+
56
+ result = service.send_metadata(
57
+ resource_uuid="<your-resource-uuid>",
58
+ resource_type="data-lake", # see note below on resource_type
59
+ events=[asset],
60
+ )
61
+ invocation_id = service.extract_invocation_id(result)
62
+ print("invocation_id:", invocation_id) # save this!
63
+ ```
64
+
65
+ ## resource_type
66
+
67
+ The `resource_type` value must match the type of the MC resource (warehouse connection) you
68
+ are pushing to. Use the same string that appears in the MC UI or the `connectionType` field
69
+ from `getUser { account { warehouses { connectionType } } }`.
70
+
71
+ Common values:
72
+ - `"data-lake"` — Hive, EMR, Glue, generic data lake connections
73
+ - `"snowflake"` — Snowflake
74
+ - `"bigquery"` — BigQuery
75
+ - `"databricks"` — Databricks Unity Catalog
76
+ - `"redshift"` — Redshift
77
+
78
+ ## Asset type
79
+
80
+ The `type` parameter on `RelationalAsset` must be one of two values (uppercase):
81
+ - `"TABLE"` — tables, external tables, dynamic tables, materialized views, etc.
82
+ - `"VIEW"` — views, secure views
83
+
84
+ **Important**: Warehouse-native type values like `"BASE TABLE"` (Snowflake), `"MANAGED"` /
85
+ `"EXTERNAL"` (Databricks), or `"MATERIALIZED_VIEW"` (BigQuery) are **NOT accepted** by the
86
+ MC API and will cause a 400 error. Always normalize to `"TABLE"` or `"VIEW"` before pushing.
87
+
88
+ ## Field types
89
+
90
+ Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical
91
+ values like `INTEGER`, `BIGINT`, `VARCHAR`, `FLOAT`, `BOOLEAN`, `TIMESTAMP`, `DATE`,
92
+ `DECIMAL`, `ARRAY`, `STRUCT` work best with downstream features.
93
+
94
+ ## Volume and freshness are optional
95
+
96
+ If your warehouse doesn't expose row counts or last-modified timestamps, omit `volume`
97
+ and/or `freshness` — schema-only metadata is valid.
98
+
99
+ If you send `freshness`, each push must carry a **changed** `last_update_time` to count as
100
+ a new data point for the anomaly detector (repeated identical timestamps don't advance the
101
+ training clock).
102
+
103
+ ## Freshness + volume only mode (skip schema)
104
+
105
+ For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema
106
+ on every run — field definitions rarely change. Collection scripts can support a
107
+ `--only-freshness-and-volume` flag that skips the `COLUMNS` / `INFORMATION_SCHEMA` query
108
+ and omits `fields` from the manifest. This is significantly faster on warehouses with many
109
+ tables. Use the full collection (with fields) on the first push and on a daily schedule,
110
+ and the freshness+volume only mode for hourly pushes in between. See the
111
+ [BigQuery Iceberg example](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)
112
+ for a working implementation of this pattern.
113
+
114
+ ## Batch multiple tables
115
+
116
+ `events` accepts a list. Push all tables in a single call or in batches:
117
+
118
+ ```python
119
+ result = service.send_metadata(
120
+ resource_uuid=resource_uuid,
121
+ resource_type="data-lake",
122
+ events=[asset1, asset2, asset3, ...],
123
+ )
124
+ ```
125
+
126
+ ## Output manifest (include invocation_id)
127
+
128
+ Always write a local manifest so you can trace issues later:
129
+
130
+ ```python
131
+ import json
132
+ from datetime import datetime, timezone
133
+
134
+ manifest = {
135
+ "resource_uuid": resource_uuid,
136
+ "invocation_id": service.extract_invocation_id(result), # ← critical for debugging
137
+ "collected_at": datetime.now(tz=timezone.utc).isoformat(),
138
+ "assets": [
139
+ {
140
+ "database": a.metadata.database,
141
+ "schema": a.metadata.schema,
142
+ "table": a.metadata.name,
143
+ "row_count": a.volume.row_count if a.volume else None,
144
+ "fields": [{"name": f.name, "type": f.type} for f in a.fields],
145
+ }
146
+ for a in assets
147
+ ],
148
+ }
149
+ with open("metadata_output.json", "w") as f:
150
+ json.dump(manifest, f, indent=2)
151
+ ```
152
+
153
+ ## Push frequency for anomaly detection
154
+
155
+ To keep volume and freshness anomaly detectors active:
156
+ - Push **at most once per hour** (pushing more frequently produces unpredictable behavior)
157
+ - Push **consistently** — gaps longer than a few days will deactivate detectors
158
+ - See `references/anomaly-detection.md` for minimum sample requirements