npm - opencode-skills-collection - Versions diffs - 2.0.0 → 2.0.2 - Mend

opencode-skills-collection 2.0.0 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (90) hide show

package/bundled-skills/monte-carlo-push-ingestion/references/push-query-logs.md ADDED Viewed

@@ -0,0 +1,219 @@
+# Pushing Query Logs
+## Overview
+Query logs let Monte Carlo build table usage history, populate query lineage, and surface
+query-level insights in the catalog. Push them via `POST /ingest/v1/querylogs`.
+**Important timing note**: MC processes pushed query logs asynchronously. Logs pushed now
+may not be visible in `getAggregatedQueries` for **at least 15-20 minutes**. This is expected
+behavior, not a bug.
+**Expiration**: Pushed query logs expire on the same schedule as pulled query logs.
+**Batching**: For large query log sets, split events into batches. The compressed request body
+must not exceed **1MB** (Kinesis limit). A conservative default is 250 entries per batch.
+## pycarlo model
+```python
+from pycarlo.features.ingestion import IngestionService, QueryLogEntry
+```
+`QueryLogEntry` required fields:
+- `start_time` (`datetime`) — when the query started
+- `end_time` (`datetime`) — when the query finished (**required**, easy to miss)
+- `query_text` (`str`) — the SQL statement
+Optional fields:
+- `query_id` (`str`) — warehouse-assigned query ID
+- `user` (`str`) — user/email who ran the query
+- `returned_rows` (`int`) — rows returned to the client
+- `default_database` (`str`) — default database context
+## Basic example
+```python
+from datetime import datetime, timezone
+entries = [
+    QueryLogEntry(
+        start_time=datetime(2024, 3, 1, 10, 0, 0, tzinfo=timezone.utc),
+        end_time=datetime(2024, 3, 1, 10, 0, 5, tzinfo=timezone.utc),
+        query_text="SELECT * FROM analytics.public.orders WHERE status = 'pending'",
+        query_id="query-abc-123",
+        user="analyst@company.com",
+        returned_rows=847,
+    ),
+]
+result = service.send_query_logs(
+    resource_uuid="<your-resource-uuid>",
+    log_type="snowflake",   # ← warehouse-specific! see table below
+    entries=entries,
+)
+invocation_id = service.extract_invocation_id(result)
+print("invocation_id:", invocation_id)
+```
+## log_type per warehouse
+**Important**: the query-log endpoint uses `log_type`, not `resource_type`. This is the only
+push endpoint where the field name differs from metadata/lineage. The `log_type` value must
+match what the MC normalizer expects for your warehouse. Using the wrong value causes:
+`ValueError: Unsupported ingest query-log log_type: <value>`
+| Warehouse | log_type |
+|---|---|
+| Snowflake | `"snowflake"` |
+| BigQuery | `"bigquery"` |
+| Databricks | `"databricks"` |
+| Redshift | `"redshift"` |
+| Hive (EMR/S3) | `"hive-s3"` |
+| Athena | `"athena"` |
+| Teradata | `"teradata"` |
+| ClickHouse | `"clickhouse"` |
+| Databricks (SQL Warehouse) | `"databricks-metastore-sql-warehouse"` |
+| S3 | `"s3"` |
+| Presto (S3) | `"presto-s3"` |
+## Warehouse-specific fields
+Some warehouses support extra fields beyond the base `QueryLogEntry`. Pass them as keyword
+arguments — the normalizer knows which fields are valid per warehouse.
+**Snowflake extras:**
+```python
+QueryLogEntry(
+    ...
+    bytes_scanned=1024000,
+    warehouse_name="COMPUTE_WH",
+    warehouse_size="X-Small",
+    role_name="ANALYST",
+    query_tag="reporting",
+    execution_status="SUCCESS",
+)
+```
+**BigQuery extras:**
+```python
+QueryLogEntry(
+    ...
+    total_bytes_billed=10485760,
+    statement_type="SELECT",
+    job_type="QUERY",
+    default_dataset="analytics.public",
+)
+```
+**Athena extras:**
+```python
+QueryLogEntry(
+    ...
+    bytes_scanned=2048000,
+    catalog="AwsDataCatalog",
+    database="analytics",
+    output_location="s3://my-bucket/results/",
+    state="SUCCEEDED",
+)
+```
+## Collecting query logs per warehouse
+### Snowflake
+```sql
+SELECT
+    query_id,
+    query_text,
+    start_time,
+    end_time,
+    user_name,
+    database_name,
+    warehouse_name,
+    bytes_scanned,
+    rows_produced AS returned_rows,
+    execution_status
+FROM snowflake.account_usage.query_history
+WHERE start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
+  AND execution_status = 'SUCCESS'
+ORDER BY start_time
+```
+Note: `ACCOUNT_USAGE` views have up to 45 minutes of latency. Don't collect the last hour.
+### BigQuery
+```python
+from google.cloud import bigquery
+client = bigquery.Client(project=project_id)
+jobs = client.list_jobs(all_users=True, min_creation_time=start_dt, max_creation_time=end_dt)
+for job in jobs:
+    if hasattr(job, 'query') and job.query:
+        # job.job_id, job.query, job.created, job.ended, job.user_email
+```
+### Databricks
+```sql
+SELECT
+    statement_id AS query_id,
+    statement_text AS query_text,
+    start_time,
+    end_time,
+    executed_by AS user,
+    produced_rows AS returned_rows
+FROM system.query.history
+WHERE start_time >= DATEADD(HOUR, -24, NOW())
+  AND status = 'FINISHED'
+```
+### Redshift (modern clusters)
+```sql
+SELECT
+    query_id,
+    query_text,   -- may need text assembly from SYS_QUERYTEXT for long queries
+    start_time,
+    end_time,
+    user_id,
+    status
+FROM sys_query_history
+WHERE start_time >= DATEADD(hour, -24, GETDATE())
+  AND status = 'success'
+```
+For long queries (text > 4000 chars), assemble from `SYS_QUERYTEXT`:
+```sql
+SELECT query_id, LISTAGG(text, '') WITHIN GROUP (ORDER BY sequence) AS full_text
+FROM sys_querytext
+WHERE query_id = <id>
+GROUP BY query_id
+```
+### Hive
+Parse the HiveServer2 log file (default: `/tmp/root/hive.log`) for lines matching:
+```
+(Executing|Starting) command\(queryId=(\S*)\): (?P<command>.*)
+```
+## Output manifest (include invocation_id)
+```python
+manifest = {
+    "resource_uuid": resource_uuid,
+    "invocation_id": service.extract_invocation_id(result),   # ← save this
+    "collected_at": datetime.now(tz=timezone.utc).isoformat(),
+    "entry_count": len(entries),
+    "window_start": min(e.start_time for e in entries).isoformat(),
+    "window_end": max(e.end_time for e in entries).isoformat(),
+    "queries": [
+        {
+            "query_id": e.query_id,
+            "start_time": e.start_time.isoformat(),
+            "end_time": e.end_time.isoformat(),
+            "returned_rows": e.returned_rows,
+            "query": e.query_text[:200],   # truncate for readability
+        }
+        for e in entries
+    ],
+}
+with open("query_logs_output.json", "w") as f:
+    json.dump(manifest, f, indent=2)
+```

package/bundled-skills/monte-carlo-push-ingestion/references/validation.md ADDED Viewed

@@ -0,0 +1,257 @@
+# Validating Pushed Data
+All verification queries use the **GraphQL API key** at `https://api.getmontecarlo.com/graphql`.
+---
+## Resolve a table's MCON and fullTableId
+Before running most queries you need either the `mcon` or `fullTableId`.
+`fullTableId` format: `<database>:<schema>.<table>` — e.g. `analytics:public.orders`
+```graphql
+query GetTable($fullTableId: String!, $dwId: UUID!) {
+  getTable(fullTableId: $fullTableId, dwId: $dwId) {
+    mcon
+    fullTableId
+    displayName
+  }
+}
+```
+Variables:
+```json
+{
+  "fullTableId": "analytics:public.orders",
+  "dwId": "<warehouse-uuid>"
+}
+```
+---
+## Verify metadata (schema + columns)
+```graphql
+query GetTableMetadata($mcon: String!) {
+  getTable(mcon: $mcon) {
+    mcon
+    fullTableId
+    versions {
+      edges {
+        node {
+          fields {
+            name
+            fieldType
+          }
+        }
+      }
+    }
+  }
+}
+```
+Check that the fields list matches your pushed schema.
+---
+## Verify volume and freshness metrics
+Use `getMetricsV4` to fetch row counts and last-modified timestamps:
+```graphql
+query GetMetrics(
+  $mcon: String!
+  $metricName: String!
+  $startTime: DateTime!
+  $endTime: DateTime!
+) {
+  getMetricsV4(
+    dwId: null
+    mcon: $mcon
+    metricName: $metricName
+    startTime: $startTime
+    endTime: $endTime
+  ) {
+    metricsJson
+  }
+}
+```
+Variables (row count):
+```json
+{
+  "mcon": "<table-mcon>",
+  "metricName": "total_row_count",
+  "startTime": "2024-03-01T00:00:00Z",
+  "endTime": "2024-03-02T00:00:00Z"
+}
+```
+`metricsJson` is a JSON string. Parse it and look for `value` and `measurementTimestamp`
+(camelCase) in each data point.
+Other useful metric names:
+- `"total_row_count"` — row count
+- `"total_byte_count"` — byte size
+- `"total_row_count_last_changed_on"` — Unix epoch float of when the row count last changed
+---
+## Verify table lineage
+```graphql
+query GetTableLineage($mcon: String!) {
+  getTableLineage(mcon: $mcon, direction: "upstream", hops: 1) {
+    connectedNodes {
+      mcon
+      displayName
+      objectType
+    }
+    flattenedEdges {
+      directlyConnectedMcons
+    }
+  }
+}
+```
+Check that your expected source tables appear in `connectedNodes` or
+`flattenedEdges[].directlyConnectedMcons`.
+---
+## Verify column lineage
+```graphql
+query GetColumnLineage($mcon: String!, $column: String!) {
+  getDerivedTablesPartialLineage(mcon: $mcon, column: $column, pageSize: 1000) {
+    destinations {
+      table { mcon displayName }
+      columns { columnName }
+    }
+  }
+}
+```
+Variables: `mcon` = source table MCON, `column` = source column name.
+Check that each destination table and column appears in the response.
+---
+## Verify query logs
+```graphql
+query GetAggregatedQueries(
+  $mcon: String!
+  $queryType: String!
+  $startTime: DateTime!
+  $endTime: DateTime!
+  $first: Int
+  $after: String
+) {
+  getAggregatedQueries(
+    mcon: $mcon
+    queryType: $queryType
+    startTime: $startTime
+    endTime: $endTime
+    first: $first
+    after: $after
+  ) {
+    edges { node { queryHash queryCount lastSeen } }
+    pageInfo { hasNextPage endCursor }
+  }
+}
+```
+Variables:
+```json
+{
+  "mcon": "<table-mcon>",
+  "queryType": "read",
+  "startTime": "2024-03-01T00:00:00Z",
+  "endTime": "2024-03-02T00:00:00Z",
+  "first": 100
+}
+```
+**Remember**: query logs take up to 1 hour to process after push. If you see 0 results
+immediately after pushing, wait and try again.
+---
+## Check detector thresholds (anomaly detection status)
+```graphql
+query GetDetectorStatus($mcon: String!) {
+  getTable(mcon: $mcon) {
+    thresholds {
+      freshness {
+        lower { value }
+        upper { value }
+        status
+      }
+      size {
+        lower { value }
+        upper { value }
+        status
+      }
+    }
+  }
+}
+```
+`status` will be `"no data"` or `"inactive"` on a newly-pushed table. Detectors need
+historical data to train — see `references/anomaly-detection.md` for requirements.
+---
+## Table management operations
+### Delete push-ingested tables
+Only works on push-ingested tables — pull-collected tables are excluded by default.
+```graphql
+mutation DeletePushTables($mcons: [String!]!) {
+  deletePushIngestedTables(mcons: $mcons) {
+    success
+    deletedCount
+  }
+}
+```
+Variables:
+```json
+{
+  "mcons": ["<mcon-1>", "<mcon-2>"]
+}
+```
+Resolve MCONs first with `getTable(fullTableId: ..., dwId: ...)`.
+---
+## Python helper
+```python
+import requests, json
+GRAPHQL_URL = "https://api.getmontecarlo.com/graphql"
+def graphql(query: str, variables: dict, key_id: str, key_token: str) -> dict:
+    resp = requests.post(
+        GRAPHQL_URL,
+        json={"query": query, "variables": variables},
+        headers={
+            "x-mcd-id": key_id,
+            "x-mcd-token": key_token,
+            "Content-Type": "application/json",
+        },
+    )
+    resp.raise_for_status()
+    data = resp.json()
+    if "errors" in data:
+        raise RuntimeError(json.dumps(data["errors"], indent=2))
+    return data["data"]
+```