@clickzetta/cz-cli-darwin-x64 0.3.16 → 0.3.18
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/cz-cli +0 -0
- package/bin/skills/clickzetta-batch-sync-pipeline/SKILL.md +386 -0
- package/bin/skills/clickzetta-cdc-sync-pipeline/SKILL.md +548 -0
- package/bin/skills/clickzetta-data-ingest-pipeline/SKILL.md +220 -0
- package/bin/skills/clickzetta-data-ingest-pipeline/eval_cases.jsonl +5 -0
- package/bin/skills/clickzetta-dynamic-table/SKILL.md +112 -0
- package/bin/skills/clickzetta-dynamic-table/best-practices/dimension-table-join-guide.md +257 -0
- package/bin/skills/clickzetta-dynamic-table/best-practices/medallion-and-stream-patterns.md +124 -0
- package/bin/skills/clickzetta-dynamic-table/best-practices/non-partitioned-merge-into-warning.md +96 -0
- package/bin/skills/clickzetta-dynamic-table/best-practices/performance-optimization.md +109 -0
- package/bin/skills/clickzetta-file-import-pipeline/SKILL.md +156 -0
- package/bin/skills/clickzetta-kafka-ingest-pipeline/SKILL.md +751 -0
- package/bin/skills/clickzetta-kafka-ingest-pipeline/eval_cases.jsonl +5 -0
- package/bin/skills/clickzetta-kafka-ingest-pipeline/references/kafka-pipe-syntax.md +324 -0
- package/bin/skills/clickzetta-oss-ingest-pipeline/SKILL.md +537 -0
- package/bin/skills/clickzetta-query-optimizer/SKILL.md +156 -0
- package/bin/skills/clickzetta-query-optimizer/references/explain.md +56 -0
- package/bin/skills/clickzetta-query-optimizer/references/hints-and-sortkey.md +78 -0
- package/bin/skills/clickzetta-query-optimizer/references/optimize.md +65 -0
- package/bin/skills/clickzetta-query-optimizer/references/result-cache.md +49 -0
- package/bin/skills/clickzetta-query-optimizer/references/show-jobs.md +42 -0
- package/bin/skills/clickzetta-realtime-sync-pipeline/SKILL.md +276 -0
- package/bin/skills/clickzetta-sql-pipeline-manager/SKILL.md +379 -0
- package/bin/skills/clickzetta-sql-pipeline-manager/evals/evals.json +166 -0
- package/bin/skills/clickzetta-sql-pipeline-manager/references/dynamic-table.md +185 -0
- package/bin/skills/clickzetta-sql-pipeline-manager/references/materialized-view.md +129 -0
- package/bin/skills/clickzetta-sql-pipeline-manager/references/pipe.md +222 -0
- package/bin/skills/clickzetta-sql-pipeline-manager/references/table-stream.md +125 -0
- package/bin/skills/clickzetta-table-stream-pipeline/SKILL.md +206 -0
- package/bin/skills/clickzetta-vcluster-manager/SKILL.md +212 -0
- package/bin/skills/clickzetta-vcluster-manager/references/vc-cache.md +54 -0
- package/bin/skills/clickzetta-vcluster-manager/references/vcluster-ddl.md +150 -0
- package/bin/skills/clickzetta-volume-manager/SKILL.md +292 -0
- package/bin/skills/clickzetta-volume-manager/references/volume-ddl.md +199 -0
- package/package.json +1 -1
- /package/bin/skills/{dt-creator → clickzetta-dynamic-table/dt-creator}/SKILL.md +0 -0
- /package/bin/skills/{dt-creator → clickzetta-dynamic-table/dt-creator}/references/dt-declaration-strategy.md +0 -0
- /package/bin/skills/{dt-creator → clickzetta-dynamic-table/dt-creator}/references/incremental-config-reference.md +0 -0
- /package/bin/skills/{dt-creator → clickzetta-dynamic-table/dt-creator}/references/refresh-history-guide.md +0 -0
- /package/bin/skills/{dt-creator → clickzetta-dynamic-table/dt-creator}/references/sql-limitations.md +0 -0
- /package/bin/skills/{dynamic-table-alter → clickzetta-dynamic-table/dynamic-table-alter}/SKILL.md +0 -0
|
@@ -0,0 +1,379 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: clickzetta-sql-pipeline-manager
|
|
3
|
+
description: >
|
|
4
|
+
管理 ClickZetta Lakehouse 的 SQL 数据管道对象,包括动态表(Dynamic Table)、
|
|
5
|
+
物化视图(Materialized View)、表流(Table Stream)和 Pipe。
|
|
6
|
+
覆盖创建、修改、暂停/恢复、删除、查看状态等完整生命周期操作。
|
|
7
|
+
仅涉及 SQL 命令操作,不涉及 Lakehouse Studio 图形化界面。
|
|
8
|
+
|
|
9
|
+
当用户说"创建动态表"、"创建物化视图"、"创建 Pipe"、"创建表流"、
|
|
10
|
+
"暂停/恢复动态表"、"查看刷新历史"、"修改刷新频率"、"接入 Kafka"、
|
|
11
|
+
"从对象存储持续导入"、"CDC 变更捕获"、"增量计算"、"实时 ETL"、
|
|
12
|
+
"数据管道"、"pipeline"、"流式处理"、"动态表刷新失败"、
|
|
13
|
+
"帮我设计 ETL"、"构建数据管道"、"数据接入方案"、
|
|
14
|
+
"Medallion Architecture"、"Bronze Silver Gold"、"奖章架构"、
|
|
15
|
+
"湖仓分层"、"Bronze 层"、"Silver 层"、"Gold 层"时触发。
|
|
16
|
+
Keywords: SQL pipeline, dynamic table, materialized view, table stream, Pipe, data pipeline
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
# ClickZetta SQL 数据管道管理
|
|
20
|
+
|
|
21
|
+
## ⚠️ ClickZetta 与标准 SQL / Snowflake 的关键语法差异
|
|
22
|
+
|
|
23
|
+
这些是最容易写错的地方,必须使用 ClickZetta 特有语法:
|
|
24
|
+
|
|
25
|
+
| 功能 | ❌ 错误写法(Snowflake/标准SQL) | ✅ ClickZetta 正确写法 |
|
|
26
|
+
|---|---|---|
|
|
27
|
+
| 动态表计算集群 | `WAREHOUSE = compute_wh` | `vcluster default`(直接跟名称,不带等号) |
|
|
28
|
+
| 动态表刷新调度 | `TARGET_LAG = '1 minutes'` | `REFRESH INTERVAL 1 MINUTE vcluster default` |
|
|
29
|
+
| Kafka 读取函数 | `TABLE(READ_KAFKA(KAFKA_BROKER => ...))` | `read_kafka('broker', 'topic', '', 'group', '', '', '', '', 'raw', 'raw', 0, MAP(...))` — 位置参数 |
|
|
30
|
+
| 物化视图定时刷新 | `REFRESH EVERY 1 HOUR` | `REFRESH INTERVAL 60 MINUTE vcluster default`(与动态表语法相同) |
|
|
31
|
+
| 物化视图手动刷新 | `REFRESH MATERIALIZED VIEW` 放在 CREATE 里 | 单独执行 `REFRESH MATERIALIZED VIEW <name>;` |
|
|
32
|
+
| 修改动态表 SQL | `ALTER DYNAMIC TABLE ... AS ...` | `CREATE OR REPLACE DYNAMIC TABLE ...`(ALTER 不支持修改 AS 子句) |
|
|
33
|
+
| JSON 字段访问 | `$1:field::TYPE` 或 `data:key` | `parse_json(value::string)['field']::TYPE` 或 `data['key']` |
|
|
34
|
+
| COPY INTO 导入格式 | `FILE_FORMAT = (TYPE = CSV)` | `USING CSV OPTIONS(...)` |
|
|
35
|
+
| COPY INTO 导出格式 | `USING CSV` | `FILE_FORMAT = (TYPE = CSV)` |
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## Pipeline Wizard(管道设计向导)
|
|
40
|
+
|
|
41
|
+
当用户想设计或构建一个完整的数据管道时,这是最高优先级的模式。触发词包括:
|
|
42
|
+
"帮我设计/构建 ETL"、"完整的数据管道"、"从 Kafka/OSS 接入数据"、"ODS→DWD→DWS"、"端到端 pipeline"、
|
|
43
|
+
"Medallion Architecture"、"Bronze/Silver/Gold"、"奖章架构"、"湖仓分层"。
|
|
44
|
+
|
|
45
|
+
### 层次命名约定
|
|
46
|
+
|
|
47
|
+
用户可能使用不同的分层命名,含义相同,按用户偏好保留原始命名:
|
|
48
|
+
|
|
49
|
+
| 用户说的 | 含义 | Schema 命名建议 |
|
|
50
|
+
|---|---|---|
|
|
51
|
+
| Bronze / Silver / Gold | Medallion Architecture | `bronze` / `silver` / `gold` |
|
|
52
|
+
| ODS / DWD / DWS | 国内数仓分层惯例 | `ods` / `dwd` / `dws` |
|
|
53
|
+
| Raw / Cleansed / Aggregated | 通用英文描述 | `raw` / `cleansed` / `agg` |
|
|
54
|
+
|
|
55
|
+
**不要把 Bronze 映射成 ODS、Silver 映射成 DWD 等——保留用户选择的命名,在 SQL 中直接使用对应的 schema 和表名前缀。**
|
|
56
|
+
|
|
57
|
+
**Schema 命名必须加业务/项目前缀,避免与其他项目冲突。** 如果用户未提供前缀,询问项目名称或业务域名称,然后生成带前缀的 Schema 名:
|
|
58
|
+
|
|
59
|
+
```sql
|
|
60
|
+
-- ❌ 容易重名,不要这样生成
|
|
61
|
+
CREATE SCHEMA IF NOT EXISTS bronze;
|
|
62
|
+
|
|
63
|
+
-- ✅ 加项目前缀
|
|
64
|
+
CREATE SCHEMA IF NOT EXISTS ecommerce_bronze;
|
|
65
|
+
CREATE SCHEMA IF NOT EXISTS ecommerce_silver;
|
|
66
|
+
CREATE SCHEMA IF NOT EXISTS ecommerce_gold;
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### 需求收集
|
|
70
|
+
|
|
71
|
+
**如果用户已经提供了足够信息(数据来源、字段、层次需求、项目前缀),直接生成完整 SQL,不要再问。**
|
|
72
|
+
|
|
73
|
+
如果信息不完整,一次性问清楚以下 5 点(合并成一个问题,不要逐一追问):
|
|
74
|
+
|
|
75
|
+
> 为了生成完整的 pipeline SQL,需要确认几个信息:
|
|
76
|
+
> 1. **项目/业务前缀**:Schema 名称的前缀是什么(如 `ecommerce`、`risk`、`ads`)?多个项目共用 Workspace 时必须加前缀避免冲突。
|
|
77
|
+
> 2. **数据来源**:Kafka(broker/topic)/ 对象存储(Volume 路径/格式)/ 已有表(是否有 UPDATE/DELETE)?
|
|
78
|
+
> 3. **字段结构**:目标表的字段名和类型?
|
|
79
|
+
> 4. **层次需求**:需要几层?每层做什么处理(清洗/聚合/维度建模)?
|
|
80
|
+
> 5. **刷新频率**:实时(秒级)/ 近实时(分钟级)/ 低频(小时/天)?
|
|
81
|
+
|
|
82
|
+
### 生成完整 SQL
|
|
83
|
+
|
|
84
|
+
收到回答后,生成完整的端到端 SQL,包含以下所有部分:
|
|
85
|
+
|
|
86
|
+
```
|
|
87
|
+
1. Schema 创建(CREATE SCHEMA IF NOT EXISTS,使用用户指定的层次名称)
|
|
88
|
+
2. 入口层建表(如果是外部摄入)
|
|
89
|
+
3. 数据入口(Pipe 或 Table Stream,根据来源选择)
|
|
90
|
+
4. 中间层动态表(清洗/过滤,REFRESH interval N MINUTE VCLUSTER name)
|
|
91
|
+
5. 服务层动态表(聚合/维度,REFRESH interval N MINUTE VCLUSTER name)
|
|
92
|
+
6. 验证命令(SHOW + REFRESH HISTORY)
|
|
93
|
+
7. 运维操作(SUSPEND/RESUME)
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**来源 → 入口对象的选择规则:**
|
|
97
|
+
- Kafka → `CREATE PIPE ... AS COPY INTO ... FROM (SELECT ... FROM read_kafka('broker', 'topic', '', 'group', '', '', '', '', 'raw', 'raw', 0, MAP(...)))`
|
|
98
|
+
- 对象存储(OSS/S3/COS)→ `CREATE PIPE ... VIRTUAL_CLUSTER = 'name' INGEST_MODE = 'LIST_PURGE' AS COPY INTO ... FROM VOLUME <volume_name> USING <format> PURGE=true`
|
|
99
|
+
- 已有表 + 有 UPDATE/DELETE → `CREATE TABLE STREAM ... WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD')`,中间层过滤 `__change_type IN ('INSERT', 'UPDATE_AFTER', 'DELETE')`
|
|
100
|
+
- 已有表 + 仅 INSERT → Dynamic Table 直接 `FROM` 源表
|
|
101
|
+
|
|
102
|
+
**刷新频率规则:**
|
|
103
|
+
- 第一个转换层(Bronze→Silver 或 ODS→DWD)设置用户指定的刷新频率(如 `REFRESH INTERVAL 1 MINUTE vcluster default`)
|
|
104
|
+
- 下游层根据业务需求设置各自的刷新频率(如 `REFRESH INTERVAL 5 MINUTE vcluster default`)
|
|
105
|
+
|
|
106
|
+
---
|
|
107
|
+
|
|
108
|
+
## 对象类型速查
|
|
109
|
+
|
|
110
|
+
| 对象 | 适用场景 | 核心特点 |
|
|
111
|
+
|---|---|---|
|
|
112
|
+
| **Dynamic Table** | 实时/近实时增量 ETL | SQL 定义,自动增量刷新,秒/分钟级延迟 |
|
|
113
|
+
| **Materialized View** | 固定聚合加速查询 | 预计算存储,手动或定时全量刷新 |
|
|
114
|
+
| **Table Stream** | CDC 变更数据捕获 | 捕获 INSERT/UPDATE/DELETE,配合 Dynamic Table 消费 |
|
|
115
|
+
| **Pipe** | 持续数据摄入 | 从 Kafka 或对象存储自动持续导入,无需调度 |
|
|
116
|
+
|
|
117
|
+
## 决策树
|
|
118
|
+
|
|
119
|
+
```
|
|
120
|
+
用户需求
|
|
121
|
+
├── 持续从外部摄入数据(Kafka / OSS / S3)
|
|
122
|
+
│ └── → Pipe
|
|
123
|
+
├── 对已有表做实时/增量转换
|
|
124
|
+
│ ├── 需要感知 UPDATE/DELETE → Table Stream + Dynamic Table
|
|
125
|
+
│ └── 只需 INSERT 追加 → Dynamic Table(直接查源表)
|
|
126
|
+
├── 固定聚合,不要求实时
|
|
127
|
+
│ └── → Materialized View
|
|
128
|
+
└── 多层 ETL(ODS→DWD→DWS 或 Bronze→Silver→Gold)
|
|
129
|
+
└── → 多个 Dynamic Table 级联(各层设置独立 REFRESH interval)
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
## 步骤 0:确认连接
|
|
133
|
+
|
|
134
|
+
操作前先确认已连接到 ClickZetta Lakehouse。参考 `clickzetta-lakehouse-connect` skill 获取连接参数。
|
|
135
|
+
|
|
136
|
+
## 步骤 1:选择对象类型
|
|
137
|
+
|
|
138
|
+
根据决策树选择对象类型,阅读对应参考文件:
|
|
139
|
+
|
|
140
|
+
| 对象 | 参考文件 |
|
|
141
|
+
|---|---|
|
|
142
|
+
| Dynamic Table | [references/dynamic-table.md](references/dynamic-table.md) |
|
|
143
|
+
| Materialized View | [references/materialized-view.md](references/materialized-view.md) |
|
|
144
|
+
| Table Stream | [references/table-stream.md](references/table-stream.md) |
|
|
145
|
+
| Pipe | [references/pipe.md](references/pipe.md) |
|
|
146
|
+
|
|
147
|
+
## 步骤 2:生成并执行 SQL
|
|
148
|
+
|
|
149
|
+
阅读对应参考文件后,根据用户提供的参数生成完整可运行 SQL。
|
|
150
|
+
|
|
151
|
+
**必填参数检查:**
|
|
152
|
+
- Dynamic Table:`REFRESH INTERVAL N MINUTE vcluster name`、AS 查询
|
|
153
|
+
- Table Stream:源表名、MODE(STANDARD 或 APPEND_ONLY)
|
|
154
|
+
- Pipe(Kafka):bootstrap_servers、topic、group_id、目标表(位置参数语法)
|
|
155
|
+
- Pipe(对象存储):Volume 路径、文件格式、目标表、`PURGE=true`(LIST_PURGE 模式)
|
|
156
|
+
|
|
157
|
+
若用户未提供 VCLUSTER,默认使用 `default`(GP 型集群)。
|
|
158
|
+
|
|
159
|
+
## 步骤 3:验证
|
|
160
|
+
|
|
161
|
+
```sql
|
|
162
|
+
-- 验证动态表
|
|
163
|
+
SHOW TABLES WHERE is_dynamic = true;
|
|
164
|
+
SHOW DYNAMIC TABLE REFRESH HISTORY <name> LIMIT 5;
|
|
165
|
+
|
|
166
|
+
-- 验证物化视图
|
|
167
|
+
SHOW TABLES WHERE is_materialized_view = true;
|
|
168
|
+
|
|
169
|
+
-- 验证 Table Stream
|
|
170
|
+
SHOW TABLE STREAMS;
|
|
171
|
+
SELECT COUNT(*) FROM <stream_name>; -- 查看待消费变更数
|
|
172
|
+
|
|
173
|
+
-- 验证 Pipe
|
|
174
|
+
SHOW PIPES;
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## 典型场景示例
|
|
180
|
+
|
|
181
|
+
### 场景 A:Kafka → 动态表(实时 ETL)
|
|
182
|
+
|
|
183
|
+
```sql
|
|
184
|
+
-- Step 1: 创建 Pipe 持续摄入 Kafka 数据到 ODS 层
|
|
185
|
+
-- ⚠️ 注意:ClickZetta 不支持 CREATE OR REPLACE PIPE,需用 CREATE PIPE 或先 DROP 再 CREATE
|
|
186
|
+
CREATE PIPE kafka_orders_pipe
|
|
187
|
+
VIRTUAL_CLUSTER = 'default'
|
|
188
|
+
BATCH_INTERVAL_IN_SECONDS = '60'
|
|
189
|
+
AS
|
|
190
|
+
COPY INTO ods.orders FROM (
|
|
191
|
+
SELECT
|
|
192
|
+
j['order_id']::STRING,
|
|
193
|
+
j['user_id']::STRING,
|
|
194
|
+
j['amount']::DECIMAL(10,2),
|
|
195
|
+
j['status']::STRING,
|
|
196
|
+
j['created_at']::TIMESTAMP
|
|
197
|
+
FROM (
|
|
198
|
+
SELECT parse_json(value::string) AS j
|
|
199
|
+
FROM read_kafka(
|
|
200
|
+
'kafka.example.com:9092', -- bootstrap_servers
|
|
201
|
+
'orders', -- topic
|
|
202
|
+
'', -- reserved
|
|
203
|
+
'lakehouse_ingest', -- group_id
|
|
204
|
+
'', '', '', '', -- 位置参数留空,由 Pipe 管理
|
|
205
|
+
'raw', 'raw', 0,
|
|
206
|
+
MAP('kafka.security.protocol', 'PLAINTEXT')
|
|
207
|
+
)
|
|
208
|
+
)
|
|
209
|
+
);
|
|
210
|
+
|
|
211
|
+
-- Step 2: 动态表做 DWD 层清洗(每分钟增量刷新)
|
|
212
|
+
CREATE OR REPLACE DYNAMIC TABLE dwd.orders_clean
|
|
213
|
+
REFRESH INTERVAL 1 MINUTE vcluster default
|
|
214
|
+
AS
|
|
215
|
+
SELECT
|
|
216
|
+
order_id,
|
|
217
|
+
user_id,
|
|
218
|
+
amount,
|
|
219
|
+
UPPER(status) AS status,
|
|
220
|
+
created_at,
|
|
221
|
+
DATE(created_at) AS dt
|
|
222
|
+
FROM ods.orders
|
|
223
|
+
WHERE amount > 0;
|
|
224
|
+
|
|
225
|
+
-- Step 3: 动态表做 DWS 层聚合(每 5 分钟刷新)
|
|
226
|
+
CREATE OR REPLACE DYNAMIC TABLE dws.order_hourly
|
|
227
|
+
REFRESH INTERVAL 5 MINUTE vcluster default
|
|
228
|
+
AS
|
|
229
|
+
SELECT
|
|
230
|
+
DATE_TRUNC('hour', created_at) AS hour,
|
|
231
|
+
status,
|
|
232
|
+
COUNT(*) AS order_cnt,
|
|
233
|
+
SUM(amount) AS total_amount
|
|
234
|
+
FROM dwd.orders_clean
|
|
235
|
+
GROUP BY 1, 2;
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
### 场景 B:Table Stream + Dynamic Table(CDC UPSERT)
|
|
239
|
+
|
|
240
|
+
```sql
|
|
241
|
+
-- Step 1: 在源表上创建 Stream 捕获变更
|
|
242
|
+
CREATE TABLE STREAM ods.orders_stream
|
|
243
|
+
ON TABLE ods.orders
|
|
244
|
+
WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD');
|
|
245
|
+
|
|
246
|
+
-- Step 2: 动态表消费 Stream,过滤出最新状态
|
|
247
|
+
CREATE OR REPLACE DYNAMIC TABLE dwd.orders_latest
|
|
248
|
+
REFRESH INTERVAL 2 MINUTE vcluster default
|
|
249
|
+
AS
|
|
250
|
+
SELECT order_id, user_id, amount, status, created_at
|
|
251
|
+
FROM ods.orders_stream
|
|
252
|
+
WHERE __change_type IN ('INSERT', 'UPDATE_AFTER');
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
### 场景 C:物化视图加速 BI 查询
|
|
256
|
+
|
|
257
|
+
```sql
|
|
258
|
+
-- 创建每小时刷新的物化视图
|
|
259
|
+
-- ⚠️ 注意:ClickZetta 不支持 CREATE OR REPLACE MATERIALIZED VIEW
|
|
260
|
+
-- 方法 1: 先 DROP 再 CREATE(推荐)
|
|
261
|
+
DROP MATERIALIZED VIEW IF EXISTS dws.mv_daily_revenue;
|
|
262
|
+
CREATE MATERIALIZED VIEW dws.mv_daily_revenue
|
|
263
|
+
COMMENT '每日收入汇总,供 BI 工具查询'
|
|
264
|
+
REFRESH INTERVAL 60 MINUTE vcluster default
|
|
265
|
+
AS
|
|
266
|
+
SELECT
|
|
267
|
+
DATE(created_at) AS day,
|
|
268
|
+
region,
|
|
269
|
+
SUM(amount) AS revenue,
|
|
270
|
+
COUNT(DISTINCT user_id) AS uv
|
|
271
|
+
FROM dwd.orders_clean
|
|
272
|
+
GROUP BY 1, 2;
|
|
273
|
+
|
|
274
|
+
-- 方法 2: 使用 BUILD DEFERRED + DISABLE QUERY REWRITE(复杂,不推荐)
|
|
275
|
+
-- CREATE OR REPLACE MATERIALIZED VIEW ... BUILD DEFERRED DISABLE QUERY REWRITE AS ...
|
|
276
|
+
|
|
277
|
+
-- 手动触发刷新
|
|
278
|
+
REFRESH MATERIALIZED VIEW dws.mv_daily_revenue;
|
|
279
|
+
|
|
280
|
+
-- 删除物化视图(⚠️ 注意:必须用 DROP MATERIALIZED VIEW,不能用 DROP TABLE)
|
|
281
|
+
DROP MATERIALIZED VIEW dws.mv_daily_revenue;
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
### 场景 D:运维操作
|
|
285
|
+
|
|
286
|
+
```sql
|
|
287
|
+
-- 暂停动态表(如集群维护)
|
|
288
|
+
ALTER DYNAMIC TABLE dwd.orders_clean SUSPEND;
|
|
289
|
+
|
|
290
|
+
-- 恢复
|
|
291
|
+
ALTER DYNAMIC TABLE dwd.orders_clean RESUME;
|
|
292
|
+
|
|
293
|
+
-- 查看刷新历史排查失败
|
|
294
|
+
SHOW DYNAMIC TABLE REFRESH HISTORY dwd.orders_clean LIMIT 10;
|
|
295
|
+
|
|
296
|
+
-- 暂停 Pipe
|
|
297
|
+
ALTER PIPE kafka_orders_pipe SET PIPE_EXECUTION_PAUSED = true;
|
|
298
|
+
|
|
299
|
+
-- 恢复 Pipe
|
|
300
|
+
ALTER PIPE kafka_orders_pipe SET PIPE_EXECUTION_PAUSED = false;
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
### 场景 E:参数化动态表(按分区刷新)
|
|
304
|
+
|
|
305
|
+
通过 `SESSION_CONFIGS()` 函数定义参数化查询,在刷新时传入分区值控制全量或增量刷新范围:
|
|
306
|
+
|
|
307
|
+
```sql
|
|
308
|
+
-- 创建参数化动态表(使用 SESSION_CONFIGS 定义参数)
|
|
309
|
+
CREATE OR REPLACE DYNAMIC TABLE dwd.orders_partitioned
|
|
310
|
+
REFRESH INTERVAL 30 MINUTE vcluster default
|
|
311
|
+
AS
|
|
312
|
+
SELECT order_id, user_id, amount, status, created_at, DATE(created_at) AS dt
|
|
313
|
+
FROM ods.orders
|
|
314
|
+
WHERE dt = SESSION_CONFIGS('target_date', CAST(CURRENT_DATE() AS STRING));
|
|
315
|
+
|
|
316
|
+
-- 手动触发刷新并传入参数
|
|
317
|
+
REFRESH DYNAMIC TABLE dwd.orders_partitioned
|
|
318
|
+
WITH PROPERTIES ('target_date' = '2024-06-15');
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
> **适用场景**:传统按天/按小时全量 ETL 任务改造为增量任务时,用 SESSION_CONFIGS 替换调度变量(如 `${bizdate}`),实现参数化分区刷新。
|
|
322
|
+
|
|
323
|
+
### 场景 F:动态表 DML 操作(手动修正数据)
|
|
324
|
+
|
|
325
|
+
⚠️ **重要**:ClickZetta 动态表**不支持 DML 操作**(INSERT/UPDATE/DELETE)。如需修正数据,有以下方案:
|
|
326
|
+
|
|
327
|
+
**方案 1:重建动态表(推荐)**
|
|
328
|
+
```sql
|
|
329
|
+
-- 1. 在源表中修正数据
|
|
330
|
+
-- 2. 等待动态表自动刷新(下一次 REFRESH INTERVAL 会全量刷新)
|
|
331
|
+
```
|
|
332
|
+
|
|
333
|
+
**方案 2:使用普通表替代动态表**
|
|
334
|
+
```sql
|
|
335
|
+
-- 对于需要频繁手动修正的场景,建议使用普通表 + 定时调度任务
|
|
336
|
+
-- 而不是动态表
|
|
337
|
+
CREATE TABLE dwd.orders_manual (
|
|
338
|
+
order_id STRING,
|
|
339
|
+
user_id STRING,
|
|
340
|
+
amount DECIMAL(10,2),
|
|
341
|
+
status STRING,
|
|
342
|
+
created_at TIMESTAMP,
|
|
343
|
+
dt DATE
|
|
344
|
+
);
|
|
345
|
+
```
|
|
346
|
+
|
|
347
|
+
> ⚠️ **动态表限制**:
|
|
348
|
+
> - 动态表是只读的,不支持 INSERT/UPDATE/DELETE
|
|
349
|
+
> - 数据修正应在源表进行,动态表会自动刷新
|
|
350
|
+
> - 如需手动控制数据,使用普通表 + Studio 调度任务
|
|
351
|
+
|
|
352
|
+
---
|
|
353
|
+
|
|
354
|
+
## 常见错误
|
|
355
|
+
|
|
356
|
+
| 错误 | 原因 | 解决方案 |
|
|
357
|
+
|---|---|---|
|
|
358
|
+
| `VCluster not available` | 计算集群未启动或名称错误 | 确认 VCLUSTER 名称,检查集群状态 |
|
|
359
|
+
| 动态表刷新失败 | SQL 查询报错或源表结构变更 | `SHOW DYNAMIC TABLE REFRESH HISTORY WHERE name = 'xxx'` 查看错误详情 |
|
|
360
|
+
| Stream 数据为空 | 已被消费或超出保留周期 | 检查源表 `data_retention_days`,确认是否已消费 |
|
|
361
|
+
| Pipe 停止摄入 | Kafka offset 问题或连接断开 | `DESC PIPE EXTENDED` 查看状态,检查 Kafka 连接 |
|
|
362
|
+
| `Cannot ALTER AS clause` | 尝试用 ALTER 修改动态表 SQL | 改用 `CREATE OR REPLACE DYNAMIC TABLE` |
|
|
363
|
+
| `CREATE OR REPLACE PIPE` 语法报错 | ClickZetta 不支持该语法 | 用 `CREATE PIPE` 或先 `DROP PIPE` 再 `CREATE` |
|
|
364
|
+
| `CREATE OR REPLACE MATERIALIZED VIEW` 语法报错 | 仅支持 `REWRITE DISABLED + BUILD DEFER` 模式 | 推荐用 `DROP MATERIALIZED VIEW` + `CREATE MATERIALIZED VIEW` |
|
|
365
|
+
| `DROP TABLE` 删除物化视图报错 | 对象类型不匹配 | 用 `DROP MATERIALIZED VIEW`(不是 `DROP TABLE`) |
|
|
366
|
+
| 动态表 DML 报错 `not allowed` | 动态表不支持 DML | 在源表修正数据,或使用普通表 + 调度任务 |
|
|
367
|
+
| `SET cz.sql.dt.allow.dml` 报错 | 不支持 session statement | 动态表不支持 DML 操作,改用其他方案 |
|
|
368
|
+
|
|
369
|
+
---
|
|
370
|
+
|
|
371
|
+
## 参考文档
|
|
372
|
+
|
|
373
|
+
- [增量计算概述](https://www.yunqi.tech/documents/streaming_data_pipeline_overview)
|
|
374
|
+
- [Dynamic Table](https://www.yunqi.tech/documents/dynamic-table)
|
|
375
|
+
- [Table Stream 变化数据捕获](https://www.yunqi.tech/documents/table_stream)
|
|
376
|
+
- [物化视图](https://www.yunqi.tech/documents/materialized_ddl)
|
|
377
|
+
- [Pipe 简介](https://www.yunqi.tech/documents/pipe-summary)
|
|
378
|
+
- [使用 Dynamic Table 开展实时 ETL](https://www.yunqi.tech/documents/tutorials-streaming-data-pipeline-with_dynamic-table)
|
|
379
|
+
- [LLM 全量文档索引](https://yunqi.tech/llms-full.txt)
|
|
@@ -0,0 +1,166 @@
|
|
|
1
|
+
{
|
|
2
|
+
"skill_name": "clickzetta-sql-pipeline-manager",
|
|
3
|
+
"evals": [
|
|
4
|
+
{
|
|
5
|
+
"id": 1,
|
|
6
|
+
"prompt": "帮我创建一个动态表 dwd.orders_clean,每分钟刷新一次,从 ods.orders 表过滤 amount > 0 的数据,计算集群用 default_ap。",
|
|
7
|
+
"expected_output": "生成 CREATE OR REPLACE DYNAMIC TABLE dwd.orders_clean,包含 TARGET_LAG = '1 minutes'、VCLUSTER = default_ap、AS SELECT ... FROM ods.orders WHERE amount > 0",
|
|
8
|
+
"files": [],
|
|
9
|
+
"assertions": [
|
|
10
|
+
{"text": "包含 CREATE DYNAMIC TABLE 或 CREATE OR REPLACE DYNAMIC TABLE"},
|
|
11
|
+
{"text": "包含 dwd.orders_clean 作为表名"},
|
|
12
|
+
{"text": "包含 TARGET_LAG = '1 minutes' 或等效写法"},
|
|
13
|
+
{"text": "包含 VCLUSTER = default_ap"},
|
|
14
|
+
{"text": "包含 WHERE amount > 0 过滤条件"},
|
|
15
|
+
{"text": "包含 FROM ods.orders"}
|
|
16
|
+
]
|
|
17
|
+
},
|
|
18
|
+
{
|
|
19
|
+
"id": 2,
|
|
20
|
+
"prompt": "我有一个动态表 dws.order_hourly 刷新失败了,怎么排查?",
|
|
21
|
+
"expected_output": "使用 SHOW DYNAMIC TABLE REFRESH HISTORY dws.order_hourly 查看失败详情,说明常见原因(SQL 报错、源表结构变更、集群不可用),提供排查步骤",
|
|
22
|
+
"files": [],
|
|
23
|
+
"assertions": [
|
|
24
|
+
{"text": "包含 SHOW DYNAMIC TABLE REFRESH HISTORY 命令"},
|
|
25
|
+
{"text": "包含 dws.order_hourly 表名"},
|
|
26
|
+
{"text": "提到至少一种常见失败原因(SQL报错/源表变更/集群不可用)"},
|
|
27
|
+
{"text": "提供具体的排查步骤或命令"}
|
|
28
|
+
]
|
|
29
|
+
},
|
|
30
|
+
{
|
|
31
|
+
"id": 3,
|
|
32
|
+
"prompt": "我想从 Kafka 持续导入订单数据到 ods.orders 表,broker 是 kafka.prod.com:9092,topic 是 orders,数据是 JSON 格式。",
|
|
33
|
+
"expected_output": "生成 CREATE OR REPLACE PIPE 语句,使用 READ_KAFKA 函数,包含 KAFKA_BROKER、KAFKA_TOPIC、KAFKA_GROUP_ID、KAFKA_DATA_FORMAT = 'json',以及 INSERT INTO ods.orders 的字段映射",
|
|
34
|
+
"files": [],
|
|
35
|
+
"assertions": [
|
|
36
|
+
{"text": "包含 CREATE PIPE 语句"},
|
|
37
|
+
{"text": "包含 READ_KAFKA 函数"},
|
|
38
|
+
{"text": "包含 kafka.prod.com:9092 作为 broker 地址"},
|
|
39
|
+
{"text": "包含 KAFKA_TOPIC 设置为 orders"},
|
|
40
|
+
{"text": "包含 KAFKA_DATA_FORMAT = 'json' 或等效写法"},
|
|
41
|
+
{"text": "包含 INSERT INTO ods.orders"}
|
|
42
|
+
]
|
|
43
|
+
},
|
|
44
|
+
{
|
|
45
|
+
"id": 4,
|
|
46
|
+
"prompt": "我需要捕获 ods.orders 表的 UPDATE 和 DELETE 变更,然后同步到 dwd.orders_dim 表,怎么做?",
|
|
47
|
+
"expected_output": "先 CREATE TABLE STREAM ON TABLE ods.orders MODE = STANDARD,再创建 Dynamic Table 或用 MERGE INTO 消费 stream,说明 _change_type 字段含义(update_postimage / delete)",
|
|
48
|
+
"files": [],
|
|
49
|
+
"assertions": [
|
|
50
|
+
{"text": "包含 CREATE TABLE STREAM"},
|
|
51
|
+
{"text": "包含 ON TABLE ods.orders"},
|
|
52
|
+
{"text": "包含 MODE = STANDARD 或说明标准模式"},
|
|
53
|
+
{"text": "提到 _change_type 字段"},
|
|
54
|
+
{"text": "提到 update_postimage 或 delete 变更类型"},
|
|
55
|
+
{"text": "提供消费 stream 的方案(MERGE INTO 或 Dynamic Table)"}
|
|
56
|
+
]
|
|
57
|
+
},
|
|
58
|
+
{
|
|
59
|
+
"id": 5,
|
|
60
|
+
"prompt": "帮我创建一个物化视图 dws.mv_daily_revenue,每小时自动刷新,统计每天每个 region 的 revenue 总和,数据来自 dwd.orders_clean。",
|
|
61
|
+
"expected_output": "生成 CREATE OR REPLACE MATERIALIZED VIEW,包含 REFRESH AUTO EVERY '1 hours'、VCLUSTER、AS SELECT DATE(created_at), region, SUM(amount) FROM dwd.orders_clean GROUP BY 1,2",
|
|
62
|
+
"files": [],
|
|
63
|
+
"assertions": [
|
|
64
|
+
{"text": "包含 CREATE MATERIALIZED VIEW"},
|
|
65
|
+
{"text": "包含 dws.mv_daily_revenue 作为视图名"},
|
|
66
|
+
{"text": "包含 REFRESH AUTO EVERY '1 hours' 或等效写法"},
|
|
67
|
+
{"text": "包含 FROM dwd.orders_clean"},
|
|
68
|
+
{"text": "包含 SUM 聚合 revenue 或 amount"},
|
|
69
|
+
{"text": "包含 GROUP BY 包括日期和 region"}
|
|
70
|
+
]
|
|
71
|
+
},
|
|
72
|
+
{
|
|
73
|
+
"id": 6,
|
|
74
|
+
"prompt": "我要做一个三层 ETL:ODS 原始层 → DWD 清洗层 → DWS 聚合层,都用动态表,希望 DWD 每分钟刷新,DWS 跟着 DWD 刷新。怎么设计?",
|
|
75
|
+
"expected_output": "DWD 动态表设置 TARGET_LAG = '1 minutes',DWS 动态表设置 TARGET_LAG = DOWNSTREAM,给出完整的两层 CREATE DYNAMIC TABLE 示例,说明 DOWNSTREAM 的含义",
|
|
76
|
+
"files": [],
|
|
77
|
+
"assertions": [
|
|
78
|
+
{"text": "包含两个 CREATE DYNAMIC TABLE 语句(DWD 和 DWS 层)"},
|
|
79
|
+
{"text": "DWD 层包含 TARGET_LAG = '1 minutes'"},
|
|
80
|
+
{"text": "DWS 层包含 TARGET_LAG = DOWNSTREAM"},
|
|
81
|
+
{"text": "解释 DOWNSTREAM 的含义(跟随上游刷新)"},
|
|
82
|
+
{"text": "DWS 层查询来自 DWD 层表"}
|
|
83
|
+
]
|
|
84
|
+
},
|
|
85
|
+
{
|
|
86
|
+
"id": 7,
|
|
87
|
+
"prompt": "我需要暂停所有动态表的刷新,明天维护结束后再恢复,怎么操作?",
|
|
88
|
+
"expected_output": "使用 ALTER DYNAMIC TABLE <name> SUSPEND 暂停,ALTER DYNAMIC TABLE <name> RESUME 恢复,提示可先用 SHOW TABLES WHERE is_dynamic=true 列出所有动态表",
|
|
89
|
+
"files": [],
|
|
90
|
+
"assertions": [
|
|
91
|
+
{"text": "包含 ALTER DYNAMIC TABLE ... SUSPEND"},
|
|
92
|
+
{"text": "包含 ALTER DYNAMIC TABLE ... RESUME"},
|
|
93
|
+
{"text": "提到 SHOW TABLES WHERE is_dynamic=true 或等效方式列出动态表"}
|
|
94
|
+
]
|
|
95
|
+
},
|
|
96
|
+
{
|
|
97
|
+
"id": 8,
|
|
98
|
+
"prompt": "动态表和物化视图有什么区别?我的场景是每天凌晨跑一次聚合报表,用哪个更合适?",
|
|
99
|
+
"expected_output": "说明动态表适合实时/增量场景,物化视图适合低频全量刷新;对于每天一次的报表场景推荐物化视图(REFRESH MANUAL 或 AUTO EVERY '1 days'),成本更低",
|
|
100
|
+
"files": [],
|
|
101
|
+
"assertions": [
|
|
102
|
+
{"text": "说明动态表适合实时或增量计算场景"},
|
|
103
|
+
{"text": "说明物化视图适合低频或手动刷新场景"},
|
|
104
|
+
{"text": "明确推荐物化视图用于每天一次的报表场景"},
|
|
105
|
+
{"text": "提到 REFRESH MANUAL 或 REFRESH AUTO EVERY '1 days' 等具体语法"}
|
|
106
|
+
]
|
|
107
|
+
},
|
|
108
|
+
{
|
|
109
|
+
"id": 9,
|
|
110
|
+
"prompt": "帮我设计一个完整的数据管道:从 Kafka(broker: kafka.prod.com:9092,topic: user_events,JSON 格式)接入用户行为数据,字段有 user_id、event_type、page_id、ts(时间戳)。需要做到三层:ODS 原始层、DWD 清洗层(过滤无效 event_type)、DWS 聚合层(按天统计每个 page 的 UV 和 PV)。DWD 每 2 分钟刷新,DWS 跟随 DWD。",
|
|
111
|
+
"expected_output": "生成完整的端到端 SQL:1) ODS 建表,2) Kafka Pipe(使用 READ_KAFKA),3) DWD 动态表(TARGET_LAG='2 minutes',VCLUSTER,过滤无效 event_type),4) DWS 动态表(TARGET_LAG=DOWNSTREAM,按天+page_id 聚合 UV/PV)",
|
|
112
|
+
"files": [],
|
|
113
|
+
"assertions": [
|
|
114
|
+
{"text": "包含 ODS 层建表语句(CREATE TABLE)"},
|
|
115
|
+
{"text": "包含 CREATE PIPE 使用 READ_KAFKA 从 kafka.prod.com:9092 摄入数据"},
|
|
116
|
+
{"text": "包含 DWD 层 CREATE DYNAMIC TABLE,TARGET_LAG = '2 minutes',VCLUSTER"},
|
|
117
|
+
{"text": "包含 DWS 层 CREATE DYNAMIC TABLE,TARGET_LAG = DOWNSTREAM"},
|
|
118
|
+
{"text": "DWS 层包含 COUNT(DISTINCT user_id) 统计 UV 或 COUNT(*) 统计 PV"},
|
|
119
|
+
{"text": "DWS 层按日期和 page_id 分组(GROUP BY)"},
|
|
120
|
+
{"text": "包含验证命令(SHOW 或 DESC)"}
|
|
121
|
+
]
|
|
122
|
+
},
|
|
123
|
+
{
|
|
124
|
+
"id": 10,
|
|
125
|
+
"prompt": "我有一个业务数据库的订单表会持续有 INSERT 和 UPDATE,已经同步到了 ClickZetta 的 ods.orders 表(字段:order_id, customer_id, amount, status, updated_at)。我需要构建一个完整的数仓管道:DWD 层保留最新状态的订单,DWS 层按天统计每个 status 的订单数和金额。要求近实时,延迟在 1 分钟以内。",
|
|
126
|
+
"expected_output": "使用 Table Stream(MODE=STANDARD)捕获 UPDATE 变更,DWD 动态表消费 stream 过滤 update_postimage,TARGET_LAG='1 minutes',DWS 动态表 TARGET_LAG=DOWNSTREAM 做聚合",
|
|
127
|
+
"files": [],
|
|
128
|
+
"assertions": [
|
|
129
|
+
{"text": "包含 CREATE TABLE STREAM ON TABLE ods.orders,MODE = STANDARD"},
|
|
130
|
+
{"text": "DWD 层动态表消费 stream,过滤 _change_type IN ('insert', 'update_postimage')"},
|
|
131
|
+
{"text": "DWD 层包含 TARGET_LAG = '1 minutes',VCLUSTER"},
|
|
132
|
+
{"text": "DWS 层包含 TARGET_LAG = DOWNSTREAM"},
|
|
133
|
+
{"text": "DWS 层按日期和 status 分组,包含 COUNT 和 SUM 聚合"},
|
|
134
|
+
{"text": "包含验证命令(SHOW DYNAMIC TABLE REFRESH HISTORY 或 SHOW TABLE STREAMS)"}
|
|
135
|
+
]
|
|
136
|
+
},
|
|
137
|
+
{
|
|
138
|
+
"id": 12,
|
|
139
|
+
"prompt": "我想在 ClickZetta 上实现 Medallion Architecture,数据来自 S3(Volume 名称 s3_raw_vol,路径 /events/,JSON 格式),字段有 event_id、user_id、event_name、properties(JSON 嵌套)、created_at。Bronze 层直接落地原始数据,Silver 层做清洗(过滤 event_name 为空的记录,展开 properties 里的 page_url 和 referrer 字段),Gold 层按天统计每个 event_name 的次数和独立用户数。Silver 每 5 分钟刷新,Gold 跟随 Silver。",
|
|
140
|
+
"expected_output": "使用 bronze/silver/gold schema 命名(不映射到 ODS/DWD/DWS),生成完整 SQL:Bronze 建表 + OSS Pipe(COPY INTO FROM '@s3_raw_vol/events/'),Silver 动态表(TARGET_LAG='5 minutes',过滤 event_name IS NULL,展开 properties:page_url 和 properties:referrer),Gold 动态表(TARGET_LAG=DOWNSTREAM,按天+event_name 聚合 COUNT 和 COUNT DISTINCT user_id)",
|
|
141
|
+
"files": [],
|
|
142
|
+
"assertions": [
|
|
143
|
+
{"text": "使用 bronze/silver/gold schema 命名(不使用 ods/dwd/dws)"},
|
|
144
|
+
{"text": "包含 CREATE PIPE 使用 COPY INTO FROM '@s3_raw_vol/events/',FILE_FORMAT TYPE = 'json'"},
|
|
145
|
+
{"text": "Silver 层包含 CREATE DYNAMIC TABLE,TARGET_LAG = '5 minutes',VCLUSTER,过滤 event_name IS NULL"},
|
|
146
|
+
{"text": "Silver 层展开 JSON 嵌套字段(properties:page_url 或 $1:page_url 等语法)"},
|
|
147
|
+
{"text": "Gold 层包含 CREATE DYNAMIC TABLE,TARGET_LAG = DOWNSTREAM"},
|
|
148
|
+
{"text": "Gold 层包含 COUNT(DISTINCT user_id) 和 COUNT(*) 或 COUNT(event_id),按日期和 event_name 分组"}
|
|
149
|
+
]
|
|
150
|
+
},
|
|
151
|
+
{
|
|
152
|
+
"id": 11,
|
|
153
|
+
"prompt": "我们有日志文件每小时上传到 OSS(Volume 名称 my_oss_vol,路径 /logs/app/,Parquet 格式),字段是 log_id、user_id、action、duration_ms、log_time。帮我构建完整的数据管道:自动导入到 ODS,然后做 DWD 清洗(过滤 duration_ms <= 0),最后做 DWS 层按天统计每个 action 的平均耗时和次数。不需要实时,每小时刷新一次就够。",
|
|
154
|
+
"expected_output": "使用对象存储 Pipe(COPY INTO from @my_oss_vol/logs/app/,TYPE=parquet),DWD 动态表 TARGET_LAG='1 hours',DWS 动态表 TARGET_LAG=DOWNSTREAM,DWS 聚合 AVG(duration_ms) 和 COUNT(*)",
|
|
155
|
+
"files": [],
|
|
156
|
+
"assertions": [
|
|
157
|
+
{"text": "包含 CREATE PIPE 使用 COPY INTO FROM '@my_oss_vol/logs/app/'"},
|
|
158
|
+
{"text": "包含 FILE_FORMAT = (TYPE = 'parquet') 或等效写法"},
|
|
159
|
+
{"text": "DWD 层包含 CREATE DYNAMIC TABLE,TARGET_LAG = '1 hours',VCLUSTER,过滤 duration_ms <= 0"},
|
|
160
|
+
{"text": "DWS 层包含 TARGET_LAG = DOWNSTREAM"},
|
|
161
|
+
{"text": "DWS 层包含 AVG(duration_ms) 统计平均耗时"},
|
|
162
|
+
{"text": "DWS 层按日期和 action 分组"}
|
|
163
|
+
]
|
|
164
|
+
}
|
|
165
|
+
]
|
|
166
|
+
}
|