@clickzetta/cz-cli-darwin-arm64 0.3.87 → 0.3.88

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. package/bin/cz-cli +0 -0
  2. package/bin/skills/clickzetta-dynamic-table/SKILL.md +169 -169
  3. package/bin/skills/clickzetta-dynamic-table/best-practices/dimension-table-join-guide.md +126 -126
  4. package/bin/skills/clickzetta-dynamic-table/best-practices/medallion-and-stream-patterns.md +25 -25
  5. package/bin/skills/clickzetta-dynamic-table/best-practices/non-partitioned-merge-into-warning.md +48 -48
  6. package/bin/skills/clickzetta-dynamic-table/best-practices/performance-optimization.md +51 -51
  7. package/bin/skills/clickzetta-dynamic-table/best-practices/scheduling-guide.md +59 -59
  8. package/bin/skills/clickzetta-dynamic-table/dt-creator/SKILL.md +8 -7
  9. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/dt-declaration-strategy.md +99 -99
  10. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/incremental-config-reference.md +188 -188
  11. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/refresh-history-guide.md +117 -117
  12. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/sql-limitations.md +29 -29
  13. package/bin/skills/clickzetta-dynamic-table/dynamic-table-alter/SKILL.md +80 -79
  14. package/bin/skills/clickzetta-dynamic-table/sql-to-dt/SKILL.md +15 -15
  15. package/bin/skills/clickzetta-dynamic-table/sql-to-dt/references/sql2dt-column-validation-rules.md +61 -61
  16. package/bin/skills/clickzetta-dynamic-table/sql-to-dt/references/sql2dt-conversion-rules.md +100 -100
  17. package/bin/skills/clickzetta-dynamic-table/sql-to-dt/references/sql2dt-placeholder-rules.md +64 -64
  18. package/bin/skills/clickzetta-dynamic-table/sql-to-dt/references/sql2dt-refresh-rules.md +32 -32
  19. package/bin/skills/clickzetta-dynamic-table/sql-to-dt/references/sql2dt-self-reference-rules.md +21 -21
  20. package/bin/skills/clickzetta-dynamic-table/sql-to-dt/references/sql2dt-workflow.md +71 -71
  21. package/bin/skills/clickzetta-sql-pipeline-manager/SKILL.md +203 -202
  22. package/bin/skills/clickzetta-sql-pipeline-manager/references/dynamic-table.md +62 -62
  23. package/bin/skills/clickzetta-sql-pipeline-manager/references/materialized-view.md +34 -34
  24. package/bin/skills/clickzetta-sql-pipeline-manager/references/pipe.md +61 -61
  25. package/bin/skills/clickzetta-sql-pipeline-manager/references/table-stream.md +41 -41
  26. package/bin/skills/clickzetta-table-stream-pipeline/SKILL.md +103 -101
  27. package/package.json +1 -1
@@ -1,53 +1,53 @@
1
- # 非分区表 + 持续写入:DT 风险告警与 MERGE INTO 替代建议
1
+ # Non-partitioned Table + Continuous Writes: DT Risk Alert and MERGE INTO Alternative
2
2
 
3
- ## 触发条件
3
+ ## Trigger Conditions
4
4
 
5
- 当用户要创建的 DT 同时满足以下条件时,**必须向用户发出告警**:
5
+ When the DT the user is about to create simultaneously meets all of the following conditions, **an alert must be issued to the user**:
6
6
 
7
- 1. DT 本身是非分区表(没有 `PARTITIONED BY`,也没有 `SESSION_CONFIGS()` 引用)
8
- 2. 源表也是非分区表,且数据会持续写入(如 Kafka 消费落地表、CDC 明细表)
9
- 3. SQL 中包含按主键去重的窗口函数模式:`ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) WHERE rn = 1`
7
+ 1. The DT itself is a non-partitioned table (no `PARTITIONED BY` and no `SESSION_CONFIGS()` references)
8
+ 2. The source table is also a non-partitioned table with continuous writes (e.g., a Kafka consumer landing table, a CDC detail table)
9
+ 3. The SQL contains a window function deduplication pattern by primary key: `ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) WHERE rn = 1`
10
10
 
11
- ## 告警内容
11
+ ## Alert Content
12
12
 
13
- 向用户说明以下三个风险:
13
+ Explain the following three risks to the user:
14
14
 
15
- ### 风险 1:存储无限膨胀
15
+ ### Risk 1: Unbounded Storage Growth
16
16
 
17
- 非分区的 DT 和非分区的源表都没有自动的数据生命周期管理机制(`data_lifecycle` 仅对分区表生效)。随着数据持续写入:
18
- - 源表数据量无限增长
19
- - DT 的状态表全局维护,随数据量线性膨胀
20
- - 目标表数据量同步增长
21
- - 三者叠加,存储成本持续上升且不可控
17
+ Non-partitioned DTs and non-partitioned source tables both lack automatic data lifecycle management (`data_lifecycle` only works for partitioned tables). As data continues to be written:
18
+ - Source table data grows without bound
19
+ - DT state tables are maintained globally and grow linearly with data volume
20
+ - Target table data grows in sync
21
+ - All three combined result in continuously rising and uncontrollable storage costs
22
22
 
23
- ### 风险 2:源表归档引发性能灾难
23
+ ### Risk 2: Source Table Archiving Causes Performance Disaster
24
24
 
25
- 当存储膨胀到一定程度,运维人员通常会对源表进行归档——将历史数据迁移到冷存储或归档表,然后从源表中删除以释放空间。此时:
25
+ When storage grows to a certain point, operations teams typically archive the source table — migrating historical data to cold storage or an archive table, then deleting it from the source table to free space. At this point:
26
26
 
27
- - DT 会捕获源表的删除事件,并将其反映到增量计算结果中
28
- - `ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) WHERE rn = 1` 的删除处理代价极高:
29
- - 窗口函数无法增量处理删除——需要回读该 key 下的所有历史数据重新排序
30
- - 非分区表没有分区边界来限制回读范围,可能需要扫描整张表
31
- - 大规模归档会产生海量删除变更,每个 key 都需要独立回算
32
- - 一次源表归档可能导致 DT REFRESH 耗时从秒级飙升到小时级甚至失败
27
+ - The DT captures the source table's delete events and reflects them in incremental computation results
28
+ - The `ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) WHERE rn = 1` delete handling cost is extremely high:
29
+ - Window functions cannot handle deletes incrementally — they need to re-read all historical data for that key and re-sort
30
+ - Non-partitioned tables have no partition boundaries to limit the re-read scope; may need to scan the entire table
31
+ - Large-scale archiving produces massive delete changes; each key needs to be independently recomputed
32
+ - A single source table archiving operation can cause DT REFRESH duration to spike from seconds to hours, or even fail
33
33
 
34
- ### 风险 3:无法过滤归档产生的删除事件
34
+ ### Risk 3: Cannot Filter Archive-generated Delete Events
35
35
 
36
- DT 的增量引擎自动捕获源表的所有变更(INSERT / UPDATE / DELETE),用户无法干预这个过程。SQL 中的 `WHERE op <> 'DELETE'` 过滤的是业务层面的删除标记,而不是源表物理删除产生的删除变更。用户没有任何手段告诉 DT "这些删除是归档操作,请忽略"
36
+ The DT's incremental engine automatically captures all changes to the source table (INSERT / UPDATE / DELETE); users cannot intervene in this process. `WHERE op <> 'DELETE'` in the SQL filters business-level delete markers, not physical deletes from the source table. Users have no way to tell the DT "these deletes are archiving operations; please ignore them."
37
37
 
38
- ## 推荐替代方案
38
+ ## Recommended Alternative
39
39
 
40
- 建议用户使用 MERGE INTO + Table Stream 替代:
40
+ Suggest the user use MERGE INTO + Table Stream instead:
41
41
 
42
42
  ```sql
43
- -- Step 1: 源表开启变更跟踪
43
+ -- Step 1: Enable change tracking on source table
44
44
  ALTER TABLE source_table SET PROPERTIES ('change_tracking' = 'true');
45
45
 
46
- -- Step 2: 创建 Table Stream
46
+ -- Step 2: Create Table Stream
47
47
  CREATE TABLE STREAM source_stream ON TABLE source_table
48
48
  WITH (TABLE_STREAM_MODE = 'STANDARD', SHOW_INITIAL_ROWS = TRUE);
49
49
 
50
- -- Step 3: 创建目标表
50
+ -- Step 3: Create target table
51
51
  CREATE TABLE target_table (
52
52
  id BIGINT,
53
53
  col1 STRING,
@@ -55,7 +55,7 @@ CREATE TABLE target_table (
55
55
  event_time TIMESTAMP
56
56
  );
57
57
 
58
- -- Step 4: 定时调度 MERGE INTO 消费 Stream
58
+ -- Step 4: Scheduled MERGE INTO to consume Stream
59
59
  MERGE INTO target_table t
60
60
  USING (
61
61
  SELECT id, col1, col2, event_time,
@@ -68,29 +68,29 @@ WHEN NOT MATCHED AND s.op = 'UPSERT' THEN INSERT
68
68
  (id, col1, col2, event_time) VALUES (s.id, s.col1, s.col2, s.event_time);
69
69
  ```
70
70
 
71
- MERGE INTO + Table Stream 的优势:
72
- - **每次计算独立**:只消费 Stream 中的增量数据,不依赖源表全量状态
73
- - **归档免疫**:源表归档时,可在 USING 子查询中通过 WHERE 条件过滤归档产生的删除事件
74
- - **目标表独立管理**:目标表的生命周期与源表解耦,可独立制定归档策略
75
- - **offset 自动推进**:MERGE INTO 消费 Stream offset 自动推进,下次只处理新变更
71
+ Advantages of MERGE INTO + Table Stream:
72
+ - **Each computation is independent**: only consumes incremental data from the Stream; does not depend on the source table's full state
73
+ - **Archive-immune**: when the source table is archived, archive-generated delete events can be filtered via WHERE conditions in the USING subquery
74
+ - **Independent target table management**: the target table's lifecycle is decoupled from the source table; independent archiving strategies can be set
75
+ - **Offset auto-advances**: after MERGE INTO consumes the Stream, the offset automatically advances; only new changes are processed next time
76
76
 
77
- ## 告警话术模板
77
+ ## Alert Message Template
78
78
 
79
- 当检测到用户的 DT 满足触发条件时,使用以下话术:
79
+ When the user's DT is detected to meet the trigger conditions, use the following message:
80
80
 
81
- > ⚠️ **风险提示**:您正在创建一个非分区的 Dynamic Table,且源表也是非分区的持续写入表。这种组合存在以下长期运维风险:
81
+ > ⚠️ **Risk Warning**: You are creating a non-partitioned Dynamic Table, and the source table is also a non-partitioned table with continuous writes. This combination has the following long-term operational risks:
82
82
  >
83
- > 1. **存储无限膨胀**:源表、DT 目标表、DT 状态表三者都会持续增长,且无法通过 `data_lifecycle` 自动清理
84
- > 2. **源表归档会引发性能灾难**:当您需要对源表进行归档(迁移历史数据后删除)时,DT 会捕获这些删除事件。由于 SQL 中包含 `ROW_NUMBER() ... WHERE rn = 1` 的去重逻辑,每个被删除的 key 都需要回读历史数据重新排序,非分区表没有边界限制,可能导致 REFRESH 性能严重回退
85
- > 3. **无法过滤归档删除**:DT 增量引擎自动捕获源表所有变更,您无法告诉 DT 忽略归档操作产生的删除
83
+ > 1. **Unbounded storage growth**: the source table, DT target table, and DT state tables will all grow continuously and cannot be automatically cleaned up via `data_lifecycle`
84
+ > 2. **Source table archiving will cause a performance disaster**: when you need to archive the source table (migrate historical data then delete), the DT will capture these delete events. Because the SQL contains `ROW_NUMBER() ... WHERE rn = 1` deduplication logic, each deleted key needs to re-read historical data and re-sort; non-partitioned tables have no boundary limits, which may cause serious REFRESH performance regression
85
+ > 3. **Cannot filter archive deletes**: the DT incremental engine automatically captures all source table changes; you cannot tell the DT to ignore deletes generated by archiving operations
86
86
  >
87
- > **建议**:对于这类"非分区 CDC 明细表合并为结果表"的场景,推荐使用 MERGE INTO + Table Stream 方案。每次只消费增量数据,源表归档时可通过 WHERE 条件过滤删除事件,不会影响下游。
87
+ > **Recommendation**: For this type of "merge non-partitioned CDC detail table into result table" scenario, the MERGE INTO + Table Stream approach is recommended. Each run only consumes incremental data; archive-generated delete events can be filtered via WHERE conditions, without affecting downstream.
88
88
 
89
- ## 判断逻辑
89
+ ## Detection Logic
90
90
 
91
- 在帮助用户创建 DT 时,按以下顺序检查:
91
+ When helping users create a DT, check in the following order:
92
92
 
93
- 1. DT 是否有 `PARTITIONED BY` `SESSION_CONFIGS()`如果有,不触发告警
94
- 2. 源表是否是持续写入的非分区表(如 Kafka 消费表、CDC 明细表)→ 如果不是,不触发告警
95
- 3. SQL 是否包含 `ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ... DESC) WHERE rn = 1` 模式如果包含,风险最高,必须告警
96
- 4. 即使没有 ROW_NUMBER,只要满足条件 1+2,也应提醒用户注意存储膨胀风险
93
+ 1. Does the DT have `PARTITIONED BY` or `SESSION_CONFIGS()`?If yes, do not trigger alert
94
+ 2. Is the source table a non-partitioned table with continuous writes (e.g., Kafka consumer table, CDC detail table)? → If not, do not trigger alert
95
+ 3. Does the SQL contain the `ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ... DESC) WHERE rn = 1` pattern?If yes, highest risk; must alert
96
+ 4. Even without ROW_NUMBER, if conditions 1+2 are met, remind the user of the storage growth risk
@@ -1,109 +1,109 @@
1
- # Dynamic Table 性能优化指南
1
+ # Dynamic Table Performance Optimization Guide
2
2
 
3
- 本文档从 SQL 写法、数据特征、管道设计三个维度,帮助用户写出增量刷新性能更好的 DT。
3
+ This document helps users write DTs with better incremental refresh performance from three dimensions: SQL writing, data characteristics, and pipeline design.
4
4
 
5
- ## 核心原则:增量刷新的代价模型
5
+ ## Core Principle: The Cost Model of Incremental Refresh
6
6
 
7
- 增量刷新的性能取决于三个因素:
8
- 1. **变更量占比**:每次刷新时,源表中有多少数据发生了变化。变更量越小,增量越划算
9
- 2. **算子类型**:不同 SQL 算子的增量代价差异很大
10
- 3. **数据局部性**:变更数据在 JOIN key / GROUP BY key / PARTITION BY key 上的分布是否集中
7
+ Incremental refresh performance depends on three factors:
8
+ 1. **Change ratio**: how much data in the source table changed during each refresh. The smaller the change volume, the more worthwhile incremental is.
9
+ 2. **Operator type**: different SQL operators have very different incremental costs.
10
+ 3. **Data locality**: whether the changed data is concentrated on JOIN keys / GROUP BY keys / PARTITION BY keys.
11
11
 
12
- 当变更量超过总数据量的较大比例时,增量刷新可能反而比全量刷新更慢,因为增量需要额外的变更数据计算、去重合并、状态表读写等开销。
12
+ When the change volume exceeds a significant proportion of total data, incremental refresh may actually be slower than full refresh, because incremental has additional overhead for change data computation, deduplication merging, state table read/write, etc.
13
13
 
14
- ## SQL 写法优化
14
+ ## SQL Writing Optimization
15
15
 
16
- ### 1. 优先使用 INNER JOIN 而非 OUTER JOIN
16
+ ### 1. Prefer INNER JOIN over OUTER JOIN
17
17
 
18
- INNER JOIN 的增量计算比 OUTER JOIN 更高效:
19
- - INNER JOIN:只需要计算 A 的变更数据 JOIN B 的全量数据 + A 的全量数据 JOIN B 的变更数据
20
- - LEFT/RIGHT/FULL OUTER JOIN:还需要额外处理 NULL 填充、反向撤回等逻辑
18
+ INNER JOIN incremental computation is more efficient than OUTER JOIN:
19
+ - INNER JOIN: only needs to compute A's change data JOIN B's full data + A's full data JOIN B's change data
20
+ - LEFT/RIGHT/FULL OUTER JOIN: also needs to handle NULL filling, reverse retraction, and other logic
21
21
 
22
- 如果业务上可以保证参照完整性(即 JOIN key 一定能匹配),优先用 INNER JOIN
22
+ If business logic can guarantee referential integrity (i.e., JOIN keys will always match), prefer INNER JOIN.
23
23
 
24
24
  ```sql
25
- -- ❌ 不必要的 LEFT JOIN(如果 product 一定存在)
25
+ -- ❌ Unnecessary LEFT JOIN (if product is guaranteed to exist)
26
26
  SELECT o.*, p.name FROM orders o LEFT JOIN products p ON o.pid = p.id;
27
27
 
28
- -- ✅ 改用 INNER JOIN
28
+ -- ✅ Switch to INNER JOIN
29
29
  SELECT o.*, p.name FROM orders o INNER JOIN products p ON o.pid = p.id;
30
30
  ```
31
31
 
32
- ### 2. 减少不必要的 DISTINCT
32
+ ### 2. Reduce Unnecessary DISTINCT
33
33
 
34
- 每次增量刷新时,DISTINCT 需要对受影响的 key 做重算。如果上游数据已经去重,或者可以通过其他方式保证唯一性,去掉 DISTINCT
34
+ On every incremental refresh, DISTINCT needs to recompute affected keys. If upstream data is already deduplicated, or uniqueness can be guaranteed another way, remove DISTINCT.
35
35
 
36
36
  ```sql
37
- -- ❌ 冗余的 DISTINCT
37
+ -- ❌ Redundant DISTINCT
38
38
  SELECT DISTINCT user_id, user_name FROM user_events;
39
39
  ```
40
40
 
41
- ### 3. 窗口函数必须有 PARTITION BY
41
+ ### 3. Window Functions Must Have PARTITION BY
42
42
 
43
- 没有 PARTITION BY 的窗口函数会导致每次增量刷新都全量重算整个窗口。加上 PARTITION BY 后,只需要重算受影响的分区。
43
+ Window functions without PARTITION BY cause every incremental refresh to fully recompute the entire window. With PARTITION BY, only affected partitions need to be recomputed.
44
44
 
45
45
  ```sql
46
- -- ❌ 全局窗口,每次增量都全量重算
46
+ -- ❌ Global window; every incremental refresh does a full recomputation
47
47
  SELECT *, ROW_NUMBER() OVER (ORDER BY created_at DESC) AS rn FROM events;
48
48
 
49
- -- ✅ 加上 PARTITION BY,只重算有变更的分区
49
+ -- ✅ Add PARTITION BY; only recompute partitions with changes
50
50
  SELECT *, ROW_NUMBER() OVER (PARTITION BY category ORDER BY created_at DESC) AS rn FROM events;
51
51
  ```
52
52
 
53
- ### 4. 聚合 key 尽量使用简单列引用
53
+ ### 4. Use Simple Column References as Aggregation Keys
54
54
 
55
- 复合表达式作为 GROUP BY key 会降低增量效率,因为引擎需要对表达式求值后才能判断哪些 key 受影响。
55
+ Compound expressions as GROUP BY keys reduce incremental efficiency, because the engine needs to evaluate the expression before determining which keys are affected.
56
56
 
57
57
  ```sql
58
- -- ❌ 复合表达式作为 GROUP BY key
58
+ -- ❌ Compound expression as GROUP BY key
59
59
  SELECT DATE_TRUNC('hour', ts) AS hour, SUM(amount)
60
60
  FROM transactions
61
61
  GROUP BY DATE_TRUNC('hour', ts);
62
62
 
63
- -- ✅ 如果可能,在上游预计算好 key
64
- -- 或者拆分为两个 DT(见下文"管道拆分"
63
+ -- ✅ If possible, pre-compute the key column upstream
64
+ -- Or split into two DTs (see "Pipeline Splitting" below)
65
65
  ```
66
66
 
67
- ### 5. 尽可能使用分区条件限制数据范围
67
+ ### 5. Use Partition Conditions to Limit Data Range Where Possible
68
68
 
69
- DT SQL 中对源表添加分区过滤条件,可以显著减少每次增量刷新需要扫描的数据量。
69
+ Adding partition filter conditions on source tables in the DT's SQL can significantly reduce the amount of data that needs to be scanned on each incremental refresh.
70
70
 
71
71
  ```sql
72
- -- ❌ 不加分区条件,每次扫描全表
72
+ -- ❌ No partition condition; scans the full table every time
73
73
  SELECT o.*, p.name
74
74
  FROM orders o JOIN products p ON o.pid = p.id;
75
75
 
76
- -- ✅ 通过分区条件限制数据范围
76
+ -- ✅ Limit data range with partition condition
77
77
  SELECT o.*, p.name
78
78
  FROM orders o JOIN products p ON o.pid = p.id
79
79
  WHERE o.ds = SESSION_CONFIGS()['dt.args.ds'];
80
80
  ```
81
81
 
82
- ## 管道拆分:复杂 DT 拆成多级
82
+ ## Pipeline Splitting: Break Complex DTs into Multiple Levels
83
83
 
84
- 当一个 DT SQL 包含多个 JOIN + 聚合 + 窗口函数时,考虑拆分为多个 DT,每个 DT 只做一件事。
84
+ When a DT's SQL contains multiple JOINs + aggregations + window functions, consider splitting it into multiple DTs, each doing one thing.
85
85
 
86
- 好处:
87
- - 每个 DT 的增量计算更简单、更快
88
- - 中间 DT 可以被多个下游 DT 复用
89
- - 出问题时更容易定位是哪一层的问题
90
- - 不同层可以使用不同的优化策略
86
+ Benefits:
87
+ - Each DT's incremental computation is simpler and faster
88
+ - Intermediate DTs can be reused by multiple downstream DTs
89
+ - Easier to pinpoint which layer has a problem when issues arise
90
+ - Different layers can use different optimization strategies
91
91
 
92
- ## 数据特征与增量效率
92
+ ## Data Characteristics and Incremental Efficiency
93
93
 
94
- ### 变更量占比
94
+ ### Change Ratio
95
95
 
96
- 增量刷新在变更量占总数据量比例较小时效果最好。经验值:
97
- - < 5%:增量刷新通常显著优于全量
98
- - 5% ~ 20%:取决于具体算子和数据分布
99
- - \> 20%:可能需要评估是否全量刷新更合适
96
+ Incremental refresh works best when the change volume is a small proportion of total data. Rule of thumb:
97
+ - < 5%: incremental refresh is usually significantly better than full
98
+ - 5% ~ 20%: depends on specific operators and data distribution
99
+ - \> 20%: may need to evaluate whether full refresh is more appropriate
100
100
 
101
- ### Append-Only 源表
101
+ ### Append-Only Source Tables
102
102
 
103
- 如果源表只有 INSERT 没有 UPDATE/DELETE 可以显著优化:
104
- - 增量引擎知道变更数据只有新增(无撤回),可以跳过去重合并等操作
105
- - 聚合可以直接累加,不需要维护完整的中间状态
103
+ If the source table only has INSERT and no UPDATE/DELETE, significant optimization is possible:
104
+ - The incremental engine knows change data only has additions (no retractions), and can skip deduplication merging and other operations
105
+ - Aggregation can directly accumulate without maintaining complete intermediate state
106
106
 
107
- ### 变更数据的分布
107
+ ### Distribution of Changed Data
108
108
 
109
- 如果变更数据集中在少数 key 上(如最近时间段的数据),增量效率高。如果变更分散在大量 key 上,聚合和窗口函数需要重算大量分区,效率下降。
109
+ If changed data is concentrated on a few keys (e.g., recent time periods), incremental efficiency is high. If changes are spread across many keys, aggregation and window functions need to recompute many partitions, reducing efficiency.
@@ -1,19 +1,19 @@
1
- # Dynamic Table 调度方式选择指南
1
+ # Dynamic Table Scheduling Method Selection Guide
2
2
 
3
- ## 两种调度方式对比
3
+ ## Comparison of Two Scheduling Methods
4
4
 
5
- | 方式 | 做法 | 优点 | 缺点 |
5
+ | Method | Approach | Advantages | Disadvantages |
6
6
  |------|------|------|------|
7
- | **DDL 内置调度**(REFRESH INTERVAL | CREATE DYNAMIC TABLE 时写 `REFRESH INTERVAL` 子句,由 Lakehouse 自动触发 | 简单,无需额外配置 | 无告警、无依赖编排、刷新状态只能手动 SQL 查询 |
8
- | **Studio Task 调度**(推荐) | Studio 创建定时任务,任务内容为 `REFRESH DYNAMIC TABLE` 命令 | 支持上下游依赖、统一告警、可视化监控 | 需要额外创建 Task |
7
+ | **DDL built-in scheduling** (REFRESH INTERVAL) | Write a `REFRESH INTERVAL` clause in CREATE DYNAMIC TABLE; Lakehouse triggers automatically | Simple; no additional configuration needed | No alerts, no dependency orchestration; refresh status can only be checked via manual SQL |
8
+ | **Studio Task scheduling** (recommended) | Create a scheduled task in Studio; task content is the `REFRESH DYNAMIC TABLE` command | Supports upstream/downstream dependencies, unified alerts, visual monitoring | Requires creating an additional Task |
9
9
 
10
- **生产环境推荐使用 Studio Task 调度。** DDL 内置调度适合快速验证和开发测试阶段。
10
+ **Studio Task scheduling is recommended for production environments.** DDL built-in scheduling is suitable for quick validation and development/testing phases.
11
11
 
12
12
  ---
13
13
 
14
- ## DDL 内置调度
14
+ ## DDL Built-in Scheduling
15
15
 
16
- CREATE 语句中通过 `REFRESH INTERVAL` 子句定义刷新频率,Lakehouse 自动周期性触发:
16
+ Define the refresh frequency via the `REFRESH INTERVAL` clause in the CREATE statement; Lakehouse triggers periodically:
17
17
 
18
18
  ```sql
19
19
  CREATE DYNAMIC TABLE sales_daily
@@ -25,111 +25,111 @@ FROM orders
25
25
  GROUP BY 1;
26
26
  ```
27
27
 
28
- ### 弊端
28
+ ### Drawbacks
29
29
 
30
- - **无告警**:刷新失败不会主动通知,只能手动执行 SQL 查询状态
31
- - **无依赖编排**:无法声明"等上游任务完成后再刷新",只能靠时间间隔错开
32
- - **监控成本高**:需要定期手动执行以下命令检查刷新是否正常
30
+ - **No alerts**: refresh failures are not proactively notified; status can only be checked by manually executing SQL
31
+ - **No dependency orchestration**: cannot declare "refresh only after upstream task completes"; can only stagger by time interval
32
+ - **High monitoring cost**: need to periodically manually execute the following command to check whether refresh is normal
33
33
 
34
34
  ```sql
35
- -- 查看刷新历史,确认 state 是否为 SUCCEED
35
+ -- View refresh history; confirm state is SUCCEED
36
36
  SHOW DYNAMIC TABLE REFRESH HISTORY WHERE name = 'your_dt_name';
37
37
  ```
38
38
 
39
- 关键字段说明:
39
+ Key field descriptions:
40
40
 
41
- | 字段 | 含义 |
41
+ | Field | Meaning |
42
42
  |------|------|
43
43
  | `state` | SUCCEED / FAILED / RUNNING / QUEUED |
44
44
  | `refresh_mode` | INCREMENTAL / FULL / NO_DATA |
45
- | `error_message` | 失败时的错误信息 |
46
- | `duration` | 本次刷新耗时 |
47
- | `stats` | 增量行数(rows_inserted / rows_deleted |
45
+ | `error_message` | Error message on failure |
46
+ | `duration` | Duration of this refresh |
47
+ | `stats` | Incremental row count (rows_inserted / rows_deleted) |
48
48
 
49
49
  ---
50
50
 
51
- ## Studio Task 调度(生产推荐)
51
+ ## Studio Task Scheduling (Recommended for Production)
52
52
 
53
- Studio 中创建 SQL 任务,任务内容为 REFRESH 命令,通过 Studio 的调度系统管理执行。
53
+ Create a SQL task in Studio; task content is the REFRESH command; managed by Studio's scheduling system.
54
54
 
55
- ### Task 内容
55
+ ### Task Content
56
56
 
57
- **非分区 DT:**
57
+ **Non-partitioned DT:**
58
58
 
59
59
  ```sql
60
60
  REFRESH DYNAMIC TABLE schema_name.dt_name;
61
61
  ```
62
62
 
63
- **分区 DT(带参数):**
63
+ **Partitioned DT (with parameters):**
64
64
 
65
65
  ```sql
66
66
  SET dt.args.ds = '${bizdate}';
67
67
  REFRESH DYNAMIC TABLE schema_name.dt_name PARTITION (ds = '${bizdate}');
68
68
  ```
69
69
 
70
- `${bizdate}` Studio 调度引擎在每次执行时自动替换为业务日期。
70
+ `${bizdate}` is automatically replaced with the business date by the Studio scheduling engine at each execution.
71
71
 
72
- ### 必须配置自依赖
72
+ ### Must Configure Self-dependency
73
73
 
74
- 同一张 DT 禁止并发 REFRESH(会导致写冲突或数据不一致)。Task 必须开启**自依赖**,确保上一个实例完成后才启动下一个实例。
74
+ Concurrent REFRESH on the same DT is prohibited (causes write conflicts or data inconsistency). The Task must enable **self-dependency** to ensure the next instance starts only after the previous one completes.
75
75
 
76
- ### 上游依赖配置
76
+ ### Upstream Dependency Configuration
77
77
 
78
- - 如果 DT 的源表数据需要等上游任务产出后才能刷新配置上游依赖
79
- - 如果源表数据不要求同步就绪(如实时写入表)→ 可以不配置上游依赖
78
+ - If the DT's source table data needs to wait for an upstream task to produce before refreshing configure upstream dependency
79
+ - If source table data does not require synchronized readiness (e.g., real-time write table) → upstream dependency is optional
80
80
 
81
- ### 告警配置
81
+ ### Alert Configuration
82
82
 
83
- Studio Task 支持以下告警规则,生产环境建议全部配置:
83
+ Studio Tasks support the following alert rules; all are recommended for production environments:
84
84
 
85
- - **失败告警**:任务执行失败时通知
86
- - **超时告警**:刷新耗时超过阈值时通知(用于发现性能回退)
87
- - **未运行告警**:任务在预期时间内未启动时通知
85
+ - **Failure alert**: notify when task execution fails
86
+ - **Timeout alert**: notify when refresh duration exceeds a threshold (used to detect performance regression)
87
+ - **Not-run alert**: notify when the task has not started within the expected time
88
88
 
89
89
  ---
90
90
 
91
- ## 多级 DT 管道的调度编排
91
+ ## Scheduling Orchestration for Multi-level DT Pipelines
92
92
 
93
- 当存在多张 DT 形成上下游依赖时(如 DT_A → DT_B → DT_C),每张 DT 对应一个 Studio Task,通过任务依赖关系保证执行顺序:
93
+ When multiple DTs form upstream/downstream dependencies (e.g., DT_A → DT_B → DT_C), each DT corresponds to one Studio Task; task dependency relationships ensure execution order:
94
94
 
95
95
  ```
96
96
  Task_A (REFRESH DT_A)
97
- └─ Task_B (REFRESH DT_B,依赖 Task_A)
98
- └─ Task_C (REFRESH DT_C,依赖 Task_B)
97
+ └─ Task_B (REFRESH DT_B, depends on Task_A)
98
+ └─ Task_C (REFRESH DT_C, depends on Task_B)
99
99
  ```
100
100
 
101
- 不同分区的 REFRESH 可以并行执行(分配到不同 Task 实例),同一分区/非分区 DT 禁止并发。
101
+ REFRESHes for different partitions can run in parallel (assigned to different Task instances); concurrent refresh of the same partition/non-partitioned DT is prohibited.
102
102
 
103
103
  ---
104
104
 
105
- ## 判断逻辑:向用户推荐调度方式
105
+ ## Decision Logic: Recommend Scheduling Method to Users
106
106
 
107
- 在帮助用户创建或配置 DT 时,按以下逻辑推荐:
107
+ When helping users create or configure a DT, recommend based on the following logic:
108
108
 
109
- 1. **是否有 Studio?**
110
- - 始终推荐 Studio Task 调度,无论是开发还是生产环境
111
- - 使用 DDL 内置调度或第三方调度引擎
109
+ 1. **Is Studio available?**
110
+ - Yesalways recommend Studio Task scheduling, regardless of development or production environment
111
+ - Nouse DDL built-in scheduling or a third-party scheduling engine
112
112
 
113
- 2. **是否有上下游依赖?**
114
- - 有(如源表由另一个任务产出)→ 必须用 Studio Task,配置上游依赖
115
- - 仍推荐 Studio Task,获得告警能力
113
+ 2. **Are there upstream/downstream dependencies?**
114
+ - Yes (e.g., source table is produced by another task) → must use Studio Task; configure upstream dependency
115
+ - Nostill recommend Studio Task to gain alert capability
116
116
 
117
- 3. **用户已经写了 REFRESH INTERVAL 子句?**
118
- - 提示:可以去掉 REFRESH INTERVAL 子句,改用 Studio Task 调度,获得告警和依赖管理能力
119
- - REFRESH INTERVAL Studio Task 可以共存,但会导致双重触发,建议二选一
117
+ 3. **User has already written a REFRESH INTERVAL clause?**
118
+ - Suggest: the REFRESH INTERVAL clause can be removed and replaced with Studio Task scheduling to gain alert and dependency management capability
119
+ - REFRESH INTERVAL and Studio Task can coexist, but will cause double triggering; choosing one is recommended
120
120
 
121
121
  ---
122
122
 
123
- ## 告警话术模板
123
+ ## Alert Message Template
124
124
 
125
- 当用户使用 DDL 内置调度时,使用以下话术提示:
125
+ When the user is using DDL built-in scheduling, use the following message:
126
126
 
127
- > 💡 **建议**:您当前使用的是 DDL 内置调度(REFRESH INTERVAL),这种方式存在以下局限:
127
+ > 💡 **Suggestion**: You are currently using DDL built-in scheduling (REFRESH INTERVAL), which has the following limitations:
128
128
  >
129
- > 1. **无告警**:刷新失败不会主动通知,需要手动执行 `SHOW DYNAMIC TABLE REFRESH HISTORY` 查看状态
130
- > 2. **无依赖编排**:无法声明上下游任务依赖关系,只能靠时间间隔错开
129
+ > 1. **No alerts**: refresh failures are not proactively notified; you need to manually execute `SHOW DYNAMIC TABLE REFRESH HISTORY` to check status
130
+ > 2. **No dependency orchestration**: upstream/downstream task dependencies cannot be declared; can only stagger by time interval
131
131
  >
132
- > **推荐**:在 Studio 中创建定时调度任务,任务内容为 `REFRESH DYNAMIC TABLE schema.dt_name`,并配置:
133
- > - 自依赖(防止并发刷新)
134
- > - 失败告警 + 超时告警
135
- > - 上游依赖(如果源表由其他任务产出)
132
+ > **Recommendation**: Create a scheduled task in Studio with content `REFRESH DYNAMIC TABLE schema.dt_name`, and configure:
133
+ > - Self-dependency (prevent concurrent refresh)
134
+ > - Failure alert + timeout alert
135
+ > - Upstream dependency (if source table is produced by other tasks)
@@ -1,15 +1,16 @@
1
1
  ---
2
2
  name: dt-creator
3
3
  description: |
4
- 创建 Dynamic Table 的参考资料索引。涵盖静态分区 DT 与动态分区 DT 的声明策略、
5
- 增量计算支持的 SQL 模式、增量刷新配置项说明、以及刷新历史的查询方式。
4
+ Reference index for creating Dynamic Tables. Covers declaration strategies for static partition DT
5
+ vs dynamic partition DT, SQL patterns supported by incremental computation, incremental refresh
6
+ configuration options, and how to query refresh history.
6
7
  ---
7
8
 
8
- # DT Creator — 参考资料索引
9
+ # DT Creator — Reference Index
9
10
 
10
11
  ## references/
11
12
 
12
- - **dt-declaration-strategy.md** — DT 声明策略(静态分区 DT vs 动态分区 DT 的创建语法与选择)
13
- - **sql-limitations.md** — SQL 支持矩阵(JOIN、聚合、窗口函数、非确定性函数等的支持情况)
14
- - **incremental-config-reference.md** — 增量计算配置参考(刷新策略、源表特征声明、状态表管理等)
15
- - **refresh-history-guide.md** — 增量刷新历史查询(SHOW REFRESH HISTORY / DESC HISTORY / information_schema
13
+ - **dt-declaration-strategy.md** — DT declaration strategy (creation syntax and selection between static partition DT and dynamic partition DT)
14
+ - **sql-limitations.md** — SQL support matrix (support status for JOIN, aggregation, window functions, non-deterministic functions, etc.)
15
+ - **incremental-config-reference.md** — Incremental computation configuration reference (refresh strategy, source table characteristic declarations, state table management, etc.)
16
+ - **refresh-history-guide.md** — Incremental refresh history queries (SHOW REFRESH HISTORY / DESC HISTORY / information_schema)