@clickzetta/cz-cli-darwin-x64 0.3.92 → 0.3.93

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (69) hide show
  1. package/bin/cz-cli +0 -0
  2. package/bin/skills/clickzetta-ai-function/SKILL.md +109 -0
  3. package/bin/skills/clickzetta-ai-function/eval_cases.jsonl +4 -0
  4. package/bin/skills/clickzetta-ai-function/references/ai-function-ddl.md +106 -0
  5. package/bin/skills/clickzetta-batch-sync-pipeline/SKILL.md +124 -124
  6. package/bin/skills/clickzetta-batch-sync-pipeline/eval_cases.jsonl +5 -5
  7. package/bin/skills/clickzetta-bi-connect/SKILL.md +79 -78
  8. package/bin/skills/clickzetta-bi-connect/references/bi-tools.md +56 -56
  9. package/bin/skills/clickzetta-cdc-sync-pipeline/SKILL.md +386 -382
  10. package/bin/skills/clickzetta-cdc-sync-pipeline/eval_cases.jsonl +5 -5
  11. package/bin/skills/clickzetta-data-ingest-pipeline/SKILL.md +73 -212
  12. package/bin/skills/clickzetta-data-science/SKILL.md +57 -56
  13. package/bin/skills/clickzetta-data-science/references/bitmap-profile.md +38 -38
  14. package/bin/skills/clickzetta-data-science/references/data-patterns.md +16 -16
  15. package/bin/skills/clickzetta-data-science/references/setup.md +28 -28
  16. package/bin/skills/clickzetta-data-science/references/stats-functions.md +44 -44
  17. package/bin/skills/clickzetta-data-science/references/write-and-infer.md +22 -22
  18. package/bin/skills/clickzetta-data-science/references/zettapark-api.md +32 -32
  19. package/bin/skills/clickzetta-dw-modeling/SKILL.md +1 -1
  20. package/bin/skills/clickzetta-external-function/SKILL.md +51 -109
  21. package/bin/skills/clickzetta-external-function/eval_cases.jsonl +4 -4
  22. package/bin/skills/clickzetta-external-function/references/external-function-ddl.md +39 -77
  23. package/bin/skills/clickzetta-java-sdk/SKILL.md +49 -48
  24. package/bin/skills/clickzetta-java-sdk/eval_cases.jsonl +12 -12
  25. package/bin/skills/clickzetta-java-sdk/references/bulkload.md +34 -34
  26. package/bin/skills/clickzetta-java-sdk/references/realtime.md +44 -44
  27. package/bin/skills/clickzetta-kafka-ingest-pipeline/SKILL.md +273 -507
  28. package/bin/skills/clickzetta-kafka-ingest-pipeline/references/kafka-pipe-syntax.md +197 -231
  29. package/bin/skills/clickzetta-oss-ingest-pipeline/SKILL.md +231 -304
  30. package/bin/skills/clickzetta-realtime-sync-pipeline/SKILL.md +180 -179
  31. package/bin/skills/clickzetta-realtime-sync-pipeline/eval_cases.jsonl +5 -5
  32. package/bin/skills/clickzetta-semantic-view/SKILL.md +74 -72
  33. package/bin/skills/clickzetta-semantic-view/eval_cases.jsonl +12 -12
  34. package/bin/skills/clickzetta-semantic-view/references/semantic-view-reference.md +75 -75
  35. package/bin/skills/clickzetta-sql-migration/SKILL.md +128 -0
  36. package/bin/skills/clickzetta-sql-migration/eval_cases.jsonl +10 -0
  37. package/bin/skills/clickzetta-sql-migration/references/ddl-reference.md +350 -0
  38. package/bin/skills/clickzetta-sql-migration/references/dml-differences.md +192 -0
  39. package/bin/skills/clickzetta-sql-migration/references/dml-reference.md +279 -0
  40. package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/references/dql-reference.md +128 -128
  41. package/bin/skills/clickzetta-sql-migration/references/function-mapping.md +194 -0
  42. package/bin/skills/clickzetta-sql-migration/references/functions-reference.md +372 -0
  43. package/bin/skills/clickzetta-sql-migration/references/implicit-type-conversion.md +143 -0
  44. package/bin/skills/clickzetta-sql-migration/references/migration-databricks.md +260 -0
  45. package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/references/migration-snowflake.md +112 -112
  46. package/bin/skills/clickzetta-sql-migration/references/vs-snowflake.md +346 -0
  47. package/bin/skills/clickzetta-sql-migration/references/vs-spark.md +229 -0
  48. package/bin/skills/clickzetta-studio-task-manager/SKILL.md +326 -329
  49. package/bin/skills/clickzetta-table-lineage/SKILL.md +57 -55
  50. package/bin/skills/clickzetta-table-lineage/eval_cases.jsonl +1 -1
  51. package/bin/skills/clickzetta-table-lineage/references/normalize_func.sql +5 -5
  52. package/bin/skills/clickzetta-table-lineage/references/table_cost.sql +6 -6
  53. package/bin/skills/clickzetta-table-lineage/references/table_relation.sql +2 -2
  54. package/bin/skills/clickzetta-volume-manager/SKILL.md +186 -100
  55. package/bin/skills/clickzetta-volume-manager/references/volume-ddl.md +153 -52
  56. package/package.json +1 -1
  57. package/bin/skills/clickzetta-dynamic-table/best-practices/scheduling-guide.md +0 -135
  58. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/dt-declaration-strategy.md +0 -185
  59. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/refresh-history-guide.md +0 -260
  60. package/bin/skills/clickzetta-dynamic-table/dynamic-table-alter/SKILL.md +0 -191
  61. package/bin/skills/clickzetta-sql-syntax-guide/SKILL.md +0 -249
  62. package/bin/skills/clickzetta-sql-syntax-guide/eval_cases.jsonl +0 -3
  63. package/bin/skills/clickzetta-sql-syntax-guide/references/ddl-reference.md +0 -350
  64. package/bin/skills/clickzetta-sql-syntax-guide/references/dml-reference.md +0 -279
  65. package/bin/skills/clickzetta-sql-syntax-guide/references/functions-reference.md +0 -372
  66. package/bin/skills/clickzetta-sql-syntax-guide/references/migration-databricks.md +0 -260
  67. package/bin/skills/clickzetta-sql-syntax-guide/references/vs-snowflake.md +0 -346
  68. package/bin/skills/clickzetta-sql-syntax-guide/references/vs-spark.md +0 -229
  69. /package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/LICENSE +0 -0
@@ -1,370 +1,367 @@
1
1
  ---
2
2
  name: clickzetta-studio-task-manager
3
3
  description: |
4
- 管理 ClickZetta Lakehouse Studio 任务,覆盖任务类型说明(离线同步/多表离线同步/实时同步/
5
- 多表实时同步/数据开发)、任务目录组织、任务类型区分、cz-cli task 命令族、
6
- 调度配置、依赖管理和常见问题排查。实现"建管分离"工程规范:DDL 任务草稿化、ETL 任务调度化、
7
- Dynamic Table 自动刷新化。
8
- 当用户说"创建 Studio 任务"、"任务目录"、"任务调度"、"cz-cli task"、"任务依赖"、
9
- "任务失败""任务状态""整库同步任务""ETL 任务编排"、"任务管理"、
10
- "建管分离""DDL 任务""调度 DAG""任务文件夹""Studio 任务"
11
- "离线同步""实时同步""多表实时同步""数据开发任务""任务类型"
12
- "选哪种同步""同步任务区别"时触发。
4
+ Manage ClickZetta Lakehouse Studio tasks, covering task type descriptions (batch sync/multi-table batch sync/
5
+ real-time sync/multi-table real-time sync/data development), task folder organization, task type differentiation,
6
+ cz-cli task command family, scheduling configuration, dependency management, and common issue troubleshooting.
7
+ Implements the "separation of DDL and pipeline management" engineering standard: DDL tasks as drafts,
8
+ ETL tasks with scheduling, Dynamic Tables with auto-refresh.
9
+ Triggered when the user says "create Studio task", "task folder", "task scheduling", "cz-cli task",
10
+ "task dependency", "task failed", "task status", "full database sync task", "ETL task orchestration",
11
+ "task management", "separation of DDL and pipeline", "DDL task", "scheduling DAG", "task folder",
12
+ "Studio task", "batch sync", "real-time sync", "multi-table real-time sync", "data development task",
13
+ "task types", "which sync to choose", "sync task differences".
13
14
  Keywords: Studio task, task management, cz-cli task, scheduling, DAG, DDL draft, ETL pipeline, task folder, offline sync, realtime sync, CDC, task types
14
15
  ---
15
16
 
16
- # ClickZetta Studio 任务管理
17
+ # ClickZetta Studio Task Management
17
18
 
18
- ## 向导:明确操作意图
19
+ ## Wizard: Clarify Intent
19
20
 
20
- 收到任务管理请求后,优先使用交互式问答工具(如 `question`)收集意图;若无此类工具,则用文字列出选项:
21
+ Upon receiving a task management request, use an interactive question tool (e.g., `question`) to collect intent. If no such tool is available, list options in text:
21
22
 
22
23
  ```
23
24
  question({
24
25
  questions: [{
25
- question: "你想做什么?",
26
+ question: "What would you like to do?",
26
27
  options: [
27
- { label: "从零搭建新管道", description: "创建目录、DDL 任务、同步任务、ETL 任务" },
28
- { label: "管理现有任务", description: "查看状态、修改配置、配置依赖、重跑、补数" },
29
- { label: "排查任务问题", description: "失败诊断、依赖检查、日志分析加载 clickzetta-pipeline-review" },
30
- { label: "规范检查", description: "检查现有任务是否符合建管分离规范" }
28
+ { label: "Build a new pipeline from scratch", description: "Create folders, DDL tasks, sync tasks, ETL tasks" },
29
+ { label: "Manage existing tasks", description: "View status, modify config, configure dependencies, rerun, backfill" },
30
+ { label: "Troubleshoot task issues", description: "Failure diagnosis, dependency check, log analysis load clickzetta-pipeline-review" },
31
+ { label: "Standards compliance check", description: "Check if existing tasks follow separation of DDL and pipeline standards" }
31
32
  ]
32
33
  }]
33
34
  })
34
35
  ```
35
36
 
36
- **如果用户已经明确说了要做什么,直接执行,不再询问。**
37
+ **If the user has clearly stated what they want to do, proceed directly without asking.**
37
38
 
38
- 对于**从零搭建**,还需收集:业务域/项目名称、数据源类型、分层结构。
39
+ For **building from scratch**, also collect: business domain/project name, data source type, layering structure.
39
40
 
40
- ## 数据管道向导(从零搭建时使用)
41
+ ## Data Pipeline Wizard (Used When Building from Scratch)
41
42
 
42
- 完整流程:**需求理解数据探索技术选型方案确认执行**
43
+ Full process: **Requirements Understanding Data Exploration Technical Selection Plan Confirmation Execution**
43
44
 
44
45
  ---
45
46
 
46
- ### Step 0:需求输入
47
+ ### Step 0: Requirements Input
47
48
 
48
- **优先询问用户是否有需求文档**(PRD、需求说明、数仓设计文档等):
49
+ **First ask if the user has a requirements document** (PRD, requirements spec, data warehouse design doc, etc.):
49
50
 
50
51
  ```
51
52
  question({
52
53
  questions: [{
53
- question: "开始前,你有需求文档或背景说明吗?",
54
+ question: "Before we start, do you have a requirements document or background description?",
54
55
  options: [
55
- { label: "有,我来提供", description: "请粘贴文档内容或上传文件,我来提取关键信息" },
56
- { label: "没有,口头描述", description: "我来引导你回答几个关键问题" }
56
+ { label: "Yes, I'll provide it", description: "Paste document content or upload file, I'll extract key information" },
57
+ { label: "No, I'll describe verbally", description: "I'll guide you through a few key questions" }
57
58
  ]
58
59
  }]
59
60
  })
60
61
  ```
61
62
 
62
- **如果有文档**:读取文档,自动提取业务场景、数据源、目标产出、时效性要求,跳到 Step 1
63
+ **If document provided**: read the document, auto-extract business scenario, data sources, target outputs, freshness requirements, skip to Step 1.
63
64
 
64
- **如果没有文档**,收集以下业务需求(优先使用交互式工具;若无,一次性文字列出):
65
+ **If no document**, collect the following business requirements (prefer interactive tools; if unavailable, list all in text):
65
66
 
66
67
  ```
67
68
  question({
68
69
  questions: [
69
70
  {
70
- question: "这个管道服务于什么业务场景?",
71
+ question: "What business scenario does this pipeline serve?",
71
72
  options: [
72
- { label: "BI 报表 / 数据看板", description: "固定报表,指标体系明确,T+1 或小时级" },
73
- { label: "实时监控 / 运营看板", description: "分钟级延迟,关注实时指标" },
74
- { label: "数据科学 / 特征工程", description: "供模型训练或推理使用" },
75
- { label: "数据共享 / 对外输出", description: "提供给其他系统或团队使用" }
73
+ { label: "BI reports / dashboards", description: "Fixed reports, clear metric system, T+1 or hourly" },
74
+ { label: "Real-time monitoring / ops dashboard", description: "Minute-level latency, focus on real-time metrics" },
75
+ { label: "Data science / feature engineering", description: "For model training or inference" },
76
+ { label: "Data sharing / external output", description: "Provided to other systems or teams" }
76
77
  ]
77
78
  },
78
79
  {
79
- question: "数据消费方是谁?",
80
+ question: "Who are the data consumers?",
80
81
  options: [
81
- { label: "BI 工具(Superset/Tableau 等)", description: "需要宽表或聚合表" },
82
- { label: "数据分析师(SQL 查询)", description: "需要清洗后的明细表" },
83
- { label: "下游系统 / API", description: "需要结构化输出" },
84
- { label: "数据科学家(Python/ZettaPark", description: "需要特征表或原始明细" }
82
+ { label: "BI tools (Superset/Tableau, etc.)", description: "Need wide tables or aggregation tables" },
83
+ { label: "Data analysts (SQL queries)", description: "Need cleaned detail tables" },
84
+ { label: "Downstream systems / APIs", description: "Need structured output" },
85
+ { label: "Data scientists (Python/ZettaPark)", description: "Need feature tables or raw detail" }
85
86
  ]
86
87
  },
87
88
  {
88
- question: "数据时效性要求?",
89
+ question: "Data freshness requirements?",
89
90
  options: [
90
- { label: "T+1(次日可用)", description: "每天凌晨跑批,早上数据就绪" },
91
- { label: "小时级", description: "每小时更新一次" },
92
- { label: "分钟级", description: "近实时,延迟 < 10 分钟" },
93
- { label: "秒级实时", description: "CDC 持续同步,秒级延迟" }
91
+ { label: "T+1 (available next day)", description: "Batch run at midnight, data ready in the morning" },
92
+ { label: "Hourly", description: "Updated every hour" },
93
+ { label: "Minute-level", description: "Near real-time, latency < 10 minutes" },
94
+ { label: "Second-level real-time", description: "CDC continuous sync, second-level latency" }
94
95
  ]
95
96
  }
96
97
  ]
97
98
  })
98
99
  ```
99
100
 
100
- 还需口头确认(文字追问,不用菜单):
101
- - **核心指标口径**:如果涉及 GMV、活跃用户等业务指标,确认计算口径
102
- - **项目/业务域名称**:用于任务目录和 Schema 命名(如 `ecommerce_dw`)
101
+ Also confirm verbally (text follow-up, no menu needed):
102
+ - **Core metric definitions**: if involving GMV, active users, or other business metrics, confirm calculation logic
103
+ - **Project/business domain name**: used for task folder and schema naming (e.g., `ecommerce_dw`)
103
104
 
104
105
  ---
105
106
 
106
- ### Step 1:数据探索(AI 自主执行,不问用户)
107
+ ### Step 1: Data Exploration (AI executes autonomously, no user input needed)
107
108
 
108
- 收集到需求后,立即探查数据现状:
109
+ After collecting requirements, immediately explore the current data state:
109
110
 
110
111
  ```sql
111
- -- 查看相关 schema 和表
112
+ -- View related schemas and tables
112
113
  SHOW SCHEMAS;
113
- SHOW TABLES IN <相关schema>;
114
+ SHOW TABLES IN <relevant_schema>;
114
115
 
115
- -- 查表大小和行数
116
+ -- Check table sizes and row counts
116
117
  SELECT table_schema, table_name,
117
118
  ROUND(bytes/1024.0/1024/1024, 2) AS size_gb, row_count
118
119
  FROM information_schema.tables
119
120
  WHERE table_type = 'MANAGED_TABLE'
120
121
  ORDER BY bytes DESC NULLS LAST LIMIT 20;
121
122
 
122
- -- 抽样了解字段含义
123
+ -- Sample to understand field meanings
123
124
  SELECT * FROM <schema>.<table> LIMIT 5;
124
125
  ```
125
126
 
126
- 同时用 `cz-cli datasource list` 查看已配置的外部数据源。
127
+ Also use `cz-cli datasource list` to view configured external data sources.
127
128
 
128
129
  ---
129
130
 
130
- ### Step 2:技术选型(选数据源类型和接入方式)
131
+ ### Step 2: Technical Selection (Choose data source type and ingestion method)
131
132
 
132
- 基于需求和数据探索结果,用交互式工具收集技术选型:
133
+ Based on requirements and data exploration results, use interactive tools to collect technical choices:
133
134
 
134
- **选数据源类型:**
135
+ **Select data source type:**
135
136
  ```
136
- question({ questions: [{ question: "数据来自哪里?", options: [
137
- { label: "外部数据库", description: "MySQL / PostgreSQL / SQL Server / Oracle " },
138
- { label: "Kafka 消息队列", description: "Kafka Topic → Lakehouse" },
139
- { label: "对象存储", description: "OSS / S3 / COS 文件导入" },
140
- { label: "Lakehouse 内部 ETL 分层", description: "ODS→DWD→DWS/ADSSQL 任务 + Dynamic Table" },
141
- { label: "端到端完整管道", description: "数据接入 + 分层建模 + 聚合" },
142
- { label: "不确定,先探索数据", description: "先看现有数据再给方案建议" }
137
+ question({ questions: [{ question: "Where does the data come from?", options: [
138
+ { label: "External database", description: "MySQL / PostgreSQL / SQL Server / Oracle, etc." },
139
+ { label: "Kafka message queue", description: "Kafka Topic → Lakehouse" },
140
+ { label: "Object storage", description: "OSS / S3 / COS file import" },
141
+ { label: "Lakehouse internal ETL layering", description: "ODS→DWD→DWS/ADS, SQL tasks + Dynamic Table" },
142
+ { label: "End-to-end complete pipeline", description: "Data ingestion + layered modeling + aggregation" },
143
+ { label: "Not sure, explore data first", description: "Look at existing data before recommending an approach" }
143
144
  ]}]})
144
145
  ```
145
146
 
146
- **追问(仅部分选项需要):**
147
+ **Follow-up (only needed for certain options):**
147
148
 
148
- 选了"外部数据库"
149
+ Selected "External database":
149
150
  ```
150
- question({ questions: [{ question: "同步时效性?", options: [
151
- { label: "实时同步(秒级)", description: "CDC,基于 Binlog/WALs,持续运行" },
152
- { label: "离线批量(小时/天级)", description: "周期性全量同步,配置 Cron" }
151
+ question({ questions: [{ question: "Sync freshness?", options: [
152
+ { label: "Real-time sync (second-level)", description: "CDC, based on Binlog/WALs, continuously running" },
153
+ { label: "Batch offline (hourly/daily)", description: "Periodic full sync, configure Cron" }
153
154
  ]}]})
154
155
  ```
155
156
 
156
- 选了"对象存储"
157
+ Selected "Object storage":
157
158
  ```
158
- question({ questions: [{ question: "接入方式?", options: [
159
- { label: "SQL Pipe(持续自动导入)", description: "LIST_PURGE EVENT_NOTIFICATION 模式" },
160
- { label: "Studio 离线同步任务", description: "周期性批量导入,配置 Cron" }
159
+ question({ questions: [{ question: "Ingestion method?", options: [
160
+ { label: "SQL Pipe (continuous auto-import)", description: "LIST_PURGE or EVENT_NOTIFICATION mode" },
161
+ { label: "Studio batch sync task", description: "Periodic batch import, configure Cron" }
161
162
  ]}]})
162
163
  ```
163
164
 
164
- 选了"Kafka"
165
+ Selected "Kafka":
165
166
  ```
166
- question({ questions: [{ question: "接入方式?", options: [
167
- { label: "SQL PipeREAD_KAFKA", description: " SQL,灵活,推荐工程师使用" },
168
- { label: "Studio 实时同步任务", description: "图形化配置,支持 JSONPath 计算列" }
167
+ question({ questions: [{ question: "Ingestion method?", options: [
168
+ { label: "SQL Pipe (READ_KAFKA)", description: "Pure SQL, flexible, recommended for engineers" },
169
+ { label: "Studio real-time sync task", description: "GUI configuration, supports JSONPath computed columns" }
169
170
  ]}]})
170
171
  ```
171
172
 
172
173
  ---
173
174
 
174
- ### Step 3:方案确认(必须执行,不得跳过)
175
+ ### Step 3: Plan Confirmation (Must execute, cannot skip)
175
176
 
176
- 综合需求和技术选型,向用户呈现完整方案摘要,请求确认:
177
+ Combining requirements and technical selection, present a complete plan summary to the user for confirmation:
177
178
 
178
179
  ```
179
180
  question({
180
181
  questions: [{
181
- question: "确认以下方案后开始构建:\n业务场景:<场景>\n数据源:<数据源名称>\n同步方式:<离线/实时/SQL Pipe>\n分层结构:<ODS/DWD/DWS Bronze/Silver/Gold>\n目标 Schema:<schema>\n调度:<Cron 或持续运行>\n是否开始?",
182
+ question: "Confirm the following plan to start building:\nBusiness scenario: <scenario>\nData source: <source_name>\nSync method: <batch/real-time/SQL Pipe>\nLayering structure: <ODS/DWD/DWS or Bronze/Silver/Gold>\nTarget schema: <schema>\nScheduling: <Cron or continuous running>\nReady to start?",
182
183
  options: [
183
- { label: "确认,开始构建", description: "加载对应 skill,开始创建任务" },
184
- { label: "需要调整", description: "重新收集信息" }
184
+ { label: "Confirmed, start building", description: "Load corresponding skill, begin creating tasks" },
185
+ { label: "Need adjustments", description: "Re-collect information" }
185
186
  ]
186
187
  }]
187
188
  })
188
189
  ```
189
190
 
190
- 用户确认后,按路由表加载对应 skill
191
+ After user confirmation, load the corresponding skill per routing table:
191
192
 
192
- **路由表**
193
+ **Routing Table**
193
194
 
194
- | 数据源 | 时效性/方式 | 加载 skill |
195
- |---|---|---|
196
- | 外部数据库 | 实时单表 CDC | `clickzetta-realtime-sync-pipeline` |
197
- | 外部数据库 | 实时多表/整库 CDC | `clickzetta-cdc-sync-pipeline` |
198
- | 外部数据库 | 离线批量 | `clickzetta-batch-sync-pipeline` |
195
+ | Data Source | Freshness/Method | Load Skill |
196
+ |-------------|-----------------|------------|
197
+ | External database | Real-time single-table CDC | `clickzetta-realtime-sync-pipeline` |
198
+ | External database | Real-time multi-table/full database CDC | `clickzetta-cdc-sync-pipeline` |
199
+ | External database | Batch offline | `clickzetta-batch-sync-pipeline` |
199
200
  | Kafka | SQL Pipe | `clickzetta-kafka-ingest-pipeline` |
200
- | Kafka | Studio 实时同步 | `clickzetta-realtime-sync-pipeline` |
201
- | 对象存储 | SQL Pipe | `clickzetta-oss-ingest-pipeline` |
202
- | 对象存储 | Studio 离线同步 | `clickzetta-batch-sync-pipeline` |
203
- | Lakehouse 内部 ETL 分层 | — | `clickzetta-sql-pipeline-manager` |
204
- | 端到端完整管道 / 不确定 | — | `clickzetta-dw-modeling` |
201
+ | Kafka | Studio real-time sync | `clickzetta-realtime-sync-pipeline` |
202
+ | Object storage | SQL Pipe | `clickzetta-oss-ingest-pipeline` |
203
+ | Object storage | Studio batch sync | `clickzetta-batch-sync-pipeline` |
204
+ | Lakehouse internal ETL layering | — | `clickzetta-sql-pipeline-manager` |
205
+ | End-to-end complete pipeline / Not sure | — | `clickzetta-dw-modeling` |
205
206
 
206
- > 实时 CDC 单表 vs 多表:用户说"整库""多张表"→ `cdc-sync-pipeline`;"一张表"→ `realtime-sync-pipeline`;不确定时追问。
207
+ > Real-time CDC single vs multi-table: user says "full database" or "multiple tables" → `cdc-sync-pipeline`; "one table" → `realtime-sync-pipeline`; if unclear, ask.
207
208
 
208
209
  ---
209
210
 
210
- ## Studio 任务类型说明
211
+ ## Studio Task Types
211
212
 
212
- Studio 提供四大类任务,选错类型是最常见的工程错误:
213
+ Studio provides four major task categories. Choosing the wrong type is the most common engineering mistake:
213
214
 
214
- ### 离线同步(单表)
215
- 将单张源表周期性全量同步到 Lakehouse
215
+ ### Batch Sync (Single-table)
216
+ Periodically full-sync a single source table to Lakehouse.
216
217
 
217
- - **适用场景**:单表定期覆盖更新、数据时效性要求不高(按天/小时批量)、资源优化(不需要实时)
218
- - **运行模式**:周期调度(需配置 Cron),每次全量覆盖或追加
219
- - **数据源**:MySQLPostgreSQLSQL Server 等关系型数据库
220
- - **对应 skill**:`clickzetta-batch-sync-pipeline`(单表模式)
218
+ - **Use cases**: single-table periodic overwrite, low data freshness requirements (daily/hourly batch), resource optimization (no real-time needed)
219
+ - **Run mode**: scheduled (Cron required), full overwrite or append each run
220
+ - **Data sources**: MySQL, PostgreSQL, SQL Server, and other relational databases
221
+ - **Corresponding skill**: `clickzetta-batch-sync-pipeline` (single-table mode)
221
222
 
222
- ### 多表离线同步
223
- 将多张源表或整库周期性批量同步到 Lakehouse
223
+ ### Multi-table Batch Sync
224
+ Periodically batch-sync multiple tables or an entire database to Lakehouse.
224
225
 
225
- - **适用场景**:
226
- - 整库迁移(批量同步所有表,减少逐表配置工作量)
227
- - 分库分表合并(多个分库分表合并到统一目标表)
228
- - 定期数据校准(周期性全量同步确保目标端与源端一致)
229
- - **运行模式**:周期调度(需配置 Cron),支持整库镜像、多表镜像、多表合并三种模式
230
- - **数据源**:MySQLPostgreSQLSQL Server
231
- - **对应 skill**:`clickzetta-batch-sync-pipeline`(多表模式)
226
+ - **Use cases**:
227
+ - Full database migration (batch sync all tables, reducing per-table configuration effort)
228
+ - Sharded table merge (merge multiple sharded tables into a unified target table)
229
+ - Periodic data calibration (periodic full sync to ensure target matches source)
230
+ - **Run mode**: scheduled (Cron required), supports full database mirror, multi-table mirror, and sharded table merge modes
231
+ - **Data sources**: MySQL, PostgreSQL, SQL Server, etc.
232
+ - **Corresponding skill**: `clickzetta-batch-sync-pipeline` (multi-table mode)
232
233
 
233
- ### 实时同步(单表)
234
- 将单张 Kafka Topic 数据持续实时同步到 Lakehouse
234
+ ### Real-time Sync (Single-table)
235
+ Continuously sync a single Kafka topic to Lakehouse in real time.
235
236
 
236
- - **适用场景**:Kafka 消息流实时入湖、秒级/分钟级延迟要求、单 Topic 精细化同步
237
- - **运行模式**:持续运行(无需配置 Cron,提交即运行)
238
- - **数据源**:**仅支持 Kafka**(JSON 消息解析,支持 JSONPath 计算列)
239
- - **对应 skill**:`clickzetta-realtime-sync-pipeline`
237
+ - **Use cases**: Kafka message stream real-time ingestion, second/minute-level latency requirements, single-topic fine-grained sync
238
+ - **Run mode**: continuously running (no Cron needed, runs upon submission)
239
+ - **Data sources**: **Kafka only** (JSON message parsing, supports JSONPath computed columns)
240
+ - **Corresponding skill**: `clickzetta-realtime-sync-pipeline`
240
241
 
241
- ### 多表实时同步(CDC
242
- MySQL / PostgreSQL 整库或多表通过 CDC 实时同步到 Lakehouse,包含全量 + 增量两阶段。
242
+ ### Multi-table Real-time Sync (CDC)
243
+ Sync entire MySQL / PostgreSQL databases or multiple tables to Lakehouse via CDC in real time, with full load + incremental two-phase sync.
243
244
 
244
- - **适用场景**:数据库整库实时镜像、秒级端到端时效性、分库分表实时合并
245
- - **运行模式**:持续运行(无需配置 Cron,提交即运行)
246
- - **数据源**:
245
+ - **Use cases**: full database real-time mirror, second-level end-to-end latency, sharded table real-time merge
246
+ - **Run mode**: continuously running (no Cron needed, runs upon submission)
247
+ - **Data sources**:
247
248
 
248
- | 类型 | 增量读取模式 | 支持版本 |
249
- |---|---|---|
250
- | MySQL 类(含 Aurora MySQLPolarDB MySQL | Binlog | 5.6 及以上、8.x |
251
- | PostgreSQL 类(含 Aurora PGPolarDB PG | WALs 日志 | 14 及以上 |
249
+ | Type | Incremental Read Mode | Supported Versions |
250
+ |------|----------------------|-------------------|
251
+ | MySQL (including Aurora MySQL, PolarDB MySQL) | Binlog | 5.6+, 8.x |
252
+ | PostgreSQL (including Aurora PG, PolarDB PG) | WALs | 14+ |
252
253
 
253
- - **对应 skill**:`clickzetta-cdc-sync-pipeline`
254
+ - **Corresponding skill**: `clickzetta-cdc-sync-pipeline`
254
255
 
255
- ### 数据开发任务(SQL / Python / Shell
256
- Studio 中编写和调度数据处理逻辑,是数仓 ETL 的核心载体。
256
+ ### Data Development Tasks (SQL / Python / Shell)
257
+ Write and schedule data processing logic in Studio the core vehicle for data warehouse ETL.
257
258
 
258
- - **SQL 任务**:ODS→DWD 清洗转换、数据质量检查、临时数据修复
259
- - **Python 任务**:自定义数据处理脚本、调用外部 API、机器学习推理
260
- - **Shell 任务**:系统命令、文件操作、调用外部工具
261
- - **运行模式**:周期调度(配置 Cron)或手动触发
262
- - **对应 skill**:`clickzetta-studio-task-manager`(本 skill
259
+ - **SQL tasks**: ODS→DWD cleaning/transformation, data quality checks, ad-hoc data repairs
260
+ - **Python tasks**: custom data processing scripts, external API calls, ML inference
261
+ - **Shell tasks**: system commands, file operations, external tool invocations
262
+ - **Run mode**: scheduled (Cron) or manual trigger
263
+ - **Corresponding skill**: `clickzetta-studio-task-manager` (this skill)
263
264
 
264
- ### 四类任务对比速查
265
+ ### Four Task Types Quick Comparison
265
266
 
266
- | 任务类型 | 数据源 | 同步粒度 | 运行模式 | 时效性 |
267
- |---|---|---|---|---|
268
- | 离线同步 | 关系型数据库 | 单表 | 周期调度 | 小时/天级 |
269
- | 多表离线同步 | 关系型数据库 | 多表/整库 | 周期调度 | 小时/天级 |
270
- | 实时同步 | **仅 Kafka** | Topic | 持续运行 | 秒/分钟级 |
271
- | 多表实时同步 | MySQL / PostgreSQL | 多表/整库 | 持续运行 | 秒级 |
272
- | 数据开发 | 任意(SQL/Python/Shell | 自定义逻辑 | 周期调度或手动 | 取决于调度频率 |
267
+ | Task Type | Data Source | Sync Granularity | Run Mode | Freshness |
268
+ |-----------|------------|-----------------|----------|-----------|
269
+ | Batch Sync | Relational DB | Single table | Scheduled | Hourly/daily |
270
+ | Multi-table Batch Sync | Relational DB | Multi-table/full DB | Scheduled | Hourly/daily |
271
+ | Real-time Sync | **Kafka only** | Single topic | Continuously running | Seconds/minutes |
272
+ | Multi-table Real-time Sync | MySQL / PostgreSQL | Multi-table/full DB | Continuously running | Seconds |
273
+ | Data Development | Any (SQL/Python/Shell) | Custom logic | Scheduled or manual | Depends on schedule frequency |
273
274
 
274
275
  ---
275
276
 
276
- ## 核心原则:建管分离
277
+ ## Core Principle: Separation of DDL and Pipeline Management
277
278
 
278
- **不同类型的任务,调度策略完全不同。** 混淆任务类型是最常见的工程错误。
279
+ **Different task types have completely different scheduling strategies.** Confusing task types is the most common engineering mistake.
279
280
 
280
- | 任务类型 | 典型内容 | Studio 任务类型 | 调度配置 | 状态 |
281
- |---|---|---|---|---|
282
- | **DDL 建表任务** | CREATE TABLE / CREATE SCHEMA | SQL 任务 | ❌ 禁止 Cron,禁止依赖 | DRAFT |
283
- | **数据同步任务** | 外部数据源(关系型数据库/对象存储)→ ODS | **SINGLE_DI / MULTI_DI / REALTIME**(不是 SQL 任务) | ✅ 配置 Cron(离线)或持续运行(实时) | PUBLISHED |
284
- | **ETL 转换任务** | ODS→DWD 清洗 SQLLakehouse 内部) | SQL 任务 | ✅ 配置 Cron + 依赖上游同步 | PUBLISHED |
285
- | **数据质量任务** | 行数检查、NULL 率验证 | SQL 任务 | ✅ 配置 Cron + 依赖 ETL | PUBLISHED |
286
- | **DWS/ADS 聚合层** | 指标汇总、报表宽表 | ❌ 使用 Dynamic Table,不建任务 | — | — |
281
+ | Task Type | Typical Content | Studio Task Type | Scheduling | Status |
282
+ |-----------|----------------|-----------------|------------|--------|
283
+ | **DDL table creation** | CREATE TABLE / CREATE SCHEMA | SQL task | ❌ No Cron, no dependencies | DRAFT |
284
+ | **Data sync tasks** | External source (relational DB/object storage) → ODS | **SINGLE_DI / MULTI_DI / REALTIME** (not SQL tasks) | ✅ Configure Cron (batch) or continuous (real-time) | PUBLISHED |
285
+ | **ETL transformation** | ODS→DWD cleaning SQL (Lakehouse internal) | SQL task | ✅ Configure Cron + depend on upstream sync | PUBLISHED |
286
+ | **Data quality tasks** | Row count checks, NULL rate validation | SQL task | ✅ Configure Cron + depend on ETL | PUBLISHED |
287
+ | **DWS/ADS aggregation** | Metric summaries, report wide tables | ❌ Use Dynamic Table, no task needed | — | — |
287
288
 
288
- **数据同步任务支持的数据源:**
289
- - 离线同步(SINGLE_DI/MULTI_DI):MySQLPostgreSQLOracleSQL Server 等关系型数据库,以及 OSS/COS/S3 对象存储
290
- - 实时同步单表(REALTIME):Kafka
291
- - 多表实时同步 CDCMySQLBinlog5.6+/8.x)、PostgreSQLWALs14+)
289
+ **Data sync task supported data sources:**
290
+ - Batch sync (SINGLE_DI/MULTI_DI): MySQL, PostgreSQL, Oracle, SQL Server and other relational databases, plus OSS/COS/S3 object storage
291
+ - Single-table real-time sync (REALTIME): Kafka
292
+ - Multi-table real-time sync CDC: MySQL (Binlog, 5.6+/8.x), PostgreSQL (WALs, 14+)
292
293
 
293
- **其他数据访问方式(不是数据同步任务):**
294
- - Kafka/OSS/S3/COS → 也可以用 SQL Pipe(`READ_KAFKA`/Volume Pipe),与 Studio 同步任务都合法,根据场景选择
295
- - Hive/Databricks/Snowflake Open Catalog → External Catalog 联邦只读查询,不做数据同步
294
+ **Other data access methods (not data sync tasks):**
295
+ - Kafka/OSS/S3/COS → can also use SQL Pipe (`READ_KAFKA`/Volume Pipe), both Studio sync tasks and SQL Pipes are valid — choose based on scenario
296
+ - Hive/Databricks/Snowflake Open Catalog → External Catalog federated read-only queries, not data sync
296
297
 
297
- > ⚠️ **DDL 任务绝对不能配 Cron**:建表语句重复执行会引发 `SCHEDULE_TASK_HAD_CHILDREN_NODES_EXCEPTION` 等调度冲突。DDL 任务执行完成后立即降级为 DRAFT
298
+ > ⚠️ **DDL tasks must never have Cron**: repeated execution of CREATE TABLE statements causes `SCHEDULE_TASK_HAD_CHILDREN_NODES_EXCEPTION` and other scheduling conflicts. DDL tasks should be demoted to DRAFT immediately after execution.
298
299
 
299
- > ⚠️ **DWS/ADS 层不要建调度任务**:Dynamic Table 系统自动刷新,额外建任务是冗余计算,浪费资源。
300
+ > ⚠️ **Do not create scheduled tasks for DWS/ADS layer**: Dynamic Tables auto-refresh by the system. Creating additional tasks is redundant computation and wastes resources.
300
301
 
301
- > ⚠️ **严禁用 SQL 任务替代数据同步任务**:不能用 SQL 任务写 `SELECT FROM EXTERNAL` 模拟同步(语法不支持),也不能用 JDBC 任务(JDBC 只能在外部数据库执行 SQL,不支持将数据同步到 Lakehouse)。
302
+ > ⚠️ **Never use SQL tasks as a substitute for data sync tasks**: you cannot use SQL tasks to write `SELECT FROM EXTERNAL` to simulate sync (syntax not supported), nor use JDBC tasks (JDBC can only execute SQL on external databases, cannot sync data to Lakehouse).
302
303
 
303
304
  ---
304
305
 
305
- ## 任务目录组织规范
306
+ ## Task Folder Organization Standards
306
307
 
307
- 每个数仓项目在 Studio 中创建独立任务目录,统一管理所有任务资产:
308
+ Each data warehouse project creates an independent task folder in Studio to manage all task assets uniformly:
308
309
 
309
310
  ```
310
- <业务域>_dw/ 项目任务目录(如 shenyu_gateway_dwecommerce_dw
311
- ├── 00_sync_<source>_to_ods ← 数据同步(Cron,最早执行)
312
- ├── 01_ddl_ods ← ODS 建表(DRAFT,不调度,手动执行一次)
313
- ├── 02_ddl_dwd ← DWD 建表(DRAFT,不调度,手动执行一次)
314
- ├── 03_ddl_dws_ads ← DWS/ADS 动态表建表(DRAFT,不调度)
315
- ├── 04_transform_ods_to_dwd ← ODS→DWD 清洗(Cron,依赖 00
316
- └── 05_dqc_check ← 数据质量检查(Cron,依赖 04,可选)
311
+ <business_domain>_dw/ Project task folder (e.g., shenyu_gateway_dw, ecommerce_dw)
312
+ ├── 00_sync_<source>_to_ods ← Data sync (Cron, runs earliest)
313
+ ├── 01_ddl_ods ← ODS table creation (DRAFT, no scheduling, run once manually)
314
+ ├── 02_ddl_dwd ← DWD table creation (DRAFT, no scheduling, run once manually)
315
+ ├── 03_ddl_dws_ads ← DWS/ADS Dynamic Table creation (DRAFT, no scheduling)
316
+ ├── 04_transform_ods_to_dwd ← ODS→DWD transformation (Cron, depends on 00)
317
+ └── 05_dqc_check ← Data quality check (Cron, depends on 04, optional)
317
318
  ```
318
319
 
319
- > DWS/ADS 层由 Dynamic Table 自动刷新,**无需创建任务**。
320
+ > DWS/ADS layer is auto-refreshed by Dynamic Tables — **no task creation needed**.
320
321
 
321
322
  ---
322
323
 
323
- ## cz-cli task 命令族
324
+ ## cz-cli task Command Family
324
325
 
325
- ### 任务目录管理
326
+ ### Task Folder Management
326
327
 
327
328
  ```bash
328
- # 创建任务目录
329
+ # Create task folder
329
330
  cz-cli task folder create <folder_name>
330
331
 
331
- # 列出所有任务目录
332
+ # List all task folders
332
333
  cz-cli task folder list
333
334
  ```
334
335
 
335
- ### 任务查询
336
+ ### Task Queries
336
337
 
337
338
  ```bash
338
- # 列出所有任务
339
+ # List all tasks
339
340
  cz-cli task list
340
341
 
341
- # 按目录过滤
342
+ # Filter by folder
342
343
  cz-cli task list --folder <folder_name>
343
344
 
344
- # 查看任务详情
345
+ # View task details
345
346
  cz-cli task get <task_id>
346
347
 
347
- # 查看任务状态
348
- cz-cli task status <task_id>
349
348
  ```
350
349
 
351
- ### 任务执行
350
+ ### Task Execution
352
351
 
353
352
  ```bash
354
- # 手动触发任务运行
353
+ # Manually trigger task run
355
354
  cz-cli task run <task_id>
356
355
 
357
- # 查看任务运行日志
356
+ # View task run logs
358
357
  cz-cli task logs <task_id>
359
358
 
360
- # 查看最近一次运行实例
361
- cz-cli task instances <task_id> --limit 5
362
359
  ```
363
360
 
364
- ### 任务创建
361
+ ### Task Creation
365
362
 
366
363
  ```bash
367
- # 创建 SQL 任务(ETL/DDL
364
+ # Create SQL task (ETL/DDL)
368
365
  cz-cli task create \
369
366
  --name "04_transform_ods_to_dwd" \
370
367
  --type SQL \
@@ -372,281 +369,281 @@ cz-cli task create \
372
369
  --vcluster default \
373
370
  --sql-file ./transform.sql
374
371
 
375
- # 创建数据同步任务(单表)
372
+ # Create data sync task (single-table)
376
373
  cz-cli task create \
377
374
  --name "00_sync_mysql_to_ods" \
378
375
  --type SINGLE_DI \
379
376
  --folder <folder_name>
380
377
  ```
381
378
 
382
- > ⚠️ **整库同步任务(MULTI_DI)的能力边界**:`cz-cli` 可以创建任务框架,但源端/目标端字段映射配置**必须在 Studio UI 中手动完成**。推荐 SOP
383
- > 1. `cz-cli task create --type MULTI_DI` 创建任务框架
384
- > 2. 复制输出的任务链接,在浏览器中打开
385
- > 3. Studio UI 中配置源端数据库、目标端 Schema、字段映射
386
- > 4. 点击发布运行
379
+ > ⚠️ **Full database sync task (MULTI_DI) capability boundary**: `cz-cli` can create the task framework, but source/target column mapping configuration **must be completed manually in Studio UI**. Recommended SOP:
380
+ > 1. `cz-cli task create --type MULTI_DI` to create task framework
381
+ > 2. Copy the output task link, open in browser
382
+ > 3. Configure source database, target schema, column mapping in Studio UI
383
+ > 4. Click publish to run
387
384
 
388
385
  ---
389
386
 
390
- ## 调度配置最佳实践
387
+ ## Scheduling Configuration Best Practices
391
388
 
392
- ### Cron 表达式参考
389
+ ### Cron Expression Reference
393
390
 
394
391
  ```
395
- # 每天 02:00 执行(数据同步)
392
+ # Daily at 02:00 (data sync)
396
393
  0 2 * * *
397
394
 
398
- # 每天 02:30 执行(ETL 转换,同步完成后 30 分钟)
395
+ # Daily at 02:30 (ETL transformation, 30 minutes after sync completes)
399
396
  30 2 * * *
400
397
 
401
- # 每天 03:00 执行(数据质量检查)
398
+ # Daily at 03:00 (data quality check)
402
399
  0 3 * * *
403
400
 
404
- # 每小时执行
401
+ # Every hour
405
402
  0 * * * *
406
403
  ```
407
404
 
408
- ### 依赖配置原则
405
+ ### Dependency Configuration Principles
409
406
 
410
407
  ```
411
- 正确的依赖链:
412
- 00_syncCron 02:00
413
- 依赖
414
- 04_transformCron 02:30
415
- 依赖
416
- 05_dqcCron 03:00
408
+ Correct dependency chain:
409
+ 00_sync (Cron 02:00)
410
+ depends on
411
+ 04_transform (Cron 02:30)
412
+ depends on
413
+ 05_dqc (Cron 03:00)
417
414
 
418
- 错误的依赖:
419
- ❌ DDL 任务(01/02/03)不应出现在依赖链中
420
- ❌ Dynamic Table 不应出现在依赖链中
415
+ Incorrect dependencies:
416
+ ❌ DDL tasks (01/02/03) should not appear in the dependency chain
417
+ ❌ Dynamic Tables should not appear in the dependency chain
421
418
  ```
422
419
 
423
420
  ---
424
421
 
425
- ## 数据同步任务类型选择
422
+ ## Data Sync Task Type Selection
426
423
 
427
- | 场景 | 任务类型 | 说明 |
428
- |---|---|---|
429
- | MySQL/PG 单表同步到 Lakehouse | `SINGLE_DI` | 简单,CLI 可完全配置 |
430
- | MySQL/PG 整库同步(多表镜像) | `MULTI_DI` | CLI 创建框架,UI 配置映射 |
431
- | Kafka 实时接入 | `REALTIME_SYNC` | 持续运行,无需 Cron |
432
- | 文件批量导入(OSS/S3 | SQL 任务(COPY INTO | SQL 任务执行 COPY INTO |
424
+ | Scenario | Task Type | Notes |
425
+ |----------|-----------|-------|
426
+ | MySQL/PG single-table sync to Lakehouse | `SINGLE_DI` | Simple, CLI can fully configure |
427
+ | MySQL/PG full database sync (multi-table mirror) | `MULTI_DI` | CLI creates framework, UI configures mapping |
428
+ | Kafka real-time ingestion | `REALTIME_SYNC` | Continuously running, no Cron needed |
429
+ | File batch import (OSS/S3) | SQL task (COPY INTO) | Use SQL task to execute COPY INTO |
433
430
 
434
431
  ---
435
432
 
436
- ## 常见问题排查
433
+ ## Common Issue Troubleshooting
437
434
 
438
- | 问题 | 原因 | 解决方案 |
439
- |---|---|---|
440
- | `SCHEDULE_TASK_HAD_CHILDREN_NODES_EXCEPTION` | DDL 任务被配置了 Cron 或依赖 | 清除 DDL 任务的调度配置,降级为 DRAFT |
441
- | 任务发布失败,提示循环依赖 | 任务 A 依赖 BB 又依赖 A | 检查依赖链,去除环形依赖 |
442
- | 同步任务一直失败,无明确报错 | 字段类型不兼容(如 MySQL BIT(1) vs Lakehouse BOOLEAN | 检查字段类型映射,参考下方类型映射表 |
443
- | 整库同步任务创建后无法运行 | MULTI_DI 任务缺少字段映射配置 | 进入 Studio UI 配置源端/目标端映射后重新发布 |
444
- | ETL 任务未按时触发 | 上游同步任务失败,依赖未满足 | 先修复上游同步任务,再手动触发 ETL |
445
- | DWS 层数据未更新 | 误建了调度任务但 Dynamic Table 未刷新 | 删除冗余调度任务,确认 Dynamic Table 状态为 RUNNING |
446
- | 任务运行成功但数据为空 | SQL 逻辑问题(如 LEFT JOIN 过滤条件位置错误) | 检查 SQLLEFT JOIN 右表过滤条件必须在 ON 子句 |
435
+ | Issue | Cause | Solution |
436
+ |-------|-------|----------|
437
+ | `SCHEDULE_TASK_HAD_CHILDREN_NODES_EXCEPTION` | DDL task was configured with Cron or dependencies | Clear DDL task scheduling config, demote to DRAFT |
438
+ | Task publish failed, circular dependency | Task A depends on B, B depends on A | Check dependency chain, remove circular dependencies |
439
+ | Sync task keeps failing, no clear error | Column type incompatibility (e.g., MySQL BIT(1) vs Lakehouse BOOLEAN) | Check column type mapping, refer to type mapping table below |
440
+ | Full database sync task cannot run after creation | MULTI_DI task missing column mapping config | Enter Studio UI to configure source/target mapping, then republish |
441
+ | ETL task not triggered on time | Upstream sync task failed, dependency not satisfied | Fix upstream sync task first, then manually trigger ETL |
442
+ | DWS layer data not updated | Mistakenly created scheduled task but Dynamic Table not refreshing | Delete redundant scheduled task, confirm Dynamic Table status is RUNNING |
443
+ | Task run succeeded but data is empty | SQL logic issue (e.g., LEFT JOIN filter condition in wrong position) | Check SQLLEFT JOIN right-table filter conditions must be in the ON clause |
447
444
 
448
- ### MySQL → Lakehouse 字段类型映射(同步任务常见踩坑)
445
+ ### MySQL → Lakehouse Column Type Mapping (Common Sync Pitfalls)
449
446
 
450
- | MySQL 类型 | ❌ 不要用 | ✅ ODS 层用 | DWD 层转换 |
451
- |---|---|---|---|
447
+ | MySQL Type | ❌ Don't Use | ✅ ODS Layer Use | DWD Layer Conversion |
448
+ |-----------|-------------|-----------------|---------------------|
452
449
  | `BIT(1)` | `BOOLEAN` | `TINYINT` | `CAST(col AS BOOLEAN)` |
453
- | `DATETIME` | `DATETIME` | `TIMESTAMP` | 直接用 |
454
- | `ENUM('a','b')` | `ENUM` | `STRING` | 直接用 |
455
- | `TEXT` / `LONGTEXT` | `TEXT` | `STRING` | 直接用 |
456
- | `DECIMAL(p,s)` | `FLOAT` | `DECIMAL(p,s)` | 直接用 |
450
+ | `DATETIME` | `DATETIME` | `TIMESTAMP` | Use directly |
451
+ | `ENUM('a','b')` | `ENUM` | `STRING` | Use directly |
452
+ | `TEXT` / `LONGTEXT` | `TEXT` | `STRING` | Use directly |
453
+ | `DECIMAL(p,s)` | `FLOAT` | `DECIMAL(p,s)` | Use directly |
457
454
  | `TINYINT(1)` | `BOOLEAN` | `TINYINT` | `CAST(col AS BOOLEAN)` |
458
455
 
459
- > **ODS 层原则:宽泛类型优先**,同步成功后在 DWD 层做精确类型转换,避免同步阶段因类型不兼容失败。
456
+ > **ODS layer principle: prefer broad types** — sync successfully first, then do precise type conversion in the DWD layer to avoid sync failures due to type incompatibility.
460
457
 
461
458
  ---
462
459
 
463
- ## 完整工程化 SOP
460
+ ## Complete Engineering SOP
464
461
 
465
- ### 代码资产化原则
462
+ ### Code-as-Asset Principle
466
463
 
467
- **数据管道开发 / 数仓建模场景下,所有 SQL 代码都应保存为 Studio 任务,作为可管理的代码资产。**
464
+ **In data pipeline development / data warehouse modeling scenarios, all SQL code should be saved as Studio tasks as manageable code assets.**
468
465
 
469
- - 任务是代码的载体,不只是调度配置
470
- - 即使是一次性执行的 DDL,也应保存为 DRAFT 任务,方便查阅、复用和多环境迁移
471
- - 不需要保存为任务的场景:SELECT 查询、临时修复 SQL、一次性验证查询
466
+ - Tasks are the vehicle for code, not just scheduling configurations
467
+ - Even one-time DDL executions should be saved as DRAFT tasks for easy reference, reuse, and multi-environment migration
468
+ - Scenarios that don't need to be saved as tasks: SELECT queries, ad-hoc fix SQL, one-time validation queries
472
469
 
473
- ### 新项目启动流程(含快速验证节点)
470
+ ### New Project Launch Process (with Quick Verification Checkpoints)
474
471
 
475
- 敏捷原则:**每步完成后立即验证,30 秒内知道是否成功,不等全链路跑完再发现问题。**
472
+ Agile principle: **verify immediately after each step, know within 30 seconds if it succeeded — don't wait until the full pipeline runs to discover issues.**
476
473
 
477
474
  ```
478
- 1. 创建任务目录
479
- cz-cli task folder create <业务域>_dw
475
+ 1. Create task folder
476
+ cz-cli task folder create <business_domain>_dw
480
477
 
481
- 2. ODS 层表,立即验证
478
+ 2. Create ODS layer tables, verify immediately
482
479
  cz-cli task save-content 01_ddl_ods --content "<ods_ddl_sql>"
483
480
  cz-cli task run 01_ddl_ods
484
- 验证:SHOW TABLES IN <ods_schema> → 确认表已创建
481
+ Verify: SHOW TABLES IN <ods_schema> → confirm tables created
485
482
 
486
- 3. 创建数据同步任务,手动触发一次,立即验证
487
- - 00_sync:整库或单表同步到 ODSMULTI_DI 需进 UI 配置映射)
483
+ 3. Create data sync task, trigger once manually, verify immediately
484
+ - 00_sync: full database or single-table sync to ODS (MULTI_DI requires UI mapping config)
488
485
  cz-cli task execute 00_sync
489
- 验证:SELECT COUNT(*) FROM <ods_schema>.<table> → 与源端行数对比
490
- SELECT * FROM <ods_schema>.<table> LIMIT 5 → 抽样检查字段
486
+ Verify: SELECT COUNT(*) FROM <ods_schema>.<table> → compare with source row count
487
+ SELECT * FROM <ods_schema>.<table> LIMIT 5 → sample check fields
491
488
 
492
- 4. DWD 层表,立即验证
489
+ 4. Create DWD layer tables, verify immediately
493
490
  cz-cli task save-content 02_ddl_dwd --content "<dwd_ddl_sql>"
494
491
  cz-cli task run 02_ddl_dwd
495
- 验证:SHOW TABLES IN <dwd_schema> → 确认表已创建
492
+ Verify: SHOW TABLES IN <dwd_schema> → confirm tables created
496
493
 
497
- 5. 生成 ETL 转换 SQL,先手动执行一次验证逻辑,再配调度
494
+ 5. Generate ETL transformation SQL, manually execute once to verify logic, then configure scheduling
498
495
  cz-cli task save-content 04_transform_ods_to_dwd --content "<etl_sql>"
499
- cz-cli task execute 04_transform_ods_to_dwd ← 先手动跑一次
500
- 验证:SELECT COUNT(*) FROM <dwd_schema>.<table> → 行数符合预期
501
- 检查关键字段非空率、LEFT JOIN 结果行数左表行数
502
- 确认无误后再配调度:
496
+ cz-cli task execute 04_transform_ods_to_dwd ← run manually first
497
+ Verify: SELECT COUNT(*) FROM <dwd_schema>.<table> → row count meets expectations
498
+ Check key field non-null rate, LEFT JOIN result rows left table rows
499
+ After confirmation, configure scheduling:
503
500
  cz-cli task save-cron 04_transform_ods_to_dwd --cron '0 30 2 * * ? *'
504
501
  cz-cli task deploy 04_transform_ods_to_dwd
505
502
 
506
- 6. DWS/ADS Dynamic Table,立即触发首次刷新验证
503
+ 6. Create DWS/ADS Dynamic Tables, trigger first refresh for verification
507
504
  cz-cli task save-content 03_ddl_dws_ads --content "<dws_ads_ddl_sql>"
508
505
  cz-cli task run 03_ddl_dws_ads
509
506
  REFRESH DYNAMIC TABLE <dws_schema>.<table>
510
- 验证:SHOW DYNAMIC TABLE REFRESH HISTORY <schema>.<table> LIMIT 3
511
- → status = SUCCESS,行数符合聚合逻辑
507
+ Verify: SHOW DYNAMIC TABLE REFRESH HISTORY <schema>.<table> LIMIT 3
508
+ → status = SUCCESS, row count matches aggregation logic
512
509
 
513
- 7. 可选:数据质量检查任务(配 Cron + 依赖 04
510
+ 7. Optional: data quality check task (Cron + depends on 04)
514
511
  cz-cli task save-content 05_dqc_check --content "<dqc_sql>"
515
512
  cz-cli task save-cron 05_dqc_check --cron '0 0 3 * * ? *'
516
513
  cz-cli task deploy 05_dqc_check
517
514
  ```
518
515
 
519
- > **快速失败原则**:任何一步验证失败,立即停下来修复,不要继续往下走。ODS 数据不对,DWD 一定也不对。
516
+ > **Fail-fast principle**: if any step's verification fails, stop immediately and fix — don't continue. If ODS data is wrong, DWD will definitely be wrong too.
520
517
 
521
518
  ---
522
519
 
523
- ## 增量迭代向导
520
+ ## Incremental Iteration Guide
524
521
 
525
- **已有管道需要修改时,走增量流程,不要重走完整建管道流程。**
522
+ **When modifying an existing pipeline, follow the incremental process — don't re-run the full pipeline build process.**
526
523
 
527
- 当用户说"加一张表""加一个字段""加一个指标"" ETL 逻辑"时,优先使用交互式工具收集迭代类型:
524
+ When the user says "add a table", "add a field", "add a metric", "change ETL logic", use interactive tools to collect the iteration type:
528
525
 
529
526
  ```
530
527
  question({
531
528
  questions: [{
532
- question: "你想对现有管道做什么修改?",
529
+ question: "What modification do you want to make to the existing pipeline?",
533
530
  options: [
534
- { label: "新增同步表", description: "在现有同步任务里增加一张源表" },
535
- { label: "新增字段", description: "源表加了字段,ODS/DWD 需要跟进" },
536
- { label: "新增指标/DWS ", description: "新增聚合逻辑或 Dynamic Table" },
537
- { label: "修改 ETL 逻辑", description: "清洗规则、过滤条件、JOIN 关系变更" }
531
+ { label: "Add sync table", description: "Add a source table to an existing sync task" },
532
+ { label: "Add field", description: "Source table added a field, ODS/DWD need to follow" },
533
+ { label: "Add metric/DWS layer", description: "Add aggregation logic or Dynamic Table" },
534
+ { label: "Modify ETL logic", description: "Cleaning rules, filter conditions, JOIN relationship changes" }
538
535
  ]
539
536
  }]
540
537
  })
541
538
  ```
542
539
 
543
- ### 新增同步表
540
+ ### Add Sync Table
544
541
 
545
542
  ```
546
- 1. 查血缘,确认影响范围
547
- 加载 clickzetta-table-lineage,确认新表是否与现有表有关联
543
+ 1. Check lineage, confirm impact scope
544
+ Load clickzetta-table-lineage, confirm if new table has relationships with existing tables
548
545
 
549
- 2. 在现有同步任务里增加表(或新建单表同步任务)
550
- cz-cli task content 00_sync → 查看现有配置
551
- 按需修改后重新部署
546
+ 2. Add table to existing sync task (or create new single-table sync task)
547
+ cz-cli task content 00_sync → view existing config
548
+ Modify as needed and redeploy
552
549
 
553
- 3. 手动触发同步,立即验证
550
+ 3. Manually trigger sync, verify immediately
554
551
  cz-cli task execute 00_sync
555
552
  ✅ SELECT COUNT(*) FROM <ods_schema>.<new_table>
556
553
 
557
- 4. 如需 DWD 层处理,新增 ETL SQL 并追加到 04_transform 任务
558
- cz-cli task content 04_transform_ods_to_dwd → 查看现有 SQL
559
- 追加新表的清洗逻辑,手动执行验证后重新部署
554
+ 4. If DWD layer processing needed, add ETL SQL to 04_transform task
555
+ cz-cli task content 04_transform_ods_to_dwd → view existing SQL
556
+ Append new table cleaning logic, manually execute to verify, then redeploy
560
557
  ```
561
558
 
562
- ### 新增字段(Schema Evolution
559
+ ### Add Field (Schema Evolution)
563
560
 
564
561
  ```
565
- 1. 查血缘,识别所有受影响的下游任务/DT
566
- 加载 clickzetta-table-lineage
562
+ 1. Check lineage, identify all affected downstream tasks/DTs
563
+ Load clickzetta-table-lineage
567
564
 
568
- 2. 逐层更新(从上游到下游,不能跳层)
569
- ODS 层:ALTER TABLE <ods_schema>.<table> ADD COLUMN <col> <type>
570
- 验证:DESC TABLE <ods_schema>.<table> → 确认字段已加
565
+ 2. Update layer by layer (upstream to downstream, cannot skip layers)
566
+ ODS layer: ALTER TABLE <ods_schema>.<table> ADD COLUMN <col> <type>
567
+ Verify: DESC TABLE <ods_schema>.<table> → confirm field added
571
568
 
572
- DWD 层:更新 ETL SQL,加入新字段的清洗逻辑
573
- 手动执行 04_transform 验证后重新部署
574
- 验证:SELECT <new_col>, COUNT(*) FROM <dwd_schema>.<table> GROUP BY 1 LIMIT 5
569
+ DWD layer: update ETL SQL, add cleaning logic for new field
570
+ Manually execute 04_transform to verify, then redeploy
571
+ Verify: SELECT <new_col>, COUNT(*) FROM <dwd_schema>.<table> GROUP BY 1 LIMIT 5
575
572
 
576
- DWS/ADS 层(如需):Dynamic Table 不支持 ALTER,用 CREATE OR REPLACE 重建
577
- 重建后立即 REFRESH DYNAMIC TABLE
578
- 验证:SHOW DYNAMIC TABLE REFRESH HISTORY LIMIT 3 → status = SUCCESS
573
+ DWS/ADS layer (if needed): Dynamic Table doesn't support ALTER, use CREATE OR REPLACE to rebuild
574
+ Immediately REFRESH DYNAMIC TABLE after rebuild
575
+ Verify: SHOW DYNAMIC TABLE REFRESH HISTORY LIMIT 3 → status = SUCCESS
579
576
 
580
- 3. 更新 Studio 任务脚本(保持代码资产同步)
577
+ 3. Update Studio task scripts (keep code assets in sync)
581
578
  cz-cli task save-content <task_name> --content "<updated_sql>"
582
579
  ```
583
580
 
584
- ### 新增指标/DWS
581
+ ### Add Metric/DWS Layer
585
582
 
586
583
  ```
587
- 1. 确认指标口径(与用户确认计算逻辑,避免后期返工)
584
+ 1. Confirm metric definition (confirm calculation logic with user to avoid rework)
588
585
 
589
- 2. 检查 DWD 层是否有所需字段,没有先走"新增字段"流程
586
+ 2. Check if DWD layer has required fields — if not, follow "Add Field" process first
590
587
 
591
- 3. 创建新的 Dynamic Table
588
+ 3. Create new Dynamic Table
592
589
  CREATE OR REPLACE DYNAMIC TABLE <dws_schema>.<new_metric_table>
593
590
  REFRESH INTERVAL <n> <unit> vcluster <gp_cluster>
594
591
  AS SELECT ...;
595
592
  REFRESH DYNAMIC TABLE <dws_schema>.<new_metric_table>
596
- 验证:SELECT COUNT(*), SUM(<metric>) FROM <dws_schema>.<new_metric_table>
597
- 与已知基准值对比
593
+ Verify: SELECT COUNT(*), SUM(<metric>) FROM <dws_schema>.<new_metric_table>
594
+ Compare with known baseline values
598
595
 
599
- 4. 保存 DDL Studio 任务
596
+ 4. Save DDL to Studio task
600
597
  cz-cli task save-content 03_ddl_dws_ads --content "<updated_ddl>"
601
598
  ```
602
599
 
603
- ### 修改 ETL 逻辑
600
+ ### Modify ETL Logic
604
601
 
605
602
  ```
606
- 1. 查血缘,确认下游影响范围
607
- 加载 clickzetta-table-lineage
603
+ 1. Check lineage, confirm downstream impact scope
604
+ Load clickzetta-table-lineage
608
605
 
609
- 2. dev/测试环境先验证新逻辑(如有)
606
+ 2. Verify new logic in dev/test environment first (if available)
610
607
 
611
- 3. 更新 ETL SQL
612
- cz-cli task content 04_transform_ods_to_dwd → 查看现有逻辑
613
- 修改后先手动执行验证:
608
+ 3. Update ETL SQL
609
+ cz-cli task content 04_transform_ods_to_dwd → view existing logic
610
+ After modification, manually execute to verify:
614
611
  cz-cli task execute 04_transform_ods_to_dwd
615
- 验证:行数对比、关键字段抽样、与修改前结果对比
612
+ Verify: row count comparison, key field sampling, compare with pre-modification results
616
613
 
617
- 4. 验证通过后重新部署
614
+ 4. After verification passes, redeploy
618
615
  cz-cli task save-content 04_transform_ods_to_dwd --content "<new_sql>"
619
616
  cz-cli task deploy 04_transform_ods_to_dwd
620
617
 
621
- 5. 下游 Dynamic Table 如受影响,触发全量刷新
618
+ 5. If downstream Dynamic Tables are affected, trigger full refresh
622
619
  SET cz.optimizer.incremental.force.full.refresh = true;
623
620
  REFRESH DYNAMIC TABLE <dws_schema>.<table>;
624
621
  SET cz.optimizer.incremental.force.full.refresh = false;
625
622
  ```
626
623
 
627
- ### 交付验证 Checklist
624
+ ### Delivery Verification Checklist
628
625
 
629
- - [ ] 各层行数与预期一致
630
- - [ ] Dynamic Table 使用的 VCluster 存在且 `status = RUNNING`(`SHOW VCLUSTERS`)
631
- - [ ] Dynamic Table 刷新历史显示 SUCCESS
632
- - [ ] 关键字段 NULL 率在可接受范围
633
- - [ ] LEFT JOIN 结果行数左表行数
634
- - [ ] 所有 DDL 任务为 DRAFT 状态
635
- - [ ] DWS/ADS 层无冗余调度任务
636
- - [ ] 调度 DAG 无循环依赖
637
- - [ ] **ETL 任务依赖链完整**(`cz-cli task deps <task>` 验证,`task_dependencies` 不为空)
638
- - [ ] 关键表和字段已加注释(加载 `clickzetta-manage-comments`)
626
+ - [ ] Row counts at each layer match expectations
627
+ - [ ] Dynamic Table's VCluster exists and `status = RUNNING` (`SHOW VCLUSTERS`)
628
+ - [ ] Dynamic Table refresh history shows SUCCESS
629
+ - [ ] Key field NULL rate within acceptable range
630
+ - [ ] LEFT JOIN result row count left table row count
631
+ - [ ] All DDL tasks are in DRAFT status
632
+ - [ ] No redundant scheduled tasks for DWS/ADS layer
633
+ - [ ] Scheduling DAG has no circular dependencies
634
+ - [ ] **ETL task dependency chain is complete** (`cz-cli task deps <task>` to verify, `task_dependencies` is not empty)
635
+ - [ ] Key tables and fields have comments (load `clickzetta-manage-comments`)
639
636
 
640
637
  ---
641
638
 
642
- ## 多环境管理(dev → prod
639
+ ## Multi-environment Management (dev → prod)
643
640
 
644
- ClickZetta 通过 **Workspace** 隔离环境(dev/staging/prod 对应不同 Workspace)。跨 Workspace 的管道迁移当前自动化程度有限,主要依赖手动操作。
641
+ ClickZetta isolates environments via **Workspace** (dev/staging/prod correspond to different Workspaces). Cross-Workspace pipeline migration currently has limited automation and mainly relies on manual operations.
645
642
 
646
- **当用户提出多环境迁移需求时**,告知以下限制并引导:
643
+ **When the user raises multi-environment migration needs**, inform them of the following limitations and guide accordingly:
647
644
 
648
- - 不同 Workspace 的数据源配置、Schema、VCluster 名称各自独立,迁移时需逐一确认和替换
649
- - 目前没有一键迁移工具,建议联系**数据运维(lh-dba 角色)**协助规划多环境策略
650
- - 可以用 `cz-cli task content <task_id>` 导出任务脚本,手动调整后在目标 Workspace 重建
645
+ - Data source configurations, schemas, and VCluster names are independent across different Workspaces — each must be confirmed and replaced during migration
646
+ - There is currently no one-click migration tool — recommend contacting **data operations (lh-dba role)** for help planning multi-environment strategy
647
+ - You can use `cz-cli task content <task_id>` to export task scripts, manually adjust, then recreate in the target Workspace
651
648
 
652
- > 多环境管理是平台能力演进方向,当前阶段建议在单 Workspace 内用 Schema 命名区分(如 `ecommerce_ods_dev` vs `ecommerce_ods`),降低迁移复杂度。
649
+ > Multi-environment management is a platform capability evolution direction. At the current stage, it's recommended to use schema naming within a single Workspace to differentiate (e.g., `ecommerce_ods_dev` vs `ecommerce_ods`) to reduce migration complexity.