@clickzetta/cz-cli-darwin-arm64 0.3.92 → 0.3.94
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/cz-cli +0 -0
- package/bin/skills/clickzetta-ai-function/SKILL.md +109 -0
- package/bin/skills/clickzetta-ai-function/eval_cases.jsonl +4 -0
- package/bin/skills/clickzetta-ai-function/references/ai-function-ddl.md +106 -0
- package/bin/skills/clickzetta-batch-sync-pipeline/SKILL.md +124 -124
- package/bin/skills/clickzetta-batch-sync-pipeline/eval_cases.jsonl +5 -5
- package/bin/skills/clickzetta-bi-connect/SKILL.md +79 -78
- package/bin/skills/clickzetta-bi-connect/references/bi-tools.md +56 -56
- package/bin/skills/clickzetta-cdc-sync-pipeline/SKILL.md +386 -382
- package/bin/skills/clickzetta-cdc-sync-pipeline/eval_cases.jsonl +5 -5
- package/bin/skills/clickzetta-data-ingest-pipeline/SKILL.md +73 -212
- package/bin/skills/clickzetta-data-science/SKILL.md +57 -56
- package/bin/skills/clickzetta-data-science/references/bitmap-profile.md +38 -38
- package/bin/skills/clickzetta-data-science/references/data-patterns.md +16 -16
- package/bin/skills/clickzetta-data-science/references/setup.md +28 -28
- package/bin/skills/clickzetta-data-science/references/stats-functions.md +44 -44
- package/bin/skills/clickzetta-data-science/references/write-and-infer.md +22 -22
- package/bin/skills/clickzetta-data-science/references/zettapark-api.md +32 -32
- package/bin/skills/clickzetta-dw-modeling/SKILL.md +1 -1
- package/bin/skills/clickzetta-external-function/SKILL.md +51 -109
- package/bin/skills/clickzetta-external-function/eval_cases.jsonl +4 -4
- package/bin/skills/clickzetta-external-function/references/external-function-ddl.md +39 -77
- package/bin/skills/clickzetta-java-sdk/SKILL.md +49 -48
- package/bin/skills/clickzetta-java-sdk/eval_cases.jsonl +12 -12
- package/bin/skills/clickzetta-java-sdk/references/bulkload.md +34 -34
- package/bin/skills/clickzetta-java-sdk/references/realtime.md +44 -44
- package/bin/skills/clickzetta-kafka-ingest-pipeline/SKILL.md +273 -507
- package/bin/skills/clickzetta-kafka-ingest-pipeline/references/kafka-pipe-syntax.md +197 -231
- package/bin/skills/clickzetta-oss-ingest-pipeline/SKILL.md +231 -304
- package/bin/skills/clickzetta-realtime-sync-pipeline/SKILL.md +180 -179
- package/bin/skills/clickzetta-realtime-sync-pipeline/eval_cases.jsonl +5 -5
- package/bin/skills/clickzetta-semantic-view/SKILL.md +74 -72
- package/bin/skills/clickzetta-semantic-view/eval_cases.jsonl +12 -12
- package/bin/skills/clickzetta-semantic-view/references/semantic-view-reference.md +75 -75
- package/bin/skills/clickzetta-sql-migration/SKILL.md +128 -0
- package/bin/skills/clickzetta-sql-migration/eval_cases.jsonl +10 -0
- package/bin/skills/clickzetta-sql-migration/references/ddl-reference.md +350 -0
- package/bin/skills/clickzetta-sql-migration/references/dml-differences.md +192 -0
- package/bin/skills/clickzetta-sql-migration/references/dml-reference.md +279 -0
- package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/references/dql-reference.md +128 -128
- package/bin/skills/clickzetta-sql-migration/references/function-mapping.md +194 -0
- package/bin/skills/clickzetta-sql-migration/references/functions-reference.md +372 -0
- package/bin/skills/clickzetta-sql-migration/references/implicit-type-conversion.md +143 -0
- package/bin/skills/clickzetta-sql-migration/references/migration-databricks.md +260 -0
- package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/references/migration-snowflake.md +112 -112
- package/bin/skills/clickzetta-sql-migration/references/vs-snowflake.md +346 -0
- package/bin/skills/clickzetta-sql-migration/references/vs-spark.md +229 -0
- package/bin/skills/clickzetta-studio-task-manager/SKILL.md +326 -329
- package/bin/skills/clickzetta-table-lineage/SKILL.md +57 -55
- package/bin/skills/clickzetta-table-lineage/eval_cases.jsonl +1 -1
- package/bin/skills/clickzetta-table-lineage/references/normalize_func.sql +5 -5
- package/bin/skills/clickzetta-table-lineage/references/table_cost.sql +6 -6
- package/bin/skills/clickzetta-table-lineage/references/table_relation.sql +2 -2
- package/bin/skills/clickzetta-volume-manager/SKILL.md +186 -100
- package/bin/skills/clickzetta-volume-manager/references/volume-ddl.md +153 -52
- package/package.json +1 -1
- package/bin/skills/clickzetta-dynamic-table/best-practices/scheduling-guide.md +0 -135
- package/bin/skills/clickzetta-dynamic-table/dt-creator/references/dt-declaration-strategy.md +0 -185
- package/bin/skills/clickzetta-dynamic-table/dt-creator/references/refresh-history-guide.md +0 -260
- package/bin/skills/clickzetta-dynamic-table/dynamic-table-alter/SKILL.md +0 -191
- package/bin/skills/clickzetta-sql-syntax-guide/SKILL.md +0 -249
- package/bin/skills/clickzetta-sql-syntax-guide/eval_cases.jsonl +0 -3
- package/bin/skills/clickzetta-sql-syntax-guide/references/ddl-reference.md +0 -350
- package/bin/skills/clickzetta-sql-syntax-guide/references/dml-reference.md +0 -279
- package/bin/skills/clickzetta-sql-syntax-guide/references/functions-reference.md +0 -372
- package/bin/skills/clickzetta-sql-syntax-guide/references/migration-databricks.md +0 -260
- package/bin/skills/clickzetta-sql-syntax-guide/references/vs-snowflake.md +0 -346
- package/bin/skills/clickzetta-sql-syntax-guide/references/vs-spark.md +0 -229
- /package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/LICENSE +0 -0
|
@@ -1,370 +1,367 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: clickzetta-studio-task-manager
|
|
3
3
|
description: |
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
"
|
|
10
|
-
"
|
|
11
|
-
"
|
|
12
|
-
"
|
|
4
|
+
Manage ClickZetta Lakehouse Studio tasks, covering task type descriptions (batch sync/multi-table batch sync/
|
|
5
|
+
real-time sync/multi-table real-time sync/data development), task folder organization, task type differentiation,
|
|
6
|
+
cz-cli task command family, scheduling configuration, dependency management, and common issue troubleshooting.
|
|
7
|
+
Implements the "separation of DDL and pipeline management" engineering standard: DDL tasks as drafts,
|
|
8
|
+
ETL tasks with scheduling, Dynamic Tables with auto-refresh.
|
|
9
|
+
Triggered when the user says "create Studio task", "task folder", "task scheduling", "cz-cli task",
|
|
10
|
+
"task dependency", "task failed", "task status", "full database sync task", "ETL task orchestration",
|
|
11
|
+
"task management", "separation of DDL and pipeline", "DDL task", "scheduling DAG", "task folder",
|
|
12
|
+
"Studio task", "batch sync", "real-time sync", "multi-table real-time sync", "data development task",
|
|
13
|
+
"task types", "which sync to choose", "sync task differences".
|
|
13
14
|
Keywords: Studio task, task management, cz-cli task, scheduling, DAG, DDL draft, ETL pipeline, task folder, offline sync, realtime sync, CDC, task types
|
|
14
15
|
---
|
|
15
16
|
|
|
16
|
-
# ClickZetta Studio
|
|
17
|
+
# ClickZetta Studio Task Management
|
|
17
18
|
|
|
18
|
-
##
|
|
19
|
+
## Wizard: Clarify Intent
|
|
19
20
|
|
|
20
|
-
|
|
21
|
+
Upon receiving a task management request, use an interactive question tool (e.g., `question`) to collect intent. If no such tool is available, list options in text:
|
|
21
22
|
|
|
22
23
|
```
|
|
23
24
|
question({
|
|
24
25
|
questions: [{
|
|
25
|
-
question: "
|
|
26
|
+
question: "What would you like to do?",
|
|
26
27
|
options: [
|
|
27
|
-
{ label: "
|
|
28
|
-
{ label: "
|
|
29
|
-
{ label: "
|
|
30
|
-
{ label: "
|
|
28
|
+
{ label: "Build a new pipeline from scratch", description: "Create folders, DDL tasks, sync tasks, ETL tasks" },
|
|
29
|
+
{ label: "Manage existing tasks", description: "View status, modify config, configure dependencies, rerun, backfill" },
|
|
30
|
+
{ label: "Troubleshoot task issues", description: "Failure diagnosis, dependency check, log analysis → load clickzetta-pipeline-review" },
|
|
31
|
+
{ label: "Standards compliance check", description: "Check if existing tasks follow separation of DDL and pipeline standards" }
|
|
31
32
|
]
|
|
32
33
|
}]
|
|
33
34
|
})
|
|
34
35
|
```
|
|
35
36
|
|
|
36
|
-
|
|
37
|
+
**If the user has clearly stated what they want to do, proceed directly without asking.**
|
|
37
38
|
|
|
38
|
-
|
|
39
|
+
For **building from scratch**, also collect: business domain/project name, data source type, layering structure.
|
|
39
40
|
|
|
40
|
-
##
|
|
41
|
+
## Data Pipeline Wizard (Used When Building from Scratch)
|
|
41
42
|
|
|
42
|
-
|
|
43
|
+
Full process: **Requirements Understanding → Data Exploration → Technical Selection → Plan Confirmation → Execution**
|
|
43
44
|
|
|
44
45
|
---
|
|
45
46
|
|
|
46
|
-
### Step 0
|
|
47
|
+
### Step 0: Requirements Input
|
|
47
48
|
|
|
48
|
-
|
|
49
|
+
**First ask if the user has a requirements document** (PRD, requirements spec, data warehouse design doc, etc.):
|
|
49
50
|
|
|
50
51
|
```
|
|
51
52
|
question({
|
|
52
53
|
questions: [{
|
|
53
|
-
question: "
|
|
54
|
+
question: "Before we start, do you have a requirements document or background description?",
|
|
54
55
|
options: [
|
|
55
|
-
{ label: "
|
|
56
|
-
{ label: "
|
|
56
|
+
{ label: "Yes, I'll provide it", description: "Paste document content or upload file, I'll extract key information" },
|
|
57
|
+
{ label: "No, I'll describe verbally", description: "I'll guide you through a few key questions" }
|
|
57
58
|
]
|
|
58
59
|
}]
|
|
59
60
|
})
|
|
60
61
|
```
|
|
61
62
|
|
|
62
|
-
|
|
63
|
+
**If document provided**: read the document, auto-extract business scenario, data sources, target outputs, freshness requirements, skip to Step 1.
|
|
63
64
|
|
|
64
|
-
|
|
65
|
+
**If no document**, collect the following business requirements (prefer interactive tools; if unavailable, list all in text):
|
|
65
66
|
|
|
66
67
|
```
|
|
67
68
|
question({
|
|
68
69
|
questions: [
|
|
69
70
|
{
|
|
70
|
-
question: "
|
|
71
|
+
question: "What business scenario does this pipeline serve?",
|
|
71
72
|
options: [
|
|
72
|
-
{ label: "BI
|
|
73
|
-
{ label: "
|
|
74
|
-
{ label: "
|
|
75
|
-
{ label: "
|
|
73
|
+
{ label: "BI reports / dashboards", description: "Fixed reports, clear metric system, T+1 or hourly" },
|
|
74
|
+
{ label: "Real-time monitoring / ops dashboard", description: "Minute-level latency, focus on real-time metrics" },
|
|
75
|
+
{ label: "Data science / feature engineering", description: "For model training or inference" },
|
|
76
|
+
{ label: "Data sharing / external output", description: "Provided to other systems or teams" }
|
|
76
77
|
]
|
|
77
78
|
},
|
|
78
79
|
{
|
|
79
|
-
question: "
|
|
80
|
+
question: "Who are the data consumers?",
|
|
80
81
|
options: [
|
|
81
|
-
{ label: "BI
|
|
82
|
-
{ label: "
|
|
83
|
-
{ label: "
|
|
84
|
-
{ label: "
|
|
82
|
+
{ label: "BI tools (Superset/Tableau, etc.)", description: "Need wide tables or aggregation tables" },
|
|
83
|
+
{ label: "Data analysts (SQL queries)", description: "Need cleaned detail tables" },
|
|
84
|
+
{ label: "Downstream systems / APIs", description: "Need structured output" },
|
|
85
|
+
{ label: "Data scientists (Python/ZettaPark)", description: "Need feature tables or raw detail" }
|
|
85
86
|
]
|
|
86
87
|
},
|
|
87
88
|
{
|
|
88
|
-
question: "
|
|
89
|
+
question: "Data freshness requirements?",
|
|
89
90
|
options: [
|
|
90
|
-
{ label: "T+1
|
|
91
|
-
{ label: "
|
|
92
|
-
{ label: "
|
|
93
|
-
{ label: "
|
|
91
|
+
{ label: "T+1 (available next day)", description: "Batch run at midnight, data ready in the morning" },
|
|
92
|
+
{ label: "Hourly", description: "Updated every hour" },
|
|
93
|
+
{ label: "Minute-level", description: "Near real-time, latency < 10 minutes" },
|
|
94
|
+
{ label: "Second-level real-time", description: "CDC continuous sync, second-level latency" }
|
|
94
95
|
]
|
|
95
96
|
}
|
|
96
97
|
]
|
|
97
98
|
})
|
|
98
99
|
```
|
|
99
100
|
|
|
100
|
-
|
|
101
|
-
-
|
|
102
|
-
-
|
|
101
|
+
Also confirm verbally (text follow-up, no menu needed):
|
|
102
|
+
- **Core metric definitions**: if involving GMV, active users, or other business metrics, confirm calculation logic
|
|
103
|
+
- **Project/business domain name**: used for task folder and schema naming (e.g., `ecommerce_dw`)
|
|
103
104
|
|
|
104
105
|
---
|
|
105
106
|
|
|
106
|
-
### Step 1
|
|
107
|
+
### Step 1: Data Exploration (AI executes autonomously, no user input needed)
|
|
107
108
|
|
|
108
|
-
|
|
109
|
+
After collecting requirements, immediately explore the current data state:
|
|
109
110
|
|
|
110
111
|
```sql
|
|
111
|
-
--
|
|
112
|
+
-- View related schemas and tables
|
|
112
113
|
SHOW SCHEMAS;
|
|
113
|
-
SHOW TABLES IN
|
|
114
|
+
SHOW TABLES IN <relevant_schema>;
|
|
114
115
|
|
|
115
|
-
--
|
|
116
|
+
-- Check table sizes and row counts
|
|
116
117
|
SELECT table_schema, table_name,
|
|
117
118
|
ROUND(bytes/1024.0/1024/1024, 2) AS size_gb, row_count
|
|
118
119
|
FROM information_schema.tables
|
|
119
120
|
WHERE table_type = 'MANAGED_TABLE'
|
|
120
121
|
ORDER BY bytes DESC NULLS LAST LIMIT 20;
|
|
121
122
|
|
|
122
|
-
--
|
|
123
|
+
-- Sample to understand field meanings
|
|
123
124
|
SELECT * FROM <schema>.<table> LIMIT 5;
|
|
124
125
|
```
|
|
125
126
|
|
|
126
|
-
|
|
127
|
+
Also use `cz-cli datasource list` to view configured external data sources.
|
|
127
128
|
|
|
128
129
|
---
|
|
129
130
|
|
|
130
|
-
### Step 2
|
|
131
|
+
### Step 2: Technical Selection (Choose data source type and ingestion method)
|
|
131
132
|
|
|
132
|
-
|
|
133
|
+
Based on requirements and data exploration results, use interactive tools to collect technical choices:
|
|
133
134
|
|
|
134
|
-
|
|
135
|
+
**Select data source type:**
|
|
135
136
|
```
|
|
136
|
-
question({ questions: [{ question: "
|
|
137
|
-
{ label: "
|
|
138
|
-
{ label: "Kafka
|
|
139
|
-
{ label: "
|
|
140
|
-
{ label: "Lakehouse
|
|
141
|
-
{ label: "
|
|
142
|
-
{ label: "
|
|
137
|
+
question({ questions: [{ question: "Where does the data come from?", options: [
|
|
138
|
+
{ label: "External database", description: "MySQL / PostgreSQL / SQL Server / Oracle, etc." },
|
|
139
|
+
{ label: "Kafka message queue", description: "Kafka Topic → Lakehouse" },
|
|
140
|
+
{ label: "Object storage", description: "OSS / S3 / COS file import" },
|
|
141
|
+
{ label: "Lakehouse internal ETL layering", description: "ODS→DWD→DWS/ADS, SQL tasks + Dynamic Table" },
|
|
142
|
+
{ label: "End-to-end complete pipeline", description: "Data ingestion + layered modeling + aggregation" },
|
|
143
|
+
{ label: "Not sure, explore data first", description: "Look at existing data before recommending an approach" }
|
|
143
144
|
]}]})
|
|
144
145
|
```
|
|
145
146
|
|
|
146
|
-
|
|
147
|
+
**Follow-up (only needed for certain options):**
|
|
147
148
|
|
|
148
|
-
|
|
149
|
+
Selected "External database":
|
|
149
150
|
```
|
|
150
|
-
question({ questions: [{ question: "
|
|
151
|
-
{ label: "
|
|
152
|
-
{ label: "
|
|
151
|
+
question({ questions: [{ question: "Sync freshness?", options: [
|
|
152
|
+
{ label: "Real-time sync (second-level)", description: "CDC, based on Binlog/WALs, continuously running" },
|
|
153
|
+
{ label: "Batch offline (hourly/daily)", description: "Periodic full sync, configure Cron" }
|
|
153
154
|
]}]})
|
|
154
155
|
```
|
|
155
156
|
|
|
156
|
-
|
|
157
|
+
Selected "Object storage":
|
|
157
158
|
```
|
|
158
|
-
question({ questions: [{ question: "
|
|
159
|
-
{ label: "SQL Pipe
|
|
160
|
-
{ label: "Studio
|
|
159
|
+
question({ questions: [{ question: "Ingestion method?", options: [
|
|
160
|
+
{ label: "SQL Pipe (continuous auto-import)", description: "LIST_PURGE or EVENT_NOTIFICATION mode" },
|
|
161
|
+
{ label: "Studio batch sync task", description: "Periodic batch import, configure Cron" }
|
|
161
162
|
]}]})
|
|
162
163
|
```
|
|
163
164
|
|
|
164
|
-
|
|
165
|
+
Selected "Kafka":
|
|
165
166
|
```
|
|
166
|
-
question({ questions: [{ question: "
|
|
167
|
-
{ label: "SQL Pipe
|
|
168
|
-
{ label: "Studio
|
|
167
|
+
question({ questions: [{ question: "Ingestion method?", options: [
|
|
168
|
+
{ label: "SQL Pipe (READ_KAFKA)", description: "Pure SQL, flexible, recommended for engineers" },
|
|
169
|
+
{ label: "Studio real-time sync task", description: "GUI configuration, supports JSONPath computed columns" }
|
|
169
170
|
]}]})
|
|
170
171
|
```
|
|
171
172
|
|
|
172
173
|
---
|
|
173
174
|
|
|
174
|
-
### Step 3
|
|
175
|
+
### Step 3: Plan Confirmation (Must execute, cannot skip)
|
|
175
176
|
|
|
176
|
-
|
|
177
|
+
Combining requirements and technical selection, present a complete plan summary to the user for confirmation:
|
|
177
178
|
|
|
178
179
|
```
|
|
179
180
|
question({
|
|
180
181
|
questions: [{
|
|
181
|
-
question: "
|
|
182
|
+
question: "Confirm the following plan to start building:\nBusiness scenario: <scenario>\nData source: <source_name>\nSync method: <batch/real-time/SQL Pipe>\nLayering structure: <ODS/DWD/DWS or Bronze/Silver/Gold>\nTarget schema: <schema>\nScheduling: <Cron or continuous running>\nReady to start?",
|
|
182
183
|
options: [
|
|
183
|
-
{ label: "
|
|
184
|
-
{ label: "
|
|
184
|
+
{ label: "Confirmed, start building", description: "Load corresponding skill, begin creating tasks" },
|
|
185
|
+
{ label: "Need adjustments", description: "Re-collect information" }
|
|
185
186
|
]
|
|
186
187
|
}]
|
|
187
188
|
})
|
|
188
189
|
```
|
|
189
190
|
|
|
190
|
-
|
|
191
|
+
After user confirmation, load the corresponding skill per routing table:
|
|
191
192
|
|
|
192
|
-
|
|
193
|
+
**Routing Table**
|
|
193
194
|
|
|
194
|
-
|
|
|
195
|
-
|
|
196
|
-
|
|
|
197
|
-
|
|
|
198
|
-
|
|
|
195
|
+
| Data Source | Freshness/Method | Load Skill |
|
|
196
|
+
|-------------|-----------------|------------|
|
|
197
|
+
| External database | Real-time single-table CDC | `clickzetta-realtime-sync-pipeline` |
|
|
198
|
+
| External database | Real-time multi-table/full database CDC | `clickzetta-cdc-sync-pipeline` |
|
|
199
|
+
| External database | Batch offline | `clickzetta-batch-sync-pipeline` |
|
|
199
200
|
| Kafka | SQL Pipe | `clickzetta-kafka-ingest-pipeline` |
|
|
200
|
-
| Kafka | Studio
|
|
201
|
-
|
|
|
202
|
-
|
|
|
203
|
-
| Lakehouse
|
|
204
|
-
|
|
|
201
|
+
| Kafka | Studio real-time sync | `clickzetta-realtime-sync-pipeline` |
|
|
202
|
+
| Object storage | SQL Pipe | `clickzetta-oss-ingest-pipeline` |
|
|
203
|
+
| Object storage | Studio batch sync | `clickzetta-batch-sync-pipeline` |
|
|
204
|
+
| Lakehouse internal ETL layering | — | `clickzetta-sql-pipeline-manager` |
|
|
205
|
+
| End-to-end complete pipeline / Not sure | — | `clickzetta-dw-modeling` |
|
|
205
206
|
|
|
206
|
-
>
|
|
207
|
+
> Real-time CDC single vs multi-table: user says "full database" or "multiple tables" → `cdc-sync-pipeline`; "one table" → `realtime-sync-pipeline`; if unclear, ask.
|
|
207
208
|
|
|
208
209
|
---
|
|
209
210
|
|
|
210
|
-
## Studio
|
|
211
|
+
## Studio Task Types
|
|
211
212
|
|
|
212
|
-
Studio
|
|
213
|
+
Studio provides four major task categories. Choosing the wrong type is the most common engineering mistake:
|
|
213
214
|
|
|
214
|
-
###
|
|
215
|
-
|
|
215
|
+
### Batch Sync (Single-table)
|
|
216
|
+
Periodically full-sync a single source table to Lakehouse.
|
|
216
217
|
|
|
217
|
-
-
|
|
218
|
-
-
|
|
219
|
-
-
|
|
220
|
-
-
|
|
218
|
+
- **Use cases**: single-table periodic overwrite, low data freshness requirements (daily/hourly batch), resource optimization (no real-time needed)
|
|
219
|
+
- **Run mode**: scheduled (Cron required), full overwrite or append each run
|
|
220
|
+
- **Data sources**: MySQL, PostgreSQL, SQL Server, and other relational databases
|
|
221
|
+
- **Corresponding skill**: `clickzetta-batch-sync-pipeline` (single-table mode)
|
|
221
222
|
|
|
222
|
-
###
|
|
223
|
-
|
|
223
|
+
### Multi-table Batch Sync
|
|
224
|
+
Periodically batch-sync multiple tables or an entire database to Lakehouse.
|
|
224
225
|
|
|
225
|
-
-
|
|
226
|
-
-
|
|
227
|
-
-
|
|
228
|
-
-
|
|
229
|
-
-
|
|
230
|
-
-
|
|
231
|
-
-
|
|
226
|
+
- **Use cases**:
|
|
227
|
+
- Full database migration (batch sync all tables, reducing per-table configuration effort)
|
|
228
|
+
- Sharded table merge (merge multiple sharded tables into a unified target table)
|
|
229
|
+
- Periodic data calibration (periodic full sync to ensure target matches source)
|
|
230
|
+
- **Run mode**: scheduled (Cron required), supports full database mirror, multi-table mirror, and sharded table merge modes
|
|
231
|
+
- **Data sources**: MySQL, PostgreSQL, SQL Server, etc.
|
|
232
|
+
- **Corresponding skill**: `clickzetta-batch-sync-pipeline` (multi-table mode)
|
|
232
233
|
|
|
233
|
-
###
|
|
234
|
-
|
|
234
|
+
### Real-time Sync (Single-table)
|
|
235
|
+
Continuously sync a single Kafka topic to Lakehouse in real time.
|
|
235
236
|
|
|
236
|
-
-
|
|
237
|
-
-
|
|
238
|
-
-
|
|
239
|
-
-
|
|
237
|
+
- **Use cases**: Kafka message stream real-time ingestion, second/minute-level latency requirements, single-topic fine-grained sync
|
|
238
|
+
- **Run mode**: continuously running (no Cron needed, runs upon submission)
|
|
239
|
+
- **Data sources**: **Kafka only** (JSON message parsing, supports JSONPath computed columns)
|
|
240
|
+
- **Corresponding skill**: `clickzetta-realtime-sync-pipeline`
|
|
240
241
|
|
|
241
|
-
###
|
|
242
|
-
|
|
242
|
+
### Multi-table Real-time Sync (CDC)
|
|
243
|
+
Sync entire MySQL / PostgreSQL databases or multiple tables to Lakehouse via CDC in real time, with full load + incremental two-phase sync.
|
|
243
244
|
|
|
244
|
-
-
|
|
245
|
-
-
|
|
246
|
-
-
|
|
245
|
+
- **Use cases**: full database real-time mirror, second-level end-to-end latency, sharded table real-time merge
|
|
246
|
+
- **Run mode**: continuously running (no Cron needed, runs upon submission)
|
|
247
|
+
- **Data sources**:
|
|
247
248
|
|
|
248
|
-
|
|
|
249
|
-
|
|
250
|
-
| MySQL
|
|
251
|
-
| PostgreSQL
|
|
249
|
+
| Type | Incremental Read Mode | Supported Versions |
|
|
250
|
+
|------|----------------------|-------------------|
|
|
251
|
+
| MySQL (including Aurora MySQL, PolarDB MySQL) | Binlog | 5.6+, 8.x |
|
|
252
|
+
| PostgreSQL (including Aurora PG, PolarDB PG) | WALs | 14+ |
|
|
252
253
|
|
|
253
|
-
-
|
|
254
|
+
- **Corresponding skill**: `clickzetta-cdc-sync-pipeline`
|
|
254
255
|
|
|
255
|
-
###
|
|
256
|
-
|
|
256
|
+
### Data Development Tasks (SQL / Python / Shell)
|
|
257
|
+
Write and schedule data processing logic in Studio — the core vehicle for data warehouse ETL.
|
|
257
258
|
|
|
258
|
-
- **SQL
|
|
259
|
-
- **Python
|
|
260
|
-
- **Shell
|
|
261
|
-
-
|
|
262
|
-
-
|
|
259
|
+
- **SQL tasks**: ODS→DWD cleaning/transformation, data quality checks, ad-hoc data repairs
|
|
260
|
+
- **Python tasks**: custom data processing scripts, external API calls, ML inference
|
|
261
|
+
- **Shell tasks**: system commands, file operations, external tool invocations
|
|
262
|
+
- **Run mode**: scheduled (Cron) or manual trigger
|
|
263
|
+
- **Corresponding skill**: `clickzetta-studio-task-manager` (this skill)
|
|
263
264
|
|
|
264
|
-
###
|
|
265
|
+
### Four Task Types Quick Comparison
|
|
265
266
|
|
|
266
|
-
|
|
|
267
|
-
|
|
268
|
-
|
|
|
269
|
-
|
|
|
270
|
-
|
|
|
271
|
-
|
|
|
272
|
-
|
|
|
267
|
+
| Task Type | Data Source | Sync Granularity | Run Mode | Freshness |
|
|
268
|
+
|-----------|------------|-----------------|----------|-----------|
|
|
269
|
+
| Batch Sync | Relational DB | Single table | Scheduled | Hourly/daily |
|
|
270
|
+
| Multi-table Batch Sync | Relational DB | Multi-table/full DB | Scheduled | Hourly/daily |
|
|
271
|
+
| Real-time Sync | **Kafka only** | Single topic | Continuously running | Seconds/minutes |
|
|
272
|
+
| Multi-table Real-time Sync | MySQL / PostgreSQL | Multi-table/full DB | Continuously running | Seconds |
|
|
273
|
+
| Data Development | Any (SQL/Python/Shell) | Custom logic | Scheduled or manual | Depends on schedule frequency |
|
|
273
274
|
|
|
274
275
|
---
|
|
275
276
|
|
|
276
|
-
##
|
|
277
|
+
## Core Principle: Separation of DDL and Pipeline Management
|
|
277
278
|
|
|
278
|
-
|
|
279
|
+
**Different task types have completely different scheduling strategies.** Confusing task types is the most common engineering mistake.
|
|
279
280
|
|
|
280
|
-
|
|
|
281
|
-
|
|
282
|
-
| **DDL
|
|
283
|
-
|
|
|
284
|
-
| **ETL
|
|
285
|
-
|
|
|
286
|
-
| **DWS/ADS
|
|
281
|
+
| Task Type | Typical Content | Studio Task Type | Scheduling | Status |
|
|
282
|
+
|-----------|----------------|-----------------|------------|--------|
|
|
283
|
+
| **DDL table creation** | CREATE TABLE / CREATE SCHEMA | SQL task | ❌ No Cron, no dependencies | DRAFT |
|
|
284
|
+
| **Data sync tasks** | External source (relational DB/object storage) → ODS | **SINGLE_DI / MULTI_DI / REALTIME** (not SQL tasks) | ✅ Configure Cron (batch) or continuous (real-time) | PUBLISHED |
|
|
285
|
+
| **ETL transformation** | ODS→DWD cleaning SQL (Lakehouse internal) | SQL task | ✅ Configure Cron + depend on upstream sync | PUBLISHED |
|
|
286
|
+
| **Data quality tasks** | Row count checks, NULL rate validation | SQL task | ✅ Configure Cron + depend on ETL | PUBLISHED |
|
|
287
|
+
| **DWS/ADS aggregation** | Metric summaries, report wide tables | ❌ Use Dynamic Table, no task needed | — | — |
|
|
287
288
|
|
|
288
|
-
|
|
289
|
-
-
|
|
290
|
-
-
|
|
291
|
-
-
|
|
289
|
+
**Data sync task supported data sources:**
|
|
290
|
+
- Batch sync (SINGLE_DI/MULTI_DI): MySQL, PostgreSQL, Oracle, SQL Server and other relational databases, plus OSS/COS/S3 object storage
|
|
291
|
+
- Single-table real-time sync (REALTIME): Kafka
|
|
292
|
+
- Multi-table real-time sync CDC: MySQL (Binlog, 5.6+/8.x), PostgreSQL (WALs, 14+)
|
|
292
293
|
|
|
293
|
-
|
|
294
|
-
- Kafka/OSS/S3/COS →
|
|
295
|
-
- Hive/Databricks/Snowflake Open Catalog → External Catalog
|
|
294
|
+
**Other data access methods (not data sync tasks):**
|
|
295
|
+
- Kafka/OSS/S3/COS → can also use SQL Pipe (`READ_KAFKA`/Volume Pipe), both Studio sync tasks and SQL Pipes are valid — choose based on scenario
|
|
296
|
+
- Hive/Databricks/Snowflake Open Catalog → External Catalog federated read-only queries, not data sync
|
|
296
297
|
|
|
297
|
-
> ⚠️ **DDL
|
|
298
|
+
> ⚠️ **DDL tasks must never have Cron**: repeated execution of CREATE TABLE statements causes `SCHEDULE_TASK_HAD_CHILDREN_NODES_EXCEPTION` and other scheduling conflicts. DDL tasks should be demoted to DRAFT immediately after execution.
|
|
298
299
|
|
|
299
|
-
> ⚠️ **DWS/ADS
|
|
300
|
+
> ⚠️ **Do not create scheduled tasks for DWS/ADS layer**: Dynamic Tables auto-refresh by the system. Creating additional tasks is redundant computation and wastes resources.
|
|
300
301
|
|
|
301
|
-
> ⚠️
|
|
302
|
+
> ⚠️ **Never use SQL tasks as a substitute for data sync tasks**: you cannot use SQL tasks to write `SELECT FROM EXTERNAL` to simulate sync (syntax not supported), nor use JDBC tasks (JDBC can only execute SQL on external databases, cannot sync data to Lakehouse).
|
|
302
303
|
|
|
303
304
|
---
|
|
304
305
|
|
|
305
|
-
##
|
|
306
|
+
## Task Folder Organization Standards
|
|
306
307
|
|
|
307
|
-
|
|
308
|
+
Each data warehouse project creates an independent task folder in Studio to manage all task assets uniformly:
|
|
308
309
|
|
|
309
310
|
```
|
|
310
|
-
|
|
311
|
-
├── 00_sync_<source>_to_ods ←
|
|
312
|
-
├── 01_ddl_ods ← ODS
|
|
313
|
-
├── 02_ddl_dwd ← DWD
|
|
314
|
-
├── 03_ddl_dws_ads ← DWS/ADS
|
|
315
|
-
├── 04_transform_ods_to_dwd ← ODS→DWD
|
|
316
|
-
└── 05_dqc_check ←
|
|
311
|
+
<business_domain>_dw/ ← Project task folder (e.g., shenyu_gateway_dw, ecommerce_dw)
|
|
312
|
+
├── 00_sync_<source>_to_ods ← Data sync (Cron, runs earliest)
|
|
313
|
+
├── 01_ddl_ods ← ODS table creation (DRAFT, no scheduling, run once manually)
|
|
314
|
+
├── 02_ddl_dwd ← DWD table creation (DRAFT, no scheduling, run once manually)
|
|
315
|
+
├── 03_ddl_dws_ads ← DWS/ADS Dynamic Table creation (DRAFT, no scheduling)
|
|
316
|
+
├── 04_transform_ods_to_dwd ← ODS→DWD transformation (Cron, depends on 00)
|
|
317
|
+
└── 05_dqc_check ← Data quality check (Cron, depends on 04, optional)
|
|
317
318
|
```
|
|
318
319
|
|
|
319
|
-
> DWS/ADS
|
|
320
|
+
> DWS/ADS layer is auto-refreshed by Dynamic Tables — **no task creation needed**.
|
|
320
321
|
|
|
321
322
|
---
|
|
322
323
|
|
|
323
|
-
## cz-cli task
|
|
324
|
+
## cz-cli task Command Family
|
|
324
325
|
|
|
325
|
-
###
|
|
326
|
+
### Task Folder Management
|
|
326
327
|
|
|
327
328
|
```bash
|
|
328
|
-
#
|
|
329
|
+
# Create task folder
|
|
329
330
|
cz-cli task folder create <folder_name>
|
|
330
331
|
|
|
331
|
-
#
|
|
332
|
+
# List all task folders
|
|
332
333
|
cz-cli task folder list
|
|
333
334
|
```
|
|
334
335
|
|
|
335
|
-
###
|
|
336
|
+
### Task Queries
|
|
336
337
|
|
|
337
338
|
```bash
|
|
338
|
-
#
|
|
339
|
+
# List all tasks
|
|
339
340
|
cz-cli task list
|
|
340
341
|
|
|
341
|
-
#
|
|
342
|
+
# Filter by folder
|
|
342
343
|
cz-cli task list --folder <folder_name>
|
|
343
344
|
|
|
344
|
-
#
|
|
345
|
+
# View task details
|
|
345
346
|
cz-cli task get <task_id>
|
|
346
347
|
|
|
347
|
-
# 查看任务状态
|
|
348
|
-
cz-cli task status <task_id>
|
|
349
348
|
```
|
|
350
349
|
|
|
351
|
-
###
|
|
350
|
+
### Task Execution
|
|
352
351
|
|
|
353
352
|
```bash
|
|
354
|
-
#
|
|
353
|
+
# Manually trigger task run
|
|
355
354
|
cz-cli task run <task_id>
|
|
356
355
|
|
|
357
|
-
#
|
|
356
|
+
# View task run logs
|
|
358
357
|
cz-cli task logs <task_id>
|
|
359
358
|
|
|
360
|
-
# 查看最近一次运行实例
|
|
361
|
-
cz-cli task instances <task_id> --limit 5
|
|
362
359
|
```
|
|
363
360
|
|
|
364
|
-
###
|
|
361
|
+
### Task Creation
|
|
365
362
|
|
|
366
363
|
```bash
|
|
367
|
-
#
|
|
364
|
+
# Create SQL task (ETL/DDL)
|
|
368
365
|
cz-cli task create \
|
|
369
366
|
--name "04_transform_ods_to_dwd" \
|
|
370
367
|
--type SQL \
|
|
@@ -372,281 +369,281 @@ cz-cli task create \
|
|
|
372
369
|
--vcluster default \
|
|
373
370
|
--sql-file ./transform.sql
|
|
374
371
|
|
|
375
|
-
#
|
|
372
|
+
# Create data sync task (single-table)
|
|
376
373
|
cz-cli task create \
|
|
377
374
|
--name "00_sync_mysql_to_ods" \
|
|
378
375
|
--type SINGLE_DI \
|
|
379
376
|
--folder <folder_name>
|
|
380
377
|
```
|
|
381
378
|
|
|
382
|
-
> ⚠️
|
|
383
|
-
> 1. `cz-cli task create --type MULTI_DI`
|
|
384
|
-
> 2.
|
|
385
|
-
> 3.
|
|
386
|
-
> 4.
|
|
379
|
+
> ⚠️ **Full database sync task (MULTI_DI) capability boundary**: `cz-cli` can create the task framework, but source/target column mapping configuration **must be completed manually in Studio UI**. Recommended SOP:
|
|
380
|
+
> 1. `cz-cli task create --type MULTI_DI` to create task framework
|
|
381
|
+
> 2. Copy the output task link, open in browser
|
|
382
|
+
> 3. Configure source database, target schema, column mapping in Studio UI
|
|
383
|
+
> 4. Click publish to run
|
|
387
384
|
|
|
388
385
|
---
|
|
389
386
|
|
|
390
|
-
##
|
|
387
|
+
## Scheduling Configuration Best Practices
|
|
391
388
|
|
|
392
|
-
### Cron
|
|
389
|
+
### Cron Expression Reference
|
|
393
390
|
|
|
394
391
|
```
|
|
395
|
-
#
|
|
392
|
+
# Daily at 02:00 (data sync)
|
|
396
393
|
0 2 * * *
|
|
397
394
|
|
|
398
|
-
#
|
|
395
|
+
# Daily at 02:30 (ETL transformation, 30 minutes after sync completes)
|
|
399
396
|
30 2 * * *
|
|
400
397
|
|
|
401
|
-
#
|
|
398
|
+
# Daily at 03:00 (data quality check)
|
|
402
399
|
0 3 * * *
|
|
403
400
|
|
|
404
|
-
#
|
|
401
|
+
# Every hour
|
|
405
402
|
0 * * * *
|
|
406
403
|
```
|
|
407
404
|
|
|
408
|
-
###
|
|
405
|
+
### Dependency Configuration Principles
|
|
409
406
|
|
|
410
407
|
```
|
|
411
|
-
|
|
412
|
-
00_sync
|
|
413
|
-
↓
|
|
414
|
-
04_transform
|
|
415
|
-
↓
|
|
416
|
-
05_dqc
|
|
408
|
+
Correct dependency chain:
|
|
409
|
+
00_sync (Cron 02:00)
|
|
410
|
+
↓ depends on
|
|
411
|
+
04_transform (Cron 02:30)
|
|
412
|
+
↓ depends on
|
|
413
|
+
05_dqc (Cron 03:00)
|
|
417
414
|
|
|
418
|
-
|
|
419
|
-
❌ DDL
|
|
420
|
-
❌ Dynamic
|
|
415
|
+
Incorrect dependencies:
|
|
416
|
+
❌ DDL tasks (01/02/03) should not appear in the dependency chain
|
|
417
|
+
❌ Dynamic Tables should not appear in the dependency chain
|
|
421
418
|
```
|
|
422
419
|
|
|
423
420
|
---
|
|
424
421
|
|
|
425
|
-
##
|
|
422
|
+
## Data Sync Task Type Selection
|
|
426
423
|
|
|
427
|
-
|
|
|
428
|
-
|
|
429
|
-
| MySQL/PG
|
|
430
|
-
| MySQL/PG
|
|
431
|
-
| Kafka
|
|
432
|
-
|
|
|
424
|
+
| Scenario | Task Type | Notes |
|
|
425
|
+
|----------|-----------|-------|
|
|
426
|
+
| MySQL/PG single-table sync to Lakehouse | `SINGLE_DI` | Simple, CLI can fully configure |
|
|
427
|
+
| MySQL/PG full database sync (multi-table mirror) | `MULTI_DI` | CLI creates framework, UI configures mapping |
|
|
428
|
+
| Kafka real-time ingestion | `REALTIME_SYNC` | Continuously running, no Cron needed |
|
|
429
|
+
| File batch import (OSS/S3) | SQL task (COPY INTO) | Use SQL task to execute COPY INTO |
|
|
433
430
|
|
|
434
431
|
---
|
|
435
432
|
|
|
436
|
-
##
|
|
433
|
+
## Common Issue Troubleshooting
|
|
437
434
|
|
|
438
|
-
|
|
|
439
|
-
|
|
440
|
-
| `SCHEDULE_TASK_HAD_CHILDREN_NODES_EXCEPTION` | DDL
|
|
441
|
-
|
|
|
442
|
-
|
|
|
443
|
-
|
|
|
444
|
-
| ETL
|
|
445
|
-
| DWS
|
|
446
|
-
|
|
|
435
|
+
| Issue | Cause | Solution |
|
|
436
|
+
|-------|-------|----------|
|
|
437
|
+
| `SCHEDULE_TASK_HAD_CHILDREN_NODES_EXCEPTION` | DDL task was configured with Cron or dependencies | Clear DDL task scheduling config, demote to DRAFT |
|
|
438
|
+
| Task publish failed, circular dependency | Task A depends on B, B depends on A | Check dependency chain, remove circular dependencies |
|
|
439
|
+
| Sync task keeps failing, no clear error | Column type incompatibility (e.g., MySQL BIT(1) vs Lakehouse BOOLEAN) | Check column type mapping, refer to type mapping table below |
|
|
440
|
+
| Full database sync task cannot run after creation | MULTI_DI task missing column mapping config | Enter Studio UI to configure source/target mapping, then republish |
|
|
441
|
+
| ETL task not triggered on time | Upstream sync task failed, dependency not satisfied | Fix upstream sync task first, then manually trigger ETL |
|
|
442
|
+
| DWS layer data not updated | Mistakenly created scheduled task but Dynamic Table not refreshing | Delete redundant scheduled task, confirm Dynamic Table status is RUNNING |
|
|
443
|
+
| Task run succeeded but data is empty | SQL logic issue (e.g., LEFT JOIN filter condition in wrong position) | Check SQL — LEFT JOIN right-table filter conditions must be in the ON clause |
|
|
447
444
|
|
|
448
|
-
### MySQL → Lakehouse
|
|
445
|
+
### MySQL → Lakehouse Column Type Mapping (Common Sync Pitfalls)
|
|
449
446
|
|
|
450
|
-
| MySQL
|
|
451
|
-
|
|
447
|
+
| MySQL Type | ❌ Don't Use | ✅ ODS Layer Use | DWD Layer Conversion |
|
|
448
|
+
|-----------|-------------|-----------------|---------------------|
|
|
452
449
|
| `BIT(1)` | `BOOLEAN` | `TINYINT` | `CAST(col AS BOOLEAN)` |
|
|
453
|
-
| `DATETIME` | `DATETIME` | `TIMESTAMP` |
|
|
454
|
-
| `ENUM('a','b')` | `ENUM` | `STRING` |
|
|
455
|
-
| `TEXT` / `LONGTEXT` | `TEXT` | `STRING` |
|
|
456
|
-
| `DECIMAL(p,s)` | `FLOAT` | `DECIMAL(p,s)` |
|
|
450
|
+
| `DATETIME` | `DATETIME` | `TIMESTAMP` | Use directly |
|
|
451
|
+
| `ENUM('a','b')` | `ENUM` | `STRING` | Use directly |
|
|
452
|
+
| `TEXT` / `LONGTEXT` | `TEXT` | `STRING` | Use directly |
|
|
453
|
+
| `DECIMAL(p,s)` | `FLOAT` | `DECIMAL(p,s)` | Use directly |
|
|
457
454
|
| `TINYINT(1)` | `BOOLEAN` | `TINYINT` | `CAST(col AS BOOLEAN)` |
|
|
458
455
|
|
|
459
|
-
> **ODS
|
|
456
|
+
> **ODS layer principle: prefer broad types** — sync successfully first, then do precise type conversion in the DWD layer to avoid sync failures due to type incompatibility.
|
|
460
457
|
|
|
461
458
|
---
|
|
462
459
|
|
|
463
|
-
##
|
|
460
|
+
## Complete Engineering SOP
|
|
464
461
|
|
|
465
|
-
###
|
|
462
|
+
### Code-as-Asset Principle
|
|
466
463
|
|
|
467
|
-
|
|
464
|
+
**In data pipeline development / data warehouse modeling scenarios, all SQL code should be saved as Studio tasks as manageable code assets.**
|
|
468
465
|
|
|
469
|
-
-
|
|
470
|
-
-
|
|
471
|
-
-
|
|
466
|
+
- Tasks are the vehicle for code, not just scheduling configurations
|
|
467
|
+
- Even one-time DDL executions should be saved as DRAFT tasks for easy reference, reuse, and multi-environment migration
|
|
468
|
+
- Scenarios that don't need to be saved as tasks: SELECT queries, ad-hoc fix SQL, one-time validation queries
|
|
472
469
|
|
|
473
|
-
###
|
|
470
|
+
### New Project Launch Process (with Quick Verification Checkpoints)
|
|
474
471
|
|
|
475
|
-
|
|
472
|
+
Agile principle: **verify immediately after each step, know within 30 seconds if it succeeded — don't wait until the full pipeline runs to discover issues.**
|
|
476
473
|
|
|
477
474
|
```
|
|
478
|
-
1.
|
|
479
|
-
cz-cli task folder create
|
|
475
|
+
1. Create task folder
|
|
476
|
+
cz-cli task folder create <business_domain>_dw
|
|
480
477
|
|
|
481
|
-
2.
|
|
478
|
+
2. Create ODS layer tables, verify immediately
|
|
482
479
|
cz-cli task save-content 01_ddl_ods --content "<ods_ddl_sql>"
|
|
483
480
|
cz-cli task run 01_ddl_ods
|
|
484
|
-
✅
|
|
481
|
+
✅ Verify: SHOW TABLES IN <ods_schema> → confirm tables created
|
|
485
482
|
|
|
486
|
-
3.
|
|
487
|
-
- 00_sync
|
|
483
|
+
3. Create data sync task, trigger once manually, verify immediately
|
|
484
|
+
- 00_sync: full database or single-table sync to ODS (MULTI_DI requires UI mapping config)
|
|
488
485
|
cz-cli task execute 00_sync
|
|
489
|
-
✅
|
|
490
|
-
SELECT * FROM <ods_schema>.<table> LIMIT 5 →
|
|
486
|
+
✅ Verify: SELECT COUNT(*) FROM <ods_schema>.<table> → compare with source row count
|
|
487
|
+
SELECT * FROM <ods_schema>.<table> LIMIT 5 → sample check fields
|
|
491
488
|
|
|
492
|
-
4.
|
|
489
|
+
4. Create DWD layer tables, verify immediately
|
|
493
490
|
cz-cli task save-content 02_ddl_dwd --content "<dwd_ddl_sql>"
|
|
494
491
|
cz-cli task run 02_ddl_dwd
|
|
495
|
-
✅
|
|
492
|
+
✅ Verify: SHOW TABLES IN <dwd_schema> → confirm tables created
|
|
496
493
|
|
|
497
|
-
5.
|
|
494
|
+
5. Generate ETL transformation SQL, manually execute once to verify logic, then configure scheduling
|
|
498
495
|
cz-cli task save-content 04_transform_ods_to_dwd --content "<etl_sql>"
|
|
499
|
-
cz-cli task execute 04_transform_ods_to_dwd ←
|
|
500
|
-
✅
|
|
501
|
-
|
|
502
|
-
|
|
496
|
+
cz-cli task execute 04_transform_ods_to_dwd ← run manually first
|
|
497
|
+
✅ Verify: SELECT COUNT(*) FROM <dwd_schema>.<table> → row count meets expectations
|
|
498
|
+
Check key field non-null rate, LEFT JOIN result rows ≥ left table rows
|
|
499
|
+
After confirmation, configure scheduling:
|
|
503
500
|
cz-cli task save-cron 04_transform_ods_to_dwd --cron '0 30 2 * * ? *'
|
|
504
501
|
cz-cli task deploy 04_transform_ods_to_dwd
|
|
505
502
|
|
|
506
|
-
6.
|
|
503
|
+
6. Create DWS/ADS Dynamic Tables, trigger first refresh for verification
|
|
507
504
|
cz-cli task save-content 03_ddl_dws_ads --content "<dws_ads_ddl_sql>"
|
|
508
505
|
cz-cli task run 03_ddl_dws_ads
|
|
509
506
|
REFRESH DYNAMIC TABLE <dws_schema>.<table>
|
|
510
|
-
✅
|
|
511
|
-
→ status = SUCCESS
|
|
507
|
+
✅ Verify: SHOW DYNAMIC TABLE REFRESH HISTORY <schema>.<table> LIMIT 3
|
|
508
|
+
→ status = SUCCESS, row count matches aggregation logic
|
|
512
509
|
|
|
513
|
-
7.
|
|
510
|
+
7. Optional: data quality check task (Cron + depends on 04)
|
|
514
511
|
cz-cli task save-content 05_dqc_check --content "<dqc_sql>"
|
|
515
512
|
cz-cli task save-cron 05_dqc_check --cron '0 0 3 * * ? *'
|
|
516
513
|
cz-cli task deploy 05_dqc_check
|
|
517
514
|
```
|
|
518
515
|
|
|
519
|
-
>
|
|
516
|
+
> **Fail-fast principle**: if any step's verification fails, stop immediately and fix — don't continue. If ODS data is wrong, DWD will definitely be wrong too.
|
|
520
517
|
|
|
521
518
|
---
|
|
522
519
|
|
|
523
|
-
##
|
|
520
|
+
## Incremental Iteration Guide
|
|
524
521
|
|
|
525
|
-
|
|
522
|
+
**When modifying an existing pipeline, follow the incremental process — don't re-run the full pipeline build process.**
|
|
526
523
|
|
|
527
|
-
|
|
524
|
+
When the user says "add a table", "add a field", "add a metric", "change ETL logic", use interactive tools to collect the iteration type:
|
|
528
525
|
|
|
529
526
|
```
|
|
530
527
|
question({
|
|
531
528
|
questions: [{
|
|
532
|
-
question: "
|
|
529
|
+
question: "What modification do you want to make to the existing pipeline?",
|
|
533
530
|
options: [
|
|
534
|
-
{ label: "
|
|
535
|
-
{ label: "
|
|
536
|
-
{ label: "
|
|
537
|
-
{ label: "
|
|
531
|
+
{ label: "Add sync table", description: "Add a source table to an existing sync task" },
|
|
532
|
+
{ label: "Add field", description: "Source table added a field, ODS/DWD need to follow" },
|
|
533
|
+
{ label: "Add metric/DWS layer", description: "Add aggregation logic or Dynamic Table" },
|
|
534
|
+
{ label: "Modify ETL logic", description: "Cleaning rules, filter conditions, JOIN relationship changes" }
|
|
538
535
|
]
|
|
539
536
|
}]
|
|
540
537
|
})
|
|
541
538
|
```
|
|
542
539
|
|
|
543
|
-
###
|
|
540
|
+
### Add Sync Table
|
|
544
541
|
|
|
545
542
|
```
|
|
546
|
-
1.
|
|
547
|
-
|
|
543
|
+
1. Check lineage, confirm impact scope
|
|
544
|
+
Load clickzetta-table-lineage, confirm if new table has relationships with existing tables
|
|
548
545
|
|
|
549
|
-
2.
|
|
550
|
-
cz-cli task content 00_sync →
|
|
551
|
-
|
|
546
|
+
2. Add table to existing sync task (or create new single-table sync task)
|
|
547
|
+
cz-cli task content 00_sync → view existing config
|
|
548
|
+
Modify as needed and redeploy
|
|
552
549
|
|
|
553
|
-
3.
|
|
550
|
+
3. Manually trigger sync, verify immediately
|
|
554
551
|
cz-cli task execute 00_sync
|
|
555
552
|
✅ SELECT COUNT(*) FROM <ods_schema>.<new_table>
|
|
556
553
|
|
|
557
|
-
4.
|
|
558
|
-
cz-cli task content 04_transform_ods_to_dwd →
|
|
559
|
-
|
|
554
|
+
4. If DWD layer processing needed, add ETL SQL to 04_transform task
|
|
555
|
+
cz-cli task content 04_transform_ods_to_dwd → view existing SQL
|
|
556
|
+
Append new table cleaning logic, manually execute to verify, then redeploy
|
|
560
557
|
```
|
|
561
558
|
|
|
562
|
-
###
|
|
559
|
+
### Add Field (Schema Evolution)
|
|
563
560
|
|
|
564
561
|
```
|
|
565
|
-
1.
|
|
566
|
-
|
|
562
|
+
1. Check lineage, identify all affected downstream tasks/DTs
|
|
563
|
+
Load clickzetta-table-lineage
|
|
567
564
|
|
|
568
|
-
2.
|
|
569
|
-
ODS
|
|
570
|
-
✅
|
|
565
|
+
2. Update layer by layer (upstream to downstream, cannot skip layers)
|
|
566
|
+
ODS layer: ALTER TABLE <ods_schema>.<table> ADD COLUMN <col> <type>
|
|
567
|
+
✅ Verify: DESC TABLE <ods_schema>.<table> → confirm field added
|
|
571
568
|
|
|
572
|
-
DWD
|
|
573
|
-
|
|
574
|
-
✅
|
|
569
|
+
DWD layer: update ETL SQL, add cleaning logic for new field
|
|
570
|
+
Manually execute 04_transform to verify, then redeploy
|
|
571
|
+
✅ Verify: SELECT <new_col>, COUNT(*) FROM <dwd_schema>.<table> GROUP BY 1 LIMIT 5
|
|
575
572
|
|
|
576
|
-
DWS/ADS
|
|
577
|
-
|
|
578
|
-
✅
|
|
573
|
+
DWS/ADS layer (if needed): Dynamic Table doesn't support ALTER, use CREATE OR REPLACE to rebuild
|
|
574
|
+
Immediately REFRESH DYNAMIC TABLE after rebuild
|
|
575
|
+
✅ Verify: SHOW DYNAMIC TABLE REFRESH HISTORY LIMIT 3 → status = SUCCESS
|
|
579
576
|
|
|
580
|
-
3.
|
|
577
|
+
3. Update Studio task scripts (keep code assets in sync)
|
|
581
578
|
cz-cli task save-content <task_name> --content "<updated_sql>"
|
|
582
579
|
```
|
|
583
580
|
|
|
584
|
-
###
|
|
581
|
+
### Add Metric/DWS Layer
|
|
585
582
|
|
|
586
583
|
```
|
|
587
|
-
1.
|
|
584
|
+
1. Confirm metric definition (confirm calculation logic with user to avoid rework)
|
|
588
585
|
|
|
589
|
-
2.
|
|
586
|
+
2. Check if DWD layer has required fields — if not, follow "Add Field" process first
|
|
590
587
|
|
|
591
|
-
3.
|
|
588
|
+
3. Create new Dynamic Table
|
|
592
589
|
CREATE OR REPLACE DYNAMIC TABLE <dws_schema>.<new_metric_table>
|
|
593
590
|
REFRESH INTERVAL <n> <unit> vcluster <gp_cluster>
|
|
594
591
|
AS SELECT ...;
|
|
595
592
|
REFRESH DYNAMIC TABLE <dws_schema>.<new_metric_table>
|
|
596
|
-
✅
|
|
597
|
-
|
|
593
|
+
✅ Verify: SELECT COUNT(*), SUM(<metric>) FROM <dws_schema>.<new_metric_table>
|
|
594
|
+
Compare with known baseline values
|
|
598
595
|
|
|
599
|
-
4.
|
|
596
|
+
4. Save DDL to Studio task
|
|
600
597
|
cz-cli task save-content 03_ddl_dws_ads --content "<updated_ddl>"
|
|
601
598
|
```
|
|
602
599
|
|
|
603
|
-
###
|
|
600
|
+
### Modify ETL Logic
|
|
604
601
|
|
|
605
602
|
```
|
|
606
|
-
1.
|
|
607
|
-
|
|
603
|
+
1. Check lineage, confirm downstream impact scope
|
|
604
|
+
Load clickzetta-table-lineage
|
|
608
605
|
|
|
609
|
-
2.
|
|
606
|
+
2. Verify new logic in dev/test environment first (if available)
|
|
610
607
|
|
|
611
|
-
3.
|
|
612
|
-
cz-cli task content 04_transform_ods_to_dwd →
|
|
613
|
-
|
|
608
|
+
3. Update ETL SQL
|
|
609
|
+
cz-cli task content 04_transform_ods_to_dwd → view existing logic
|
|
610
|
+
After modification, manually execute to verify:
|
|
614
611
|
cz-cli task execute 04_transform_ods_to_dwd
|
|
615
|
-
✅
|
|
612
|
+
✅ Verify: row count comparison, key field sampling, compare with pre-modification results
|
|
616
613
|
|
|
617
|
-
4.
|
|
614
|
+
4. After verification passes, redeploy
|
|
618
615
|
cz-cli task save-content 04_transform_ods_to_dwd --content "<new_sql>"
|
|
619
616
|
cz-cli task deploy 04_transform_ods_to_dwd
|
|
620
617
|
|
|
621
|
-
5.
|
|
618
|
+
5. If downstream Dynamic Tables are affected, trigger full refresh
|
|
622
619
|
SET cz.optimizer.incremental.force.full.refresh = true;
|
|
623
620
|
REFRESH DYNAMIC TABLE <dws_schema>.<table>;
|
|
624
621
|
SET cz.optimizer.incremental.force.full.refresh = false;
|
|
625
622
|
```
|
|
626
623
|
|
|
627
|
-
###
|
|
624
|
+
### Delivery Verification Checklist
|
|
628
625
|
|
|
629
|
-
- [ ]
|
|
630
|
-
- [ ] Dynamic Table
|
|
631
|
-
- [ ] Dynamic Table
|
|
632
|
-
- [ ]
|
|
633
|
-
- [ ] LEFT JOIN
|
|
634
|
-
- [ ]
|
|
635
|
-
- [ ] DWS/ADS
|
|
636
|
-
- [ ]
|
|
637
|
-
- [ ] **ETL
|
|
638
|
-
- [ ]
|
|
626
|
+
- [ ] Row counts at each layer match expectations
|
|
627
|
+
- [ ] Dynamic Table's VCluster exists and `status = RUNNING` (`SHOW VCLUSTERS`)
|
|
628
|
+
- [ ] Dynamic Table refresh history shows SUCCESS
|
|
629
|
+
- [ ] Key field NULL rate within acceptable range
|
|
630
|
+
- [ ] LEFT JOIN result row count ≥ left table row count
|
|
631
|
+
- [ ] All DDL tasks are in DRAFT status
|
|
632
|
+
- [ ] No redundant scheduled tasks for DWS/ADS layer
|
|
633
|
+
- [ ] Scheduling DAG has no circular dependencies
|
|
634
|
+
- [ ] **ETL task dependency chain is complete** (`cz-cli task deps <task>` to verify, `task_dependencies` is not empty)
|
|
635
|
+
- [ ] Key tables and fields have comments (load `clickzetta-manage-comments`)
|
|
639
636
|
|
|
640
637
|
---
|
|
641
638
|
|
|
642
|
-
##
|
|
639
|
+
## Multi-environment Management (dev → prod)
|
|
643
640
|
|
|
644
|
-
ClickZetta
|
|
641
|
+
ClickZetta isolates environments via **Workspace** (dev/staging/prod correspond to different Workspaces). Cross-Workspace pipeline migration currently has limited automation and mainly relies on manual operations.
|
|
645
642
|
|
|
646
|
-
|
|
643
|
+
**When the user raises multi-environment migration needs**, inform them of the following limitations and guide accordingly:
|
|
647
644
|
|
|
648
|
-
-
|
|
649
|
-
-
|
|
650
|
-
-
|
|
645
|
+
- Data source configurations, schemas, and VCluster names are independent across different Workspaces — each must be confirmed and replaced during migration
|
|
646
|
+
- There is currently no one-click migration tool — recommend contacting **data operations (lh-dba role)** for help planning multi-environment strategy
|
|
647
|
+
- You can use `cz-cli task content <task_id>` to export task scripts, manually adjust, then recreate in the target Workspace
|
|
651
648
|
|
|
652
|
-
>
|
|
649
|
+
> Multi-environment management is a platform capability evolution direction. At the current stage, it's recommended to use schema naming within a single Workspace to differentiate (e.g., `ecommerce_ods_dev` vs `ecommerce_ods`) to reduce migration complexity.
|