@clickzetta/cz-cli-darwin-x64 0.3.92 → 0.3.94
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/cz-cli +0 -0
- package/bin/skills/clickzetta-ai-function/SKILL.md +109 -0
- package/bin/skills/clickzetta-ai-function/eval_cases.jsonl +4 -0
- package/bin/skills/clickzetta-ai-function/references/ai-function-ddl.md +106 -0
- package/bin/skills/clickzetta-batch-sync-pipeline/SKILL.md +124 -124
- package/bin/skills/clickzetta-batch-sync-pipeline/eval_cases.jsonl +5 -5
- package/bin/skills/clickzetta-bi-connect/SKILL.md +79 -78
- package/bin/skills/clickzetta-bi-connect/references/bi-tools.md +56 -56
- package/bin/skills/clickzetta-cdc-sync-pipeline/SKILL.md +386 -382
- package/bin/skills/clickzetta-cdc-sync-pipeline/eval_cases.jsonl +5 -5
- package/bin/skills/clickzetta-data-ingest-pipeline/SKILL.md +73 -212
- package/bin/skills/clickzetta-data-science/SKILL.md +57 -56
- package/bin/skills/clickzetta-data-science/references/bitmap-profile.md +38 -38
- package/bin/skills/clickzetta-data-science/references/data-patterns.md +16 -16
- package/bin/skills/clickzetta-data-science/references/setup.md +28 -28
- package/bin/skills/clickzetta-data-science/references/stats-functions.md +44 -44
- package/bin/skills/clickzetta-data-science/references/write-and-infer.md +22 -22
- package/bin/skills/clickzetta-data-science/references/zettapark-api.md +32 -32
- package/bin/skills/clickzetta-dw-modeling/SKILL.md +1 -1
- package/bin/skills/clickzetta-external-function/SKILL.md +51 -109
- package/bin/skills/clickzetta-external-function/eval_cases.jsonl +4 -4
- package/bin/skills/clickzetta-external-function/references/external-function-ddl.md +39 -77
- package/bin/skills/clickzetta-java-sdk/SKILL.md +49 -48
- package/bin/skills/clickzetta-java-sdk/eval_cases.jsonl +12 -12
- package/bin/skills/clickzetta-java-sdk/references/bulkload.md +34 -34
- package/bin/skills/clickzetta-java-sdk/references/realtime.md +44 -44
- package/bin/skills/clickzetta-kafka-ingest-pipeline/SKILL.md +273 -507
- package/bin/skills/clickzetta-kafka-ingest-pipeline/references/kafka-pipe-syntax.md +197 -231
- package/bin/skills/clickzetta-oss-ingest-pipeline/SKILL.md +231 -304
- package/bin/skills/clickzetta-realtime-sync-pipeline/SKILL.md +180 -179
- package/bin/skills/clickzetta-realtime-sync-pipeline/eval_cases.jsonl +5 -5
- package/bin/skills/clickzetta-semantic-view/SKILL.md +74 -72
- package/bin/skills/clickzetta-semantic-view/eval_cases.jsonl +12 -12
- package/bin/skills/clickzetta-semantic-view/references/semantic-view-reference.md +75 -75
- package/bin/skills/clickzetta-sql-migration/SKILL.md +128 -0
- package/bin/skills/clickzetta-sql-migration/eval_cases.jsonl +10 -0
- package/bin/skills/clickzetta-sql-migration/references/ddl-reference.md +350 -0
- package/bin/skills/clickzetta-sql-migration/references/dml-differences.md +192 -0
- package/bin/skills/clickzetta-sql-migration/references/dml-reference.md +279 -0
- package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/references/dql-reference.md +128 -128
- package/bin/skills/clickzetta-sql-migration/references/function-mapping.md +194 -0
- package/bin/skills/clickzetta-sql-migration/references/functions-reference.md +372 -0
- package/bin/skills/clickzetta-sql-migration/references/implicit-type-conversion.md +143 -0
- package/bin/skills/clickzetta-sql-migration/references/migration-databricks.md +260 -0
- package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/references/migration-snowflake.md +112 -112
- package/bin/skills/clickzetta-sql-migration/references/vs-snowflake.md +346 -0
- package/bin/skills/clickzetta-sql-migration/references/vs-spark.md +229 -0
- package/bin/skills/clickzetta-studio-task-manager/SKILL.md +326 -329
- package/bin/skills/clickzetta-table-lineage/SKILL.md +57 -55
- package/bin/skills/clickzetta-table-lineage/eval_cases.jsonl +1 -1
- package/bin/skills/clickzetta-table-lineage/references/normalize_func.sql +5 -5
- package/bin/skills/clickzetta-table-lineage/references/table_cost.sql +6 -6
- package/bin/skills/clickzetta-table-lineage/references/table_relation.sql +2 -2
- package/bin/skills/clickzetta-volume-manager/SKILL.md +186 -100
- package/bin/skills/clickzetta-volume-manager/references/volume-ddl.md +153 -52
- package/package.json +1 -1
- package/bin/skills/clickzetta-dynamic-table/best-practices/scheduling-guide.md +0 -135
- package/bin/skills/clickzetta-dynamic-table/dt-creator/references/dt-declaration-strategy.md +0 -185
- package/bin/skills/clickzetta-dynamic-table/dt-creator/references/refresh-history-guide.md +0 -260
- package/bin/skills/clickzetta-dynamic-table/dynamic-table-alter/SKILL.md +0 -191
- package/bin/skills/clickzetta-sql-syntax-guide/SKILL.md +0 -249
- package/bin/skills/clickzetta-sql-syntax-guide/eval_cases.jsonl +0 -3
- package/bin/skills/clickzetta-sql-syntax-guide/references/ddl-reference.md +0 -350
- package/bin/skills/clickzetta-sql-syntax-guide/references/dml-reference.md +0 -279
- package/bin/skills/clickzetta-sql-syntax-guide/references/functions-reference.md +0 -372
- package/bin/skills/clickzetta-sql-syntax-guide/references/migration-databricks.md +0 -260
- package/bin/skills/clickzetta-sql-syntax-guide/references/vs-snowflake.md +0 -346
- package/bin/skills/clickzetta-sql-syntax-guide/references/vs-spark.md +0 -229
- /package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/LICENSE +0 -0
|
@@ -1,5 +1,5 @@
|
|
|
1
|
-
{"case_id":"001","type":"should_call","user_input":"
|
|
2
|
-
{"case_id":"002","type":"should_call","user_input":"
|
|
3
|
-
{"case_id":"003","type":"should_call","user_input":"CDC
|
|
4
|
-
{"case_id":"004","type":"should_call","user_input":"
|
|
5
|
-
{"case_id":"005","type":"should_call","user_input":"PostgreSQL
|
|
1
|
+
{"case_id":"001","type":"should_call","user_input":"How do I sync an entire MySQL database to Lakehouse in real time?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["full database mirror","Binlog","CDC"]}
|
|
2
|
+
{"case_id":"002","type":"should_call","user_input":"How do I merge sharded tables into a single Lakehouse table during sync?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["sharded table merge","merge"]}
|
|
3
|
+
{"case_id":"003","type":"should_call","user_input":"The CDC sync task has a Binlog position expired error, what should I do?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["Binlog"]}
|
|
4
|
+
{"case_id":"004","type":"should_call","user_input":"How do I configure alerts for multi-table real-time sync? Does it support Feishu notifications?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["alert","webhook"]}
|
|
5
|
+
{"case_id":"005","type":"should_call","user_input":"What source database preparation is needed for PostgreSQL full database CDC sync?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["WAL","PostgreSQL"]}
|
|
@@ -1,237 +1,98 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: clickzetta-data-ingest-pipeline
|
|
3
3
|
description: |
|
|
4
|
-
ClickZetta Lakehouse
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
4
|
+
Router skill: selects the best data ingestion method for ClickZetta Lakehouse based on
|
|
5
|
+
data source, latency, sync scope, and continuity requirements. Routes to specialized skills:
|
|
6
|
+
Kafka Pipe, OSS/S3/COS Pipe, Studio batch sync, Studio real-time sync, Studio CDC,
|
|
7
|
+
file/URL import, or SDK ingestion.
|
|
8
|
+
Trigger: user wants to get external data INTO Lakehouse but hasn't chosen a method, OR
|
|
9
|
+
asks which ingestion approach fits their scenario.
|
|
10
|
+
Covers: Kafka, MySQL, PostgreSQL, SQL Server, TiDB, OSS, S3, COS, local files, URLs,
|
|
11
|
+
Java SDK, Python/ZettaPark. Does NOT cover: querying data, pipeline monitoring/diagnosis,
|
|
12
|
+
Dynamic Table tuning, internal table transformations, or data export.
|
|
13
|
+
Keywords: data ingestion, import, sync, ETL, migrate, load data, ingest, pipeline selection
|
|
9
14
|
---
|
|
10
15
|
|
|
11
|
-
# Lakehouse
|
|
16
|
+
# Lakehouse Data Ingestion Router
|
|
12
17
|
|
|
13
|
-
|
|
14
|
-
并路由到对应的专项 Pipeline Skill 或直接执行简单导入操作。
|
|
18
|
+
Routes users to the correct ingestion skill based on data source, latency requirements, sync scope, and whether continuous sync is needed. Scoped to **external data → Lakehouse** only.
|
|
15
19
|
|
|
16
|
-
##
|
|
20
|
+
## Applicable Scenarios
|
|
17
21
|
|
|
18
|
-
-
|
|
19
|
-
-
|
|
20
|
-
-
|
|
21
|
-
- 关键词:数据导入、数据入仓、数据入湖、数据采集、数据加载、pipeline 选择
|
|
22
|
+
- User wants to import data into ClickZetta Lakehouse but is unsure which method to use
|
|
23
|
+
- User describes a data source and needs an ingestion recommendation
|
|
24
|
+
- User asks about differences between ingestion methods
|
|
22
25
|
|
|
23
|
-
##
|
|
26
|
+
## Decision Matrix
|
|
24
27
|
|
|
25
|
-
|
|
26
|
-
- **执行环境**:已安装并配置 cz-cli
|
|
28
|
+
### Step 1: Collect Requirements
|
|
27
29
|
|
|
28
|
-
|
|
30
|
+
1. **Data source type**: Kafka / Object storage (OSS/S3/COS) / Relational DB (MySQL/PostgreSQL/SQL Server/TiDB) / Local file / URL / Java SDK / Python
|
|
31
|
+
2. **Latency**: Real-time (seconds) / Near real-time (minutes) / Batch (hours/days)
|
|
32
|
+
3. **Sync scope**: Single table / Multiple tables / Entire database
|
|
33
|
+
4. **Continuity**: One-time import / Continuous incremental sync
|
|
34
|
+
5. **CDC needed**: Yes / No
|
|
29
35
|
|
|
30
|
-
|
|
36
|
+
### Step 2: Route
|
|
31
37
|
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
3. **同步范围**:单表 / 多表 / 整库
|
|
47
|
-
4. **是否需要持续同步**:一次性导入 / 持续增量同步
|
|
48
|
-
5. **是否需要 CDC(变更数据捕获)**:是 / 否
|
|
49
|
-
|
|
50
|
-
### 步骤 2:根据决策矩阵推荐方案
|
|
51
|
-
|
|
52
|
-
| 数据源 | 实时性 | 同步范围 | 推荐方式 | 对应 Skill |
|
|
53
|
-
|--------|--------|---------|---------|-----------|
|
|
54
|
-
| Kafka | 实时/准实时 | 单 topic | Kafka PIPE 持续导入(SQL) | `clickzetta-kafka-ingest-pipeline` |
|
|
55
|
-
| Kafka | 实时 | 多 topic | Studio 实时同步 | `clickzetta-realtime-sync-pipeline` |
|
|
56
|
-
| 对象存储 (OSS/S3/COS) | 准实时/批量 | 文件持续到达 | PIPE 持续导入 | `clickzetta-oss-ingest-pipeline` |
|
|
57
|
-
| 对象存储 | 一次性 | 批量文件 | COPY INTO 命令 | `clickzetta-file-import-pipeline`(COPY INTO 部分) |
|
|
58
|
-
| MySQL/PostgreSQL/SQL Server | 实时 CDC | 单表 | Studio 实时同步 | `clickzetta-realtime-sync-pipeline` |
|
|
59
|
-
| MySQL/PostgreSQL/SQL Server | 实时 CDC | 多表/整库 | Studio 多表实时同步 | `clickzetta-cdc-sync-pipeline` |
|
|
60
|
-
| MySQL/PostgreSQL/SQL Server | 离线批量 | 单表 | Studio 离线同步 | `clickzetta-batch-sync-pipeline` |
|
|
61
|
-
| MySQL/PostgreSQL/SQL Server | 离线批量 | 多表 | Studio 多表离线同步 | `clickzetta-batch-sync-pipeline` |
|
|
62
|
-
| 本地文件 / URL | 一次性 | 单文件/多文件 | URL 下载 + COPY INTO | `clickzetta-file-import-pipeline` |
|
|
63
|
-
| 流式增量计算 | 准实时 | 表变更驱动 | Dynamic Table + Stream | `clickzetta-incremental-compute-pipeline` |
|
|
64
|
-
| Java 应用 | 实时/批量 | 程序写入 | Java SDK | (见下方 SDK 导入指引) |
|
|
65
|
-
| Python/ZettaPark | 批量 | DataFrame | ZettaPark save_as_table | (见下方 SDK 导入指引) |
|
|
66
|
-
|
|
67
|
-
### 步骤 3:路由到专项 Skill 或直接执行
|
|
68
|
-
|
|
69
|
-
根据推荐方案,执行以下路由逻辑:
|
|
70
|
-
|
|
71
|
-
**有对应专项 Skill 的场景** → 告知用户推荐方案,引导使用对应 Skill:
|
|
72
|
-
- `clickzetta-kafka-ingest-pipeline`:Kafka PIPE 管道搭建
|
|
73
|
-
- `clickzetta-oss-ingest-pipeline`:对象存储 PIPE 管道搭建
|
|
74
|
-
- `clickzetta-batch-sync-pipeline`:Studio 离线同步任务
|
|
75
|
-
- `clickzetta-realtime-sync-pipeline`:Studio 实时同步任务
|
|
76
|
-
- `clickzetta-cdc-sync-pipeline`:Studio 多表实时同步(CDC)
|
|
77
|
-
- `clickzetta-incremental-compute-pipeline`:Dynamic Table + Stream 增量计算管道
|
|
78
|
-
- `clickzetta-file-import-pipeline`:URL/文件下载导入
|
|
79
|
-
- `clickzetta-table-stream-pipeline`:Table Stream 变更数据捕获
|
|
80
|
-
|
|
81
|
-
**无专项 Skill 的简单场景** → 直接执行:
|
|
82
|
-
|
|
83
|
-
#### SQL INSERT 导入(小数据量)
|
|
84
|
-
```sql
|
|
85
|
-
INSERT INTO schema_name.table_name (col1, col2, col3)
|
|
86
|
-
VALUES ('val1', 'val2', 'val3');
|
|
87
|
-
```
|
|
88
|
-
|
|
89
|
-
#### COPY INTO 快速导入(从 Volume)
|
|
90
|
-
```sql
|
|
91
|
-
-- 1. 确认 Volume 中有文件
|
|
92
|
-
SHOW VOLUME DIRECTORY volume_name;
|
|
93
|
-
|
|
94
|
-
-- 2. 执行 COPY INTO
|
|
95
|
-
COPY INTO schema_name.table_name
|
|
96
|
-
FROM VOLUME volume_name
|
|
97
|
-
USING CSV
|
|
98
|
-
OPTIONS('header' = 'true');
|
|
99
|
-
```
|
|
100
|
-
|
|
101
|
-
#### Java SDK 导入指引
|
|
102
|
-
提供 Java SDK 的关键配置信息:
|
|
103
|
-
- Maven 依赖坐标
|
|
104
|
-
- 连接配置(endpoint、workspace、schema、vcluster)
|
|
105
|
-
- 批量写入 API:`BulkloadWriter`
|
|
106
|
-
- 实时写入 API:`RealtimeWriter`
|
|
107
|
-
- 建议用户参考官方文档:`comprehensive_guide_to_ingesting_javasdk_buckload_realtime`
|
|
108
|
-
|
|
109
|
-
#### ZettaPark (Python) 导入指引
|
|
110
|
-
- `INSERT` 方式:`session.sql("INSERT INTO ...")`
|
|
111
|
-
- `save_as_table` 方式:`df.write.save_as_table("table_name")`
|
|
112
|
-
- 建议用户参考官方文档:`comprehensive_guide_to_ingesting_zettapark_save_as_table`
|
|
113
|
-
|
|
114
|
-
## 异构数据源类型映射(ODS 层建表必读)
|
|
115
|
-
|
|
116
|
-
从关系型数据库(MySQL/PostgreSQL 等)同步数据时,ODS 层建表的字段类型选择直接影响同步成功率。
|
|
117
|
-
|
|
118
|
-
**核心原则:ODS 层使用宽泛类型,DWD 层再做精确转换。**
|
|
119
|
-
|
|
120
|
-
### MySQL → Lakehouse 类型映射
|
|
121
|
-
|
|
122
|
-
| MySQL 类型 | ❌ ODS 层不要用 | ✅ ODS 层用 | DWD 层转换 |
|
|
123
|
-
|---|---|---|---|
|
|
124
|
-
| `BIT(1)` | `BOOLEAN` | `TINYINT` | `CAST(col AS BOOLEAN)` |
|
|
125
|
-
| `TINYINT(1)` | `BOOLEAN` | `TINYINT` | `CAST(col AS BOOLEAN)` |
|
|
126
|
-
| `DATETIME` | `DATETIME` | `TIMESTAMP` | 直接用 |
|
|
127
|
-
| `ENUM('a','b')` | `ENUM` | `STRING` | 直接用 |
|
|
128
|
-
| `TEXT` / `LONGTEXT` / `MEDIUMTEXT` | `TEXT` | `STRING` | 直接用 |
|
|
129
|
-
| `DECIMAL(p,s)` | `FLOAT` / `DOUBLE` | `DECIMAL(p,s)` | 直接用 |
|
|
130
|
-
| `JSON` | `JSON` | `STRING` | `parse_json(col)['field']` |
|
|
131
|
-
| `SET('a','b')` | `SET` | `STRING` | 直接用 |
|
|
132
|
-
|
|
133
|
-
> ⚠️ **最常见踩坑**:`BIT(1)` 映射为 `BOOLEAN` 会导致同步任务失败。ODS 层改为 `TINYINT`,同步成功后在 DWD 层用 `CAST(col AS BOOLEAN)` 转换。
|
|
134
|
-
|
|
135
|
-
### PostgreSQL → Lakehouse 类型映射
|
|
136
|
-
|
|
137
|
-
| PostgreSQL 类型 | ❌ ODS 层不要用 | ✅ ODS 层用 |
|
|
138
|
-
|---|---|---|
|
|
139
|
-
| `BOOLEAN` | `BOOLEAN` | `TINYINT` |
|
|
140
|
-
| `SERIAL` / `BIGSERIAL` | `SERIAL` | `BIGINT` |
|
|
141
|
-
| `JSONB` / `JSON` | `JSON` | `STRING` |
|
|
142
|
-
| `ARRAY` | `ARRAY` | `STRING`(JSON 序列化) |
|
|
143
|
-
| `UUID` | `UUID` | `STRING` |
|
|
144
|
-
| `NUMERIC(p,s)` | `FLOAT` | `DECIMAL(p,s)` |
|
|
145
|
-
|
|
146
|
-
## 数据入仓 vs 数据入湖| 维度 | 数据入仓 | 数据入湖 |
|
|
147
|
-
|------|---------|---------|
|
|
148
|
-
| 目标 | Lakehouse 托管表 | 用户 Volume(对象存储) |
|
|
149
|
-
| 格式 | 自动转为内部列式格式 | 保持原始文件格式 |
|
|
150
|
-
| 查询性能 | 高(列式存储 + 索引) | 较低(需扫描原始文件) |
|
|
151
|
-
| 适用场景 | 分析查询、BI 报表、数据仓库 | 数据暂存、原始数据归档、跨系统共享 |
|
|
152
|
-
| 常用方式 | Studio 同步、PIPE、COPY INTO、SDK | PUT 文件、Python 脚本上传 |
|
|
38
|
+
| Data Source | Latency | Scope | Route To |
|
|
39
|
+
|-------------|---------|-------|----------|
|
|
40
|
+
| Kafka | Real-time/Near real-time | Single topic (SQL Pipe) | `clickzetta-kafka-ingest-pipeline` |
|
|
41
|
+
| Kafka | Real-time | Single topic (Studio task) | `clickzetta-realtime-sync-pipeline` |
|
|
42
|
+
| OSS/S3/COS | Near real-time | Files arriving continuously | `clickzetta-oss-ingest-pipeline` |
|
|
43
|
+
| OSS/S3/COS | One-time batch | Bulk files from bucket | `clickzetta-oss-ingest-pipeline` (batch mode) |
|
|
44
|
+
| Local file / URL | One-time | Single/multiple files | `clickzetta-file-import-pipeline` |
|
|
45
|
+
| MySQL/PG/SQL Server/TiDB | Real-time CDC | Single table | `clickzetta-realtime-sync-pipeline` |
|
|
46
|
+
| MySQL/PG/SQL Server/TiDB | Real-time CDC | Multiple tables / Entire DB | `clickzetta-cdc-sync-pipeline` |
|
|
47
|
+
| MySQL/PG/SQL Server | Offline batch | Single table | `clickzetta-batch-sync-pipeline` |
|
|
48
|
+
| MySQL/PG/SQL Server | Offline batch | Multiple tables / Entire DB | `clickzetta-batch-sync-pipeline` |
|
|
49
|
+
| Java application | Real-time/Batch | Programmatic write | `clickzetta-java-sdk` |
|
|
50
|
+
| Python/ZettaPark | Batch | DataFrame write | `clickzetta-zettapark` / `clickzetta-app-python-sdk` |
|
|
51
|
+
| Small data (manual) | One-time | Few rows | Direct SQL INSERT (see below) |
|
|
153
52
|
|
|
154
|
-
|
|
53
|
+
### How to Choose Between Similar Options
|
|
155
54
|
|
|
156
|
-
|
|
55
|
+
| Scenario | Option A | Option B | Differentiator |
|
|
56
|
+
|----------|----------|----------|----------------|
|
|
57
|
+
| Kafka → Lakehouse | `clickzetta-kafka-ingest-pipeline` | `clickzetta-realtime-sync-pipeline` | Pipe: SQL-native, flexible transforms, no Studio. Studio task: UI-based JSONPath config. |
|
|
58
|
+
| OSS one-time import | `clickzetta-oss-ingest-pipeline` (batch) | `clickzetta-file-import-pipeline` | OSS skill: creates Connection + External Volume. File skill: uses existing/User Volume. |
|
|
59
|
+
| Single-table CDC | `clickzetta-realtime-sync-pipeline` | `clickzetta-cdc-sync-pipeline` | Realtime-sync: simpler for single table. CDC: more operational depth, alerting, repair. |
|
|
60
|
+
| Batch single vs multi | Same skill | Same skill | `clickzetta-batch-sync-pipeline` handles both; multi-table also supports sharded-table merge. |
|
|
157
61
|
|
|
158
|
-
|
|
62
|
+
### Key Constraints
|
|
159
63
|
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
64
|
+
| Skill | Constraints |
|
|
65
|
+
|-------|-------------|
|
|
66
|
+
| `clickzetta-kafka-ingest-pipeline` | PLAINTEXT/SASL_PLAINTEXT only (no SSL/mTLS); GP VCluster |
|
|
67
|
+
| `clickzetta-realtime-sync-pipeline` | Sync VCluster required; Studio datasource config; no test mode before deployment |
|
|
68
|
+
| `clickzetta-cdc-sync-pipeline` | Source tables must have PKs; Studio datasource (not SQL Connection); Sync VCluster |
|
|
69
|
+
| `clickzetta-batch-sync-pipeline` | Studio datasource; Sync VCluster; field mapping configured in Studio UI |
|
|
70
|
+
| `clickzetta-oss-ingest-pipeline` | Dedicated Volume per Pipe (no sharing); same cloud provider as Lakehouse instance |
|
|
167
71
|
|
|
168
|
-
|
|
72
|
+
## Direct Execution for Simple Scenarios
|
|
169
73
|
|
|
170
|
-
|
|
74
|
+
For trivial imports that don't need a specialized skill:
|
|
171
75
|
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
- 实时 → `clickzetta-realtime-sync-pipeline` 或 `clickzetta-cdc-sync-pipeline`
|
|
176
|
-
- 离线 → `clickzetta-batch-sync-pipeline`
|
|
177
|
-
→ 分别引导到对应 Skill
|
|
178
|
-
|
|
179
|
-
### 示例 3:简单的一次性文件导入
|
|
180
|
-
|
|
181
|
-
用户说:"我有一个 CSV 文件要导入"
|
|
182
|
-
|
|
183
|
-
路由逻辑:
|
|
184
|
-
1. 数据源:本地文件
|
|
185
|
-
2. 一次性导入
|
|
186
|
-
→ 路由到 `clickzetta-file-import-pipeline` Skill(支持文件上传 + COPY INTO)
|
|
187
|
-
|
|
188
|
-
## 错误处理
|
|
189
|
-
|
|
190
|
-
| 场景 | 处理方式 |
|
|
191
|
-
|------|---------|
|
|
192
|
-
| 用户无法确定数据源类型 | 询问数据当前存储位置(哪个系统/服务),帮助判断 |
|
|
193
|
-
| 用户需求跨多种导入方式 | 拆分为多个独立的导入任务,分别路由到对应 Skill |
|
|
194
|
-
| 推荐的 Skill 尚未创建 | 提供该导入方式的基本步骤和关键 SQL/API,引导用户参考官方文档 |
|
|
195
|
-
| 用户的云环境不支持某种连接 | 执行 `cz-cli sql "SHOW CONNECTIONS" --sync` 检查可用连接类型,推荐替代方案 |
|
|
196
|
-
| 数据量极大(TB 级) | 建议分批导入,优先使用 PIPE 或 Studio 同步任务(支持断点续传) |
|
|
197
|
-
|
|
198
|
-
## 注意事项
|
|
199
|
-
|
|
200
|
-
- 本 Skill 是路由入口,不直接执行复杂的 pipeline 搭建,而是引导到专项 Skill
|
|
201
|
-
- 对于简单场景(SQL INSERT、单次 COPY INTO),可以直接在本 Skill 中完成
|
|
202
|
-
- 推荐方案时需考虑用户的云环境(阿里云/腾讯云/AWS),不同环境支持的连接类型可能不同
|
|
203
|
-
- 执行 `cz-cli sql "SHOW VCLUSTERS" --sync` 确认可用的虚拟集群,同步任务需要 SYNC 类型的 VCluster
|
|
204
|
-
- 数据入仓是最常见的场景,数据入湖主要用于原始数据暂存或跨系统共享
|
|
205
|
-
|
|
206
|
-
---
|
|
207
|
-
|
|
208
|
-
## cz-cli 替代路径
|
|
209
|
-
|
|
210
|
-
> 仅在 cz-cli 可用且 MCP 不可用时使用本节。
|
|
211
|
-
> 本 Skill 是路由入口,cz-cli 路径的核心逻辑在各专项 Skill 的"cz-cli 替代路径"章节中。
|
|
212
|
-
|
|
213
|
-
### 路由说明
|
|
214
|
-
|
|
215
|
-
当 MCP 不可用时,各专项 Skill 均已提供 cz-cli 替代路径:
|
|
76
|
+
```bash
|
|
77
|
+
# SQL INSERT (small data volume)
|
|
78
|
+
cz-cli sql "INSERT INTO schema_name.table_name (col1, col2, col3) VALUES ('val1', 'val2', 'val3')" --sync
|
|
216
79
|
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
| 对象存储 (OSS/S3/COS) | PIPE 持续导入 | `clickzetta-oss-ingest-pipeline` → cz-cli 替代路径 |
|
|
221
|
-
| MySQL/PostgreSQL/SQL Server(实时单表) | Studio 实时同步 | `clickzetta-realtime-sync-pipeline` → cz-cli 替代路径 |
|
|
222
|
-
| MySQL/PostgreSQL/SQL Server(实时多表/整库) | Studio 多表实时同步 | `clickzetta-cdc-sync-pipeline` → cz-cli 替代路径 |
|
|
223
|
-
| MySQL/PostgreSQL/SQL Server(离线批量) | Studio 离线同步 | `clickzetta-batch-sync-pipeline` → cz-cli 替代路径 |
|
|
80
|
+
# COPY INTO from Volume
|
|
81
|
+
cz-cli sql "COPY INTO schema_name.table_name FROM VOLUME volume_name USING CSV OPTIONS('header'='true') FILES('data.csv')" --sync
|
|
82
|
+
```
|
|
224
83
|
|
|
225
|
-
|
|
84
|
+
## Error Handling
|
|
226
85
|
|
|
227
|
-
|
|
86
|
+
| Scenario | Resolution |
|
|
87
|
+
|----------|-----------|
|
|
88
|
+
| User cannot identify data source type | Ask where data currently lives (which system/service) |
|
|
89
|
+
| Multiple data sources | Split into separate tasks, route each independently |
|
|
90
|
+
| Cloud environment doesn't support a connection | `cz-cli sql "SHOW CONNECTIONS" --sync` to check available types |
|
|
91
|
+
| TB-scale data volume | Prefer PIPE or Studio sync tasks (support checkpoint resume) |
|
|
92
|
+
| No Sync VCluster available | Studio sync tasks (batch/realtime/CDC) require Sync VCluster — check with `cz-cli sql "SHOW VCLUSTERS" --sync` |
|
|
228
93
|
|
|
229
|
-
|
|
230
|
-
# SQL INSERT 导入(小数据量)
|
|
231
|
-
cz-cli agent run "向表 <schema_name>.<table_name> 插入数据:<col1>=<val1>, <col2>=<val2>" \
|
|
232
|
-
--format a2a --dangerously-skip-permissions
|
|
94
|
+
## Notes
|
|
233
95
|
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
```
|
|
96
|
+
- This skill is a routing entry point — it does not execute complex pipeline setup
|
|
97
|
+
- Data warehousing (into managed tables) is the default; data lake loading (into Volume/object storage) is for staging or cross-system sharing
|
|
98
|
+
- Consider user's cloud environment (Alibaba Cloud/Tencent Cloud/AWS) — cross-cloud ingestion is not supported
|
|
@@ -1,50 +1,51 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: clickzetta-data-science
|
|
3
3
|
description: |
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
"
|
|
11
|
-
"
|
|
12
|
-
"
|
|
13
|
-
"
|
|
14
|
-
"
|
|
4
|
+
End-to-end data science workflow guide for ClickZetta Lakehouse. Organized by
|
|
5
|
+
work stage: environment setup (Python 3.10+ check/install), Jupyter Notebook
|
|
6
|
+
configuration, project structure (Cookiecutter DS standard), data discovery,
|
|
7
|
+
data quality assessment, data cleaning and integration, dataset construction,
|
|
8
|
+
EDA, feature engineering (SQL + ZettaPark), and model inference deployment
|
|
9
|
+
(BITMAP user profiling / UDF batch inference / vector search).
|
|
10
|
+
Trigger when user mentions: "data science", "machine learning", "feature engineering",
|
|
11
|
+
"EDA", "data exploration", "ZettaPark ML", "Jupyter connect Lakehouse", "notebook",
|
|
12
|
+
"ipynb", "jupyter kernel", "%%sql", "magic command", "pandas read data",
|
|
13
|
+
"data quality check", "data sampling", "TABLESAMPLE", "approx_percentile",
|
|
14
|
+
"BITMAP user profile", "audience segmentation", "batch inference", "Python 3.10",
|
|
15
|
+
"scikit-learn", "project directory structure", "config.json", ".env".
|
|
15
16
|
Keywords: data science, Jupyter, EDA, feature engineering, ML, pandas, notebook
|
|
16
17
|
---
|
|
17
18
|
|
|
18
|
-
# ClickZetta Lakehouse
|
|
19
|
+
# ClickZetta Lakehouse Data Science Workflow
|
|
19
20
|
|
|
20
|
-
##
|
|
21
|
+
## Workflow Overview
|
|
21
22
|
|
|
22
23
|
```
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
24
|
+
Environment Setup → Jupyter Config → Project Structure → Data Discovery → Data Quality → Data Cleaning
|
|
25
|
+
↓
|
|
26
|
+
Model Inference ← Feature Engineering ← EDA ← Dataset Build
|
|
26
27
|
```
|
|
27
28
|
|
|
28
29
|
---
|
|
29
30
|
|
|
30
|
-
##
|
|
31
|
+
## Hard Prerequisite
|
|
31
32
|
|
|
32
|
-
**Python 3.10
|
|
33
|
+
**Python 3.10+** (required by ZettaPark). If the user's environment is 3.9 or lower, provide an upgrade path before continuing:
|
|
33
34
|
|
|
34
35
|
```bash
|
|
35
36
|
brew install pyenv && pyenv install 3.12.9 && pyenv local 3.12.9
|
|
36
37
|
python -m venv .venv && source .venv/bin/activate
|
|
37
38
|
```
|
|
38
39
|
|
|
39
|
-
|
|
40
|
+
See [references/setup.md](references/setup.md) for detailed setup steps.
|
|
40
41
|
|
|
41
42
|
---
|
|
42
43
|
|
|
43
|
-
##
|
|
44
|
+
## Project Structure
|
|
44
45
|
|
|
45
46
|
```
|
|
46
47
|
my-ds-project/
|
|
47
|
-
├── notebooks/ # 00-env-check.ipynb
|
|
48
|
+
├── notebooks/ # 00-env-check.ipynb must be first
|
|
48
49
|
│ ├── 00-env-check.ipynb
|
|
49
50
|
│ ├── 01-data-discovery.ipynb
|
|
50
51
|
│ ├── 02-data-quality.ipynb
|
|
@@ -52,74 +53,74 @@ my-ds-project/
|
|
|
52
53
|
│ ├── 04-feature-engineering.ipynb
|
|
53
54
|
│ └── 05-modeling.ipynb
|
|
54
55
|
├── src/
|
|
55
|
-
│ ├── config.py #
|
|
56
|
+
│ ├── config.py # connection config, see references/setup.md
|
|
56
57
|
│ ├── data/
|
|
57
58
|
│ └── features/
|
|
58
59
|
├── sql/
|
|
59
|
-
├── data/ #
|
|
60
|
-
├── models/ #
|
|
61
|
-
├── .env #
|
|
62
|
-
└── .env.example #
|
|
60
|
+
├── data/ # all in .gitignore
|
|
61
|
+
├── models/ # all in .gitignore
|
|
62
|
+
├── .env # never commit to git
|
|
63
|
+
└── .env.example # commit to git
|
|
63
64
|
```
|
|
64
65
|
|
|
65
|
-
|
|
66
|
+
Environment variable naming: `CLICKZETTA_SERVICE` / `CLICKZETTA_INSTANCE` / `CLICKZETTA_WORKSPACE` / `CLICKZETTA_USERNAME` / `CLICKZETTA_PASSWORD` / `CLICKZETTA_VCLUSTER` / `CLICKZETTA_SCHEMA`.
|
|
66
67
|
|
|
67
68
|
---
|
|
68
69
|
|
|
69
|
-
##
|
|
70
|
+
## Data Write Rules
|
|
70
71
|
|
|
71
|
-
|
|
|
72
|
+
| Method | Verdict |
|
|
72
73
|
|------|------|
|
|
73
|
-
| `session.create_dataframe(df).write.save_as_table()` | ✅
|
|
74
|
-
| `cursor`
|
|
75
|
-
| `df.to_sql(conn, ...)` | ❌
|
|
76
|
-
| SQLAlchemy `clickzetta://...` | ❌
|
|
74
|
+
| `session.create_dataframe(df).write.save_as_table()` | ✅ Recommended |
|
|
75
|
+
| `cursor` batch INSERT (500 rows per batch) | ✅ Fallback when Python 3.9 / ZettaPark unavailable |
|
|
76
|
+
| `df.to_sql(conn, ...)` | ❌ Forbidden — raises `'list' object has no attribute 'keys'` |
|
|
77
|
+
| SQLAlchemy `clickzetta://...` | ❌ Forbidden — dialect is unreliable |
|
|
77
78
|
|
|
78
|
-
|
|
79
|
+
See [references/write-and-infer.md](references/write-and-infer.md) for code templates.
|
|
79
80
|
|
|
80
81
|
---
|
|
81
82
|
|
|
82
|
-
##
|
|
83
|
+
## Data Viewing Rules
|
|
83
84
|
|
|
84
|
-
-
|
|
85
|
-
-
|
|
85
|
+
- Use `.show()` for quick inspection; avoid `.to_pandas()` when pandas is not needed
|
|
86
|
+
- Always add `TABLESAMPLE ROW(10)` when working with large tables to prevent OOM
|
|
86
87
|
|
|
87
88
|
---
|
|
88
89
|
|
|
89
|
-
##
|
|
90
|
+
## Data Validation Rules
|
|
90
91
|
|
|
91
|
-
|
|
92
|
+
After loading data, **immediately validate statistics against known baseline values** before proceeding with analysis.
|
|
92
93
|
|
|
93
|
-
|
|
94
|
+
Common pitfall: raw athlete/user-level data where each participant in a team event has one row — a direct `SUM` will double-count. Correct approach: `SELECT DISTINCT event, medal, ...` to deduplicate first, then aggregate.
|
|
94
95
|
|
|
95
96
|
---
|
|
96
97
|
|
|
97
|
-
## ClickZetta SQL
|
|
98
|
+
## Unsupported ClickZetta SQL Syntax
|
|
98
99
|
|
|
99
|
-
|
|
|
100
|
+
| Not Supported | Alternative |
|
|
100
101
|
|--------|---------|
|
|
101
|
-
| `CREATE OR REPLACE TABLE` | `CREATE TABLE IF NOT EXISTS
|
|
102
|
-
| `ARRAY_AGG(col IGNORE NULLS)` | `MAX(col)`
|
|
103
|
-
| `QUALIFY`
|
|
104
|
-
| `UNION` / `INTERSECT` / `EXCEPT` | JOIN +
|
|
105
|
-
| `BEGIN; COMMIT; ROLLBACK;` |
|
|
102
|
+
| `CREATE OR REPLACE TABLE` | `CREATE TABLE IF NOT EXISTS` (regular tables don't support OR REPLACE) |
|
|
103
|
+
| `ARRAY_AGG(col IGNORE NULLS)` | `MAX(col)` or `COALESCE()` |
|
|
104
|
+
| `QUALIFY` clause | Subquery + `WHERE rn = 1` |
|
|
105
|
+
| `UNION` / `INTERSECT` / `EXCEPT` | JOIN + application-layer merge |
|
|
106
|
+
| `BEGIN; COMMIT; ROLLBACK;` | Use MERGE for atomic operations |
|
|
106
107
|
| `NOW()` | `CURRENT_TIMESTAMP()` |
|
|
107
108
|
|
|
108
|
-
|
|
109
|
+
For other syntax errors, load the `clickzetta-sql-migration` skill to see Snowflake/Databricks/Spark vs. ClickZetta syntax differences.
|
|
109
110
|
|
|
110
111
|
---
|
|
111
112
|
|
|
112
|
-
## Schema
|
|
113
|
+
## Schema Context
|
|
113
114
|
|
|
114
|
-
|
|
115
|
+
Always use fully qualified table names (`schema.table`) in SQL within Python code — do not rely on the current schema context.
|
|
115
116
|
|
|
116
117
|
---
|
|
117
118
|
|
|
118
|
-
##
|
|
119
|
+
## References
|
|
119
120
|
|
|
120
|
-
- [
|
|
121
|
-
- [
|
|
122
|
-
- [
|
|
121
|
+
- [Environment Setup & Project Config](references/setup.md) — environment setup, config.py template, Jupyter configuration
|
|
122
|
+
- [Data Discovery / Quality / Cleaning / EDA Examples](references/data-patterns.md)
|
|
123
|
+
- [Data Write / Feature Engineering / Model Inference Examples](references/write-and-infer.md)
|
|
123
124
|
- [ZettaPark API](references/zettapark-api.md)
|
|
124
|
-
- [
|
|
125
|
-
- [BITMAP
|
|
125
|
+
- [Statistical Analysis Functions](references/stats-functions.md)
|
|
126
|
+
- [BITMAP User Profiling](references/bitmap-profile.md)
|