@clickzetta/cz-cli-darwin-arm64 0.3.92 → 0.3.94

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (69) hide show
  1. package/bin/cz-cli +0 -0
  2. package/bin/skills/clickzetta-ai-function/SKILL.md +109 -0
  3. package/bin/skills/clickzetta-ai-function/eval_cases.jsonl +4 -0
  4. package/bin/skills/clickzetta-ai-function/references/ai-function-ddl.md +106 -0
  5. package/bin/skills/clickzetta-batch-sync-pipeline/SKILL.md +124 -124
  6. package/bin/skills/clickzetta-batch-sync-pipeline/eval_cases.jsonl +5 -5
  7. package/bin/skills/clickzetta-bi-connect/SKILL.md +79 -78
  8. package/bin/skills/clickzetta-bi-connect/references/bi-tools.md +56 -56
  9. package/bin/skills/clickzetta-cdc-sync-pipeline/SKILL.md +386 -382
  10. package/bin/skills/clickzetta-cdc-sync-pipeline/eval_cases.jsonl +5 -5
  11. package/bin/skills/clickzetta-data-ingest-pipeline/SKILL.md +73 -212
  12. package/bin/skills/clickzetta-data-science/SKILL.md +57 -56
  13. package/bin/skills/clickzetta-data-science/references/bitmap-profile.md +38 -38
  14. package/bin/skills/clickzetta-data-science/references/data-patterns.md +16 -16
  15. package/bin/skills/clickzetta-data-science/references/setup.md +28 -28
  16. package/bin/skills/clickzetta-data-science/references/stats-functions.md +44 -44
  17. package/bin/skills/clickzetta-data-science/references/write-and-infer.md +22 -22
  18. package/bin/skills/clickzetta-data-science/references/zettapark-api.md +32 -32
  19. package/bin/skills/clickzetta-dw-modeling/SKILL.md +1 -1
  20. package/bin/skills/clickzetta-external-function/SKILL.md +51 -109
  21. package/bin/skills/clickzetta-external-function/eval_cases.jsonl +4 -4
  22. package/bin/skills/clickzetta-external-function/references/external-function-ddl.md +39 -77
  23. package/bin/skills/clickzetta-java-sdk/SKILL.md +49 -48
  24. package/bin/skills/clickzetta-java-sdk/eval_cases.jsonl +12 -12
  25. package/bin/skills/clickzetta-java-sdk/references/bulkload.md +34 -34
  26. package/bin/skills/clickzetta-java-sdk/references/realtime.md +44 -44
  27. package/bin/skills/clickzetta-kafka-ingest-pipeline/SKILL.md +273 -507
  28. package/bin/skills/clickzetta-kafka-ingest-pipeline/references/kafka-pipe-syntax.md +197 -231
  29. package/bin/skills/clickzetta-oss-ingest-pipeline/SKILL.md +231 -304
  30. package/bin/skills/clickzetta-realtime-sync-pipeline/SKILL.md +180 -179
  31. package/bin/skills/clickzetta-realtime-sync-pipeline/eval_cases.jsonl +5 -5
  32. package/bin/skills/clickzetta-semantic-view/SKILL.md +74 -72
  33. package/bin/skills/clickzetta-semantic-view/eval_cases.jsonl +12 -12
  34. package/bin/skills/clickzetta-semantic-view/references/semantic-view-reference.md +75 -75
  35. package/bin/skills/clickzetta-sql-migration/SKILL.md +128 -0
  36. package/bin/skills/clickzetta-sql-migration/eval_cases.jsonl +10 -0
  37. package/bin/skills/clickzetta-sql-migration/references/ddl-reference.md +350 -0
  38. package/bin/skills/clickzetta-sql-migration/references/dml-differences.md +192 -0
  39. package/bin/skills/clickzetta-sql-migration/references/dml-reference.md +279 -0
  40. package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/references/dql-reference.md +128 -128
  41. package/bin/skills/clickzetta-sql-migration/references/function-mapping.md +194 -0
  42. package/bin/skills/clickzetta-sql-migration/references/functions-reference.md +372 -0
  43. package/bin/skills/clickzetta-sql-migration/references/implicit-type-conversion.md +143 -0
  44. package/bin/skills/clickzetta-sql-migration/references/migration-databricks.md +260 -0
  45. package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/references/migration-snowflake.md +112 -112
  46. package/bin/skills/clickzetta-sql-migration/references/vs-snowflake.md +346 -0
  47. package/bin/skills/clickzetta-sql-migration/references/vs-spark.md +229 -0
  48. package/bin/skills/clickzetta-studio-task-manager/SKILL.md +326 -329
  49. package/bin/skills/clickzetta-table-lineage/SKILL.md +57 -55
  50. package/bin/skills/clickzetta-table-lineage/eval_cases.jsonl +1 -1
  51. package/bin/skills/clickzetta-table-lineage/references/normalize_func.sql +5 -5
  52. package/bin/skills/clickzetta-table-lineage/references/table_cost.sql +6 -6
  53. package/bin/skills/clickzetta-table-lineage/references/table_relation.sql +2 -2
  54. package/bin/skills/clickzetta-volume-manager/SKILL.md +186 -100
  55. package/bin/skills/clickzetta-volume-manager/references/volume-ddl.md +153 -52
  56. package/package.json +1 -1
  57. package/bin/skills/clickzetta-dynamic-table/best-practices/scheduling-guide.md +0 -135
  58. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/dt-declaration-strategy.md +0 -185
  59. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/refresh-history-guide.md +0 -260
  60. package/bin/skills/clickzetta-dynamic-table/dynamic-table-alter/SKILL.md +0 -191
  61. package/bin/skills/clickzetta-sql-syntax-guide/SKILL.md +0 -249
  62. package/bin/skills/clickzetta-sql-syntax-guide/eval_cases.jsonl +0 -3
  63. package/bin/skills/clickzetta-sql-syntax-guide/references/ddl-reference.md +0 -350
  64. package/bin/skills/clickzetta-sql-syntax-guide/references/dml-reference.md +0 -279
  65. package/bin/skills/clickzetta-sql-syntax-guide/references/functions-reference.md +0 -372
  66. package/bin/skills/clickzetta-sql-syntax-guide/references/migration-databricks.md +0 -260
  67. package/bin/skills/clickzetta-sql-syntax-guide/references/vs-snowflake.md +0 -346
  68. package/bin/skills/clickzetta-sql-syntax-guide/references/vs-spark.md +0 -229
  69. /package/bin/skills/{clickzetta-sql-syntax-guide → clickzetta-sql-migration}/LICENSE +0 -0
@@ -1,5 +1,5 @@
1
- {"case_id":"001","type":"should_call","user_input":"怎么把 MySQL 整库实时同步到 Lakehouse","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["整库镜像","Binlog","CDC"]}
2
- {"case_id":"002","type":"should_call","user_input":"分库分表的数据怎么合并同步到 Lakehouse 一张表?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["多表合并","分库分表"]}
3
- {"case_id":"003","type":"should_call","user_input":"CDC 同步任务 Binlog 位点过期了怎么办?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["Binlog"]}
4
- {"case_id":"004","type":"should_call","user_input":"多表实时同步怎么配置告警?支持飞书通知吗?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["告警","webhook"]}
5
- {"case_id":"005","type":"should_call","user_input":"PostgreSQL 整库 CDC 同步需要源端做什么准备?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["WAL","PostgreSQL"]}
1
+ {"case_id":"001","type":"should_call","user_input":"How do I sync an entire MySQL database to Lakehouse in real time?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["full database mirror","Binlog","CDC"]}
2
+ {"case_id":"002","type":"should_call","user_input":"How do I merge sharded tables into a single Lakehouse table during sync?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["sharded table merge","merge"]}
3
+ {"case_id":"003","type":"should_call","user_input":"The CDC sync task has a Binlog position expired error, what should I do?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["Binlog"]}
4
+ {"case_id":"004","type":"should_call","user_input":"How do I configure alerts for multi-table real-time sync? Does it support Feishu notifications?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["alert","webhook"]}
5
+ {"case_id":"005","type":"should_call","user_input":"What source database preparation is needed for PostgreSQL full database CDC sync?","expected_skill":"clickzetta-cdc-sync-pipeline","expected_output_contains":["WAL","PostgreSQL"]}
@@ -1,237 +1,98 @@
1
1
  ---
2
2
  name: clickzetta-data-ingest-pipeline
3
3
  description: |
4
- ClickZetta Lakehouse 数据导入总览与路由。根据用户的数据源类型、实时性要求、数据量等条件,
5
- 推荐最合适的数据导入方式,并引导到对应的专项 Skill 或直接执行简单导入操作。
6
- 当用户说"导入数据到 Lakehouse"、"数据入仓"、"数据入湖"、"怎么把数据导进来"、
7
- "数据采集"、"数据加载"、"ingest data"、"load data"、"数据导入方案选择"时触发。
8
- Keywords: data ingestion, import, routing, pipeline selection, data source
4
+ Router skill: selects the best data ingestion method for ClickZetta Lakehouse based on
5
+ data source, latency, sync scope, and continuity requirements. Routes to specialized skills:
6
+ Kafka Pipe, OSS/S3/COS Pipe, Studio batch sync, Studio real-time sync, Studio CDC,
7
+ file/URL import, or SDK ingestion.
8
+ Trigger: user wants to get external data INTO Lakehouse but hasn't chosen a method, OR
9
+ asks which ingestion approach fits their scenario.
10
+ Covers: Kafka, MySQL, PostgreSQL, SQL Server, TiDB, OSS, S3, COS, local files, URLs,
11
+ Java SDK, Python/ZettaPark. Does NOT cover: querying data, pipeline monitoring/diagnosis,
12
+ Dynamic Table tuning, internal table transformations, or data export.
13
+ Keywords: data ingestion, import, sync, ETL, migrate, load data, ingest, pipeline selection
9
14
  ---
10
15
 
11
- # Lakehouse 数据导入总览与路由
16
+ # Lakehouse Data Ingestion Router
12
17
 
13
- 根据用户的数据源、实时性需求、数据规模等条件,推荐最合适的数据导入方式,
14
- 并路由到对应的专项 Pipeline Skill 或直接执行简单导入操作。
18
+ Routes users to the correct ingestion skill based on data source, latency requirements, sync scope, and whether continuous sync is needed. Scoped to **external data → Lakehouse** only.
15
19
 
16
- ## 适用场景
20
+ ## Applicable Scenarios
17
21
 
18
- - 用户想把数据导入 ClickZetta Lakehouse,但不确定用哪种方式
19
- - 用户描述了数据源(Kafka、MySQL、OSS、文件等),需要推荐导入方案
20
- - 用户需要了解各种导入方式的适用场景和差异
21
- - 关键词:数据导入、数据入仓、数据入湖、数据采集、数据加载、pipeline 选择
22
+ - User wants to import data into ClickZetta Lakehouse but is unsure which method to use
23
+ - User describes a data source and needs an ingestion recommendation
24
+ - User asks about differences between ingestion methods
22
25
 
23
- ## 前置依赖
26
+ ## Decision Matrix
24
27
 
25
- - ClickZetta Lakehouse 账户,具备创建工作空间、Schema、表、PIPE、任务等权限
26
- - **执行环境**:已安装并配置 cz-cli
28
+ ### Step 1: Collect Requirements
27
29
 
28
- ## 执行环境
30
+ 1. **Data source type**: Kafka / Object storage (OSS/S3/COS) / Relational DB (MySQL/PostgreSQL/SQL Server/TiDB) / Local file / URL / Java SDK / Python
31
+ 2. **Latency**: Real-time (seconds) / Near real-time (minutes) / Batch (hours/days)
32
+ 3. **Sync scope**: Single table / Multiple tables / Entire database
33
+ 4. **Continuity**: One-time import / Continuous incremental sync
34
+ 5. **CDC needed**: Yes / No
29
35
 
30
- 所有操作通过 `cz-cli` 执行:
36
+ ### Step 2: Route
31
37
 
32
- ```bash
33
- cz-cli --version
34
- ```
35
-
36
- 需要 cz-cli,请参考官方文档安装并完成配置后重试。
37
-
38
- ## 数据导入方式决策树
39
-
40
- ### 步骤 1:确认数据源类型和需求
41
-
42
- 向用户收集以下信息:
43
-
44
- 1. **数据源类型**:Kafka / 对象存储(OSS/S3/COS) / 关系型数据库(MySQL/PostgreSQL/SQL Server) / 本地文件 / URL/Web 文件 / Java SDK / ZettaPark
45
- 2. **实时性要求**:实时(秒级延迟)/ 准实时(分钟级)/ 离线批量(小时/天级)
46
- 3. **同步范围**:单表 / 多表 / 整库
47
- 4. **是否需要持续同步**:一次性导入 / 持续增量同步
48
- 5. **是否需要 CDC(变更数据捕获)**:是 / 否
49
-
50
- ### 步骤 2:根据决策矩阵推荐方案
51
-
52
- | 数据源 | 实时性 | 同步范围 | 推荐方式 | 对应 Skill |
53
- |--------|--------|---------|---------|-----------|
54
- | Kafka | 实时/准实时 | 单 topic | Kafka PIPE 持续导入(SQL) | `clickzetta-kafka-ingest-pipeline` |
55
- | Kafka | 实时 | 多 topic | Studio 实时同步 | `clickzetta-realtime-sync-pipeline` |
56
- | 对象存储 (OSS/S3/COS) | 准实时/批量 | 文件持续到达 | PIPE 持续导入 | `clickzetta-oss-ingest-pipeline` |
57
- | 对象存储 | 一次性 | 批量文件 | COPY INTO 命令 | `clickzetta-file-import-pipeline`(COPY INTO 部分) |
58
- | MySQL/PostgreSQL/SQL Server | 实时 CDC | 单表 | Studio 实时同步 | `clickzetta-realtime-sync-pipeline` |
59
- | MySQL/PostgreSQL/SQL Server | 实时 CDC | 多表/整库 | Studio 多表实时同步 | `clickzetta-cdc-sync-pipeline` |
60
- | MySQL/PostgreSQL/SQL Server | 离线批量 | 单表 | Studio 离线同步 | `clickzetta-batch-sync-pipeline` |
61
- | MySQL/PostgreSQL/SQL Server | 离线批量 | 多表 | Studio 多表离线同步 | `clickzetta-batch-sync-pipeline` |
62
- | 本地文件 / URL | 一次性 | 单文件/多文件 | URL 下载 + COPY INTO | `clickzetta-file-import-pipeline` |
63
- | 流式增量计算 | 准实时 | 表变更驱动 | Dynamic Table + Stream | `clickzetta-incremental-compute-pipeline` |
64
- | Java 应用 | 实时/批量 | 程序写入 | Java SDK | (见下方 SDK 导入指引) |
65
- | Python/ZettaPark | 批量 | DataFrame | ZettaPark save_as_table | (见下方 SDK 导入指引) |
66
-
67
- ### 步骤 3:路由到专项 Skill 或直接执行
68
-
69
- 根据推荐方案,执行以下路由逻辑:
70
-
71
- **有对应专项 Skill 的场景** → 告知用户推荐方案,引导使用对应 Skill:
72
- - `clickzetta-kafka-ingest-pipeline`:Kafka PIPE 管道搭建
73
- - `clickzetta-oss-ingest-pipeline`:对象存储 PIPE 管道搭建
74
- - `clickzetta-batch-sync-pipeline`:Studio 离线同步任务
75
- - `clickzetta-realtime-sync-pipeline`:Studio 实时同步任务
76
- - `clickzetta-cdc-sync-pipeline`:Studio 多表实时同步(CDC)
77
- - `clickzetta-incremental-compute-pipeline`:Dynamic Table + Stream 增量计算管道
78
- - `clickzetta-file-import-pipeline`:URL/文件下载导入
79
- - `clickzetta-table-stream-pipeline`:Table Stream 变更数据捕获
80
-
81
- **无专项 Skill 的简单场景** → 直接执行:
82
-
83
- #### SQL INSERT 导入(小数据量)
84
- ```sql
85
- INSERT INTO schema_name.table_name (col1, col2, col3)
86
- VALUES ('val1', 'val2', 'val3');
87
- ```
88
-
89
- #### COPY INTO 快速导入(从 Volume)
90
- ```sql
91
- -- 1. 确认 Volume 中有文件
92
- SHOW VOLUME DIRECTORY volume_name;
93
-
94
- -- 2. 执行 COPY INTO
95
- COPY INTO schema_name.table_name
96
- FROM VOLUME volume_name
97
- USING CSV
98
- OPTIONS('header' = 'true');
99
- ```
100
-
101
- #### Java SDK 导入指引
102
- 提供 Java SDK 的关键配置信息:
103
- - Maven 依赖坐标
104
- - 连接配置(endpoint、workspace、schema、vcluster)
105
- - 批量写入 API:`BulkloadWriter`
106
- - 实时写入 API:`RealtimeWriter`
107
- - 建议用户参考官方文档:`comprehensive_guide_to_ingesting_javasdk_buckload_realtime`
108
-
109
- #### ZettaPark (Python) 导入指引
110
- - `INSERT` 方式:`session.sql("INSERT INTO ...")`
111
- - `save_as_table` 方式:`df.write.save_as_table("table_name")`
112
- - 建议用户参考官方文档:`comprehensive_guide_to_ingesting_zettapark_save_as_table`
113
-
114
- ## 异构数据源类型映射(ODS 层建表必读)
115
-
116
- 从关系型数据库(MySQL/PostgreSQL 等)同步数据时,ODS 层建表的字段类型选择直接影响同步成功率。
117
-
118
- **核心原则:ODS 层使用宽泛类型,DWD 层再做精确转换。**
119
-
120
- ### MySQL → Lakehouse 类型映射
121
-
122
- | MySQL 类型 | ❌ ODS 层不要用 | ✅ ODS 层用 | DWD 层转换 |
123
- |---|---|---|---|
124
- | `BIT(1)` | `BOOLEAN` | `TINYINT` | `CAST(col AS BOOLEAN)` |
125
- | `TINYINT(1)` | `BOOLEAN` | `TINYINT` | `CAST(col AS BOOLEAN)` |
126
- | `DATETIME` | `DATETIME` | `TIMESTAMP` | 直接用 |
127
- | `ENUM('a','b')` | `ENUM` | `STRING` | 直接用 |
128
- | `TEXT` / `LONGTEXT` / `MEDIUMTEXT` | `TEXT` | `STRING` | 直接用 |
129
- | `DECIMAL(p,s)` | `FLOAT` / `DOUBLE` | `DECIMAL(p,s)` | 直接用 |
130
- | `JSON` | `JSON` | `STRING` | `parse_json(col)['field']` |
131
- | `SET('a','b')` | `SET` | `STRING` | 直接用 |
132
-
133
- > ⚠️ **最常见踩坑**:`BIT(1)` 映射为 `BOOLEAN` 会导致同步任务失败。ODS 层改为 `TINYINT`,同步成功后在 DWD 层用 `CAST(col AS BOOLEAN)` 转换。
134
-
135
- ### PostgreSQL → Lakehouse 类型映射
136
-
137
- | PostgreSQL 类型 | ❌ ODS 层不要用 | ✅ ODS 层用 |
138
- |---|---|---|
139
- | `BOOLEAN` | `BOOLEAN` | `TINYINT` |
140
- | `SERIAL` / `BIGSERIAL` | `SERIAL` | `BIGINT` |
141
- | `JSONB` / `JSON` | `JSON` | `STRING` |
142
- | `ARRAY` | `ARRAY` | `STRING`(JSON 序列化) |
143
- | `UUID` | `UUID` | `STRING` |
144
- | `NUMERIC(p,s)` | `FLOAT` | `DECIMAL(p,s)` |
145
-
146
- ## 数据入仓 vs 数据入湖| 维度 | 数据入仓 | 数据入湖 |
147
- |------|---------|---------|
148
- | 目标 | Lakehouse 托管表 | 用户 Volume(对象存储) |
149
- | 格式 | 自动转为内部列式格式 | 保持原始文件格式 |
150
- | 查询性能 | 高(列式存储 + 索引) | 较低(需扫描原始文件) |
151
- | 适用场景 | 分析查询、BI 报表、数据仓库 | 数据暂存、原始数据归档、跨系统共享 |
152
- | 常用方式 | Studio 同步、PIPE、COPY INTO、SDK | PUT 文件、Python 脚本上传 |
38
+ | Data Source | Latency | Scope | Route To |
39
+ |-------------|---------|-------|----------|
40
+ | Kafka | Real-time/Near real-time | Single topic (SQL Pipe) | `clickzetta-kafka-ingest-pipeline` |
41
+ | Kafka | Real-time | Single topic (Studio task) | `clickzetta-realtime-sync-pipeline` |
42
+ | OSS/S3/COS | Near real-time | Files arriving continuously | `clickzetta-oss-ingest-pipeline` |
43
+ | OSS/S3/COS | One-time batch | Bulk files from bucket | `clickzetta-oss-ingest-pipeline` (batch mode) |
44
+ | Local file / URL | One-time | Single/multiple files | `clickzetta-file-import-pipeline` |
45
+ | MySQL/PG/SQL Server/TiDB | Real-time CDC | Single table | `clickzetta-realtime-sync-pipeline` |
46
+ | MySQL/PG/SQL Server/TiDB | Real-time CDC | Multiple tables / Entire DB | `clickzetta-cdc-sync-pipeline` |
47
+ | MySQL/PG/SQL Server | Offline batch | Single table | `clickzetta-batch-sync-pipeline` |
48
+ | MySQL/PG/SQL Server | Offline batch | Multiple tables / Entire DB | `clickzetta-batch-sync-pipeline` |
49
+ | Java application | Real-time/Batch | Programmatic write | `clickzetta-java-sdk` |
50
+ | Python/ZettaPark | Batch | DataFrame write | `clickzetta-zettapark` / `clickzetta-app-python-sdk` |
51
+ | Small data (manual) | One-time | Few rows | Direct SQL INSERT (see below) |
153
52
 
154
- ## 示例
53
+ ### How to Choose Between Similar Options
155
54
 
156
- ### 示例 1:用户不确定导入方式
55
+ | Scenario | Option A | Option B | Differentiator |
56
+ |----------|----------|----------|----------------|
57
+ | Kafka → Lakehouse | `clickzetta-kafka-ingest-pipeline` | `clickzetta-realtime-sync-pipeline` | Pipe: SQL-native, flexible transforms, no Studio. Studio task: UI-based JSONPath config. |
58
+ | OSS one-time import | `clickzetta-oss-ingest-pipeline` (batch) | `clickzetta-file-import-pipeline` | OSS skill: creates Connection + External Volume. File skill: uses existing/User Volume. |
59
+ | Single-table CDC | `clickzetta-realtime-sync-pipeline` | `clickzetta-cdc-sync-pipeline` | Realtime-sync: simpler for single table. CDC: more operational depth, alerting, repair. |
60
+ | Batch single vs multi | Same skill | Same skill | `clickzetta-batch-sync-pipeline` handles both; multi-table also supports sharded-table merge. |
157
61
 
158
- 用户说:"我有一个 MySQL 数据库,想把里面的订单表实时同步到 Lakehouse"
62
+ ### Key Constraints
159
63
 
160
- 路由逻辑:
161
- 1. 数据源:MySQL(关系型数据库)
162
- 2. 实时性:实时
163
- 3. 同步范围:单表
164
- 4. 需要 CDC:是(实时同步意味着需要捕获变更)
165
- 推荐:Studio 实时同步
166
- 路由到 `clickzetta-realtime-sync-pipeline` Skill
64
+ | Skill | Constraints |
65
+ |-------|-------------|
66
+ | `clickzetta-kafka-ingest-pipeline` | PLAINTEXT/SASL_PLAINTEXT only (no SSL/mTLS); GP VCluster |
67
+ | `clickzetta-realtime-sync-pipeline` | Sync VCluster required; Studio datasource config; no test mode before deployment |
68
+ | `clickzetta-cdc-sync-pipeline` | Source tables must have PKs; Studio datasource (not SQL Connection); Sync VCluster |
69
+ | `clickzetta-batch-sync-pipeline` | Studio datasource; Sync VCluster; field mapping configured in Studio UI |
70
+ | `clickzetta-oss-ingest-pipeline` | Dedicated Volume per Pipe (no sharing); same cloud provider as Lakehouse instance |
167
71
 
168
- ### 示例 2:多种数据源混合场景
72
+ ## Direct Execution for Simple Scenarios
169
73
 
170
- 用户说:"我们有 Kafka 的用户行为日志,还有 MySQL 的业务数据,都要导入 Lakehouse"
74
+ For trivial imports that don't need a specialized skill:
171
75
 
172
- 路由逻辑:
173
- 1. Kafka 用户行为日志 `clickzetta-kafka-ingest-pipeline`(PIPE 持续导入)
174
- 2. MySQL 业务数据 确认实时性需求:
175
- - 实时 → `clickzetta-realtime-sync-pipeline` 或 `clickzetta-cdc-sync-pipeline`
176
- - 离线 → `clickzetta-batch-sync-pipeline`
177
- → 分别引导到对应 Skill
178
-
179
- ### 示例 3:简单的一次性文件导入
180
-
181
- 用户说:"我有一个 CSV 文件要导入"
182
-
183
- 路由逻辑:
184
- 1. 数据源:本地文件
185
- 2. 一次性导入
186
- → 路由到 `clickzetta-file-import-pipeline` Skill(支持文件上传 + COPY INTO)
187
-
188
- ## 错误处理
189
-
190
- | 场景 | 处理方式 |
191
- |------|---------|
192
- | 用户无法确定数据源类型 | 询问数据当前存储位置(哪个系统/服务),帮助判断 |
193
- | 用户需求跨多种导入方式 | 拆分为多个独立的导入任务,分别路由到对应 Skill |
194
- | 推荐的 Skill 尚未创建 | 提供该导入方式的基本步骤和关键 SQL/API,引导用户参考官方文档 |
195
- | 用户的云环境不支持某种连接 | 执行 `cz-cli sql "SHOW CONNECTIONS" --sync` 检查可用连接类型,推荐替代方案 |
196
- | 数据量极大(TB 级) | 建议分批导入,优先使用 PIPE 或 Studio 同步任务(支持断点续传) |
197
-
198
- ## 注意事项
199
-
200
- - 本 Skill 是路由入口,不直接执行复杂的 pipeline 搭建,而是引导到专项 Skill
201
- - 对于简单场景(SQL INSERT、单次 COPY INTO),可以直接在本 Skill 中完成
202
- - 推荐方案时需考虑用户的云环境(阿里云/腾讯云/AWS),不同环境支持的连接类型可能不同
203
- - 执行 `cz-cli sql "SHOW VCLUSTERS" --sync` 确认可用的虚拟集群,同步任务需要 SYNC 类型的 VCluster
204
- - 数据入仓是最常见的场景,数据入湖主要用于原始数据暂存或跨系统共享
205
-
206
- ---
207
-
208
- ## cz-cli 替代路径
209
-
210
- > 仅在 cz-cli 可用且 MCP 不可用时使用本节。
211
- > 本 Skill 是路由入口,cz-cli 路径的核心逻辑在各专项 Skill 的"cz-cli 替代路径"章节中。
212
-
213
- ### 路由说明
214
-
215
- 当 MCP 不可用时,各专项 Skill 均已提供 cz-cli 替代路径:
76
+ ```bash
77
+ # SQL INSERT (small data volume)
78
+ cz-cli sql "INSERT INTO schema_name.table_name (col1, col2, col3) VALUES ('val1', 'val2', 'val3')" --sync
216
79
 
217
- | 数据源 | 推荐方式 | 对应 Skill 的 cz-cli 路径 |
218
- |--------|---------|--------------------------|
219
- | Kafka | PIPE 持续导入 | `clickzetta-kafka-ingest-pipeline` → cz-cli 替代路径 |
220
- | 对象存储 (OSS/S3/COS) | PIPE 持续导入 | `clickzetta-oss-ingest-pipeline` → cz-cli 替代路径 |
221
- | MySQL/PostgreSQL/SQL Server(实时单表) | Studio 实时同步 | `clickzetta-realtime-sync-pipeline` → cz-cli 替代路径 |
222
- | MySQL/PostgreSQL/SQL Server(实时多表/整库) | Studio 多表实时同步 | `clickzetta-cdc-sync-pipeline` → cz-cli 替代路径 |
223
- | MySQL/PostgreSQL/SQL Server(离线批量) | Studio 离线同步 | `clickzetta-batch-sync-pipeline` → cz-cli 替代路径 |
80
+ # COPY INTO from Volume
81
+ cz-cli sql "COPY INTO schema_name.table_name FROM VOLUME volume_name USING CSV OPTIONS('header'='true') FILES('data.csv')" --sync
82
+ ```
224
83
 
225
- ### 简单场景直接执行(cz-cli 版)
84
+ ## Error Handling
226
85
 
227
- 对于无需专项 Skill 的简单场景,可直接用 cz-cli agent 完成:
86
+ | Scenario | Resolution |
87
+ |----------|-----------|
88
+ | User cannot identify data source type | Ask where data currently lives (which system/service) |
89
+ | Multiple data sources | Split into separate tasks, route each independently |
90
+ | Cloud environment doesn't support a connection | `cz-cli sql "SHOW CONNECTIONS" --sync` to check available types |
91
+ | TB-scale data volume | Prefer PIPE or Studio sync tasks (support checkpoint resume) |
92
+ | No Sync VCluster available | Studio sync tasks (batch/realtime/CDC) require Sync VCluster — check with `cz-cli sql "SHOW VCLUSTERS" --sync` |
228
93
 
229
- ```bash
230
- # SQL INSERT 导入(小数据量)
231
- cz-cli agent run "向表 <schema_name>.<table_name> 插入数据:<col1>=<val1>, <col2>=<val2>" \
232
- --format a2a --dangerously-skip-permissions
94
+ ## Notes
233
95
 
234
- # COPY INTO 快速导入(从 Volume)
235
- cz-cli agent run "从 Volume <volume_name> CSV 格式(有 header)将数据导入表 <schema_name>.<table_name>" \
236
- --format a2a --dangerously-skip-permissions
237
- ```
96
+ - This skill is a routing entry point — it does not execute complex pipeline setup
97
+ - Data warehousing (into managed tables) is the default; data lake loading (into Volume/object storage) is for staging or cross-system sharing
98
+ - Consider user's cloud environment (Alibaba Cloud/Tencent Cloud/AWS) — cross-cloud ingestion is not supported
@@ -1,50 +1,51 @@
1
1
  ---
2
2
  name: clickzetta-data-science
3
3
  description: |
4
- 数据科学家使用 ClickZetta Lakehouse 的端到端工作流指南。按工作阶段组织:
5
- 开发环境准备(Python 3.10+ 检查/搭建)、Jupyter Notebook 配置与使用、
6
- 项目结构规范(Cookiecutter DS 标准)、数据发现、数据质量评估、
7
- 数据清洗与整合、数据集构建、EDA 探索分析、
8
- 特征工程(SQL + ZettaPark)、模型推理上线(BITMAP 用户画像/UDF 批量推理/向量检索)。
9
- 当用户说"数据科学"、"机器学习"、"特征工程"、"EDA"、"数据探索"、
10
- "ZettaPark 机器学习""Jupyter 连接 Lakehouse""notebook"、"ipynb"、
11
- "jupyter kernel""%%sql""magic command""pandas 读取数据"
12
- "数据质量检查""数据采样""TABLESAMPLE""approx_percentile"
13
- "BITMAP 用户画像""人群圈选""批量推理"、"Python 3.10"
14
- "scikit-learn""项目目录结构""config.json"".env"时触发。
4
+ End-to-end data science workflow guide for ClickZetta Lakehouse. Organized by
5
+ work stage: environment setup (Python 3.10+ check/install), Jupyter Notebook
6
+ configuration, project structure (Cookiecutter DS standard), data discovery,
7
+ data quality assessment, data cleaning and integration, dataset construction,
8
+ EDA, feature engineering (SQL + ZettaPark), and model inference deployment
9
+ (BITMAP user profiling / UDF batch inference / vector search).
10
+ Trigger when user mentions: "data science", "machine learning", "feature engineering",
11
+ "EDA", "data exploration", "ZettaPark ML", "Jupyter connect Lakehouse", "notebook",
12
+ "ipynb", "jupyter kernel", "%%sql", "magic command", "pandas read data",
13
+ "data quality check", "data sampling", "TABLESAMPLE", "approx_percentile",
14
+ "BITMAP user profile", "audience segmentation", "batch inference", "Python 3.10",
15
+ "scikit-learn", "project directory structure", "config.json", ".env".
15
16
  Keywords: data science, Jupyter, EDA, feature engineering, ML, pandas, notebook
16
17
  ---
17
18
 
18
- # ClickZetta Lakehouse 数据科学工作流
19
+ # ClickZetta Lakehouse Data Science Workflow
19
20
 
20
- ## 工作流全景
21
+ ## Workflow Overview
21
22
 
22
23
  ```
23
- 环境准备 → Jupyter 配置项目结构数据发现数据质量评估数据清洗整合
24
-
25
- 模型推理上线特征工程 ← EDA ← 数据集构建
24
+ Environment Setup → Jupyter ConfigProject Structure Data Discovery Data Quality Data Cleaning
25
+
26
+ Model Inference Feature Engineering ← EDA ← Dataset Build
26
27
  ```
27
28
 
28
29
  ---
29
30
 
30
- ## 硬性前提条件
31
+ ## Hard Prerequisite
31
32
 
32
- **Python 3.10+**(ZettaPark 硬性要求)。用户环境是 3.9 或更低时,先给升级方案再继续:
33
+ **Python 3.10+** (required by ZettaPark). If the user's environment is 3.9 or lower, provide an upgrade path before continuing:
33
34
 
34
35
  ```bash
35
36
  brew install pyenv && pyenv install 3.12.9 && pyenv local 3.12.9
36
37
  python -m venv .venv && source .venv/bin/activate
37
38
  ```
38
39
 
39
- 详细搭建步骤见 [references/setup.md](references/setup.md)
40
+ See [references/setup.md](references/setup.md) for detailed setup steps.
40
41
 
41
42
  ---
42
43
 
43
- ## 项目结构
44
+ ## Project Structure
44
45
 
45
46
  ```
46
47
  my-ds-project/
47
- ├── notebooks/ # 00-env-check.ipynb 必须是第一个
48
+ ├── notebooks/ # 00-env-check.ipynb must be first
48
49
  │ ├── 00-env-check.ipynb
49
50
  │ ├── 01-data-discovery.ipynb
50
51
  │ ├── 02-data-quality.ipynb
@@ -52,74 +53,74 @@ my-ds-project/
52
53
  │ ├── 04-feature-engineering.ipynb
53
54
  │ └── 05-modeling.ipynb
54
55
  ├── src/
55
- │ ├── config.py # 连接配置,见 references/setup.md
56
+ │ ├── config.py # connection config, see references/setup.md
56
57
  │ ├── data/
57
58
  │ └── features/
58
59
  ├── sql/
59
- ├── data/ # 全部加入 .gitignore
60
- ├── models/ # 全部加入 .gitignore
61
- ├── .env # 绝不入 git
62
- └── .env.example # git
60
+ ├── data/ # all in .gitignore
61
+ ├── models/ # all in .gitignore
62
+ ├── .env # never commit to git
63
+ └── .env.example # commit to git
63
64
  ```
64
65
 
65
- 环境变量命名规范:`CLICKZETTA_SERVICE` / `CLICKZETTA_INSTANCE` / `CLICKZETTA_WORKSPACE` / `CLICKZETTA_USERNAME` / `CLICKZETTA_PASSWORD` / `CLICKZETTA_VCLUSTER` / `CLICKZETTA_SCHEMA`。
66
+ Environment variable naming: `CLICKZETTA_SERVICE` / `CLICKZETTA_INSTANCE` / `CLICKZETTA_WORKSPACE` / `CLICKZETTA_USERNAME` / `CLICKZETTA_PASSWORD` / `CLICKZETTA_VCLUSTER` / `CLICKZETTA_SCHEMA`.
66
67
 
67
68
  ---
68
69
 
69
- ## 数据写入规则(禁止事项)
70
+ ## Data Write Rules
70
71
 
71
- | 方式 | 结论 |
72
+ | Method | Verdict |
72
73
  |------|------|
73
- | `session.create_dataframe(df).write.save_as_table()` | ✅ 推荐 |
74
- | `cursor` 批量 INSERT(每批 500 行) | ✅ Python 3.9 / ZettaPark 不可用时的 fallback |
75
- | `df.to_sql(conn, ...)` | ❌ 禁止,报 `'list' object has no attribute 'keys'` |
76
- | SQLAlchemy `clickzetta://...` | ❌ 禁止,dialect 不可靠 |
74
+ | `session.create_dataframe(df).write.save_as_table()` | ✅ Recommended |
75
+ | `cursor` batch INSERT (500 rows per batch) | ✅ Fallback when Python 3.9 / ZettaPark unavailable |
76
+ | `df.to_sql(conn, ...)` | ❌ Forbidden — raises `'list' object has no attribute 'keys'` |
77
+ | SQLAlchemy `clickzetta://...` | ❌ Forbidden — dialect is unreliable |
77
78
 
78
- 代码模板见 [references/write-and-infer.md](references/write-and-infer.md)
79
+ See [references/write-and-infer.md](references/write-and-infer.md) for code templates.
79
80
 
80
81
  ---
81
82
 
82
- ## 数据查看规则
83
+ ## Data Viewing Rules
83
84
 
84
- - 快速查看用 `.show()`,不需要 pandas 时不要 `.to_pandas()`
85
- - 大表操作默认加 `TABLESAMPLE ROW(10)` 采样,避免 OOM
85
+ - Use `.show()` for quick inspection; avoid `.to_pandas()` when pandas is not needed
86
+ - Always add `TABLESAMPLE ROW(10)` when working with large tables to prevent OOM
86
87
 
87
88
  ---
88
89
 
89
- ## 数据验证规则
90
+ ## Data Validation Rules
90
91
 
91
- 导入数据后,**立即用已知基准值验证统计结果**,再进行后续分析。
92
+ After loading data, **immediately validate statistics against known baseline values** before proceeding with analysis.
92
93
 
93
- 常见陷阱:运动员/用户级别的原始数据,团体项目每个参与者各有一条记录,直接 SUM 会重复计算。正确做法:先 `SELECT DISTINCT event, medal, ...` 去重,再聚合。
94
+ Common pitfall: raw athlete/user-level data where each participant in a team event has one row — a direct `SUM` will double-count. Correct approach: `SELECT DISTINCT event, medal, ...` to deduplicate first, then aggregate.
94
95
 
95
96
  ---
96
97
 
97
- ## ClickZetta SQL 不支持的语法
98
+ ## Unsupported ClickZetta SQL Syntax
98
99
 
99
- | 不支持 | 替代方案 |
100
+ | Not Supported | Alternative |
100
101
  |--------|---------|
101
- | `CREATE OR REPLACE TABLE` | `CREATE TABLE IF NOT EXISTS`(普通表不支持 OR REPLACE |
102
- | `ARRAY_AGG(col IGNORE NULLS)` | `MAX(col)` `COALESCE()` |
103
- | `QUALIFY` 子句 | 子查询 + `WHERE rn = 1` |
104
- | `UNION` / `INTERSECT` / `EXCEPT` | JOIN + 应用层合并 |
105
- | `BEGIN; COMMIT; ROLLBACK;` | MERGE 实现原子操作 |
102
+ | `CREATE OR REPLACE TABLE` | `CREATE TABLE IF NOT EXISTS` (regular tables don't support OR REPLACE) |
103
+ | `ARRAY_AGG(col IGNORE NULLS)` | `MAX(col)` or `COALESCE()` |
104
+ | `QUALIFY` clause | Subquery + `WHERE rn = 1` |
105
+ | `UNION` / `INTERSECT` / `EXCEPT` | JOIN + application-layer merge |
106
+ | `BEGIN; COMMIT; ROLLBACK;` | Use MERGE for atomic operations |
106
107
  | `NOW()` | `CURRENT_TIMESTAMP()` |
107
108
 
108
- 遇到其他语法报错,加载 `clickzetta-sql-syntax-guide` skill
109
+ For other syntax errors, load the `clickzetta-sql-migration` skill to see Snowflake/Databricks/Spark vs. ClickZetta syntax differences.
109
110
 
110
111
  ---
111
112
 
112
- ## Schema 上下文
113
+ ## Schema Context
113
114
 
114
- Python 代码中 SQL 语句始终使用完整表名 `schema.table`,不依赖当前 schema 上下文。
115
+ Always use fully qualified table names (`schema.table`) in SQL within Python code — do not rely on the current schema context.
115
116
 
116
117
  ---
117
118
 
118
- ## 参考文档
119
+ ## References
119
120
 
120
- - [环境搭建与项目配置](references/setup.md) — 环境搭建、config.py 模板、Jupyter 配置
121
- - [数据发现/质量/清洗/EDA 示例](references/data-patterns.md)
122
- - [数据写入/特征工程/模型推理示例](references/write-and-infer.md)
121
+ - [Environment Setup & Project Config](references/setup.md) — environment setup, config.py template, Jupyter configuration
122
+ - [Data Discovery / Quality / Cleaning / EDA Examples](references/data-patterns.md)
123
+ - [Data Write / Feature Engineering / Model Inference Examples](references/write-and-infer.md)
123
124
  - [ZettaPark API](references/zettapark-api.md)
124
- - [统计分析函数](references/stats-functions.md)
125
- - [BITMAP 用户画像](references/bitmap-profile.md)
125
+ - [Statistical Analysis Functions](references/stats-functions.md)
126
+ - [BITMAP User Profiling](references/bitmap-profile.md)