@clickzetta/cz-cli-linux-x64 0.3.4 → 0.3.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (118) hide show
  1. package/bin/cz-cli +0 -0
  2. package/package.json +1 -1
  3. package/bin/skills/clickzetta-access-control/SKILL.md +0 -243
  4. package/bin/skills/clickzetta-access-control/references/dynamic-masking.md +0 -86
  5. package/bin/skills/clickzetta-access-control/references/grant-revoke.md +0 -103
  6. package/bin/skills/clickzetta-access-control/references/role-management.md +0 -66
  7. package/bin/skills/clickzetta-access-control/references/user-management.md +0 -61
  8. package/bin/skills/clickzetta-ai-vector-search/SKILL.md +0 -160
  9. package/bin/skills/clickzetta-ai-vector-search/references/vector-search.md +0 -155
  10. package/bin/skills/clickzetta-app-python-sdk/SKILL.md +0 -153
  11. package/bin/skills/clickzetta-app-python-sdk/references/bulkload.md +0 -196
  12. package/bin/skills/clickzetta-app-python-sdk/references/connector.md +0 -143
  13. package/bin/skills/clickzetta-app-python-sdk/references/realtime.md +0 -122
  14. package/bin/skills/clickzetta-batch-sync-pipeline/SKILL.md +0 -293
  15. package/bin/skills/clickzetta-bi-connect/SKILL.md +0 -176
  16. package/bin/skills/clickzetta-bi-connect/references/bi-tools.md +0 -170
  17. package/bin/skills/clickzetta-cdc-sync-pipeline/SKILL.md +0 -457
  18. package/bin/skills/clickzetta-concepts/SKILL.md +0 -282
  19. package/bin/skills/clickzetta-concepts/references/brands-and-endpoints.md +0 -79
  20. package/bin/skills/clickzetta-concepts/references/object-model.md +0 -311
  21. package/bin/skills/clickzetta-data-ingest-pipeline/SKILL.md +0 -165
  22. package/bin/skills/clickzetta-data-lifecycle/SKILL.md +0 -211
  23. package/bin/skills/clickzetta-data-lifecycle/references/lifecycle-reference.md +0 -175
  24. package/bin/skills/clickzetta-data-recovery/SKILL.md +0 -215
  25. package/bin/skills/clickzetta-data-recovery/evals/evals.json +0 -35
  26. package/bin/skills/clickzetta-data-science/SKILL.md +0 -125
  27. package/bin/skills/clickzetta-data-science/references/bitmap-profile.md +0 -146
  28. package/bin/skills/clickzetta-data-science/references/data-patterns.md +0 -110
  29. package/bin/skills/clickzetta-data-science/references/setup.md +0 -160
  30. package/bin/skills/clickzetta-data-science/references/stats-functions.md +0 -195
  31. package/bin/skills/clickzetta-data-science/references/write-and-infer.md +0 -122
  32. package/bin/skills/clickzetta-data-science/references/zettapark-api.md +0 -156
  33. package/bin/skills/clickzetta-data-sharing/SKILL.md +0 -160
  34. package/bin/skills/clickzetta-data-sharing/references/share-ddl.md +0 -134
  35. package/bin/skills/clickzetta-dba-guide/SKILL.md +0 -540
  36. package/bin/skills/clickzetta-dw-modeling/SKILL.md +0 -259
  37. package/bin/skills/clickzetta-dw-modeling/references/modeling-patterns.md +0 -100
  38. package/bin/skills/clickzetta-dynamic-table/SKILL.md +0 -112
  39. package/bin/skills/clickzetta-dynamic-table/best-practices/dimension-table-join-guide.md +0 -257
  40. package/bin/skills/clickzetta-dynamic-table/best-practices/medallion-and-stream-patterns.md +0 -124
  41. package/bin/skills/clickzetta-dynamic-table/best-practices/non-partitioned-merge-into-warning.md +0 -96
  42. package/bin/skills/clickzetta-dynamic-table/best-practices/performance-optimization.md +0 -109
  43. package/bin/skills/clickzetta-dynamic-table/dt-creator/SKILL.md +0 -15
  44. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/dt-declaration-strategy.md +0 -185
  45. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/incremental-config-reference.md +0 -429
  46. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/refresh-history-guide.md +0 -268
  47. package/bin/skills/clickzetta-dynamic-table/dt-creator/references/sql-limitations.md +0 -80
  48. package/bin/skills/clickzetta-dynamic-table/dynamic-table-alter/SKILL.md +0 -190
  49. package/bin/skills/clickzetta-external-catalog/SKILL.md +0 -120
  50. package/bin/skills/clickzetta-external-catalog/references/external-catalog-ddl.md +0 -130
  51. package/bin/skills/clickzetta-external-function/SKILL.md +0 -203
  52. package/bin/skills/clickzetta-external-function/references/external-function-ddl.md +0 -171
  53. package/bin/skills/clickzetta-file-import-pipeline/SKILL.md +0 -156
  54. package/bin/skills/clickzetta-index-manager/SKILL.md +0 -140
  55. package/bin/skills/clickzetta-index-manager/references/bloomfilter-index.md +0 -67
  56. package/bin/skills/clickzetta-index-manager/references/index-management.md +0 -73
  57. package/bin/skills/clickzetta-index-manager/references/inverted-index.md +0 -80
  58. package/bin/skills/clickzetta-index-manager/references/vector-index.md +0 -81
  59. package/bin/skills/clickzetta-information-schema/SKILL.md +0 -367
  60. package/bin/skills/clickzetta-information-schema/references/instance-views-reference.md +0 -276
  61. package/bin/skills/clickzetta-information-schema/references/metering-views-reference.md +0 -137
  62. package/bin/skills/clickzetta-information-schema/references/views-reference.md +0 -271
  63. package/bin/skills/clickzetta-java-sdk/SKILL.md +0 -186
  64. package/bin/skills/clickzetta-java-sdk/references/bulkload.md +0 -163
  65. package/bin/skills/clickzetta-java-sdk/references/realtime.md +0 -212
  66. package/bin/skills/clickzetta-kafka-ingest-pipeline/SKILL.md +0 -639
  67. package/bin/skills/clickzetta-kafka-ingest-pipeline/references/kafka-pipe-syntax.md +0 -324
  68. package/bin/skills/clickzetta-lakehouse-connect/SKILL.md +0 -218
  69. package/bin/skills/clickzetta-lakehouse-connect/evals/evals.json +0 -35
  70. package/bin/skills/clickzetta-lakehouse-connect/references/config-file.md +0 -435
  71. package/bin/skills/clickzetta-lakehouse-connect/references/jdbc.md +0 -478
  72. package/bin/skills/clickzetta-lakehouse-connect/references/python-sdk.md +0 -225
  73. package/bin/skills/clickzetta-lakehouse-connect/references/sqlalchemy.md +0 -468
  74. package/bin/skills/clickzetta-lakehouse-connect/references/zettapark-session.md +0 -445
  75. package/bin/skills/clickzetta-manage-comments/SKILL.md +0 -219
  76. package/bin/skills/clickzetta-metadata-query/SKILL.md +0 -298
  77. package/bin/skills/clickzetta-metadata-query/references/show-desc-reference.md +0 -326
  78. package/bin/skills/clickzetta-monitoring/SKILL.md +0 -199
  79. package/bin/skills/clickzetta-monitoring/references/job-history-analysis.md +0 -97
  80. package/bin/skills/clickzetta-monitoring/references/show-jobs.md +0 -48
  81. package/bin/skills/clickzetta-oss-ingest-pipeline/SKILL.md +0 -427
  82. package/bin/skills/clickzetta-query-optimizer/SKILL.md +0 -156
  83. package/bin/skills/clickzetta-query-optimizer/references/explain.md +0 -56
  84. package/bin/skills/clickzetta-query-optimizer/references/hints-and-sortkey.md +0 -78
  85. package/bin/skills/clickzetta-query-optimizer/references/optimize.md +0 -65
  86. package/bin/skills/clickzetta-query-optimizer/references/result-cache.md +0 -49
  87. package/bin/skills/clickzetta-query-optimizer/references/show-jobs.md +0 -42
  88. package/bin/skills/clickzetta-realtime-sync-pipeline/SKILL.md +0 -197
  89. package/bin/skills/clickzetta-semantic-view/SKILL.md +0 -207
  90. package/bin/skills/clickzetta-semantic-view/references/semantic-view-reference.md +0 -167
  91. package/bin/skills/clickzetta-spark-flink-connector/SKILL.md +0 -92
  92. package/bin/skills/clickzetta-spark-flink-connector/references/flink.md +0 -147
  93. package/bin/skills/clickzetta-spark-flink-connector/references/spark.md +0 -132
  94. package/bin/skills/clickzetta-sql-pipeline-manager/SKILL.md +0 -379
  95. package/bin/skills/clickzetta-sql-pipeline-manager/evals/evals.json +0 -166
  96. package/bin/skills/clickzetta-sql-pipeline-manager/references/dynamic-table.md +0 -185
  97. package/bin/skills/clickzetta-sql-pipeline-manager/references/materialized-view.md +0 -129
  98. package/bin/skills/clickzetta-sql-pipeline-manager/references/pipe.md +0 -222
  99. package/bin/skills/clickzetta-sql-pipeline-manager/references/table-stream.md +0 -125
  100. package/bin/skills/clickzetta-sql-syntax-guide/SKILL.md +0 -172
  101. package/bin/skills/clickzetta-sql-syntax-guide/references/ddl-reference.md +0 -350
  102. package/bin/skills/clickzetta-sql-syntax-guide/references/dml-reference.md +0 -279
  103. package/bin/skills/clickzetta-sql-syntax-guide/references/dql-reference.md +0 -504
  104. package/bin/skills/clickzetta-sql-syntax-guide/references/functions-reference.md +0 -372
  105. package/bin/skills/clickzetta-sql-syntax-guide/references/migration-databricks.md +0 -260
  106. package/bin/skills/clickzetta-sql-syntax-guide/references/migration-snowflake.md +0 -382
  107. package/bin/skills/clickzetta-sql-syntax-guide/references/vs-snowflake.md +0 -346
  108. package/bin/skills/clickzetta-sql-syntax-guide/references/vs-spark.md +0 -229
  109. package/bin/skills/clickzetta-studio-overview/SKILL.md +0 -170
  110. package/bin/skills/clickzetta-studio-overview/references/studio-modules.md +0 -173
  111. package/bin/skills/clickzetta-table-stream-pipeline/SKILL.md +0 -206
  112. package/bin/skills/clickzetta-vcluster-manager/SKILL.md +0 -212
  113. package/bin/skills/clickzetta-vcluster-manager/references/vc-cache.md +0 -54
  114. package/bin/skills/clickzetta-vcluster-manager/references/vcluster-ddl.md +0 -150
  115. package/bin/skills/clickzetta-volume-manager/SKILL.md +0 -292
  116. package/bin/skills/clickzetta-volume-manager/references/volume-ddl.md +0 -199
  117. package/bin/skills/clickzetta-zettapark/SKILL.md +0 -248
  118. package/bin/skills/clickzetta-zettapark/references/zettapark-api.md +0 -283
@@ -1,143 +0,0 @@
1
- # clickzetta-connector-python 详细参考
2
-
3
- ## 安装
4
-
5
- ```bash
6
- pip install clickzetta-connector-python -U
7
- ```
8
-
9
- ## 建立连接
10
-
11
- ```python
12
- from clickzetta import connect
13
-
14
- conn = connect(
15
- username='your_username',
16
- password='your_password',
17
- service='api.clickzetta.com',
18
- instance='your_instance',
19
- workspace='your_workspace',
20
- schema='public',
21
- vcluster='default'
22
- )
23
- ```
24
-
25
- ## 基本查询
26
-
27
- ```python
28
- cursor = conn.cursor()
29
- cursor.execute('SELECT * FROM orders LIMIT 10')
30
- results = cursor.fetchall()
31
- for row in results:
32
- print(row)
33
- cursor.close()
34
- conn.close()
35
- ```
36
-
37
- ## 参数绑定
38
-
39
- 支持两种风格(PEP-249 规范):
40
-
41
- ### qmark 风格(推荐)
42
-
43
- ```python
44
- # 单行插入
45
- cursor.execute('INSERT INTO test (id, name) VALUES (?, ?)', binding_params=[1, 'test'])
46
-
47
- # 批量插入(executemany)
48
- data = [
49
- (1, 'test1'),
50
- (2, 'test2'),
51
- (3, 'test3')
52
- ]
53
- cursor.executemany('INSERT INTO test (id, name) VALUES (?, ?)', data)
54
- ```
55
-
56
- ### pyformat 风格
57
-
58
- ```python
59
- data = {'id': 1, 'name': 'test'}
60
- cursor.execute('INSERT INTO test (id, name) VALUES (%(id)s, %(name)s)', data)
61
- ```
62
-
63
- ## SQL hints(超时控制等)
64
-
65
- ```python
66
- my_param = {
67
- 'hints': {
68
- 'sdk.job.timeout': 30 # 查询超时秒数
69
- }
70
- }
71
- cursor.execute('SELECT * FROM large_table', my_param)
72
- ```
73
-
74
- ## 异步执行(长时间查询)
75
-
76
- ```python
77
- import time
78
-
79
- cursor.execute_async('SELECT * FROM large_table')
80
-
81
- while not cursor.is_job_finished():
82
- print("查询执行中...")
83
- time.sleep(1)
84
-
85
- results = cursor.fetchall()
86
- ```
87
-
88
- ## 结果保存到 CSV
89
-
90
- ```python
91
- import csv
92
-
93
- cursor.execute('SELECT * FROM orders LIMIT 1000')
94
- results = cursor.fetchall()
95
-
96
- with open('output.csv', 'w', newline='', encoding='utf-8') as f:
97
- writer = csv.writer(f)
98
- writer.writerow([col[0] for col in cursor.description])
99
- writer.writerows(results)
100
-
101
- cursor.close()
102
- conn.close()
103
- ```
104
-
105
- ## SQLAlchemy 集成
106
-
107
- ```python
108
- from sqlalchemy import create_engine, text
109
-
110
- engine = create_engine(
111
- "clickzetta://username:password@instance.api.clickzetta.com/workspace"
112
- "?schema=public&vcluster=default"
113
- )
114
-
115
- with engine.connect() as conn:
116
- result = conn.execute(text("SELECT * FROM orders LIMIT 10"))
117
- for row in result:
118
- print(row)
119
- ```
120
-
121
- ### SQLAlchemy + pandas
122
-
123
- ```python
124
- import pandas as pd
125
- from sqlalchemy import create_engine, text
126
-
127
- engine = create_engine(
128
- "clickzetta://username:password@instance.api.clickzetta.com/workspace"
129
- "?schema=public&vcluster=default"
130
- )
131
-
132
- with engine.connect() as conn:
133
- result = conn.execute(text("SELECT * FROM orders"))
134
- df = pd.DataFrame(result.fetchall(), columns=result.keys())
135
-
136
- print(df.head())
137
- ```
138
-
139
- ## 注意事项
140
-
141
- - 不支持 `commit()` 和 `rollback()` 接口
142
- - 需要 `clickzetta-connector-python >= 0.8.82` 才能使用参数绑定和异步执行
143
- - 旧版 `clickzetta-connector` 已停止维护,请迁移到 `clickzetta-connector-python`
@@ -1,122 +0,0 @@
1
- # clickzetta-ingestion-python-v2 实时写入详细参考
2
-
3
- ## 安装
4
-
5
- ```bash
6
- pip install clickzetta-ingestion-python-v2
7
- ```
8
-
9
- > 注意:这是独立的包,与 `clickzetta-ingestion-python`(BulkLoad)不同。
10
-
11
- ## 与 BulkLoad 的区别
12
-
13
- | 特性 | BulkLoad (`ingestion-python`) | 实时写入 (`ingestion-python-v2`) |
14
- |---|---|---|
15
- | 写入延迟 | 分钟级(commit 后可见) | 秒级可查 |
16
- | 适用频率 | 间隔 ≥ 5 分钟 | 高频小批(5 分钟以内) |
17
- | 主键表支持 | ❌ 不支持 | ✅ 支持(CDC 模式) |
18
- | UPDATE/DELETE | ❌ | ✅ UPSERT / DELETE_IGNORE |
19
- | 写入原理 | 上传对象存储 → commit 导入 | 直写 Ingestion Service |
20
-
21
- ## 工作原理
22
-
23
- ```
24
- [SDK 写入] → [Ingestion Service] → [混合表(秒级可查)] → [~1分钟后自动 commit] → [普通表(stream/DT 可见)]
25
- ```
26
-
27
- - 写入后秒级即可 SELECT 查到数据
28
- - table stream、materialized view、dynamic table 需等约 1 分钟(commit 后)才能看到
29
- - commit 后后台合并为普通表,之后可执行 UPDATE/MERGE/DELETE
30
-
31
- ## 普通表写入(APPEND_ONLY)
32
-
33
- ```python
34
- from clickzetta.connector.v0.connection import connect
35
- from clickzetta.connector.v0.enums import RealtimeOperation
36
- from clickzetta_ingestion.realtime.realtime_options import RealtimeOptionsBuilder, FlushMode
37
- from clickzetta_ingestion.realtime.arrow_stream import RowOperator
38
-
39
- with connect(
40
- username='your_username',
41
- password='your_password',
42
- service='your_service_endpoint',
43
- instance='your_instance',
44
- workspace='your_workspace',
45
- schema='your_schema',
46
- vcluster='default'
47
- ) as conn:
48
- stream = conn.get_realtime_stream(
49
- schema='your_schema',
50
- table='your_table',
51
- operate=RealtimeOperation.APPEND_ONLY,
52
- options=RealtimeOptionsBuilder()
53
- .with_flush_mode(FlushMode.AUTO_FLUSH_BACKGROUND)
54
- .build()
55
- )
56
-
57
- for i in range(1000):
58
- row = stream.create_row(RowOperator.INSERT)
59
- row.set_value('id', str(i))
60
- row.set_value('name', f'user_{i}')
61
- stream.apply(row)
62
-
63
- stream.close()
64
- ```
65
-
66
- ## 主键表写入(CDC 模式)
67
-
68
- 主键表**必须**使用 `RealtimeOperation.CDC` + `FlushMode.AUTO_FLUSH_SYNC`。
69
-
70
- ```python
71
- from clickzetta.connector.v0.connection import connect
72
- from clickzetta.connector.v0.enums import RealtimeOperation
73
- from clickzetta_ingestion.realtime.realtime_options import RealtimeOptionsBuilder, FlushMode
74
- from clickzetta_ingestion.realtime.arrow_stream import RowOperator
75
-
76
- with connect(...) as conn:
77
- stream = conn.get_realtime_stream(
78
- schema='your_schema',
79
- table='your_pk_table',
80
- operate=RealtimeOperation.CDC,
81
- options=RealtimeOptionsBuilder()
82
- .with_flush_mode(FlushMode.AUTO_FLUSH_SYNC) # PK 表强制同步刷写
83
- .build()
84
- )
85
-
86
- # UPSERT:存在则更新,不存在则插入
87
- row = stream.create_row(RowOperator.UPSERT)
88
- row.set_value('id', 'id_1')
89
- row.set_value('name', 'alice')
90
- row.set_value('age', 30)
91
- stream.apply(row)
92
-
93
- # DELETE_IGNORE:删除,目标行不存在时自动忽略
94
- row = stream.create_row(RowOperator.DELETE_IGNORE)
95
- row.set_value('id', 'id_1')
96
- stream.apply(row)
97
-
98
- stream.close()
99
- ```
100
-
101
- ## 操作类型对照
102
-
103
- | Stream 模式 | 可用 RowOperator | 适用表类型 |
104
- |---|---|---|
105
- | `APPEND_ONLY` | `INSERT` | 普通表 |
106
- | `CDC` | `UPSERT`、`DELETE_IGNORE` | 主键表(必须) |
107
-
108
- ## FlushMode 说明
109
-
110
- | 模式 | 说明 | 适用场景 |
111
- |---|---|---|
112
- | `AUTO_FLUSH_BACKGROUND` | 异步刷写,高吞吐 | 普通表,对顺序无要求 |
113
- | `AUTO_FLUSH_SYNC` | 同步刷写,阻塞式 | 主键表(强制),需保证顺序 |
114
- | `MANUAL_FLUSH` | 手动调用 `stream.flush()` | 精确控制刷写时机 |
115
-
116
- > ⚠️ 主键表不支持 `AUTO_FLUSH_BACKGROUND`,会自动重置为 `AUTO_FLUSH_SYNC`。
117
-
118
- ## 关键注意事项
119
-
120
- - 表结构变更前需先停止实时写入任务,变更后约 90 分钟再重启(Flink Connector 的 schema change sink 除外)
121
- - 分区列必须是 primary key 的子集
122
- - 避免 `flush()` 过于频繁,会产生大量小文件
@@ -1,293 +0,0 @@
1
- ---
2
- name: clickzetta-batch-sync-pipeline
3
- description: |
4
- 创建和管理 ClickZetta Lakehouse 离线同步(批量同步)任务,支持单表离线同步和多表离线同步两种模式。
5
- 单表模式适合简单的源→目标表同步;多表模式支持整库镜像、多表镜像、多表合并三种同步方式。
6
- 当用户说"离线同步"、"批量同步"、"batch sync"、"数据库同步到 Lakehouse"、"整库迁移"、
7
- "多表同步"、"定期同步"、"周期性数据同步"、"分库分表合并"、"离线数据迁移"时触发。
8
- 包含单表/多表离线同步任务创建、数据源配置、字段映射、同步规则、调度部署、任务运维等
9
- ClickZetta Studio 特有逻辑。
10
- Keywords: batch sync, offline sync, full load, mirror, multi-table sync
11
- ---
12
-
13
- # 离线同步(批量同步)Pipeline 工作流
14
-
15
- ## 适用场景
16
-
17
- - 将外部数据库(MySQL / PostgreSQL / SQL Server 等)的数据定期同步到 Lakehouse
18
- - 单表离线同步:简单的源表 → 目标表周期性同步
19
- - 多表离线同步:整库迁移、多表批量同步、分库分表合并
20
- - 数据时效性要求不高,按天/小时等周期批量更新
21
- - 需要通过 Studio 调度系统进行周期性自动执行
22
- - 关键词:离线同步、批量同步、整库迁移、多表同步、定期同步、batch sync
23
-
24
- ## 前置依赖
25
-
26
- - ClickZetta Lakehouse Studio 账户,具备创建同步任务、目标表的权限
27
- - 源端数据源已在 Studio 中配置(具备 SELECT 权限)
28
- - 目标端 Lakehouse 数据源可用(具备 CREATE、INSERT 权限)
29
- - clickzetta-studio-mcp 工具可用(`create_task`、`save_integration_task`、`save_task_configuration`、`publish_task`、`list_data_sources` 等)
30
-
31
- ## 模式选择指引
32
-
33
- | 维度 | 单表离线同步 | 多表离线同步 |
34
- |------|------------|------------|
35
- | 任务类型 ID | `10`(离线同步) | `291`(多表离线同步) |
36
- | 同步粒度 | 单张源表 → 单张目标表 | 整库 / 多表 → 多张目标表 |
37
- | 适用场景 | 简单同步、精细控制单表 | 整库迁移、批量同步、分库分表合并 |
38
- | 字段映射 | 手动拖拽调整 | 自动识别 + 批量配置 |
39
- | Schema Evolution | 不支持 | 支持(新增字段自动适配) |
40
- | 自动建表 | 需手动创建目标表或快速创建 | 目标表不存在时自动创建 |
41
- | 写入模式 | 由数据源决定 | overwrite / upsert 可选 |
42
-
43
- ## 多表离线同步支持的数据源
44
-
45
- **来源端**:MySQL、PostgreSQL、SQL Server、Aurora MySQL、Aurora PostgreSQL、PolarDB MySQL、PolarDB PostgreSQL
46
-
47
- **目标端**:Lakehouse
48
-
49
- ## 工作流
50
-
51
- ### 模式 A:单表离线同步
52
-
53
- #### 步骤 1:查找可用数据源
54
-
55
- ```
56
- 使用 list_data_sources 查看已配置的数据源列表。
57
- 如需按类型过滤,指定 ds_type 参数(5=MySQL, 7=PostgreSQL, 8=SQL Server)。
58
- 记录源端 datasource_name 和目标端 datasource_name。
59
- ```
60
-
61
- #### 步骤 2:创建离线同步任务
62
-
63
- ```
64
- 使用 create_task 创建任务:
65
- - task_type: 10(离线同步)
66
- - task_name: 自定义任务名称
67
- - data_folder_id: 目标文件夹 ID(可通过 list_folders 获取)
68
- ```
69
-
70
- #### 步骤 3:配置同步内容
71
-
72
- ```
73
- 使用 save_integration_task 配置同步:
74
- - task_id: 步骤 2 返回的任务 ID
75
- - source_datasource_name: 源端数据源名称
76
- - source_schema: 源端数据库/Schema
77
- - source_table: 源端表名
78
- - source_ds_type: 源端类型(5=MySQL, 7=PostgreSQL, 8=SQL Server 等)
79
- - sink_datasource_name: 目标 Lakehouse 数据源名称
80
- - sink_schema: 目标 Schema(默认 public)
81
- - sink_table: 目标表名(可选,默认与源表同名)
82
- - sink_ds_type: 1(Lakehouse)
83
- ```
84
-
85
- > **说明**:系统会自动获取源表和目标表的元数据,生成字段映射。如目标表不存在,会自动创建。
86
-
87
- #### 步骤 4:配置调度并部署
88
-
89
- ```
90
- 使用 save_task_configuration 配置调度:
91
- - task_id: 任务 ID
92
- - cron_express: Cron 表达式(如 '0 0 2 * * ? *' 表示每天凌晨 2 点)
93
- - schedule_start_time: 调度开始时间(如 '02:00')
94
-
95
- 注意:离线同步任务(task_type=10)需要 Sync VCluster。
96
- 工具会自动检查并分配可用的 Sync VCluster。如无可用 Sync VCluster,需先创建。
97
- ```
98
-
99
- #### 步骤 5:提交任务
100
-
101
- ```
102
- 使用 publish_task 提交任务到调度系统:
103
- - task_id: 任务 ID
104
- - task_version: 当前版本号(通过 get_task_detail 获取)
105
- ```
106
-
107
- #### 步骤 6:验证与监控
108
-
109
- ```
110
- 使用 get_task_detail 查看任务详情和状态。
111
- 使用 list_task_run 查看任务执行记录。
112
- 如执行失败,使用 list_executions + get_execution_log 查看日志排查问题。
113
- ```
114
-
115
- ---
116
-
117
- ### 模式 B:多表离线同步
118
-
119
- #### 三种同步方式
120
-
121
- | 方式 | 说明 | 适用场景 |
122
- |------|------|---------|
123
- | 整库镜像 | 同步源端整个数据库的所有表 | 整库迁移 |
124
- | 多表镜像 | 选择多张表分别同步,保持表结构独立 | 按需选择部分表 |
125
- | 多表合并 | 多个源表合并到一张或多张目标表 | 分库分表合并 |
126
-
127
- #### 步骤 1:创建多表离线同步任务
128
-
129
- ```
130
- 使用 create_task 创建任务:
131
- - task_type: 291(多表离线同步)
132
- - task_name: 自定义任务名称
133
- - data_folder_id: 目标文件夹 ID
134
- ```
135
-
136
- #### 步骤 2:在 Studio UI 中配置同步
137
-
138
- > **重要**:多表离线同步的详细配置(来源数据选择、目标设置、映射关系、同步规则等)
139
- > 目前需要在 Studio Web UI 中完成。create_task 返回的 studio_url 可直接打开配置页面。
140
-
141
- 配置要点:
142
-
143
- **来源数据配置**
144
- - 选择源端数据源类型和数据源连接
145
- - 根据同步方式选择:整库 / 勾选多表 / 配置合并规则
146
-
147
- **目标设置**
148
- - 选择目标 Lakehouse 数据源和 workspace
149
- - 配置命名空间规则:镜像来源 / 指定选择 / 自定义(支持 `{SOURCE_DATABASE}` 变量)
150
- - 配置目标表命名规则:镜像来源 / 自定义(支持 `{SOURCE_DATABASE}`、`{SOURCE_SCHEMA}`、`{SOURCE_TABLE}` 变量)
151
- - 可选:配置分区(分区字段 + 分区值表达式)
152
-
153
- **同步规则**
154
- - Schema Evolution:源端删除字段 → 写入 Null;源端新增字段 → 自动适配;源端删除表 → 忽略
155
- - 分组策略:智能分组(自动)或 静态分组(指定单组表数量,默认 4)
156
- - 并发控制:单分组源端最大连接数(默认 4)、并发执行分组数(默认 2)
157
- - 写入模式:非主键表 → overwrite;主键表 → overwrite 或 upsert
158
-
159
- #### 步骤 3:调试运行
160
-
161
- 在 Studio UI 中点击「运行」按钮进行调试,验证数据源连接和配置是否正确。
162
- 在「运行历史」中查看运行详情。
163
-
164
- #### 步骤 4:配置调度
165
-
166
- ```
167
- 使用 save_task_configuration 配置调度:
168
- - task_id: 任务 ID
169
- - cron_express: Cron 表达式
170
- - schedule_start_time: 调度开始时间
171
-
172
- 同样需要 Sync VCluster(task_type=291 属于离线同步类型)。
173
- ```
174
-
175
- #### 步骤 5:提交任务
176
-
177
- ```
178
- 使用 publish_task 提交任务:
179
- - task_id: 任务 ID
180
- - task_version: 当前版本号
181
- ```
182
-
183
- #### 步骤 6:任务运维
184
-
185
- ```
186
- 多表离线同步任务在「任务运维」→「周期任务」中管理。
187
-
188
- 查看任务详情:get_task_detail
189
- 查看执行记录:list_task_run
190
- 查看执行日志:list_executions + get_execution_log
191
-
192
- Studio UI 中可查看:
193
- - 任务详情 Tab:上下游 DAG、配置信息
194
- - 任务实例 Tab:实例列表、每张表的读取/写入行数和同步速率
195
- - 同步对象 Tab:所有源表和目标表的映射关系
196
- - 操作日志 Tab:运维操作审计
197
- ```
198
-
199
- ---
200
-
201
- ## 任务运维操作
202
-
203
- ### 常用操作
204
-
205
- | 操作 | 说明 |
206
- |------|------|
207
- | 暂停/恢复 | 暂停或恢复周期调度 |
208
- | 下线 | 停止任务并从调度系统移除,回退到未提交状态 |
209
- | 下线(含下游) | 将当前任务及下游任务一并下线(有下游依赖时不允许单独下线) |
210
- | 补数据 | 对历史周期进行数据补录 |
211
- | 编辑 | 跳转到开发界面修改配置 |
212
-
213
- ### 实例操作(多表离线同步)
214
-
215
- | 操作 | 说明 |
216
- |------|------|
217
- | 重跑(全部对象) | 重新同步所有表 |
218
- | 重跑(仅失败对象) | 只重跑同步失败的表 |
219
- | 置成功/置失败 | 手动设置实例最终状态 |
220
- | 取消运行 | 强制终止正在运行的实例 |
221
- | 单表重新同步 | 在同步对象 Tab 中对单张表重新同步 |
222
- | 单表强制停止 | 终止单张表的同步 |
223
-
224
- ---
225
-
226
- ## 示例
227
-
228
- ### 示例 1:MySQL 单表每日同步
229
-
230
- 用户说:"把 MySQL 的 orders 表每天凌晨 2 点同步到 Lakehouse"
231
-
232
- 操作:
233
- 1. `list_data_sources` 找到 MySQL 数据源名称(如 `mysql_prod`)和 Lakehouse 数据源名称
234
- 2. `create_task(task_type=10, task_name="sync_orders_daily")`
235
- 3. `save_integration_task(source_datasource_name="mysql_prod", source_table="orders", sink_schema="public", sink_table="orders")`
236
- 4. `save_task_configuration(cron_express="0 0 2 * * ? *", schedule_start_time="02:00")`
237
- 5. `publish_task(task_id=..., task_version=...)`
238
-
239
- ### 示例 2:MySQL 整库迁移到 Lakehouse
240
-
241
- 用户说:"把 MySQL 的 ecommerce 数据库整库同步到 Lakehouse"
242
-
243
- 操作:
244
- 1. `create_task(task_type=291, task_name="sync_ecommerce_db")` → 获取 studio_url
245
- 2. 在 Studio UI 中:选择整库镜像 → 选择 ecommerce 数据库 → 配置目标 workspace → 写入模式选 upsert(主键表)
246
- 3. 点击「运行」调试验证
247
- 4. `save_task_configuration(cron_express="0 0 1 * * ? *")` 配置每日凌晨 1 点调度
248
- 5. `publish_task(...)` 提交
249
-
250
- ## 故障排除
251
-
252
- | 问题 | 排查方向 |
253
- |------|---------|
254
- | 任务创建失败 | 检查是否有可用的 Sync VCluster(`LH_show_object_list` 查看 VCLUSTERS) |
255
- | 源端连接失败 | 检查数据源配置中的连接信息、网络可达性、账号权限 |
256
- | 字段映射失败 | 检查源表和目标表的字段类型兼容性 |
257
- | 同步速度慢 | 调整并发数(最大 10)和同步速率;检查源端数据库负载 |
258
- | Schema Evolution 失败 | 不支持修改主键字段;字段类型仅支持同类型扩展(int8→int16→int32→int64);不支持跨类型转换 |
259
- | 多表同步部分表失败 | 在实例详情的「同步对象」Tab 查看各表状态;可对失败表单独重跑 |
260
- | upsert 模式数据不一致 | 确认目标表有正确的主键定义;检查源端数据是否有主键冲突 |
261
-
262
- ## 注意事项
263
-
264
- ### 权限要求
265
-
266
- - 源端:数据源配置的账号需具备 SELECT 权限(读取元数据和表数据)
267
- - 目标端:任务负责人需具备 CREATE 和 INSERT 权限
268
-
269
- ### 性能考虑
270
-
271
- - 合理配置并发度,避免对源端数据库造成过大压力
272
- - 首次执行需初始化所有同步对象,可能耗时较长
273
- - 单个多表同步任务的表数量控制在合理范围,过多表影响执行效率
274
- - 选择源端数据库压力较小的时间窗口执行调度
275
-
276
- ### 数据一致性
277
-
278
- - overwrite 模式:每次执行完全刷新目标表数据
279
- - upsert 模式:基于主键进行增量更新
280
- - 两次同步间隔期间的数据变化会在下次同步时体现
281
-
282
- ### Schema Evolution 限制(多表离线同步)
283
-
284
- - 不支持修改主键字段(Lakehouse 主键表限制)
285
- - 字段类型修改仅支持同类型扩展(int8 → int16 → int32 → int64)
286
- - 不支持跨类型转换(如 int → double)
287
- - 建议在源端 Schema 相对稳定时启用
288
-
289
- ### Sync VCluster 要求
290
-
291
- - 离线同步任务(task_type=10 和 291)必须使用 Sync VCluster
292
- - 创建/调度任务前需确认有可用的 Sync VCluster
293
- - 可通过 `LH_show_object_list`(object_type='VCLUSTERS')查看,筛选 vcluster_type 包含 SYNC 的集群