@clickzetta/cz-cli-darwin-x64 0.5.16 → 0.5.18

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (243) hide show
  1. package/bin/cz-cli +0 -0
  2. package/bin/skills/lakehouse-doc-en/SKILL.md +6 -11
  3. package/bin/skills/lakehouse-doc-en/references/AIGateway.md +58 -13
  4. package/bin/skills/lakehouse-doc-en/references/Computation.md +1 -1
  5. package/bin/skills/lakehouse-doc-en/references/DataSource_Amazon_DocumentDB.md +3 -1
  6. package/bin/skills/lakehouse-doc-en/references/Foreach.md +14 -14
  7. package/bin/skills/lakehouse-doc-en/references/JDBC-Driver.md +0 -1
  8. package/bin/skills/lakehouse-doc-en/references/LakehouseAI-overview.md +21 -8
  9. package/bin/skills/lakehouse-doc-en/references/LakehouseDataGPT-tour.md +4 -9
  10. package/bin/skills/lakehouse-doc-en/references/LakehouseStudio-tour.md +14 -19
  11. package/bin/skills/lakehouse-doc-en/references/Lakehouse_Zilliz_MakeDataReadyforBIandAI.md +1 -1
  12. package/bin/skills/lakehouse-doc-en/references/Logstash.md +3 -3
  13. package/bin/skills/lakehouse-doc-en/references/Migrate_Spark_DataEngineeringBestPractices_Project_to_Lakehouse.md +1 -1
  14. package/bin/skills/lakehouse-doc-en/references/Notebook.md +17 -17
  15. package/bin/skills/lakehouse-doc-en/references/RemoteFunction-as-udf.md +14 -14
  16. package/bin/skills/lakehouse-doc-en/references/SQL_External_Catalog_Guide.md +1 -9
  17. package/bin/skills/lakehouse-doc-en/references/SUMMARY.md +59 -29
  18. package/bin/skills/lakehouse-doc-en/references/WINDOWFUNCTION.md +99 -57
  19. package/bin/skills/lakehouse-doc-en/references/Zettapark_Data_Engineering_Demo.md +1 -1
  20. package/bin/skills/lakehouse-doc-en/references/access-control-configuration.md +1 -8
  21. package/bin/skills/lakehouse-doc-en/references/aigw-2026-2-5-1.0.md +16 -0
  22. package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-29-1.0.2.md +14 -0
  23. package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-8-1.0.1.md +16 -0
  24. package/bin/skills/lakehouse-doc-en/references/aigw-2026-4-28-1.1.md +29 -0
  25. package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-12-1.1.1.md +18 -0
  26. package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-15-1.2.md +9 -0
  27. package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-21-1.3.md +9 -0
  28. package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-28-1.4.md +10 -0
  29. package/bin/skills/lakehouse-doc-en/references/aigw-2026-6-3-1.5.md +9 -0
  30. package/bin/skills/lakehouse-doc-en/references/alicloud-arn-externalid.md +0 -5
  31. package/bin/skills/lakehouse-doc-en/references/answer-accuracy-improve.md +120 -103
  32. package/bin/skills/lakehouse-doc-en/references/application-list.md +1 -3
  33. package/bin/skills/lakehouse-doc-en/references/approval-list.md +16 -17
  34. package/bin/skills/lakehouse-doc-en/references/batch-load-parquet-file-into-lakehouse.md +1 -1
  35. package/bin/skills/lakehouse-doc-en/references/batch_sync.md +9 -9
  36. package/bin/skills/lakehouse-doc-en/references/batch_sync_Sop.md +2 -2
  37. package/bin/skills/lakehouse-doc-en/references/batchloadparquetfileintoLakehouse.md +1 -1
  38. package/bin/skills/lakehouse-doc-en/references/bulkloadv1-python-sdk.md +3 -3
  39. package/bin/skills/lakehouse-doc-en/references/chart-auto-refresh-guide.md +12 -6
  40. package/bin/skills/lakehouse-doc-en/references/clickzetta-sample-data.md +3 -3
  41. package/bin/skills/lakehouse-doc-en/references/code_approval.md +1 -5
  42. package/bin/skills/lakehouse-doc-en/references/composite_task.md +31 -42
  43. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_environment_and_data_generate.md +6 -9
  44. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_javasdk_bulkload_realtime.md +4 -10
  45. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_kafka_realtime_sync.md +1 -10
  46. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_local_file_into_table_by_studio.md +0 -6
  47. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_batchload_public_network.md +0 -5
  48. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_python_node.md +2 -7
  49. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_realtime_cdc_public_network.md +13 -18
  50. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_sql_insert.md +0 -1
  51. package/bin/skills/lakehouse-doc-en/references/concepts.md +1 -1
  52. package/bin/skills/lakehouse-doc-en/references/config-datasource.md +5 -7
  53. package/bin/skills/lakehouse-doc-en/references/connect-with-cli.md +116 -72
  54. package/bin/skills/lakehouse-doc-en/references/connect-with-cz-cli.md +151 -0
  55. package/bin/skills/lakehouse-doc-en/references/continue-job.md +9 -17
  56. package/bin/skills/lakehouse-doc-en/references/create-api-connection.md +315 -286
  57. package/bin/skills/lakehouse-doc-en/references/create-catalog-connection.md +1 -0
  58. package/bin/skills/lakehouse-doc-en/references/create-dynamic-table.md +4 -4
  59. package/bin/skills/lakehouse-doc-en/references/create-external-catalog.md +85 -22
  60. package/bin/skills/lakehouse-doc-en/references/create-table-ddl.md +45 -0
  61. package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkendpoint.md +4 -6
  62. package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkservice.md +4 -7
  63. package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkendpoint.md +2 -7
  64. package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkservice.md +1 -5
  65. package/bin/skills/lakehouse-doc-en/references/cz-cli-agent.md +15 -10
  66. package/bin/skills/lakehouse-doc-en/references/cz-cli-datasource.md +0 -8
  67. package/bin/skills/lakehouse-doc-en/references/cz-cli-sql.md +2 -45
  68. package/bin/skills/lakehouse-doc-en/references/cz-cli.md +53 -42
  69. package/bin/skills/lakehouse-doc-en/references/dashboard-version-management-guide.md +12 -4
  70. package/bin/skills/lakehouse-doc-en/references/data-integration-intro.md +1 -1
  71. package/bin/skills/lakehouse-doc-en/references/data-integration.md +29 -27
  72. package/bin/skills/lakehouse-doc-en/references/data-load-summary.md +3 -3
  73. package/bin/skills/lakehouse-doc-en/references/data-quality.md +25 -25
  74. package/bin/skills/lakehouse-doc-en/references/data-sharing.md +31 -54
  75. package/bin/skills/lakehouse-doc-en/references/data-sources.md +45 -45
  76. package/bin/skills/lakehouse-doc-en/references/data_catalog.md +23 -25
  77. package/bin/skills/lakehouse-doc-en/references/data_privacy.md +5 -2
  78. package/bin/skills/lakehouse-doc-en/references/data_sharing_between_accounts_guide.md +0 -4
  79. package/bin/skills/lakehouse-doc-en/references/data_visualization.md +4 -15
  80. package/bin/skills/lakehouse-doc-en/references/dataagent.md +39 -7
  81. package/bin/skills/lakehouse-doc-en/references/databricks-delta-to-lakehouse-migration.md +168 -0
  82. package/bin/skills/lakehouse-doc-en/references/databricks-dlt-to-lakehouse-migration.md +331 -0
  83. package/bin/skills/lakehouse-doc-en/references/databricks-external-catalog-practice.md +367 -0
  84. package/bin/skills/lakehouse-doc-en/references/databricks-jobs-to-studio-migration.md +199 -0
  85. package/bin/skills/lakehouse-doc-en/references/databricks-notebook-to-studio-migration.md +350 -0
  86. package/bin/skills/lakehouse-doc-en/references/databricks-uc-governance-to-lakehouse-migration.md +327 -0
  87. package/bin/skills/lakehouse-doc-en/references/datagpt-model-config.md +34 -0
  88. package/bin/skills/lakehouse-doc-en/references/datagpt_data_source.md +50 -37
  89. package/bin/skills/lakehouse-doc-en/references/datagpt_introduction.md +55 -79
  90. package/bin/skills/lakehouse-doc-en/references/datagpt_quickstart.md +50 -64
  91. package/bin/skills/lakehouse-doc-en/references/datalake-acceleration.md +75 -2
  92. package/bin/skills/lakehouse-doc-en/references/dbt-databricks-to-clickzetta-migration.md +242 -0
  93. package/bin/skills/lakehouse-doc-en/references/dynamic-mask.md +30 -30
  94. package/bin/skills/lakehouse-doc-en/references/dynamic-table-bestpractice.md +1 -1
  95. package/bin/skills/lakehouse-doc-en/references/dynamic-table-introduce.md +1 -1
  96. package/bin/skills/lakehouse-doc-en/references/dynamic_table_summary.md +1 -1
  97. package/bin/skills/lakehouse-doc-en/references/eco_integration/streamlit.md +1 -1
  98. package/bin/skills/lakehouse-doc-en/references/eco_integration/superset.md +1 -1
  99. package/bin/skills/lakehouse-doc-en/references/ecosystem-all.md +1 -3
  100. package/bin/skills/lakehouse-doc-en/references/ecosystem.md +145 -0
  101. package/bin/skills/lakehouse-doc-en/references/external-catalog-summary.md +33 -38
  102. package/bin/skills/lakehouse-doc-en/references/external-function-combo-practice.md +466 -0
  103. package/bin/skills/lakehouse-doc-en/references/f6fc6447ee.md +7 -9
  104. package/bin/skills/lakehouse-doc-en/references/federation-query.md +56 -6
  105. package/bin/skills/lakehouse-doc-en/references/finebi-mysql.md +2 -0
  106. package/bin/skills/lakehouse-doc-en/references/get-started-with-sample-data.md +10 -11
  107. package/bin/skills/lakehouse-doc-en/references/gitfolder.md +2 -3
  108. package/bin/skills/lakehouse-doc-en/references/grant-privileges.md +2 -0
  109. package/bin/skills/lakehouse-doc-en/references/iceberg-rest-catalog-databricks.md +166 -0
  110. package/bin/skills/lakehouse-doc-en/references/ide.md +1 -1
  111. package/bin/skills/lakehouse-doc-en/references/if_else_task.md +59 -57
  112. package/bin/skills/lakehouse-doc-en/references/input_output.md +10 -7
  113. package/bin/skills/lakehouse-doc-en/references/jobprofile-bestpractices.md +60 -64
  114. package/bin/skills/lakehouse-doc-en/references/kafka-connection.md +0 -1
  115. package/bin/skills/lakehouse-doc-en/references/key-concepts.md +146 -117
  116. package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-gateway-cz-cli.md +317 -0
  117. package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-sql-analysis.md +345 -0
  118. package/bin/skills/lakehouse-doc-en/references/lakehouse-dqc-guide.md +300 -0
  119. package/bin/skills/lakehouse-doc-en/references/lakehouse-medallion-sql-dt-guide.md +543 -0
  120. package/bin/skills/lakehouse-doc-en/references/lakehouse-multi-cloud-acceleration.md +274 -0
  121. package/bin/skills/lakehouse-doc-en/references/lakehouse-multimodal-ai-pipeline.md +198 -0
  122. package/bin/skills/lakehouse-doc-en/references/lakehouse-quick-experience_guide.md +49 -52
  123. package/bin/skills/lakehouse-doc-en/references/lakehouse-volume-pipe-acceleration-guide.md +380 -0
  124. package/bin/skills/lakehouse-doc-en/references/langchain-plug-installation.md +1 -1
  125. package/bin/skills/lakehouse-doc-en/references/management.md +4 -9
  126. package/bin/skills/lakehouse-doc-en/references/medallion-lakehouse-from-scratch.md +2 -1
  127. package/bin/skills/lakehouse-doc-en/references/metrics_answer_build.md +58 -21
  128. package/bin/skills/lakehouse-doc-en/references/migrate-spark-data-engineering-best-practices-to-lakehouse.md +1 -1
  129. package/bin/skills/lakehouse-doc-en/references/mindsdb.md +1 -1
  130. package/bin/skills/lakehouse-doc-en/references/monitoring_and_alerting.md +65 -60
  131. package/bin/skills/lakehouse-doc-en/references/monitoring_item_specification.md +33 -33
  132. package/bin/skills/lakehouse-doc-en/references/multitable_batch_sync.md +16 -16
  133. package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync.md +65 -72
  134. package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync_sop.md +54 -52
  135. package/bin/skills/lakehouse-doc-en/references/navicat-mysql.md +2 -0
  136. package/bin/skills/lakehouse-doc-en/references/om-dynamic-table.md +71 -66
  137. package/bin/skills/lakehouse-doc-en/references/om-vcluster.md +2 -0
  138. package/bin/skills/lakehouse-doc-en/references/open-api-create-session.md +79 -0
  139. package/bin/skills/lakehouse-doc-en/references/open-api-generate-auth-token.md +63 -0
  140. package/bin/skills/lakehouse-doc-en/references/open-api-overview.md +96 -0
  141. package/bin/skills/lakehouse-doc-en/references/open-api-quick-start.md +286 -0
  142. package/bin/skills/lakehouse-doc-en/references/open-api-response-guide.md +264 -0
  143. package/bin/skills/lakehouse-doc-en/references/open-api-safe-question-poll.md +201 -0
  144. package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-query.md +99 -0
  145. package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-stop.md +74 -0
  146. package/bin/skills/lakehouse-doc-en/references/overview.md +6 -7
  147. package/bin/skills/lakehouse-doc-en/references/permission-application.md +5 -5
  148. package/bin/skills/lakehouse-doc-en/references/pipe-introduction.md +1 -0
  149. package/bin/skills/lakehouse-doc-en/references/pipe-kafka-table-stream.md +72 -70
  150. package/bin/skills/lakehouse-doc-en/references/pipe-kafka.md +105 -110
  151. package/bin/skills/lakehouse-doc-en/references/pipe-overview.md +40 -40
  152. package/bin/skills/lakehouse-doc-en/references/pipe-storage-object.md +43 -48
  153. package/bin/skills/lakehouse-doc-en/references/pipe-summary.md +14 -4
  154. package/bin/skills/lakehouse-doc-en/references/pipe-syntax.md +58 -151
  155. package/bin/skills/lakehouse-doc-en/references/practice_python_task.md +4 -4
  156. package/bin/skills/lakehouse-doc-en/references/pricing-ai-gateway.md +181 -0
  157. package/bin/skills/lakehouse-doc-en/references/pricing-lakehouse.md +316 -0
  158. package/bin/skills/lakehouse-doc-en/references/pricing.md +44 -288
  159. package/bin/skills/lakehouse-doc-en/references/private-link-general.md +0 -2
  160. package/bin/skills/lakehouse-doc-en/references/pyspark-to-zettapark-migration-f1.md +1 -1
  161. package/bin/skills/lakehouse-doc-en/references/python-igs.md +7 -3
  162. package/bin/skills/lakehouse-doc-en/references/python-sample-put-github-rt-events.md +1 -1
  163. package/bin/skills/lakehouse-doc-en/references/python-task.md +1 -1
  164. package/bin/skills/lakehouse-doc-en/references/python_reference/connector.md +3 -3
  165. package/bin/skills/lakehouse-doc-en/references/python_reference/connector_advanced.md +2 -2
  166. package/bin/skills/lakehouse-doc-en/references/python_reference/connector_examples.md +2 -2
  167. package/bin/skills/lakehouse-doc-en/references/python_sdk_guide.md +1 -1
  168. package/bin/skills/lakehouse-doc-en/references/python_shell_datasource.md +11 -9
  169. package/bin/skills/lakehouse-doc-en/references/quick_start_batch_sync_data.md +9 -18
  170. package/bin/skills/lakehouse-doc-en/references/quick_start_bi_analysis.md +8 -25
  171. package/bin/skills/lakehouse-doc-en/references/quick_start_create_workspace.md +4 -6
  172. package/bin/skills/lakehouse-doc-en/references/quick_start_data_quality.md +8 -8
  173. package/bin/skills/lakehouse-doc-en/references/quick_start_etl.md +16 -20
  174. package/bin/skills/lakehouse-doc-en/references/quick_start_monitoring_and_alerting.md +10 -18
  175. package/bin/skills/lakehouse-doc-en/references/quick_start_sql_query.md +7 -10
  176. package/bin/skills/lakehouse-doc-en/references/quick_start_upload_data.md +5 -7
  177. package/bin/skills/lakehouse-doc-en/references/quick_start_user_management.md +8 -8
  178. package/bin/skills/lakehouse-doc-en/references/quick_start_workspace.md +0 -5
  179. package/bin/skills/lakehouse-doc-en/references/quick_start_workspace_user.md +8 -8
  180. package/bin/skills/lakehouse-doc-en/references/quickstart.md +69 -56
  181. package/bin/skills/lakehouse-doc-en/references/quickstart_datashare_between_companies.md +0 -5
  182. package/bin/skills/lakehouse-doc-en/references/quickstart_envirment_for_team.md +0 -24
  183. package/bin/skills/lakehouse-doc-en/references/realtime-pipeline-selection-guide.md +1 -2
  184. package/bin/skills/lakehouse-doc-en/references/realtime-sales-dashboard-with-dynamic-table.md +3 -3
  185. package/bin/skills/lakehouse-doc-en/references/realtime_sync.md +0 -1
  186. package/bin/skills/lakehouse-doc-en/references/release-note-2026-05-19.md +5 -3
  187. package/bin/skills/lakehouse-doc-en/references/revoke-privileges.md +3 -1
  188. package/bin/skills/lakehouse-doc-en/references/roles.md +2 -3
  189. package/bin/skills/lakehouse-doc-en/references/row-filter.md +165 -0
  190. package/bin/skills/lakehouse-doc-en/references/row_level_permission.md +30 -19
  191. package/bin/skills/lakehouse-doc-en/references/scheduled_task.md +28 -21
  192. package/bin/skills/lakehouse-doc-en/references/security_overview.md +99 -21
  193. package/bin/skills/lakehouse-doc-en/references/set-command.md +1 -1
  194. package/bin/skills/lakehouse-doc-en/references/setup.md +13 -15
  195. package/bin/skills/lakehouse-doc-en/references/show-grants.md +1 -1
  196. package/bin/skills/lakehouse-doc-en/references/snowflake-dynamic-tables-to-lakehouse.md +2 -2
  197. package/bin/skills/lakehouse-doc-en/references/spark-connector-summary.md +1 -1
  198. package/bin/skills/lakehouse-doc-en/references/sql_functions/context_functions/current_vcluster.md +1 -1
  199. package/bin/skills/lakehouse-doc-en/references/sso-configuration.md +2 -2
  200. package/bin/skills/lakehouse-doc-en/references/streaming_pipeline_with_dynamic_table.md +0 -1
  201. package/bin/skills/lakehouse-doc-en/references/studio-incremental-sync-practice.md +27 -23
  202. package/bin/skills/lakehouse-doc-en/references/studio-shell-task.md +1 -1
  203. package/bin/skills/lakehouse-doc-en/references/supported-cloud-platforms.md +32 -0
  204. package/bin/skills/lakehouse-doc-en/references/table_rendering.md +18 -12
  205. package/bin/skills/lakehouse-doc-en/references/task-develop.md +89 -91
  206. package/bin/skills/lakehouse-doc-en/references/task_development.md +19 -17
  207. package/bin/skills/lakehouse-doc-en/references/task_group.md +16 -14
  208. package/bin/skills/lakehouse-doc-en/references/task_instance.md +21 -21
  209. package/bin/skills/lakehouse-doc-en/references/task_param.md +38 -35
  210. package/bin/skills/lakehouse-doc-en/references/task_param_reference.md +81 -79
  211. package/bin/skills/lakehouse-doc-en/references/task_scheduling_dependency.md +20 -21
  212. package/bin/skills/lakehouse-doc-en/references/tencentcloud_arn_and_externalid.md +1 -5
  213. package/bin/skills/lakehouse-doc-en/references/trial-account-quotas-and-limits.md +1 -3
  214. package/bin/skills/lakehouse-doc-en/references/tutorial_connect_to_lakehouse.md +69 -0
  215. package/bin/skills/lakehouse-doc-en/references/tutorials.md +4 -1
  216. package/bin/skills/lakehouse-doc-en/references/unique-key.md +167 -0
  217. package/bin/skills/lakehouse-doc-en/references/usageandbillingview.md +138 -0
  218. package/bin/skills/lakehouse-doc-en/references/use-dbt-dev.md +3 -3
  219. package/bin/skills/lakehouse-doc-en/references/use-java-sdk-realtime-uploaddata.md +1 -1
  220. package/bin/skills/lakehouse-doc-en/references/use-java-sdk-upload-data-local.md +3 -3
  221. package/bin/skills/lakehouse-doc-en/references/use-models.md +128 -0
  222. package/bin/skills/lakehouse-doc-en/references/use-mysql-client.md +81 -81
  223. package/bin/skills/lakehouse-doc-en/references/use-python-sdk-upload-data.md +10 -12
  224. package/bin/skills/lakehouse-doc-en/references/user-identification.md +2 -3
  225. package/bin/skills/lakehouse-doc-en/references/user_permission_grand_guide.md +1 -1
  226. package/bin/skills/lakehouse-doc-en/references/using-udf-in-dynamic-table.md +1 -1
  227. package/bin/skills/lakehouse-doc-en/references/vc_cache.md +18 -22
  228. package/bin/skills/lakehouse-doc-en/references/vcluster_size_description.md +33 -31
  229. package/bin/skills/lakehouse-doc-en/references/virtual-cluster.md +43 -45
  230. package/bin/skills/lakehouse-doc-en/references/web-job-history.md +94 -108
  231. package/bin/skills/lakehouse-doc-en/references/web_search.md +16 -7
  232. package/bin/skills/lakehouse-doc-en/references/zettapark-data-engineering-demo.md +1 -1
  233. package/bin/skills/lakehouse-doc-en/references/zettapark-dataframe-guide.md +144 -70
  234. package/bin/skills/lakehouse-doc-en/references/zettapark-dynamic-table-guide.md +2 -2
  235. package/bin/skills/lakehouse-doc-en/references/zettapark-etl-guide.md +73 -33
  236. package/bin/skills/lakehouse-doc-en/references/zettapark-feature-engineering.md +2 -2
  237. package/bin/skills/lakehouse-doc-en/references/zettapark-functions-guide.md +75 -46
  238. package/bin/skills/lakehouse-doc-en/references/zettapark-quick-start.md +2 -2
  239. package/bin/skills/lakehouse-doc-en/references/zettapark-stream-guide.md +4 -4
  240. package/bin/skills/lakehouse-doc-en/references/zettapark-volume-guide.md +93 -29
  241. package/package.json +1 -1
  242. package/bin/skills/lakehouse-doc-en/references/CLAUDE.md +0 -606
  243. package/bin/skills/lakehouse-doc-en/references/modelprice.md +0 -155
@@ -30,7 +30,6 @@ The core working principle of Data Share is:
30
30
 
31
31
  **Data Sharing Flow**:
32
32
 
33
- :-: ![](.topwrite/assets/image_1747735731291.png =322)
34
33
 
35
34
  The data provider retains full control and can update or remove shared data at any time. When source data is updated, the consumer immediately obtains the latest data without additional data sync operations.
36
35
 
@@ -69,7 +68,6 @@ ALTER SHARE taxi_data_share ADD INSTANCE target_instance_name;
69
68
  5. Under **Receiving Instances**, click **Add** and enter the consumer's service instance name.
70
69
  6. Click **Confirm** to complete creation.
71
70
 
72
- :-: ![](.topwrite/assets/image_1747742254855.png =699)
73
71
 
74
72
  #### 4.1.3 Creating a View for Partial Data Sharing
75
73
 
@@ -140,7 +138,6 @@ DESC SHARE source_instance_name.taxi_data_share;
140
138
  2. In the left navigation, select **Data Management** -> **Data Share**.
141
139
  3. Switch to the **Shared with Me** tab to see all received shares.
142
140
 
143
- :-: ![](.topwrite/assets/image_1747742432405.png =757)
144
141
 
145
142
  ### 5.2 Creating a Local Schema Linked to Shared Data
146
143
 
@@ -165,7 +162,6 @@ Where:
165
162
  3. In the popup, select the **Source Schema** and specify the schema name to be created locally.
166
163
  4. Click **Confirm** to complete the extraction.
167
164
 
168
- :-: ![](.topwrite/assets/image_1747742473685.png =373)
169
165
 
170
166
  ### 5.3 Using Shared Data
171
167
 
@@ -21,8 +21,7 @@ After clicking "Chart" to visualize the worksheet results, you need to select th
21
21
 
22
22
  Hover over the chart to view detailed information for each data point. For example, you can view results as a line chart:
23
23
 
24
- ![](.topwrite/assets/image_1742200992399.png =757)
25
-
24
+ ^
26
25
  ^
27
26
 
28
27
  In the "Settings" panel on the right side of the visualization area, configure what to display:
@@ -35,7 +34,7 @@ In the "Settings" panel on the right side of the visualization area, configure w
35
34
 
36
35
  3. Y-axis field:
37
36
 
38
- ![](.topwrite/assets/image_1742201014009.png =347)
37
+ ^
39
38
 
40
39
  The Y-axis supports aggregate functions to derive a single value from multiple data points. The available aggregation methods are:
41
40
 
@@ -59,8 +58,7 @@ When there are many X-axis values and you need to view a specific data point on
59
58
 
60
59
  For example, in the chart below, the accurate value is shown only when hovering over the visualization and the specific timestamp information appears.
61
60
 
62
- ![](.topwrite/assets/image_1742201080540.png =387)
63
-
61
+ ^
64
62
  ^
65
63
 
66
64
  ## Use Cases
@@ -75,16 +73,7 @@ For example, in the chart below, the accurate value is shown only when hovering
75
73
  select order_date, count(*) as c from big_data_table group by order_date;
76
74
  ```
77
75
 
78
- ![](.topwrite/assets/20250317-190632.jpeg =617)
79
-
80
76
  ^
81
-
82
- ![](.topwrite/assets/20250317-190714.jpeg =607)
83
-
84
- ^
85
-
86
- ![](.topwrite/assets/20250317-190759.jpeg =608)
87
-
88
77
  ^
89
78
 
90
79
  **Scenario 2**: To **ignore** time span differences and keep only the specific result data points, cast the result to a string (`order_date::string`):
@@ -93,4 +82,4 @@ select order_date, count(*) as c from big_data_table group by order_date;
93
82
  select order_date::string, count(*) as c from big_data_table group by order_date;
94
83
  ```
95
84
 
96
- ![](.topwrite/assets/20250317-190854.jpeg =654)
85
+ ^
@@ -1,9 +1,11 @@
1
- ## What is Data Agent
1
+ ## What is Data Engineering Agent
2
2
 
3
- Data Agent is an AI-powered agent built on top of Singdata Lakehouse and Studio. It covers the full lifecycle of "development, operations, and governance" and implements intelligent platform upgrades through an Agentic AIOps philosophy — transforming data development from "people operating the platform" to "people directing the agent."
3
+ Data Engineering Agent is an AI-powered agent built on top of Singdata Lakehouse and Studio. It covers the full lifecycle of "development, operations, and governance" and implements intelligent platform upgrades through an Agentic AIOps philosophy — transforming data development from "people operating the platform" to "people directing the agent."
4
4
 
5
5
  Data Agent is not just a tool that makes data teams more productive. It is a **data intelligence collaboration system** that enables everyone in the company to work with data.
6
6
 
7
+ ^
8
+
7
9
  ## User Value
8
10
 
9
11
  * **Higher productivity: reclaim 80% of your time for what truly matters**
@@ -42,18 +44,18 @@ For example:
42
44
  **High cost of understanding standards** Each business domain has its own layering rules, naming conventions, and field standards, scattered across various documents. Engineers must "catch up" before taking on any new requirement, and even minor oversights get flagged in reviews, keeping rework costs high.
43
45
 
44
46
  > Example prompt:
45
- > I need to design a Medallion architecture data warehouse based on this metric requirements spec to support GMV analysis. I've already planned the tables for each layer: [Bronze layer] xxx [Silver layer] xxx [Gold layer] xxx. Based on this table list, please generate a data warehouse modeling standards document.
47
+ > I need to design a Medallion architecture data warehouse based on this metric requirements spec to support GMV analysis. I've already planned the tables for each layer: \[Bronze layer] xxx \[Silver layer] xxx \[Gold layer] xxx. Based on this table list, please generate a data warehouse modeling standards document.
46
48
 
47
- ![](/.topwrite/assets/image_1779715425568.png)
49
+ ^
48
50
 
49
51
  ### Scenario 2: Ad-hoc Data Retrieval
50
52
 
51
53
  **Everything waits in the queue** Exploratory analysis, market research, and other ad-hoc requests are naturally lower priority and get perpetually pushed aside by formal requests. By the time the data finally arrives, the decision window has often closed and the business has already fallen behind the market. The core problem: ad-hoc analysis has no self-service path — it must go through the data team, which simply doesn't have the bandwidth to continuously handle low-priority requests.
52
54
 
53
55
  > Example prompt:
54
- > Query brazilianecommerce.olist_orders and count orders by day.
56
+ > Query brazilianecommerce.olist\_orders and count orders by day.
55
57
 
56
- ![](/.topwrite/assets/image_1779715858474.png)
58
+ ^
57
59
 
58
60
  ### Scenario 3: Day-to-day Operations
59
61
 
@@ -77,4 +79,34 @@ Daily task operations are the most critical routine work on an enterprise data p
77
79
  > Please help me analyze which instances failed in the past week.
78
80
  > For the task with instance ID xxx, what was the failure reason and which downstream tasks were affected?
79
81
 
80
- ![](/.topwrite/assets/image_1779715663695.png)
82
+ ^
83
+
84
+ ### Scenario 4: Studio Task Development and Management
85
+
86
+ Studio tasks are the core scheduling unit of the Lakehouse data pipeline, but traditional management is cumbersome: creating a task requires entering the IDE and configuring step by step, modifying a schedule requires locating the task and opening the schedule panel, and dependency configuration requires manually maintaining task IDs.
87
+
88
+ Data Agent operates the full lifecycle of Studio tasks directly through natural language:
89
+
90
+ * **Create tasks**: describe the task logic, and the agent automatically generates SQL or Python code, creates the task, and configures the Virtual Cluster
91
+ * **Schedule configuration**: tell the agent "run at 2 AM every day" and it automatically converts this to a cron expression and applies it
92
+ * **Dependency orchestration**: describe the dependencies between tasks, and the agent automatically configures upstream and downstream dependency chains, avoiding manual task ID lookups
93
+ * **Batch operations**: publish multiple tasks or bulk-update the retry strategy for a category of tasks in a single instruction
94
+
95
+ > Example prompt:
96
+ > Create a Python task that runs at 3 AM every day, triggers after the ods\_order\_load task completes, and aggregates yesterday's order data into dws\_order\_daily.
97
+
98
+ ### Scenario 5: Data Source Management
99
+
100
+ Onboarding enterprise data is the first mile of data engineering, involving connection configuration, sync strategy, status monitoring, and more — manual operations are error-prone and hard to track.
101
+
102
+ Data Agent supports the following data source management operations:
103
+
104
+ * **Quick onboarding**: describe the database type and connection details, and the agent automatically creates the data source and tests connectivity
105
+ * **Sync configuration**: specify the source and target tables, and the agent selects a full or incremental (CDC) sync strategy based on business needs
106
+ * **Status queries**: ask in a single sentence to get sync delays, recent failure records, and data volume statistics for all data sources
107
+ * **Troubleshooting**: when a sync task fails, the agent automatically pulls error logs, analyzes the root cause, and recommends fix steps
108
+
109
+ > Example prompt:
110
+ > Help me check which data sources currently have sync delays exceeding 30 minutes, and which ones had sync failures in the last 24 hours.
111
+
112
+ ^
@@ -0,0 +1,168 @@
1
+ # Databricks Delta Tables → Lakehouse Migration Guide
2
+
3
+ Data on Databricks can almost entirely be migrated to Singdata Lakehouse. There are two complementary paths: **federated direct read** (data stays in place, query Databricks tables directly from Lakehouse) and **Studio built-in sync** (move various table types into Lakehouse native format). All 7 table types have been tested and pass, with row counts and field values fully consistent.
4
+
5
+ Full code on GitHub: [databricks2lakehouse-delta](https://github.com/clickzetta/databricks2lakehouse-delta)
6
+
7
+ ---
8
+
9
+ ## Conclusion First
10
+
11
+ **Two paths** — selection rule: if federated read is available (External Delta), use federation + CTAS; for everything else, use Studio built-in sync.
12
+
13
+ | Table Type | Federated Read | Studio Sync | Recommended Path |
14
+ |---|:---:|:---:|---|
15
+ | **External Delta** | ✅ Tested | ✅ Tested | Federated read (no data movement) or CTAS landing |
16
+ | **Managed Delta** | ❌ | ✅ Tested | Studio built-in sync |
17
+ | **Parquet External** | ❌ | ✅ Tested | Studio built-in sync |
18
+ | **CSV External** | ❌ | ✅ Tested | Studio built-in sync |
19
+ | **JSON External** | ❌ | ✅ Tested | Studio built-in sync |
20
+ | **Managed Iceberg** | ❌ | ✅ Tested | Studio built-in sync |
21
+ | **View** | ❌ | ✅ Tested (materialized as table) | Studio built-in sync |
22
+
23
+ ---
24
+
25
+ ## Technical Background
26
+
27
+ Databricks uses Unity Catalog three-level naming (`catalog.schema.table`), with data stored in two categories:
28
+ - **External tables**: Data files reside in the customer's own S3/ADLS; Databricks only manages metadata
29
+ - **Managed tables**: Data files reside in Databricks managed storage, inaccessible externally
30
+
31
+ This difference determines the scope of federated reads — only External Delta tables can be directly queried via Lakehouse External Catalog federation. Managed tables must be migrated via Studio sync jobs.
32
+
33
+ ---
34
+
35
+ ![](.topwrite/assets/anim-33-databricks-delta-migration.svg)
36
+
37
+ ---
38
+
39
+ ## Path A: Federated Direct Read (External Delta Only)
40
+
41
+ Federated read queries Databricks tables directly via Lakehouse External Catalog with **no data movement**, suitable for parallel access during transition or PoC validation.
42
+
43
+ ### Configuration (one-time)
44
+
45
+ ```sql
46
+ -- Step 1: Create a Catalog Connection to Databricks
47
+ CREATE CATALOG CONNECTION IF NOT EXISTS databricks_conn
48
+ TYPE = DATABRICKS_UNITY_CATALOG
49
+ PROPERTIES (
50
+ 'host' = 'https://dbc-xxxx.cloud.databricks.com',
51
+ 'catalog' = 'workspace'
52
+ -- Authentication parameters configured via Studio UI
53
+ );
54
+
55
+ -- Step 2: Create External Catalog (federation entry point)
56
+ CREATE EXTERNAL CATALOG IF NOT EXISTS databricks_new_catalog
57
+ USING CONNECTION databricks_conn;
58
+ ```
59
+
60
+ ### Federated Query
61
+
62
+ ```sql
63
+ -- Query Databricks tables directly (cross-region/cross-cloud supported, via public internet)
64
+ SELECT * FROM databricks_new_catalog.table_types_demo.customers_external;
65
+ SELECT COUNT(*) FROM databricks_new_catalog.table_types_demo.orders_external;
66
+
67
+ -- CTAS: land the federated table as a Lakehouse native table (optional)
68
+ CREATE TABLE delta_migration.customers AS
69
+ SELECT * FROM databricks_new_catalog.table_types_demo.customers_external;
70
+ ```
71
+
72
+ ### Federated Read Limitations
73
+
74
+ Federated read **only supports External Delta format**. Test results:
75
+
76
+ | Table Tested | Format | Result |
77
+ |---|---|---|
78
+ | `customers_external` | External Delta | ✅ 7 rows, schema/types correct |
79
+ | `orders_external` | External Delta | ✅ 8 rows, aggregation correct |
80
+ | `inventory_delta` | External Delta | ✅ 4 rows |
81
+ | `customers_managed` | Managed Delta | ❌ Data in Databricks managed storage, no external access |
82
+ | `products_parquet` | External Parquet | ❌ `unsupported databricks table format [PARQUET]` |
83
+ | `suppliers_csv` | External CSV | ❌ `unsupported databricks table format [CSV]` |
84
+ | `shipments_iceberg` | Managed Iceberg | ❌ S3 400 (data in Databricks managed bucket) |
85
+ | `customer_orders_view` | View | ❌ Not supported |
86
+
87
+ ---
88
+
89
+ ## Path B: Studio Built-in Sync (Handles All Table Types)
90
+
91
+ Studio has a built-in Databricks data source with visual sync task configuration to move all table types into Lakehouse native format.
92
+
93
+ ### Configuration Steps
94
+
95
+ 1. Studio UI → Data Integration → New Data Source → Select Databricks
96
+ 2. Enter Databricks Workspace URL + authentication info (Service Principal)
97
+ 3. New sync task → Select source tables → Configure target schema → Execute
98
+
99
+ ### Test Results (All 7 Table Types SUCCEED)
100
+
101
+ Sync tasks were run against all table types in the `table_types_demo` schema in a real Databricks environment (AWS us-east-1, Unity Catalog). All succeeded:
102
+
103
+ | Source Table | Type | Result | Duration | Source Rows | Target Rows | Consistent |
104
+ |---|---|:---:|:---:|:---:|:---:|:---:|
105
+ | `customers_managed` | Managed Delta | ✅ | ~88s | 7 | 7 | ✅ |
106
+ | `customers_external` | External Delta | ✅ | ~84s | 7 | 7 | ✅ |
107
+ | `products_parquet` | Parquet | ✅ | ~87s | 5 | 5 | ✅ |
108
+ | `suppliers_csv` | CSV | ✅ | ~93s | 3 | 3 | ✅ |
109
+ | `product_reviews_json` | JSON | ✅ | ~109s | 3 | 3 | ✅ |
110
+ | `shipments_iceberg` | Managed Iceberg | ✅ | ~92s | 5 | 5 | ✅ |
111
+ | `customer_orders_view` | View | ✅ | ~86s | 8 | 8 | ✅ |
112
+
113
+ > Single task duration is 84–109 seconds (including sync cluster cold start), largely independent of data volume. Estimate total migration time linearly based on table count.
114
+
115
+ ### Data Consistency Verification
116
+
117
+ **Field-level row-by-row comparison** was performed on 5 tables (Databricks SDK reads source, cz-cli reads target). Findings:
118
+
119
+ - `products_parquet`, `product_reviews_json`, `shipments_iceberg`, `suppliers_csv`: **All fields fully consistent**
120
+ - `customers_external`, `suppliers_csv`: email/contact_email fields show masked values on the Lakehouse side (e.g., `a***@example.com`)
121
+
122
+ > Masking is the Lakehouse workspace's **read-time column masking policy** automatically matching by column name. The data itself is migrated intact. To see plaintext values, use a role with appropriate permissions or query in a schema without a masking policy bound.
123
+
124
+ ---
125
+
126
+ ## Delta-Specific Feature Handling
127
+
128
+ | Feature | Migration Impact |
129
+ |---|---|
130
+ | **Deletion Vectors (DV)** | ✅ Federated read correctly recognizes DVs; deleted rows are not returned. Data state is consistent after sync to Lakehouse |
131
+ | **Change Data Feed (CDF)** | Current data syncs normally. CDF incremental consumption interface (`table_changes()`) has no equivalent in Lakehouse → rebuild incremental pipelines using Lakehouse Table Stream |
132
+ | **Liquid Clustering** | Does not affect data content. After migration to Lakehouse, rebuild layout optimization using Lakehouse `CLUSTER BY` or indexing mechanisms |
133
+
134
+ ---
135
+
136
+ ## Connectivity Notes
137
+
138
+ - **Cross-region/cross-cloud supported**: Both federated read and Studio sync use the public internet. Tested: Lakehouse AWS Singapore ↔ Databricks us-east-1 cross-region, cross-cloud connectivity confirmed
139
+ - **Only hard limitation**: `COPY INTO`/`Pipe` object storage connections cannot cross cloud providers (e.g., an Alibaba Cloud instance cannot connect to AWS S3), but this affects direct S3 file reads, not reading tables via Databricks External Catalog
140
+
141
+ ---
142
+
143
+ ## Notes
144
+
145
+ - **Managed tables cannot be federated**: Managed Delta/Iceberg data resides in Databricks managed storage with no external access — must use Studio sync
146
+ - **Parquet/CSV/JSON External tables cannot be federated**: External Catalog currently only supports Delta format; use Studio sync for other formats
147
+ - **Array columns not yet supported for sync**: Columns with `ARRAY<...>` types are not yet supported by Studio sync tasks. Convert them to STRING using `to_json()` on the source side before syncing
148
+ - **Configure sync tasks table by table**: Studio's built-in Databricks data source does not yet support bulk schema-wide sync; configure each table individually. Scripts can be used to batch-generate configurations
149
+
150
+ ## Related Documentation
151
+
152
+ ### Federated Query
153
+
154
+ - [External Catalog Overview](external-catalog-concept.md): External Catalog federated query principles
155
+ - [Federated Query Guide](SQL_External_Catalog_Guide.md): SQL syntax and usage examples
156
+
157
+ ### Data Ingestion
158
+
159
+ - [Data Ingestion Overview](streaming_data_pipeline_overview.md): Full ingestion solution landscape
160
+ - [COPY INTO](copy-into.md): Bulk load from object storage
161
+ - [Pipe (Continuous Ingestion)](pipe-overview.md): Continuously monitor object storage for new files
162
+
163
+ ### Other Migration Guides
164
+
165
+ - [Databricks Notebook → Lakehouse Migration Guide](databricks-notebook-to-studio-migration.md)
166
+ - [Databricks DLT → Lakehouse Migration Guide](databricks-dlt-to-lakehouse-migration.md)
167
+ - [Databricks Jobs → Lakehouse Studio Migration Guide](databricks-jobs-to-studio-migration.md)
168
+ - [Databricks Unity Catalog → Lakehouse Migration Guide](databricks-uc-governance-to-lakehouse-migration.md)
@@ -0,0 +1,331 @@
1
+ # Databricks DLT → Lakehouse Migration Guide: Apparel Retail Streaming Pipeline
2
+
3
+ If your data pipeline runs on Databricks Delta Live Tables, the core migration effort to Singdata Lakehouse is lower than you might expect — **most PySpark DataFrame code can be reused directly**. DLT uses decorators (`@dlt.table`, `@dlt.expect_or_drop`) to wrap Python functions into pipeline nodes. ZettaPark removes the decorators and replaces them with `df.write.saveAsTable()` — business logic is unchanged, line for line.
4
+
5
+ This article validates this with a real project: a Databricks DLT-based apparel retail streaming pipeline (Bronze ingestion → Silver SCD2 cleansing → Gold aggregation analytics) fully migrated to Singdata Lakehouse, passing all 16 automated validations.
6
+
7
+ Full code on GitHub: [databricks2lakehouse-dlt-apparel](https://github.com/clickzetta/databricks2lakehouse-dlt-apparel)
8
+
9
+ ---
10
+
11
+ ## Source Project
12
+
13
+ [databricks2lakehouse-dlt-apparel](https://github.com/clickzetta/databricks2lakehouse-dlt-apparel) is forked from [jrlasak/databricks_apparel_streaming](https://github.com/jrlasak/databricks_apparel_streaming) (⭐45). The original tech stack is Databricks DLT + Auto Loader + Delta Lake. The project implements a complete data pipeline for an apparel retailer across 4 dimensions (customers, products, stores, transactions), covering streaming ingestion, SCD Type 2 history tracking, data quality constraints, and Gold layer aggregation analytics.
14
+
15
+ Migrated code is in the `03_lakehouse/` directory, with three available approaches that can be compared file-by-file with `01_source/dlt/`.
16
+
17
+ ## Conclusion First
18
+
19
+ **Option C (Dynamic Table) is the closest equivalent for DLT pipelines** — declarative definition + automatic scheduled refresh, consistent with DLT's "define and execute" philosophy. If the existing team is comfortable with PySpark DataFrame, Option A (ZettaPark) requires only 5 mechanical substitutions with all business logic fully preserved.
20
+
21
+ | Option | Files | DLT Equivalent | Characteristics |
22
+ |------|------|---------|------|
23
+ | **A. ZettaPark Python** | `03_lakehouse/*.py` | `@dlt.table` → `df.write.saveAsTable()` | Minimal code changes, retains PySpark skills |
24
+ | **B. Pure SQL** | `03_lakehouse/sql/` | SQL equivalent implementation | SQL-first teams |
25
+ | **C. Dynamic Table (GIC)** | `03_lakehouse/dynamic_tables/` | **Native equivalent of @dlt.table** | Declarative + auto-refresh, closest to DLT semantics |
26
+
27
+ **Option A change list** (ZettaPark main path):
28
+
29
+ | Change | Effort | Notes |
30
+ |--------|--------|------|
31
+ | `import dlt` / `from pyspark...` | Very low | `from clickzetta.zettapark import functions as F`, replace package name |
32
+ | `spark` global | Very low | `session = Session.builder.configs({}).create()` |
33
+ | `@dlt.table(name=X)` | Very low | Add `df.write.mode("overwrite").saveAsTable(X)` as last line of function |
34
+ | `@dlt.expect_or_drop("msg","cond")` | Very low | `df.filter(F.col(...)...)`, semantically equivalent |
35
+ | `dlt.read_stream("LIVE.X")` / `dlt.read("LIVE.X")` | Very low | `session.table("X")`, same as PySpark |
36
+ | `dlt.create_auto_cdc_flow(scd_type=2)` | Low | `F.lead().over(Window.partitionBy(key).orderBy(seq))`, ZettaPark Window is identical to PySpark |
37
+ | `F.window("event_time","1 day")` | Very low | `F.to_date(F.col("event_time"))`, ZettaPark has no F.window |
38
+ | Auto Loader → | Low | `session.read.csv("vol://...")` batch; use Pipe for streaming in production |
39
+
40
+ JOIN logic, aggregation functions (`F.sum/count/avg`), `F.when/coalesce/lead/lag`, `Window` — all require no changes, fully consistent with PySpark API.
41
+
42
+ ---
43
+
44
+ ## Tech Stack Comparison
45
+
46
+ | | Databricks DLT | ZettaPark (after migration) |
47
+ |---|---|---|
48
+ | Pipeline definition | Python decorator `@dlt.table` | `df.write.mode("overwrite").saveAsTable(X)` |
49
+ | Data quality constraints | `@dlt.expect_or_drop("msg","cond")` | `df.filter(condition)` (same semantics) |
50
+ | Streaming read | `dlt.read_stream("LIVE.X")` | `session.table("X")` (DT auto-incremental) |
51
+ | Static read | `dlt.read("LIVE.X")` | `session.table("X")` |
52
+ | SCD Type 2 | `dlt.create_auto_cdc_flow(stored_as_scd_type=2)` | `F.lead().over(Window.partitionBy(key).orderBy(seq))` |
53
+ | Time window aggregation | `F.window("event_time","1 day")` | `F.to_date(F.col("event_time"))` |
54
+ | File ingestion | Auto Loader (`cloudFiles`) | `session.read.csv("vol://...")` / Pipe (streaming) |
55
+ | DataFrame API | PySpark | **Fully consistent** (F.col/when/lead/Window etc.) |
56
+
57
+ ---
58
+
59
+ ![](.topwrite/assets/anim-30-databricks-dlt-migration.svg)
60
+
61
+ ---
62
+
63
+ ## Project Background
64
+
65
+ Apparel retailer with 4 data streams + Bronze/Silver/Gold three-layer architecture:
66
+
67
+ | Data Domain | Raw Events | Row Count | Notes |
68
+ |--------|---------|------|------|
69
+ | Customers | `raw_customers` | 150 | Includes SCD2 update records (30 historical) |
70
+ | Products | `raw_products` | 50 | Includes category/brand/size/color |
71
+ | Stores | `raw_stores` | 5 | 5 stores |
72
+ | Transactions | `raw_sales` | 500 | Includes discount, tax, payment method |
73
+
74
+ DLT pipeline file structure:
75
+
76
+ ```
77
+ 01_bronze.py — Auto Loader streaming ingestion → 4 Bronze tables
78
+ 02A_silver.py — @dlt.view + @dlt.expect_or_drop (data quality filtering)
79
+ 02B_silver.py — dlt.create_auto_cdc_flow (SCD Type 2 dimension tables)
80
+ 02C_silver.py — Sales transaction cleansing
81
+ 03_gold.py — 4 Gold aggregation tables
82
+ ```
83
+
84
+ ---
85
+
86
+ ## Migration Steps
87
+
88
+ ### Step 1: `@dlt.table` / Auto Loader → ZettaPark
89
+
90
+ DLT uses decorators to declare tables + `spark.readStream` for streaming ingestion. ZettaPark uses `session.read.csv("vol://...")` to load files, with `df.write.saveAsTable()` replacing decorators:
91
+
92
+ ```python
93
+ # Databricks DLT (01_source/dlt/01_bronze.py)
94
+ @dlt.table(name="bronze_sales")
95
+ def bronze_sales():
96
+ return spark.readStream.format("cloudFiles").option("cloudFiles.format","csv").load(VOL_PATH)
97
+ ```
98
+
99
+ ```python
100
+ # ZettaPark (03_lakehouse/01_bronze.py) — minimal rewrite
101
+ from clickzetta.zettapark import Session # ← import pyspark → clickzetta.zettapark
102
+ session = Session.builder.configs({...}).create() # ← spark global injection → explicit creation
103
+
104
+ df = session.read.option("header","true").csv("vol://apparel_bronze.raw_data/sales.csv")
105
+ # readStream.format("cloudFiles")... → session.read.csv("vol://...")
106
+ df.write.mode("overwrite").saveAsTable("apparel_bronze.raw_sales")
107
+ # @dlt.table(name=X) → df.write.saveAsTable(X) (last line of function body)
108
+ ```
109
+
110
+ > 💡 **Note**: Auto Loader continuously monitors new files (streaming). The Lakehouse equivalent is **Pipe** (automatically ingests after configuration). Use Pipe instead of COPY INTO in production environments.
111
+
112
+ ### Step 2: `@dlt.expect_or_drop` → `df.filter()`
113
+
114
+ DLT data quality constraints translate directly to ZettaPark `.filter()` — semantically equivalent, API consistent:
115
+
116
+ ```python
117
+ # Databricks DLT (02A_silver.py)
118
+ @dlt.view(name="customers_cleaned_stream")
119
+ @dlt.expect_or_drop("valid_customer_id", "customer_id IS NOT NULL")
120
+ @dlt.expect("realistic_age", "age >= 18 AND age <= 100")
121
+ def customers_cleaned_stream():
122
+ return dlt.read_stream(f"LIVE.{BRONZE_CUSTOMERS}")
123
+ ```
124
+
125
+ ```python
126
+ # ZettaPark (03_lakehouse/02B_silver.py)
127
+ customers = session.table("apparel_bronze.raw_customers") # dlt.read_stream → session.table()
128
+ customers = (customers
129
+ .filter(F.col("customer_id").isNotNull()) # @dlt.expect_or_drop("valid_customer_id",...)
130
+ .filter((F.col("age") >= 18) & (F.col("age") <= 100)) # @dlt.expect("realistic_age",...)
131
+ )
132
+ ```
133
+
134
+ ### Step 3: SCD Type 2 — `dlt.create_auto_cdc_flow` → `LEAD()` Window Function
135
+
136
+ `create_auto_cdc_flow(stored_as_scd_type=2)` is fundamentally a LEAD window function. ZettaPark implements the equivalent logic directly — `F.lead()` and `Window` APIs are fully consistent between the two:
137
+
138
+ ```python
139
+ # Databricks DLT (02B_silver.py)
140
+ dlt.create_auto_cdc_flow(
141
+ target=SILVER_CUSTOMERS,
142
+ source=f"live.{CUSTOMERS_CLEANED_STREAM}",
143
+ keys=["customer_id"],
144
+ sequence_by=F.col("last_update_time"),
145
+ stored_as_scd_type=2,
146
+ )
147
+ ```
148
+
149
+ ```python
150
+ # ZettaPark (03_lakehouse/02B_silver.py)
151
+ from clickzetta.zettapark.window import Window # ← pyspark.sql.window → clickzetta.zettapark.window
152
+
153
+ w = Window.partitionBy("customer_id").orderBy("last_update_time")
154
+ # F.lead() and Window API are fully consistent, no changes needed
155
+ silver_customers = customers.withColumn(
156
+ "__end_at", F.lead("last_update_time").over(w) # SCD2 end timestamp
157
+ ).withColumn(
158
+ "__is_current", F.lead("last_update_time").over(w).isNull()
159
+ ).withColumn(
160
+ "__end_at", F.coalesce(F.col("__end_at"), F.lit("9999-12-31 23:59:59").cast(TimestampType()))
161
+ )
162
+ silver_customers.write.mode("overwrite").saveAsTable("apparel_silver.silver_customers")
163
+ ```
164
+
165
+ **Result**: 150 raw records → 150 rows (including 30 historical) → 120 current snapshot records (`__is_current = TRUE`).
166
+
167
+ ### Step 4: Time Window Aggregation — `F.window()` → `F.to_date()`
168
+
169
+ `F.window("event_time","1 day")` is not implemented in ZettaPark. Use `F.to_date()` instead (functionally equivalent, results consistent):
170
+
171
+ ```python
172
+ # Databricks DLT (03_gold.py)
173
+ df.groupBy(F.window("event_time","1 day").alias("sale_window"), "store_id","store_name") \
174
+ .agg(F.round(F.sum("total_amount"),2).alias("total_revenue")) \
175
+ .select(F.col("sale_window.start").cast("date").alias("sale_date"), ...)
176
+ ```
177
+
178
+ ```python
179
+ # ZettaPark (03_lakehouse/03_gold.py)
180
+ facts.withColumn("sale_date", F.to_date(F.col("event_time"))) # F.window → F.to_date
181
+ .groupBy("sale_date","store_id","store_name")
182
+ .agg(F.round(F.sum("total_amount"),2).alias("total_revenue"),
183
+ F.count("transaction_id").alias("total_transactions"), ...)
184
+ ```
185
+
186
+ > 💡 `F.to_date()` is equivalent to truncating to the day, which produces exactly the same grouping result as `F.window("1 day")`.
187
+
188
+ ---
189
+
190
+ ## Pure SQL Alternative
191
+
192
+ If the team prefers SQL, `03_lakehouse/sql/` provides equivalent SQL scripts:
193
+
194
+ ```bash
195
+ cz-cli sql --file 03_lakehouse/sql/02_silver.sql --profile aws_singapore_prod --sync --write
196
+ cz-cli sql --file 03_lakehouse/sql/03_gold.sql --profile aws_singapore_prod --sync --write
197
+ ```
198
+
199
+ | DLT Python | SQL Equivalent |
200
+ |---|---|
201
+ | `@dlt.expect_or_drop` | `WHERE condition` |
202
+ | `create_auto_cdc_flow(scd=2)` | `LEAD() OVER (PARTITION BY key ORDER BY seq)` |
203
+ | `F.window("event_time","1 day")` | `DATE_TRUNC('day', event_time)` |
204
+
205
+ ZettaPark is the recommended primary path (minimal code changes, retains PySpark skills); SQL is suitable for SQL-first teams or rapid validation.
206
+
207
+ ---
208
+
209
+ ## Option C: Dynamic Table (GIC) — Native Equivalent of DLT
210
+
211
+ Dynamic Table is the most direct equivalent of DLT's `@dlt.table` — declare SQL declaratively, and the platform automatically refreshes on schedule. DLT triggers on new data automatically; Dynamic Table refreshes on a `REFRESH INTERVAL` schedule. Logically equivalent.
212
+
213
+ ```python
214
+ # Databricks DLT — declarative pipeline
215
+ @dlt.table(name="gold_daily_sales_by_store")
216
+ def gold_daily_sales_by_store():
217
+ return df.groupBy(F.window("event_time","1 day"), "store_id","store_name") \
218
+ .agg(F.sum("total_amount").alias("total_revenue"))
219
+ ```
220
+
221
+ ```sql
222
+ -- Lakehouse Dynamic Table — direct equivalent (03_lakehouse/dynamic_tables/gold_dynamic_tables.sql)
223
+ CREATE OR REPLACE DYNAMIC TABLE apparel_gold.dt_daily_sales_by_store
224
+ REFRESH INTERVAL 10 MINUTE -- DLT triggers on new data; DT triggers on schedule
225
+ VCLUSTER DEFAULT
226
+ AS
227
+ SELECT CAST(event_time AS DATE) AS sale_date, store_id, store_name,
228
+ ROUND(SUM(total_amount), 2) AS total_revenue, ...
229
+ FROM apparel_gold.denormalized_sales_facts
230
+ GROUP BY CAST(event_time AS DATE), store_id, store_name;
231
+
232
+ -- Manual trigger (equivalent to DLT pipeline trigger)
233
+ REFRESH DYNAMIC TABLE apparel_gold.dt_daily_sales_by_store;
234
+ ```
235
+
236
+ | DLT | Dynamic Table | Notes |
237
+ |---|---|---|
238
+ | `@dlt.table(name=X)` | `CREATE DYNAMIC TABLE X ... AS SELECT ...` | Direct equivalent |
239
+ | Auto-refresh on new data | `REFRESH INTERVAL N MINUTE` | Different scheduling method, semantically equivalent |
240
+ | `F.window("1 day")` | `CAST(event_time AS DATE)` | DT uses SQL |
241
+ | `create_auto_cdc_flow(scd=2)` | `LEAD() OVER Window` embedded in DT definition | Standard window function |
242
+ | DLT pipeline DAG | DT dependencies (downstream DT references upstream DT) | Declarative dependencies |
243
+
244
+ Tested in AWS Singapore: 4 Dynamic Tables created and refreshed, all passed e2e validation: `dt_customers_current` (150 rows / 120 current), `dt_daily_sales_by_store` (367), `dt_product_performance` (50), `dt_customer_lifetime_value` (96).
245
+
246
+ ## About Pipe (Auto Loader Equivalent)
247
+
248
+ **Pipe** is the Databricks Auto Loader equivalent — continuously monitors object storage (OSS/S3/COS) for new files, automatically triggering COPY INTO ingestion. Tested in AWS Singapore: after uploading a new CSV to S3, Pipe auto-ingested within 15 seconds.
249
+
250
+ ```sql
251
+ -- Step 1: Create External Volume (connected to S3/OSS/COS)
252
+ CREATE EXTERNAL VOLUME apparel_bronze.s3_sales_landing
253
+ LOCATION 's3://your-bucket/apparel_landing/'
254
+ USING CONNECTION s3_conn
255
+ DIRECTORY = (enable=true, auto_refresh=true)
256
+ RECURSIVE = true;
257
+
258
+ -- Step 2: Create table (External Volume does not support inferSchema, explicit definition required)
259
+ CREATE TABLE apparel_bronze.raw_sales (
260
+ transaction_id BIGINT, store_id BIGINT, event_time STRING,
261
+ customer_id BIGINT, product_id BIGINT, quantity BIGINT,
262
+ unit_price DOUBLE, total_amount DOUBLE, payment_method STRING,
263
+ discount_applied DOUBLE, tax_amount DOUBLE
264
+ );
265
+
266
+ -- Step 3: Create Pipe
267
+ -- DLT: spark.readStream.format("cloudFiles").option("cloudFiles.format","csv").load(PATH)
268
+ CREATE OR REPLACE PIPE apparel_bronze.pipe_sales
269
+ VIRTUAL_CLUSTER = 'DEFAULT'
270
+ INGEST_MODE = 'LIST_PURGE' -- Poll for new files, equivalent to Auto Loader LIST mode
271
+ AS COPY INTO apparel_bronze.raw_sales
272
+ FROM VOLUME apparel_bronze.s3_sales_landing
273
+ USING CSV OPTIONS ('header'='true')
274
+ PURGE = TRUE -- Delete landing zone files after processing
275
+ ON_ERROR = CONTINUE;
276
+ ```
277
+
278
+ > 💡 **Pipe vs COPY INTO**: COPY INTO is a one-time batch load; Pipe continuously monitors and auto-triggers when new files arrive — a direct equivalent to Auto Loader. Internal Volumes only support COPY INTO; Pipe requires an External Volume (OSS/S3/COS).
279
+
280
+ ---
281
+
282
+ ## E2E Validation Results
283
+
284
+ Tested on AWS Singapore instance (`aws_singapore_prod`), **20/20 all passed** (including Dynamic Table validation):
285
+
286
+ | Check | Expected | Result |
287
+ |--------|--------|------|
288
+ | bronze.raw_customers | 150 | ✅ |
289
+ | bronze.raw_products | 50 | ✅ |
290
+ | bronze.raw_stores | 5 | ✅ |
291
+ | bronze.raw_sales | 500 | ✅ |
292
+ | silver_customers (including history) | 150 | ✅ |
293
+ | silver_products | 50 | ✅ |
294
+ | silver_stores | 5 | ✅ |
295
+ | silver_sales_transactions | 500 | ✅ |
296
+ | silver_customers_current | 120 | ✅ |
297
+ | silver_products_current | 50 | ✅ |
298
+ | silver_stores_current | 5 | ✅ |
299
+ | gold.denormalized_sales_facts | 500 | ✅ |
300
+ | gold.gold_product_performance | 50 | ✅ |
301
+ | Total sales amount | 281,490 | ✅ |
302
+ | Customers with purchases | 96 | ✅ |
303
+ | SCD2 historical records | 30 | ✅ |
304
+ | **dt_customers_current** (Dynamic Table) | 150 | ✅ |
305
+ | **dt_daily_sales_by_store** (Dynamic Table) | ≥100 | ✅ 367 |
306
+ | **dt_product_performance** (Dynamic Table) | 50 | ✅ |
307
+ | **dt_customer_lifetime_value** (Dynamic Table) | 96 | ✅ |
308
+ | SCD2 historical records | 30 | ✅ |
309
+
310
+ ---
311
+
312
+ ## Notes
313
+
314
+ - **Streaming vs batch**: Auto Loader continuously monitors new files; COPY INTO is a one-time batch. Use Lakehouse **Pipe** instead of Auto Loader in production — automatically ingests new files after configuration, no manual triggering required.
315
+ - **`@dlt.expect` warn semantics**: `expect` (warn level) in DLT counts but does not drop records. When migrating to SQL, you can choose to not filter (ignore entirely) or add a `WHERE` filter (equivalent to upgrading to `expect_or_drop`). Decide based on business requirements.
316
+ - **DT auto-incremental**: Lakehouse Dynamic Tables automatically refresh incrementally after definition (similar to DLT streaming updates), without needing to explicitly write `read_stream`.
317
+ - **SCD2 `__is_current` field**: `LEAD()` returning NULL means there is no subsequent version, i.e., this is the current record. `COALESCE(__end_at, '9999-12-31')` is the standard SCD2 convention, with identical behavior on both sides.
318
+
319
+ ## Related Documentation
320
+
321
+ ### Dynamic Table (GIC)
322
+
323
+ - [Dynamic Table Overview](dynamic-table-overview.md): Declarative incremental computation principles and architecture
324
+ - [Dynamic Table SQL Reference](dynamic-table-sql.md): Full CREATE DYNAMIC TABLE syntax
325
+ - [Dynamic Table Usage Guide](SQL_DynamicTable_Guide.md): Unified batch-streaming pipeline examples
326
+
327
+ ### Other Migration Guides
328
+
329
+ - [Databricks Notebook → Lakehouse Migration Guide (Retail Medallion Pipeline)](databricks-notebook-to-studio-migration.md)
330
+ - [dbt-databricks → dbt-clickzetta Migration Guide (Financial Payment Pipeline)](dbt-databricks-to-clickzetta-migration.md)
331
+ - [Spark Migration Guide](spark-migration-guide.md): Spark ecosystem migration overview