@clickzetta/cz-cli-darwin-x64 0.5.15 → 0.5.17

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (243) hide show
  1. package/bin/cz-cli +0 -0
  2. package/bin/skills/lakehouse-doc-en/SKILL.md +6 -11
  3. package/bin/skills/lakehouse-doc-en/references/AIGateway.md +58 -13
  4. package/bin/skills/lakehouse-doc-en/references/Computation.md +1 -1
  5. package/bin/skills/lakehouse-doc-en/references/DataSource_Amazon_DocumentDB.md +3 -1
  6. package/bin/skills/lakehouse-doc-en/references/Foreach.md +14 -14
  7. package/bin/skills/lakehouse-doc-en/references/JDBC-Driver.md +0 -1
  8. package/bin/skills/lakehouse-doc-en/references/LakehouseAI-overview.md +21 -8
  9. package/bin/skills/lakehouse-doc-en/references/LakehouseDataGPT-tour.md +4 -9
  10. package/bin/skills/lakehouse-doc-en/references/LakehouseStudio-tour.md +14 -19
  11. package/bin/skills/lakehouse-doc-en/references/Lakehouse_Zilliz_MakeDataReadyforBIandAI.md +1 -1
  12. package/bin/skills/lakehouse-doc-en/references/Logstash.md +3 -3
  13. package/bin/skills/lakehouse-doc-en/references/Migrate_Spark_DataEngineeringBestPractices_Project_to_Lakehouse.md +1 -1
  14. package/bin/skills/lakehouse-doc-en/references/Notebook.md +17 -17
  15. package/bin/skills/lakehouse-doc-en/references/RemoteFunction-as-udf.md +14 -14
  16. package/bin/skills/lakehouse-doc-en/references/SQL_External_Catalog_Guide.md +1 -9
  17. package/bin/skills/lakehouse-doc-en/references/SUMMARY.md +59 -29
  18. package/bin/skills/lakehouse-doc-en/references/WINDOWFUNCTION.md +99 -57
  19. package/bin/skills/lakehouse-doc-en/references/Zettapark_Data_Engineering_Demo.md +1 -1
  20. package/bin/skills/lakehouse-doc-en/references/access-control-configuration.md +1 -8
  21. package/bin/skills/lakehouse-doc-en/references/aigw-2026-2-5-1.0.md +16 -0
  22. package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-29-1.0.2.md +14 -0
  23. package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-8-1.0.1.md +16 -0
  24. package/bin/skills/lakehouse-doc-en/references/aigw-2026-4-28-1.1.md +29 -0
  25. package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-12-1.1.1.md +18 -0
  26. package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-15-1.2.md +9 -0
  27. package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-21-1.3.md +9 -0
  28. package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-28-1.4.md +10 -0
  29. package/bin/skills/lakehouse-doc-en/references/aigw-2026-6-3-1.5.md +9 -0
  30. package/bin/skills/lakehouse-doc-en/references/alicloud-arn-externalid.md +0 -5
  31. package/bin/skills/lakehouse-doc-en/references/answer-accuracy-improve.md +120 -103
  32. package/bin/skills/lakehouse-doc-en/references/application-list.md +1 -3
  33. package/bin/skills/lakehouse-doc-en/references/approval-list.md +16 -17
  34. package/bin/skills/lakehouse-doc-en/references/batch-load-parquet-file-into-lakehouse.md +1 -1
  35. package/bin/skills/lakehouse-doc-en/references/batch_sync.md +9 -9
  36. package/bin/skills/lakehouse-doc-en/references/batch_sync_Sop.md +2 -2
  37. package/bin/skills/lakehouse-doc-en/references/batchloadparquetfileintoLakehouse.md +1 -1
  38. package/bin/skills/lakehouse-doc-en/references/bulkloadv1-python-sdk.md +3 -3
  39. package/bin/skills/lakehouse-doc-en/references/chart-auto-refresh-guide.md +12 -6
  40. package/bin/skills/lakehouse-doc-en/references/clickzetta-sample-data.md +3 -3
  41. package/bin/skills/lakehouse-doc-en/references/code_approval.md +1 -5
  42. package/bin/skills/lakehouse-doc-en/references/composite_task.md +31 -42
  43. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_environment_and_data_generate.md +6 -9
  44. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_javasdk_bulkload_realtime.md +4 -10
  45. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_kafka_realtime_sync.md +1 -10
  46. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_local_file_into_table_by_studio.md +0 -6
  47. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_batchload_public_network.md +0 -5
  48. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_python_node.md +2 -7
  49. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_realtime_cdc_public_network.md +13 -18
  50. package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_sql_insert.md +0 -1
  51. package/bin/skills/lakehouse-doc-en/references/concepts.md +1 -1
  52. package/bin/skills/lakehouse-doc-en/references/config-datasource.md +5 -7
  53. package/bin/skills/lakehouse-doc-en/references/connect-with-cli.md +116 -72
  54. package/bin/skills/lakehouse-doc-en/references/connect-with-cz-cli.md +151 -0
  55. package/bin/skills/lakehouse-doc-en/references/continue-job.md +9 -17
  56. package/bin/skills/lakehouse-doc-en/references/create-api-connection.md +315 -286
  57. package/bin/skills/lakehouse-doc-en/references/create-catalog-connection.md +1 -0
  58. package/bin/skills/lakehouse-doc-en/references/create-dynamic-table.md +4 -4
  59. package/bin/skills/lakehouse-doc-en/references/create-external-catalog.md +85 -22
  60. package/bin/skills/lakehouse-doc-en/references/create-table-ddl.md +45 -0
  61. package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkendpoint.md +4 -6
  62. package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkservice.md +4 -7
  63. package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkendpoint.md +2 -7
  64. package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkservice.md +1 -5
  65. package/bin/skills/lakehouse-doc-en/references/cz-cli-agent.md +15 -10
  66. package/bin/skills/lakehouse-doc-en/references/cz-cli-datasource.md +0 -8
  67. package/bin/skills/lakehouse-doc-en/references/cz-cli-sql.md +2 -45
  68. package/bin/skills/lakehouse-doc-en/references/cz-cli.md +53 -42
  69. package/bin/skills/lakehouse-doc-en/references/dashboard-version-management-guide.md +12 -4
  70. package/bin/skills/lakehouse-doc-en/references/data-integration-intro.md +1 -1
  71. package/bin/skills/lakehouse-doc-en/references/data-integration.md +29 -27
  72. package/bin/skills/lakehouse-doc-en/references/data-load-summary.md +3 -3
  73. package/bin/skills/lakehouse-doc-en/references/data-quality.md +25 -25
  74. package/bin/skills/lakehouse-doc-en/references/data-sharing.md +31 -54
  75. package/bin/skills/lakehouse-doc-en/references/data-sources.md +45 -45
  76. package/bin/skills/lakehouse-doc-en/references/data_catalog.md +23 -25
  77. package/bin/skills/lakehouse-doc-en/references/data_privacy.md +5 -2
  78. package/bin/skills/lakehouse-doc-en/references/data_sharing_between_accounts_guide.md +0 -4
  79. package/bin/skills/lakehouse-doc-en/references/data_visualization.md +4 -15
  80. package/bin/skills/lakehouse-doc-en/references/dataagent.md +39 -7
  81. package/bin/skills/lakehouse-doc-en/references/databricks-delta-to-lakehouse-migration.md +168 -0
  82. package/bin/skills/lakehouse-doc-en/references/databricks-dlt-to-lakehouse-migration.md +331 -0
  83. package/bin/skills/lakehouse-doc-en/references/databricks-external-catalog-practice.md +367 -0
  84. package/bin/skills/lakehouse-doc-en/references/databricks-jobs-to-studio-migration.md +199 -0
  85. package/bin/skills/lakehouse-doc-en/references/databricks-notebook-to-studio-migration.md +350 -0
  86. package/bin/skills/lakehouse-doc-en/references/databricks-uc-governance-to-lakehouse-migration.md +327 -0
  87. package/bin/skills/lakehouse-doc-en/references/datagpt-model-config.md +34 -0
  88. package/bin/skills/lakehouse-doc-en/references/datagpt_data_source.md +50 -37
  89. package/bin/skills/lakehouse-doc-en/references/datagpt_introduction.md +55 -79
  90. package/bin/skills/lakehouse-doc-en/references/datagpt_quickstart.md +50 -64
  91. package/bin/skills/lakehouse-doc-en/references/datalake-acceleration.md +75 -2
  92. package/bin/skills/lakehouse-doc-en/references/dbt-databricks-to-clickzetta-migration.md +242 -0
  93. package/bin/skills/lakehouse-doc-en/references/dynamic-mask.md +30 -30
  94. package/bin/skills/lakehouse-doc-en/references/dynamic-table-bestpractice.md +1 -1
  95. package/bin/skills/lakehouse-doc-en/references/dynamic-table-introduce.md +1 -1
  96. package/bin/skills/lakehouse-doc-en/references/dynamic_table_summary.md +1 -1
  97. package/bin/skills/lakehouse-doc-en/references/eco_integration/streamlit.md +1 -1
  98. package/bin/skills/lakehouse-doc-en/references/eco_integration/superset.md +1 -1
  99. package/bin/skills/lakehouse-doc-en/references/ecosystem-all.md +1 -3
  100. package/bin/skills/lakehouse-doc-en/references/ecosystem.md +145 -0
  101. package/bin/skills/lakehouse-doc-en/references/external-catalog-summary.md +33 -38
  102. package/bin/skills/lakehouse-doc-en/references/external-function-combo-practice.md +466 -0
  103. package/bin/skills/lakehouse-doc-en/references/f6fc6447ee.md +7 -9
  104. package/bin/skills/lakehouse-doc-en/references/federation-query.md +56 -6
  105. package/bin/skills/lakehouse-doc-en/references/finebi-mysql.md +2 -0
  106. package/bin/skills/lakehouse-doc-en/references/get-started-with-sample-data.md +10 -11
  107. package/bin/skills/lakehouse-doc-en/references/gitfolder.md +2 -3
  108. package/bin/skills/lakehouse-doc-en/references/grant-privileges.md +2 -0
  109. package/bin/skills/lakehouse-doc-en/references/iceberg-rest-catalog-databricks.md +166 -0
  110. package/bin/skills/lakehouse-doc-en/references/ide.md +1 -1
  111. package/bin/skills/lakehouse-doc-en/references/if_else_task.md +59 -57
  112. package/bin/skills/lakehouse-doc-en/references/input_output.md +10 -7
  113. package/bin/skills/lakehouse-doc-en/references/jobprofile-bestpractices.md +60 -64
  114. package/bin/skills/lakehouse-doc-en/references/kafka-connection.md +0 -1
  115. package/bin/skills/lakehouse-doc-en/references/key-concepts.md +146 -117
  116. package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-gateway-cz-cli.md +317 -0
  117. package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-sql-analysis.md +345 -0
  118. package/bin/skills/lakehouse-doc-en/references/lakehouse-dqc-guide.md +300 -0
  119. package/bin/skills/lakehouse-doc-en/references/lakehouse-medallion-sql-dt-guide.md +543 -0
  120. package/bin/skills/lakehouse-doc-en/references/lakehouse-multi-cloud-acceleration.md +274 -0
  121. package/bin/skills/lakehouse-doc-en/references/lakehouse-multimodal-ai-pipeline.md +198 -0
  122. package/bin/skills/lakehouse-doc-en/references/lakehouse-quick-experience_guide.md +49 -52
  123. package/bin/skills/lakehouse-doc-en/references/lakehouse-volume-pipe-acceleration-guide.md +380 -0
  124. package/bin/skills/lakehouse-doc-en/references/langchain-plug-installation.md +1 -1
  125. package/bin/skills/lakehouse-doc-en/references/management.md +4 -9
  126. package/bin/skills/lakehouse-doc-en/references/medallion-lakehouse-from-scratch.md +2 -1
  127. package/bin/skills/lakehouse-doc-en/references/metrics_answer_build.md +58 -21
  128. package/bin/skills/lakehouse-doc-en/references/migrate-spark-data-engineering-best-practices-to-lakehouse.md +1 -1
  129. package/bin/skills/lakehouse-doc-en/references/mindsdb.md +1 -1
  130. package/bin/skills/lakehouse-doc-en/references/monitoring_and_alerting.md +65 -60
  131. package/bin/skills/lakehouse-doc-en/references/monitoring_item_specification.md +33 -33
  132. package/bin/skills/lakehouse-doc-en/references/multitable_batch_sync.md +16 -16
  133. package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync.md +65 -72
  134. package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync_sop.md +54 -52
  135. package/bin/skills/lakehouse-doc-en/references/navicat-mysql.md +2 -0
  136. package/bin/skills/lakehouse-doc-en/references/om-dynamic-table.md +71 -66
  137. package/bin/skills/lakehouse-doc-en/references/om-vcluster.md +2 -0
  138. package/bin/skills/lakehouse-doc-en/references/open-api-create-session.md +79 -0
  139. package/bin/skills/lakehouse-doc-en/references/open-api-generate-auth-token.md +63 -0
  140. package/bin/skills/lakehouse-doc-en/references/open-api-overview.md +96 -0
  141. package/bin/skills/lakehouse-doc-en/references/open-api-quick-start.md +286 -0
  142. package/bin/skills/lakehouse-doc-en/references/open-api-response-guide.md +264 -0
  143. package/bin/skills/lakehouse-doc-en/references/open-api-safe-question-poll.md +201 -0
  144. package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-query.md +99 -0
  145. package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-stop.md +74 -0
  146. package/bin/skills/lakehouse-doc-en/references/overview.md +6 -7
  147. package/bin/skills/lakehouse-doc-en/references/permission-application.md +5 -5
  148. package/bin/skills/lakehouse-doc-en/references/pipe-introduction.md +1 -0
  149. package/bin/skills/lakehouse-doc-en/references/pipe-kafka-table-stream.md +72 -70
  150. package/bin/skills/lakehouse-doc-en/references/pipe-kafka.md +105 -110
  151. package/bin/skills/lakehouse-doc-en/references/pipe-overview.md +40 -40
  152. package/bin/skills/lakehouse-doc-en/references/pipe-storage-object.md +43 -48
  153. package/bin/skills/lakehouse-doc-en/references/pipe-summary.md +14 -4
  154. package/bin/skills/lakehouse-doc-en/references/pipe-syntax.md +58 -151
  155. package/bin/skills/lakehouse-doc-en/references/practice_python_task.md +4 -4
  156. package/bin/skills/lakehouse-doc-en/references/pricing-ai-gateway.md +181 -0
  157. package/bin/skills/lakehouse-doc-en/references/pricing-lakehouse.md +316 -0
  158. package/bin/skills/lakehouse-doc-en/references/pricing.md +44 -288
  159. package/bin/skills/lakehouse-doc-en/references/private-link-general.md +0 -2
  160. package/bin/skills/lakehouse-doc-en/references/pyspark-to-zettapark-migration-f1.md +1 -1
  161. package/bin/skills/lakehouse-doc-en/references/python-igs.md +7 -3
  162. package/bin/skills/lakehouse-doc-en/references/python-sample-put-github-rt-events.md +1 -1
  163. package/bin/skills/lakehouse-doc-en/references/python-task.md +1 -1
  164. package/bin/skills/lakehouse-doc-en/references/python_reference/connector.md +3 -3
  165. package/bin/skills/lakehouse-doc-en/references/python_reference/connector_advanced.md +2 -2
  166. package/bin/skills/lakehouse-doc-en/references/python_reference/connector_examples.md +2 -2
  167. package/bin/skills/lakehouse-doc-en/references/python_sdk_guide.md +1 -1
  168. package/bin/skills/lakehouse-doc-en/references/python_shell_datasource.md +11 -9
  169. package/bin/skills/lakehouse-doc-en/references/quick_start_batch_sync_data.md +9 -18
  170. package/bin/skills/lakehouse-doc-en/references/quick_start_bi_analysis.md +8 -25
  171. package/bin/skills/lakehouse-doc-en/references/quick_start_create_workspace.md +4 -6
  172. package/bin/skills/lakehouse-doc-en/references/quick_start_data_quality.md +8 -8
  173. package/bin/skills/lakehouse-doc-en/references/quick_start_etl.md +16 -20
  174. package/bin/skills/lakehouse-doc-en/references/quick_start_monitoring_and_alerting.md +10 -18
  175. package/bin/skills/lakehouse-doc-en/references/quick_start_sql_query.md +7 -10
  176. package/bin/skills/lakehouse-doc-en/references/quick_start_upload_data.md +5 -7
  177. package/bin/skills/lakehouse-doc-en/references/quick_start_user_management.md +8 -8
  178. package/bin/skills/lakehouse-doc-en/references/quick_start_workspace.md +0 -5
  179. package/bin/skills/lakehouse-doc-en/references/quick_start_workspace_user.md +8 -8
  180. package/bin/skills/lakehouse-doc-en/references/quickstart.md +69 -56
  181. package/bin/skills/lakehouse-doc-en/references/quickstart_datashare_between_companies.md +0 -5
  182. package/bin/skills/lakehouse-doc-en/references/quickstart_envirment_for_team.md +0 -24
  183. package/bin/skills/lakehouse-doc-en/references/realtime-pipeline-selection-guide.md +1 -2
  184. package/bin/skills/lakehouse-doc-en/references/realtime-sales-dashboard-with-dynamic-table.md +3 -3
  185. package/bin/skills/lakehouse-doc-en/references/realtime_sync.md +0 -1
  186. package/bin/skills/lakehouse-doc-en/references/release-note-2026-05-19.md +5 -3
  187. package/bin/skills/lakehouse-doc-en/references/revoke-privileges.md +3 -1
  188. package/bin/skills/lakehouse-doc-en/references/roles.md +2 -3
  189. package/bin/skills/lakehouse-doc-en/references/row-filter.md +165 -0
  190. package/bin/skills/lakehouse-doc-en/references/row_level_permission.md +30 -19
  191. package/bin/skills/lakehouse-doc-en/references/scheduled_task.md +28 -21
  192. package/bin/skills/lakehouse-doc-en/references/security_overview.md +99 -21
  193. package/bin/skills/lakehouse-doc-en/references/set-command.md +1 -1
  194. package/bin/skills/lakehouse-doc-en/references/setup.md +13 -15
  195. package/bin/skills/lakehouse-doc-en/references/show-grants.md +1 -1
  196. package/bin/skills/lakehouse-doc-en/references/snowflake-dynamic-tables-to-lakehouse.md +2 -2
  197. package/bin/skills/lakehouse-doc-en/references/spark-connector-summary.md +1 -1
  198. package/bin/skills/lakehouse-doc-en/references/sql_functions/context_functions/current_vcluster.md +1 -1
  199. package/bin/skills/lakehouse-doc-en/references/sso-configuration.md +2 -2
  200. package/bin/skills/lakehouse-doc-en/references/streaming_pipeline_with_dynamic_table.md +0 -1
  201. package/bin/skills/lakehouse-doc-en/references/studio-incremental-sync-practice.md +27 -23
  202. package/bin/skills/lakehouse-doc-en/references/studio-shell-task.md +1 -1
  203. package/bin/skills/lakehouse-doc-en/references/supported-cloud-platforms.md +32 -0
  204. package/bin/skills/lakehouse-doc-en/references/table_rendering.md +18 -12
  205. package/bin/skills/lakehouse-doc-en/references/task-develop.md +89 -91
  206. package/bin/skills/lakehouse-doc-en/references/task_development.md +19 -17
  207. package/bin/skills/lakehouse-doc-en/references/task_group.md +16 -14
  208. package/bin/skills/lakehouse-doc-en/references/task_instance.md +21 -21
  209. package/bin/skills/lakehouse-doc-en/references/task_param.md +38 -35
  210. package/bin/skills/lakehouse-doc-en/references/task_param_reference.md +81 -79
  211. package/bin/skills/lakehouse-doc-en/references/task_scheduling_dependency.md +20 -21
  212. package/bin/skills/lakehouse-doc-en/references/tencentcloud_arn_and_externalid.md +1 -5
  213. package/bin/skills/lakehouse-doc-en/references/trial-account-quotas-and-limits.md +1 -3
  214. package/bin/skills/lakehouse-doc-en/references/tutorial_connect_to_lakehouse.md +69 -0
  215. package/bin/skills/lakehouse-doc-en/references/tutorials.md +4 -1
  216. package/bin/skills/lakehouse-doc-en/references/unique-key.md +167 -0
  217. package/bin/skills/lakehouse-doc-en/references/usageandbillingview.md +138 -0
  218. package/bin/skills/lakehouse-doc-en/references/use-dbt-dev.md +3 -3
  219. package/bin/skills/lakehouse-doc-en/references/use-java-sdk-realtime-uploaddata.md +1 -1
  220. package/bin/skills/lakehouse-doc-en/references/use-java-sdk-upload-data-local.md +3 -3
  221. package/bin/skills/lakehouse-doc-en/references/use-models.md +128 -0
  222. package/bin/skills/lakehouse-doc-en/references/use-mysql-client.md +81 -81
  223. package/bin/skills/lakehouse-doc-en/references/use-python-sdk-upload-data.md +10 -12
  224. package/bin/skills/lakehouse-doc-en/references/user-identification.md +2 -3
  225. package/bin/skills/lakehouse-doc-en/references/user_permission_grand_guide.md +1 -1
  226. package/bin/skills/lakehouse-doc-en/references/using-udf-in-dynamic-table.md +1 -1
  227. package/bin/skills/lakehouse-doc-en/references/vc_cache.md +18 -22
  228. package/bin/skills/lakehouse-doc-en/references/vcluster_size_description.md +33 -31
  229. package/bin/skills/lakehouse-doc-en/references/virtual-cluster.md +43 -45
  230. package/bin/skills/lakehouse-doc-en/references/web-job-history.md +94 -108
  231. package/bin/skills/lakehouse-doc-en/references/web_search.md +16 -7
  232. package/bin/skills/lakehouse-doc-en/references/zettapark-data-engineering-demo.md +1 -1
  233. package/bin/skills/lakehouse-doc-en/references/zettapark-dataframe-guide.md +144 -70
  234. package/bin/skills/lakehouse-doc-en/references/zettapark-dynamic-table-guide.md +2 -2
  235. package/bin/skills/lakehouse-doc-en/references/zettapark-etl-guide.md +73 -33
  236. package/bin/skills/lakehouse-doc-en/references/zettapark-feature-engineering.md +2 -2
  237. package/bin/skills/lakehouse-doc-en/references/zettapark-functions-guide.md +75 -46
  238. package/bin/skills/lakehouse-doc-en/references/zettapark-quick-start.md +2 -2
  239. package/bin/skills/lakehouse-doc-en/references/zettapark-stream-guide.md +4 -4
  240. package/bin/skills/lakehouse-doc-en/references/zettapark-volume-guide.md +93 -29
  241. package/package.json +1 -1
  242. package/bin/skills/lakehouse-doc-en/references/CLAUDE.md +0 -606
  243. package/bin/skills/lakehouse-doc-en/references/modelprice.md +0 -155
@@ -0,0 +1,350 @@
1
+ # Databricks Notebook → Lakehouse Migration Guide: Retail Data Medallion Pipeline
2
+
3
+ If your data engineering pipeline runs on Databricks Notebooks, the migration effort to Singdata Lakehouse Studio is lower than you might expect. Databricks' PySpark DataFrame API — `select`, `filter`, `join`, `withColumn`, `when`, Window functions — has identical syntax in ZettaPark. Changes are limited to just 5 mechanical substitutions: import paths, session acquisition method, table path prefix, one API casing difference, and replacing `dbutils.notebook.run()` with Studio task dependencies.
4
+
5
+ This article validates this with a real project: a Databricks-based retail data Medallion pipeline (Bronze → Silver → Gold three-layer architecture, 14 Notebooks, 81 code cells) fully migrated to Singdata Lakehouse, offering three migration options, all 20 automated validations passing.
6
+
7
+ Full code on GitHub: [databricks2lakehouse-bootcamp](https://github.com/clickzetta/databricks2lakehouse-bootcamp)
8
+
9
+ ---
10
+
11
+ ## Source Project
12
+
13
+ [databricks2lakehouse-bootcamp](https://github.com/clickzetta/databricks2lakehouse-bootcamp) is forked from [DataWithBaraa/databricks_bootcamp_2026](https://github.com/DataWithBaraa/databricks_bootcamp_2026) (⭐335). The original tech stack is Databricks + PySpark + Delta Lake + Unity Catalog. The project implements a complete retail data warehouse from dual-source CRM/ERP ingestion to a star schema, covering 18,484 customers, 397 products, and 60,398 sales records, with complete data cleansing, type conversion, code mapping, and dimensional modeling.
14
+
15
+ Migrated code is in the `03_lakehouse/` directory, comparable file-by-file with `01_source/`.
16
+
17
+ ## Conclusion First
18
+
19
+ You don't need to rewrite any business logic, or retrain your team. All 5 changes are mechanical substitutions.
20
+
21
+ | Change | Effort | Notes |
22
+ |--------|--------|------|
23
+ | Import path replacement | Very low | `pyspark.sql` → `clickzetta.zettapark`, global search-replace |
24
+ | Session acquisition method | Very low | `spark` (global injection) → Studio tasks use `clickzetta_dbutils`, local use `Session.builder.configs({})` |
25
+ | Table path prefix | Very low | `workspace.bronze.X` → `bronze.X`, remove catalog prefix |
26
+ | StructField casing | Very low | `field.dataType` → `field.datatype` (ZettaPark API difference) |
27
+ | Orchestration method | Low | `dbutils.notebook.run(nb)` → Studio task dependencies (DAG) |
28
+
29
+ `select`, `filter`, `join`, `withColumn`, `when`, `coalesce`, `trim`, `regexp_replace`, `to_date`, `cast`, `isNotNull`, Window functions, `ROW_NUMBER()` — these core data engineering operations have identical syntax and require no changes.
30
+
31
+ ---
32
+
33
+ ## Tech Stack Comparison
34
+
35
+ | | Databricks Notebook | Lakehouse Studio Task |
36
+ |---|---|---|
37
+ | Compute engine | Apache Spark (Databricks) | Singdata Lakehouse |
38
+ | DataFrame API | PySpark (`pyspark.sql`) | ZettaPark (`clickzetta.zettapark`) |
39
+ | Session acquisition | `spark` (Databricks global injection) | `clickzetta_dbutils.get_active_lakehouse_engine()` |
40
+ | Table naming | `workspace.bronze.crm_cust_info` (3-level) | `bronze.crm_cust_info` (2-level) |
41
+ | File path | `/Volumes/workspace/bronze/raw_sources/...` | `vol://bronze.raw_sources/...` |
42
+ | StructField | `field.dataType` | `field.datatype` |
43
+ | SQL execution | `spark.sql(q)` executes immediately | `session.sql(q).collect()` triggers execution |
44
+ | Notebook chaining | `dbutils.notebook.run(nb, timeout_seconds=0)` | Studio task dependencies (`--deps` parameter) |
45
+ | Scheduling orchestration | Databricks Jobs | Studio task DAG |
46
+
47
+ ---
48
+
49
+ ![](.topwrite/assets/anim-28-databricks-notebook-migration.svg)
50
+
51
+ ---
52
+
53
+ ## Project Background
54
+
55
+ Data comes from a bicycle retailer's dual-source CRM + ERP system, including 6 CSV files:
56
+
57
+ | Data Source | Table Name | Row Count | Notes |
58
+ |--------|------|------|------|
59
+ | CRM | `crm_cust_info` | 18,494 | Customer info (includes dirty data and inconsistent casing) |
60
+ | CRM | `crm_prd_info` | 397 | Product info (includes category ID encoded in product key) |
61
+ | CRM | `crm_sales_details` | 60,398 | Sales transactions (includes yyyyMMdd format dates) |
62
+ | ERP | `erp_cust_az12` | 18,484 | Customer master data (includes NAS prefix, future dates) |
63
+ | ERP | `erp_loc_a101` | 18,484 | Customer addresses (includes hyphens and country code abbreviations) |
64
+ | ERP | `erp_px_cat_g1v2` | 37 | Product categories (includes YES/NO maintenance flags) |
65
+
66
+ Medallion architecture in three layers:
67
+
68
+ - **Bronze**: Raw CSV → 6 Delta tables (no transformation)
69
+ - **Silver**: Cleansing + normalization → 6 wide tables (trim, type conversion, enum mapping, column renaming)
70
+ - **Gold**: Star schema → dim_customers + dim_products + fact_sales
71
+
72
+ The original project has 14 Databricks Notebooks (81 code cells), chained via `dbutils.notebook.run()`.
73
+
74
+ ---
75
+
76
+ ## Migration Steps
77
+
78
+ ### Step 1: Replace Import Paths
79
+
80
+ Mechanical global replacement, no logic changes:
81
+
82
+ ```python
83
+ # Databricks
84
+ import pyspark.sql.functions as F
85
+ from pyspark.sql.types import StringType, DateType
86
+ from pyspark.sql.functions import trim, col, length
87
+ from pyspark.sql.window import Window
88
+
89
+ # ZettaPark (package name only)
90
+ from clickzetta.zettapark import functions as F
91
+ from clickzetta.zettapark.types import StringType, DateType
92
+ from clickzetta.zettapark.functions import trim, col, length
93
+ from clickzetta.zettapark.window import Window
94
+ ```
95
+
96
+ ### Step 2: Replace Session Acquisition Method
97
+
98
+ Databricks injects `spark` into every Notebook without requiring explicit creation. Studio tasks obtain it via `clickzetta_dbutils`, also without needing to manage passwords or connection parameters:
99
+
100
+ ```python
101
+ # Databricks: spark is globally available, use directly
102
+ df = spark.table("workspace.bronze.crm_cust_info")
103
+
104
+ # Studio task: platform-injected, get session via clickzetta_dbutils
105
+ from clickzetta_dbutils import get_active_lakehouse_engine
106
+ from clickzetta.zettapark.session import Session
107
+ from urllib.parse import urlparse, parse_qs
108
+
109
+ engine = get_active_lakehouse_engine(schema="quick_start")
110
+ url_str = str(engine.url)
111
+ parsed = urlparse(url_str.replace('clickzetta://', 'https://'))
112
+ params = parse_qs(parsed.query)
113
+ parts = parsed.hostname.split('.', 1)
114
+
115
+ session = Session.builder.configs({
116
+ "service": parts[1],
117
+ "instance": parts[0],
118
+ "magic_token": params['magic_token'][0],
119
+ "workspace": parsed.path.lstrip('/'),
120
+ "schema": params.get('schema', ['quick_start'])[0],
121
+ "vcluster": params.get('virtualcluster', ['DEFAULT'])[0],
122
+ }).getOrCreate()
123
+
124
+ # After this, session usage is identical to spark
125
+ df = session.table("bronze.crm_cust_info")
126
+ ```
127
+
128
+ > 💡 **Note**: For local development and debugging, explicitly pass credentials with `Session.builder.configs({"instance":..., "password":..., ...}).create()`, without relying on `clickzetta_dbutils`. DataFrame code is identical after either acquisition method.
129
+
130
+ ### Step 3: Remove Catalog Prefix from Table Paths
131
+
132
+ Unity Catalog uses three-level naming (catalog.schema.table). Lakehouse only requires two levels within a single workspace:
133
+
134
+ ```python
135
+ # Databricks (Unity Catalog three-level)
136
+ df = spark.table("workspace.bronze.crm_cust_info")
137
+ df.write.mode("overwrite").format("delta").saveAsTable("workspace.silver.crm_customers")
138
+
139
+ # ZettaPark (two-level, remove catalog prefix)
140
+ df = session.table("bronze.crm_cust_info")
141
+ df.write.mode("overwrite").saveAsTable("silver.crm_customers") # .format("delta") not needed either
142
+ ```
143
+
144
+ ### Step 4: Fix StructField Casing
145
+
146
+ This is an API difference discovered in testing, affecting all code that iterates field types with `df.schema.fields`:
147
+
148
+ ```python
149
+ # Databricks / PySpark
150
+ for field in df.schema.fields:
151
+ if isinstance(field.dataType, StringType): # uppercase T
152
+ df = df.withColumn(field.name, trim(col(field.name)))
153
+
154
+ # ZettaPark (datatype all lowercase)
155
+ for field in df.schema.fields:
156
+ if isinstance(field.datatype, StringType): # lowercase t
157
+ df = df.withColumn(field.name, trim(col(field.name)))
158
+ ```
159
+
160
+ ### Step 5: Replace dbutils.notebook.run() with Studio Task Dependencies
161
+
162
+ The original project chains 14 Notebooks via `dbutils.notebook.run()`, which maps directly to Studio task DAG:
163
+
164
+ ```python
165
+ # Databricks: orchestration notebook
166
+ notebooks = [
167
+ "./silver_crm_cust_info",
168
+ "./silver_crm_prd_info",
169
+ "./silver_crm_sales_details",
170
+ "./silver_erp_cust_az12",
171
+ "./silver_erp_loc_a101",
172
+ "./silver_erp_px_cat_g1v2"
173
+ ]
174
+ for nb in notebooks:
175
+ dbutils.notebook.run(nb, timeout_seconds=0)
176
+ ```
177
+
178
+ ```bash
179
+ # Studio: set task dependencies with cz-cli, one-time configuration, platform handles scheduling
180
+ cz-cli task save-config bootcamp/silver_crm_cust_info \
181
+ --deps bootcamp/bronze_ingestion --profile aws_singapore_prod
182
+
183
+ cz-cli task save-config bootcamp/gold_dim_customers \
184
+ --deps bootcamp/silver_crm_cust_info,bootcamp/silver_erp_cust_az12,bootcamp/silver_erp_loc_a101 \
185
+ --profile aws_singapore_prod
186
+ ```
187
+
188
+ Execute DAG (equivalent to the original orchestration notebook's chained calls):
189
+
190
+ ```bash
191
+ cz-cli task execute bootcamp/init_lakehouse --profile aws_singapore_prod
192
+ # → automatically triggers bronze → silver (parallel) → gold (in dependency order)
193
+ ```
194
+
195
+ ---
196
+
197
+ ## Studio Task DAG After Migration
198
+
199
+ The original Databricks project used 2 orchestration notebooks chained in series. After migration, the Studio platform automatically manages dependencies and parallelism:
200
+
201
+ ```
202
+ init_lakehouse (SQL task)
203
+ └── bronze_ingestion (Python task)
204
+ ├── silver_crm_cust_info ← parallel execution
205
+ ├── silver_crm_prd_info ← parallel execution
206
+ ├── silver_crm_sales_details ← parallel execution
207
+ ├── silver_erp_cust_az12 ← parallel execution
208
+ ├── silver_erp_loc_a101 ← parallel execution
209
+ └── silver_erp_px_cat_g1v2 ← parallel execution
210
+ ├── gold_dim_customers ← depends on silver_crm_cust_info + erp_cust + erp_loc
211
+ ├── gold_dim_products ← depends on silver_crm_prd_info + erp_px_cat
212
+ └── gold_fact_sales ← depends on dim_customers + dim_products + silver_crm_sales
213
+ ```
214
+
215
+ The original Databricks ran 6 silver notebooks sequentially; Studio runs 6 silver tasks in parallel — same logic, shorter overall runtime.
216
+
217
+ ---
218
+
219
+ ## Fully Compatible Parts
220
+
221
+ The following code has identical syntax on both sides, validated by testing — no changes needed:
222
+
223
+ ```python
224
+ # String cleansing
225
+ for field in df.schema.fields:
226
+ if isinstance(field.datatype, StringType):
227
+ df = df.withColumn(field.name, trim(col(field.name)))
228
+
229
+ # Enum mapping (conditional replacement)
230
+ df = df.withColumn("cst_marital_status",
231
+ F.when(F.upper(F.col("cst_marital_status")) == "S", "Single")
232
+ .when(F.upper(F.col("cst_marital_status")) == "M", "Married")
233
+ .otherwise("n/a"))
234
+
235
+ # Composite string parsing (extract category ID from product key)
236
+ df = df.withColumn("cat_id", F.regexp_replace(F.substring(col("prd_key"), 1, 5), "-", "_"))
237
+ df = df.withColumn("prd_key", F.substring(col("prd_key"), 7, F.length(col("prd_key"))))
238
+
239
+ # Date format conversion (yyyyMMdd integer → DATE)
240
+ df = df.withColumn("sls_order_dt",
241
+ F.when(
242
+ (col("sls_order_dt") == 0) | (length(col("sls_order_dt")) != 8), None
243
+ ).otherwise(F.to_date(col("sls_order_dt").cast("string"), "yyyyMMdd")))
244
+
245
+ # Conditional price fix (derive from sales/quantity when quantity != 0)
246
+ df = df.withColumn("sls_price",
247
+ F.when(
248
+ (col("sls_price").isNull()) | (col("sls_price") <= 0),
249
+ F.when(col("sls_quantity") != 0, col("sls_sales") / col("sls_quantity")).otherwise(None)
250
+ ).otherwise(col("sls_price")))
251
+
252
+ # Prefix cleansing (remove invalid NAS prefix)
253
+ df = df.withColumn("cid",
254
+ F.when(col("cid").startswith("NAS"), F.substring(col("cid"), 4, F.length(col("cid"))))
255
+ .otherwise(col("cid")))
256
+
257
+ # Future date filtering (dirty data cleansing)
258
+ df = df.withColumn("bdate",
259
+ F.when(col("bdate") > F.current_date(), None).otherwise(col("bdate")))
260
+
261
+ # Multi-table LEFT JOIN + ROW_NUMBER for dimension table
262
+ df = session.sql("""
263
+ SELECT
264
+ ROW_NUMBER() OVER (ORDER BY ci.customer_id) AS customer_key,
265
+ ci.customer_id, ci.customer_number, ci.first_name, ci.last_name,
266
+ COALESCE(la.country, 'n/a') AS country, ci.marital_status,
267
+ CASE WHEN ci.gender <> 'n/a' THEN ci.gender ELSE COALESCE(ca.gender, 'n/a') END AS gender,
268
+ ca.birth_date, ci.created_date
269
+ FROM silver.crm_customers ci
270
+ LEFT JOIN silver.erp_customer_location la ON ci.customer_number = la.customer_number
271
+ LEFT JOIN silver.erp_customers ca ON ci.customer_number = ca.customer_number
272
+ """)
273
+ ```
274
+
275
+ ---
276
+
277
+ ## E2E Validation Results
278
+
279
+ Tested on AWS Singapore instance (`aws_singapore_prod`), 20/20 all passed:
280
+
281
+ | Check | Expected | Result |
282
+ |--------|--------|------|
283
+ | bronze.crm_cust_info | 18,494 | ✅ |
284
+ | bronze.crm_prd_info | 397 | ✅ |
285
+ | bronze.crm_sales_details | 60,398 | ✅ |
286
+ | bronze.erp_cust_az12 | 18,484 | ✅ |
287
+ | bronze.erp_loc_a101 | 18,484 | ✅ |
288
+ | bronze.erp_px_cat_g1v2 | 37 | ✅ |
289
+ | silver.crm_customers | 18,490 | ✅ |
290
+ | silver.crm_products | 397 | ✅ |
291
+ | silver.crm_sales | 60,398 | ✅ |
292
+ | silver.erp_customers | 18,484 | ✅ |
293
+ | silver.erp_customer_location | 18,484 | ✅ |
294
+ | silver.erp_product_category | 37 | ✅ |
295
+ | gold.dim_customers | 18,490 | ✅ |
296
+ | gold.dim_products | 397 | ✅ |
297
+ | gold.fact_sales | 89,833 | ✅ |
298
+ | Total sales amount | 43,538,800 | ✅ |
299
+ | Rows with negative sales | 5 (raw data) | ✅ |
300
+ | Distinct customer count | 18,484 | ✅ |
301
+ | Distinct product SKUs | 295 | ✅ |
302
+ | Rows with null order date | 24 (raw data format issue) | ✅ |
303
+
304
+ > Bronze 18,494 → Silver 18,490: 4 records with NULL customer_id were filtered during Silver cleansing, consistent with original Databricks behavior.
305
+
306
+ ---
307
+
308
+ ## Three Migration Options Compared
309
+
310
+ This project provides three options suited to different team backgrounds and deployment needs:
311
+
312
+ | Option | File Location | Session Method | Change Volume | Use Case |
313
+ |------|----------|-------------|--------|----------|
314
+ | **A. ZettaPark Local** | `03_lakehouse/{bronze,silver,gold}/` | `Session.builder.configs({}).create()` | ~5% | Local development, CI/CD debugging |
315
+ | **B. Pure SQL** | `03_lakehouse/sql/` | Not needed | Full rewrite (logic unchanged) | SQL-first teams, Studio SQL tasks |
316
+ | **C. Studio Task** | `03_lakehouse/tasks/` | `clickzetta_dbutils` | ~5% | **Production deployment (recommended)** |
317
+
318
+ All three options produce identical results — row counts, metrics, and data are consistent. Option C is the recommended path for production, directly mapping to the Databricks Notebook + Jobs architecture.
319
+
320
+ ## Notes
321
+
322
+ - **`field.datatype` casing**: ZettaPark's `StructField` property is `datatype` (all lowercase); PySpark uses `dataType` (camelCase). This is the only non-mechanical API difference. Search for all uses of `schema.fields` and update individually.
323
+ - **`clickzetta_dbutils` template for Studio tasks**: The `get_active_lakehouse_engine()` call returns the current task's connection info; parse the URL to build the Session. This template code is fixed and can be extracted to a shared module for reuse.
324
+ - **Volume path format**: Databricks `/Volumes/catalog/schema/volume/path` → ZettaPark `vol://schema.volume/path`, note the removal of the catalog level.
325
+ - **`.format("delta")` not needed**: ZettaPark's `saveAsTable()` writes in Lakehouse native format by default; no explicit format specification required.
326
+ - **Silver row count difference (18,494 → 18,490)**: The CRM source data contains 4 records with NULL customer_id, filtered during Silver cleansing. This is expected behavior.
327
+
328
+ ## Related Documentation
329
+
330
+ ### ZettaPark DataFrame API
331
+
332
+ - [ZettaPark Quick Start](zettapark-quick-start.md): Session creation, DataFrame basics
333
+ - [ZettaPark DataFrame API Guide](zettapark-dataframe-guide.md): Full reference for select, filter, join, withColumn, etc.
334
+ - [ZettaPark Common Functions Reference](zettapark-functions-guide.md): Usage of trim, when, regexp_replace, Window, etc.
335
+ - [ZettaPark Data Engineering in Practice](zettapark-etl-guide.md): Complete ETL pipeline examples
336
+
337
+ ### Studio Tasks and Orchestration
338
+
339
+ - [Lakehouse Studio Introduction](lakehouse-studio-concept.md): Studio platform concepts and architecture
340
+ - [Task Development and Scheduling](task-develop.md): Creating, developing, and scheduling Studio tasks
341
+ - [Task Scheduling Dependencies](task_scheduling_dependency.md): `--deps` parameter for configuring task DAG
342
+ - [Python Task](python-task.md): Creating and running Python tasks in Studio
343
+ - [Studio Task Development and Operations (cz-cli)](cz-cli-studio-tasks.md): Managing Studio tasks via cz-cli command line
344
+
345
+ ### Other Migration Guides
346
+
347
+ - [PySpark → ZettaPark Migration Guide (Formula 1)](pyspark-to-zettapark-migration-f1.md): PySpark DataFrame API query-by-query migration
348
+ - [Snowpark → ZettaPark Migration Guide](snowflake-snowpark-to-zettapark-migration.md): Snowflake Python DataFrame migration
349
+ - [Spark Migration Guide](spark-migration-guide.md): Spark ecosystem migration overview and common issues
350
+ - [SQL Compatibility Reference](migration-sql-compatibility.md): Cross-platform SQL syntax differences
@@ -0,0 +1,327 @@
1
+ # Databricks Unity Catalog → Lakehouse Migration Guide: Permissions and Governance
2
+
3
+ If your data platform uses Databricks Unity Catalog to manage permissions — RBAC roles, column-level masking, row-level access control, data auditing — the migration effort to Singdata Lakehouse is lower than you might expect. GRANT/REVOKE syntax is fully consistent, role management requires no changes, and the `SET MASK` keyword is identical. Changes are concentrated in 3 areas: removing the catalog prefix (three-level → two-level naming), swapping one API in masking functions (`is_account_group_member` → `array_contains(current_roles(),...)`), and replacing declarative ROW FILTER with explicit security views for row-level security.
4
+
5
+ This article validates this with a financial payment scenario: a user table with PII fields (email/phone/card), an orders table with region-based isolation, and a balance/accounts table — complete migration of RBAC, column masking, row-level security, and audit logs, passing all 16 automated validations.
6
+
7
+ Full code on GitHub: [databricks2lakehouse-governance](https://github.com/clickzetta/databricks2lakehouse-governance)
8
+
9
+ ---
10
+
11
+ ## Source Project
12
+
13
+ Demo data is a financial payment scenario with 3 tables:
14
+
15
+ | Table | Rows | Sensitive Fields | Access Control Requirements |
16
+ |----|------|---------|------------|
17
+ | `users` | 100 | `email`, `phone`, `ssn_last4` | Masking: non-admin sees masked values |
18
+ | `orders` | 300 | `amount`, `region` | Row-level: analyst only sees North America |
19
+ | `accounts` | 50 | `balance`, `card_number` | Masking: last 4 digits of card visible |
20
+
21
+ Migrated code is in the `03_lakehouse/sql/` directory (6 SQL files), comparable file-by-file with `01_source/sql/` (original UC SQL).
22
+
23
+ ## Conclusion First
24
+
25
+ Most permission SQL can be reused directly with **very few or zero changes**.
26
+
27
+ | Change | Effort | Notes |
28
+ |--------|--------|------|
29
+ | Masking function judgment API | Very low | `is_account_group_member('g')` → `array_contains(current_roles(), 'ws.g')` |
30
+ | Audit table and column names | Low | `system.access.audit` → `sys.information_schema.job_history`, different column names |
31
+
32
+ **No changes needed**: `CREATE ROLE`, `GRANT SELECT`, `GRANT ALL PRIVILEGES`, `WITH GRANT OPTION`, `SHOW GRANTS`, `SET MASK`, `SET ROW FILTER`, **three-level naming (workspace.schema.table)** — syntax fully consistent. Row-level security migrates with zero changes!
33
+
34
+ > 💡 Namespacing: UC uses `catalog.schema.table`; Lakehouse uses `workspace.schema.table` — the structure is identical, just replacing the catalog name with the workspace name (a connection parameter change, no SQL code changes needed).
35
+
36
+ Lakehouse also has additional capabilities that UC lacks: `mask_inner`/`mask_outer` (built-in string masking functions), `AI_MASK` (AI model semantic masking), `CREATE SHARE` (data sharing, SQL DDL simpler than Delta Sharing).
37
+
38
+ ---
39
+
40
+ ## Tech Stack Comparison
41
+
42
+ | | Databricks Unity Catalog | Singdata Lakehouse |
43
+ |---|---|---|
44
+ | Namespace levels | Three-level: `catalog.schema.table` | Three-level: `workspace.schema.table` (same structure) |
45
+ | Role management | `CREATE ROLE` (metastore-level) | `CREATE ROLE` (workspace-level) |
46
+ | GRANT syntax | `GRANT ... ON TABLE cat.s.t TO ROLE r` | `GRANT ... ON TABLE s.t TO ROLE r` |
47
+ | Column masking judgment | `is_account_group_member('group')` | `array_contains(current_roles(), 'ws.role')` |
48
+ | Column masking application | `ALTER TABLE ... ALTER COLUMN c SET MASK f` | `ALTER TABLE ... ALTER COLUMN c SET MASK f` |
49
+ | Row-level security | `ALTER TABLE ... SET ROW FILTER f ON (col)` | `ALTER TABLE ... SET ROW FILTER f ON (col)` |
50
+ | Data sharing | Delta Sharing (UC + REST API) | `CREATE SHARE` / `GRANT select, read metadata ... TO SHARE` |
51
+ | String masking | Hand-written REGEXP_REPLACE | `mask_inner()` / `mask_outer()` (built-in functions) |
52
+ | AI intelligent masking | — | `AI_MASK()` (Lakehouse exclusive) |
53
+ | Tag / ABAC | UC Tag policies (cross-table tagging) | Not yet supported → use role naming + Schema isolation as substitute |
54
+ | Audit source | `system.access.audit` (account-level) | `sys.information_schema.job_history` (workspace-level) |
55
+
56
+ ---
57
+
58
+ ![](.topwrite/assets/anim-31-databricks-uc-governance-migration.svg)
59
+
60
+ ---
61
+
62
+ ## Project Background
63
+
64
+ Three tables covering typical financial scenarios:
65
+
66
+ - `users`: User registration info with PII fields like email/phone/ssn_last4, requiring column-level masking
67
+ - `orders`: Sales orders partitioned by region (North America/Europe/Asia Pacific), requiring row-level isolation
68
+ - `accounts`: Account balances and bank card info; balance and card_number are highly sensitive
69
+
70
+ The original UC design defines 3 roles:
71
+ - `payments_admin`: Full access to real data
72
+ - `payments_analyst`: Masked data + only North America orders
73
+ - `payments_viewer`: Masked data + only Asia orders
74
+
75
+ ---
76
+
77
+ ## Migration Steps
78
+
79
+ ### Step 1: Remove Catalog Prefix
80
+
81
+ UC uses three-level naming; the Lakehouse workspace itself is the catalog boundary, requiring only two levels:
82
+
83
+ ```sql
84
+ -- UC (three-level)
85
+ GRANT SELECT ON TABLE payments_catalog.raw.orders TO ROLE payments_analyst;
86
+
87
+ -- Lakehouse (two-level) — remove catalog prefix
88
+ GRANT SELECT ON TABLE gov_raw.orders TO ROLE payments_analyst;
89
+ ```
90
+
91
+ ### Step 2: RBAC — Syntax Fully Consistent
92
+
93
+ GRANT/REVOKE/SHOW GRANTS syntax requires zero changes:
94
+
95
+ ```sql
96
+ -- Identical on both sides
97
+ CREATE ROLE IF NOT EXISTS payments_analyst;
98
+ CREATE ROLE IF NOT EXISTS payments_admin;
99
+
100
+ GRANT SELECT ON TABLE gov_raw.orders TO ROLE payments_analyst;
101
+ GRANT ALL PRIVILEGES ON SCHEMA gov_raw TO ROLE payments_admin;
102
+ GRANT SELECT ON ALL TABLES IN SCHEMA gov_raw TO ROLE payments_viewer;
103
+ GRANT SELECT ON TABLE gov_raw.users TO ROLE payments_admin WITH GRANT OPTION;
104
+
105
+ SHOW GRANTS TO ROLE payments_analyst;
106
+ ```
107
+
108
+ ### Step 3: Column Masking — `SET MASK` Identical, One Function API Change
109
+
110
+ UC uses `is_account_group_member()`; Lakehouse uses `array_contains(current_roles(), 'workspace.role')`:
111
+
112
+ ```python
113
+ # UC masking function
114
+ CREATE FUNCTION payments_catalog.raw.mask_email(email STRING)
115
+ RETURN CASE
116
+ WHEN is_account_group_member('payments_admin') THEN email
117
+ ELSE CONCAT(LEFT(email, 2), '***@***.***')
118
+ END;
119
+ ```
120
+
121
+ ```sql
122
+ -- Lakehouse masking function (only the judgment API changes, logic identical)
123
+ CREATE OR REPLACE FUNCTION gov_raw.mask_email(email STRING)
124
+ RETURNS STRING
125
+ RETURN CASE
126
+ WHEN array_contains(current_roles(), 'quick_start.payments_admin') -- ← current_roles()
127
+ OR array_contains(current_roles(), 'workspace_admin')
128
+ THEN email
129
+ ELSE CONCAT(LEFT(email, 2), '***@***.***')
130
+ END;
131
+
132
+ -- SET MASK application syntax is identical
133
+ ALTER TABLE gov_raw.users ALTER COLUMN email SET MASK gov_raw.mask_email;
134
+ ALTER TABLE gov_raw.users ALTER COLUMN phone SET MASK gov_raw.mask_phone;
135
+ ALTER TABLE gov_raw.accounts ALTER COLUMN card_number SET MASK gov_raw.mask_card;
136
+ ```
137
+
138
+ Tested masking results (seen by non-admin users):
139
+
140
+ | Field | Original Value | After Masking |
141
+ |------|--------|--------|
142
+ | email | `user001@example.com` | `u***@***.***` |
143
+ | phone | `+12345678901` | `****8901` |
144
+ | card_number | `4866524041574` | `****-****-****-1574` |
145
+
146
+ ### Step 4: Row-Level Security — `SET ROW FILTER` Syntax Fully Consistent
147
+
148
+ Lakehouse natively supports `SET ROW FILTER`, with syntax **fully consistent with UC, zero changes**:
149
+
150
+ ```sql
151
+ -- UC (Databricks)
152
+ CREATE FUNCTION payments_catalog.raw.filter_orders_by_region(region STRING)
153
+ RETURN CASE
154
+ WHEN is_account_group_member('payments_admin') THEN TRUE
155
+ WHEN is_account_group_member('payments_analyst')
156
+ AND region = 'North America' THEN TRUE
157
+ ELSE FALSE
158
+ END;
159
+
160
+ ALTER TABLE payments_catalog.raw.orders
161
+ SET ROW FILTER payments_catalog.raw.filter_orders_by_region ON (region);
162
+ ```
163
+
164
+ ```sql
165
+ -- Lakehouse — syntax fully consistent, only change is is_account_group_member in function body
166
+ CREATE OR REPLACE FUNCTION gov_raw.filter_orders_by_role(region STRING)
167
+ RETURNS BOOLEAN
168
+ AS
169
+ array_contains(current_roles(), 'quick_start.payments_admin')
170
+ OR array_contains(current_roles(), 'workspace_admin')
171
+ OR (array_contains(current_roles(), 'quick_start.payments_analyst')
172
+ AND region = 'North America')
173
+ OR (array_contains(current_roles(), 'quick_start.payments_viewer')
174
+ AND region = 'Asia');
175
+
176
+ -- SET ROW FILTER syntax is identical
177
+ ALTER TABLE gov_raw.orders SET ROW FILTER gov_raw.filter_orders_by_role ON (region);
178
+
179
+ -- Verify binding
180
+ DESC EXTENDED gov_raw.orders;
181
+ -- Output includes # Row Filter section at the end
182
+
183
+ -- Remove
184
+ ALTER TABLE gov_raw.orders DROP ROW FILTER;
185
+ ```
186
+
187
+ ROW FILTER takes full effect for SELECT, COUNT, UPDATE, DELETE — the current user (workspace_admin) queries 300 rows; a `payments_analyst` role user only sees the 52 North America rows.
188
+
189
+ ### Step 5: Auditing — Table and Column Name Adaptation
190
+
191
+ UC uses `system.access.audit` (account-level); Lakehouse uses `sys.information_schema.job_history` (workspace-level):
192
+
193
+ ```sql
194
+ -- UC auditing
195
+ SELECT event_time, user_name, action_name, request_params
196
+ FROM system.access.audit
197
+ WHERE event_time >= current_timestamp() - INTERVAL 7 DAYS
198
+ AND action_name IN ('createTable','grantPermission','revokePermission');
199
+ ```
200
+
201
+ ```sql
202
+ -- Lakehouse auditing (different column names, date uses literal string)
203
+ SELECT
204
+ job_type,
205
+ job_creator, -- UC: user_name
206
+ LEFT(job_text, 100) AS sql_preview, -- UC: request_params
207
+ status,
208
+ LEFT(start_time, 19) AS time
209
+ FROM sys.information_schema.job_history
210
+ WHERE start_time >= '2026-06-07' -- must be literal, cannot use NOW() - INTERVAL
211
+ AND (UPPER(job_text) LIKE 'GRANT%'
212
+ OR UPPER(job_text) LIKE 'REVOKE%'
213
+ OR UPPER(job_text) LIKE 'CREATE TABLE%')
214
+ ORDER BY start_time DESC;
215
+ ```
216
+
217
+ ---
218
+
219
+ ## Data Sharing
220
+
221
+ Lakehouse's **Share** is the equivalent of Databricks Delta Sharing — share tables or views with other workspaces or external consumers, with syntax highly similar to UC:
222
+
223
+ ```sql
224
+ -- Step 1: Create Share object
225
+ CREATE SHARE IF NOT EXISTS analytics_share
226
+ COMMENT 'Analytics data share for partners';
227
+
228
+ -- Step 2: Add tables/views to Share (dual permissions: select + read metadata)
229
+ GRANT select, read metadata ON VIEW gov_marts.orders_by_region TO SHARE analytics_share;
230
+ GRANT select, read metadata ON TABLE gov_raw.orders TO SHARE analytics_share;
231
+
232
+ -- Bulk add all tables in a schema
233
+ GRANT SELECT, READ METADATA ON ALL TABLES IN SCHEMA gov_raw TO SHARE analytics_share;
234
+
235
+ -- Step 3: View Share contents
236
+ DESC SHARE analytics_share;
237
+
238
+ -- Step 4: Consumer access
239
+ CREATE SCHEMA partner_schema FROM SHARE analytics_share;
240
+ ```
241
+
242
+ `SELECT, READ METADATA` dual permission combination: `select` allows querying data; `read metadata` allows consumers to discover the existence of tables/views. Both are typically granted together. UC Delta Sharing configuration is done via the Databricks UI/REST API; Lakehouse Share uses SQL DDL directly, which is more concise.
243
+
244
+ ---
245
+
246
+ ## Lakehouse-Exclusive Capabilities
247
+
248
+ **`mask_inner` / `mask_outer`**: Built-in string masking functions not available in UC, usable directly in masking functions without AI connection:
249
+
250
+ ```sql
251
+ -- mask_inner: mask middle characters (preserve first n and last m characters)
252
+ SELECT mask_inner('user001@example.com', 2, 7);
253
+ -- → usXXXXXXXXXXXXX.com (preserve first 2, last 7, mask middle)
254
+
255
+ -- mask_outer: mask outer characters
256
+ SELECT mask_outer('+12345678901', 1, 4);
257
+ -- → X1234567XXXX
258
+
259
+ -- More concise in masking functions than hand-written CONCAT+REGEXP_REPLACE
260
+ CREATE OR REPLACE FUNCTION gov_raw.mask_email_v2(email STRING)
261
+ RETURNS STRING
262
+ RETURN CASE
263
+ WHEN array_contains(current_roles(), 'quick_start.payments_admin') THEN email
264
+ ELSE mask_inner(email, 2, 7)
265
+ END;
266
+ ```
267
+
268
+ **`AI_MASK`**: AI model-based semantic masking, automatically identifies PII types (names, phones, ID numbers, etc.). UC has no equivalent:
269
+
270
+ ```sql
271
+ -- Requires AI connection configuration (Bailian/Tongyi models etc.)
272
+ SELECT AI_MASK('conn_bailian:qwen3.5-plus',
273
+ 'User Wang Xiaoming, phone: 13800138000',
274
+ ARRAY('Name', 'Phone'));
275
+ -- → 'User ***, phone: 138****8000'
276
+ ```
277
+
278
+ ---
279
+
280
+ ## Validation Results
281
+
282
+ Tested on AWS Singapore instance (`aws_singapore_prod`), 16/16 all passed:
283
+
284
+ | Check | Expected | Result |
285
+ |--------|--------|------|
286
+ | gov_raw.users | 100 | ✅ |
287
+ | gov_raw.orders | 300 | ✅ |
288
+ | gov_raw.accounts | 50 | ✅ |
289
+ | analyst has GRANT records | ≥2 | ✅ |
290
+ | admin has GRANT records | ≥1 | ✅ |
291
+ | viewer has GRANT records | ≥1 | ✅ |
292
+ | email is masked | True | ✅ |
293
+ | phone is masked | True | ✅ |
294
+ | card_number is masked | True | ✅ |
295
+ | mask_email function exists | True | ✅ |
296
+ | mask_phone function exists | True | ✅ |
297
+ | mask_card function exists | True | ✅ |
298
+ | orders_by_region admin sees 300 rows | 300 | ✅ |
299
+ | orders_analyst_view only sees North America | 52 | ✅ |
300
+ | analyst_view has only 1 region | 1 | ✅ |
301
+ | job_history has records | ≥1 | ✅ |
302
+
303
+ ---
304
+
305
+ ## Notes
306
+
307
+ - **`current_roles()` includes workspace prefix**: Lakehouse's `current_roles()` returns role names with workspace prefix (e.g., `quick_start.payments_admin`). When matching, the full prefix is required — `payments_admin` alone will not work.
308
+ - **`SET MASK` becomes invalid after table recreation**: After DROP TABLE + CREATE, column masks are not inherited automatically. `ALTER TABLE ... ALTER COLUMN c SET MASK` must be re-applied.
309
+ - **COPY INTO type inference**: External Volume COPY INTO does not support `inferSchema`. Column types must be explicitly declared when creating tables (especially phone/card should use STRING; otherwise inferred as BIGINT causing data loss).
310
+ - **job_history date condition**: `WHERE start_time >= '2026-06-07'` must be a literal string. `CURRENT_DATE()` or `INTERVAL` expressions are not supported.
311
+ - **Row-level security function judgment API**: The ROW FILTER function body also requires replacing `is_account_group_member('g')` with `array_contains(current_roles(), 'ws.g')`. This is the only change point. `ALTER TABLE ... SET ROW FILTER` / `DROP ROW FILTER` syntax is identical.
312
+ - **Tag / ABAC (honest limitation)**: UC supports fine-grained attribute-based access control (ABAC) via Tags — tagging tables/columns and then writing policies. Lakehouse does not currently support Tag-driven ABAC. The recommended approach is to use role naming conventions (e.g., `dept_finance_analyst`) + Schema isolation to simulate ABAC, or keep tag-based permissions managed on the Databricks side.
313
+
314
+ ## Related Documentation
315
+
316
+ ### Permissions and Security
317
+
318
+ - [GRANT](GRANT.md): Full GRANT syntax reference
319
+ - [CREATE ROLE](create-role.md): Role creation and management
320
+ - [SET MASK](set-mask.md): Column masking syntax
321
+ - [SHOW GRANTS](show-grants.md): View authorization relationships
322
+
323
+ ### Other Migration Guides
324
+
325
+ - [Databricks Notebook → Lakehouse Migration Guide](databricks-notebook-to-studio-migration.md)
326
+ - [Databricks DLT → Lakehouse Migration Guide](databricks-dlt-to-lakehouse-migration.md)
327
+ - [dbt-databricks → dbt-clickzetta Migration Guide](dbt-databricks-to-clickzetta-migration.md)