@clickzetta/cz-cli-darwin-x64 0.5.16 → 0.5.17
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/cz-cli +0 -0
- package/bin/skills/lakehouse-doc-en/SKILL.md +6 -11
- package/bin/skills/lakehouse-doc-en/references/AIGateway.md +58 -13
- package/bin/skills/lakehouse-doc-en/references/Computation.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/DataSource_Amazon_DocumentDB.md +3 -1
- package/bin/skills/lakehouse-doc-en/references/Foreach.md +14 -14
- package/bin/skills/lakehouse-doc-en/references/JDBC-Driver.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/LakehouseAI-overview.md +21 -8
- package/bin/skills/lakehouse-doc-en/references/LakehouseDataGPT-tour.md +4 -9
- package/bin/skills/lakehouse-doc-en/references/LakehouseStudio-tour.md +14 -19
- package/bin/skills/lakehouse-doc-en/references/Lakehouse_Zilliz_MakeDataReadyforBIandAI.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/Logstash.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/Migrate_Spark_DataEngineeringBestPractices_Project_to_Lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/Notebook.md +17 -17
- package/bin/skills/lakehouse-doc-en/references/RemoteFunction-as-udf.md +14 -14
- package/bin/skills/lakehouse-doc-en/references/SQL_External_Catalog_Guide.md +1 -9
- package/bin/skills/lakehouse-doc-en/references/SUMMARY.md +59 -29
- package/bin/skills/lakehouse-doc-en/references/WINDOWFUNCTION.md +99 -57
- package/bin/skills/lakehouse-doc-en/references/Zettapark_Data_Engineering_Demo.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/access-control-configuration.md +1 -8
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-2-5-1.0.md +16 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-29-1.0.2.md +14 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-8-1.0.1.md +16 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-4-28-1.1.md +29 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-12-1.1.1.md +18 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-15-1.2.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-21-1.3.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-28-1.4.md +10 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-6-3-1.5.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/alicloud-arn-externalid.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/answer-accuracy-improve.md +120 -103
- package/bin/skills/lakehouse-doc-en/references/application-list.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/approval-list.md +16 -17
- package/bin/skills/lakehouse-doc-en/references/batch-load-parquet-file-into-lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/batch_sync.md +9 -9
- package/bin/skills/lakehouse-doc-en/references/batch_sync_Sop.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/batchloadparquetfileintoLakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/bulkloadv1-python-sdk.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/chart-auto-refresh-guide.md +12 -6
- package/bin/skills/lakehouse-doc-en/references/clickzetta-sample-data.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/code_approval.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/composite_task.md +31 -42
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_environment_and_data_generate.md +6 -9
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_javasdk_bulkload_realtime.md +4 -10
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_kafka_realtime_sync.md +1 -10
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_local_file_into_table_by_studio.md +0 -6
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_batchload_public_network.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_python_node.md +2 -7
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_realtime_cdc_public_network.md +13 -18
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_sql_insert.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/concepts.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/config-datasource.md +5 -7
- package/bin/skills/lakehouse-doc-en/references/connect-with-cli.md +116 -72
- package/bin/skills/lakehouse-doc-en/references/connect-with-cz-cli.md +151 -0
- package/bin/skills/lakehouse-doc-en/references/continue-job.md +9 -17
- package/bin/skills/lakehouse-doc-en/references/create-api-connection.md +315 -286
- package/bin/skills/lakehouse-doc-en/references/create-catalog-connection.md +1 -0
- package/bin/skills/lakehouse-doc-en/references/create-dynamic-table.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/create-external-catalog.md +85 -22
- package/bin/skills/lakehouse-doc-en/references/create-table-ddl.md +45 -0
- package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkendpoint.md +4 -6
- package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkservice.md +4 -7
- package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkendpoint.md +2 -7
- package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkservice.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/cz-cli-agent.md +15 -10
- package/bin/skills/lakehouse-doc-en/references/cz-cli-datasource.md +0 -8
- package/bin/skills/lakehouse-doc-en/references/cz-cli-sql.md +2 -45
- package/bin/skills/lakehouse-doc-en/references/cz-cli.md +53 -42
- package/bin/skills/lakehouse-doc-en/references/dashboard-version-management-guide.md +12 -4
- package/bin/skills/lakehouse-doc-en/references/data-integration-intro.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/data-integration.md +29 -27
- package/bin/skills/lakehouse-doc-en/references/data-load-summary.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/data-quality.md +25 -25
- package/bin/skills/lakehouse-doc-en/references/data-sharing.md +31 -54
- package/bin/skills/lakehouse-doc-en/references/data-sources.md +45 -45
- package/bin/skills/lakehouse-doc-en/references/data_catalog.md +23 -25
- package/bin/skills/lakehouse-doc-en/references/data_privacy.md +5 -2
- package/bin/skills/lakehouse-doc-en/references/data_sharing_between_accounts_guide.md +0 -4
- package/bin/skills/lakehouse-doc-en/references/data_visualization.md +4 -15
- package/bin/skills/lakehouse-doc-en/references/dataagent.md +39 -7
- package/bin/skills/lakehouse-doc-en/references/databricks-delta-to-lakehouse-migration.md +168 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-dlt-to-lakehouse-migration.md +331 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-external-catalog-practice.md +367 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-jobs-to-studio-migration.md +199 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-notebook-to-studio-migration.md +350 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-uc-governance-to-lakehouse-migration.md +327 -0
- package/bin/skills/lakehouse-doc-en/references/datagpt-model-config.md +34 -0
- package/bin/skills/lakehouse-doc-en/references/datagpt_data_source.md +50 -37
- package/bin/skills/lakehouse-doc-en/references/datagpt_introduction.md +55 -79
- package/bin/skills/lakehouse-doc-en/references/datagpt_quickstart.md +50 -64
- package/bin/skills/lakehouse-doc-en/references/datalake-acceleration.md +75 -2
- package/bin/skills/lakehouse-doc-en/references/dbt-databricks-to-clickzetta-migration.md +242 -0
- package/bin/skills/lakehouse-doc-en/references/dynamic-mask.md +30 -30
- package/bin/skills/lakehouse-doc-en/references/dynamic-table-bestpractice.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/dynamic-table-introduce.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/dynamic_table_summary.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/eco_integration/streamlit.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/eco_integration/superset.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/ecosystem-all.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/ecosystem.md +145 -0
- package/bin/skills/lakehouse-doc-en/references/external-catalog-summary.md +33 -38
- package/bin/skills/lakehouse-doc-en/references/external-function-combo-practice.md +466 -0
- package/bin/skills/lakehouse-doc-en/references/f6fc6447ee.md +7 -9
- package/bin/skills/lakehouse-doc-en/references/federation-query.md +56 -6
- package/bin/skills/lakehouse-doc-en/references/finebi-mysql.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/get-started-with-sample-data.md +10 -11
- package/bin/skills/lakehouse-doc-en/references/gitfolder.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/grant-privileges.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/iceberg-rest-catalog-databricks.md +166 -0
- package/bin/skills/lakehouse-doc-en/references/ide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/if_else_task.md +59 -57
- package/bin/skills/lakehouse-doc-en/references/input_output.md +10 -7
- package/bin/skills/lakehouse-doc-en/references/jobprofile-bestpractices.md +60 -64
- package/bin/skills/lakehouse-doc-en/references/kafka-connection.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/key-concepts.md +146 -117
- package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-gateway-cz-cli.md +317 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-sql-analysis.md +345 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-dqc-guide.md +300 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-medallion-sql-dt-guide.md +543 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-multi-cloud-acceleration.md +274 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-multimodal-ai-pipeline.md +198 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-quick-experience_guide.md +49 -52
- package/bin/skills/lakehouse-doc-en/references/lakehouse-volume-pipe-acceleration-guide.md +380 -0
- package/bin/skills/lakehouse-doc-en/references/langchain-plug-installation.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/management.md +4 -9
- package/bin/skills/lakehouse-doc-en/references/medallion-lakehouse-from-scratch.md +2 -1
- package/bin/skills/lakehouse-doc-en/references/metrics_answer_build.md +58 -21
- package/bin/skills/lakehouse-doc-en/references/migrate-spark-data-engineering-best-practices-to-lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/mindsdb.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/monitoring_and_alerting.md +65 -60
- package/bin/skills/lakehouse-doc-en/references/monitoring_item_specification.md +33 -33
- package/bin/skills/lakehouse-doc-en/references/multitable_batch_sync.md +16 -16
- package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync.md +65 -72
- package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync_sop.md +54 -52
- package/bin/skills/lakehouse-doc-en/references/navicat-mysql.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/om-dynamic-table.md +71 -66
- package/bin/skills/lakehouse-doc-en/references/om-vcluster.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-create-session.md +79 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-generate-auth-token.md +63 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-overview.md +96 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-quick-start.md +286 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-response-guide.md +264 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-safe-question-poll.md +201 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-query.md +99 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-stop.md +74 -0
- package/bin/skills/lakehouse-doc-en/references/overview.md +6 -7
- package/bin/skills/lakehouse-doc-en/references/permission-application.md +5 -5
- package/bin/skills/lakehouse-doc-en/references/pipe-introduction.md +1 -0
- package/bin/skills/lakehouse-doc-en/references/pipe-kafka-table-stream.md +72 -70
- package/bin/skills/lakehouse-doc-en/references/pipe-kafka.md +105 -110
- package/bin/skills/lakehouse-doc-en/references/pipe-overview.md +40 -40
- package/bin/skills/lakehouse-doc-en/references/pipe-storage-object.md +43 -48
- package/bin/skills/lakehouse-doc-en/references/pipe-summary.md +14 -4
- package/bin/skills/lakehouse-doc-en/references/pipe-syntax.md +58 -151
- package/bin/skills/lakehouse-doc-en/references/practice_python_task.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/pricing-ai-gateway.md +181 -0
- package/bin/skills/lakehouse-doc-en/references/pricing-lakehouse.md +316 -0
- package/bin/skills/lakehouse-doc-en/references/pricing.md +44 -288
- package/bin/skills/lakehouse-doc-en/references/private-link-general.md +0 -2
- package/bin/skills/lakehouse-doc-en/references/pyspark-to-zettapark-migration-f1.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python-igs.md +7 -3
- package/bin/skills/lakehouse-doc-en/references/python-sample-put-github-rt-events.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python-task.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector_advanced.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector_examples.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/python_sdk_guide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python_shell_datasource.md +11 -9
- package/bin/skills/lakehouse-doc-en/references/quick_start_batch_sync_data.md +9 -18
- package/bin/skills/lakehouse-doc-en/references/quick_start_bi_analysis.md +8 -25
- package/bin/skills/lakehouse-doc-en/references/quick_start_create_workspace.md +4 -6
- package/bin/skills/lakehouse-doc-en/references/quick_start_data_quality.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quick_start_etl.md +16 -20
- package/bin/skills/lakehouse-doc-en/references/quick_start_monitoring_and_alerting.md +10 -18
- package/bin/skills/lakehouse-doc-en/references/quick_start_sql_query.md +7 -10
- package/bin/skills/lakehouse-doc-en/references/quick_start_upload_data.md +5 -7
- package/bin/skills/lakehouse-doc-en/references/quick_start_user_management.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quick_start_workspace.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/quick_start_workspace_user.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quickstart.md +69 -56
- package/bin/skills/lakehouse-doc-en/references/quickstart_datashare_between_companies.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/quickstart_envirment_for_team.md +0 -24
- package/bin/skills/lakehouse-doc-en/references/realtime-pipeline-selection-guide.md +1 -2
- package/bin/skills/lakehouse-doc-en/references/realtime-sales-dashboard-with-dynamic-table.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/realtime_sync.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/release-note-2026-05-19.md +5 -3
- package/bin/skills/lakehouse-doc-en/references/revoke-privileges.md +3 -1
- package/bin/skills/lakehouse-doc-en/references/roles.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/row-filter.md +165 -0
- package/bin/skills/lakehouse-doc-en/references/row_level_permission.md +30 -19
- package/bin/skills/lakehouse-doc-en/references/scheduled_task.md +28 -21
- package/bin/skills/lakehouse-doc-en/references/security_overview.md +99 -21
- package/bin/skills/lakehouse-doc-en/references/set-command.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/setup.md +13 -15
- package/bin/skills/lakehouse-doc-en/references/show-grants.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/snowflake-dynamic-tables-to-lakehouse.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/spark-connector-summary.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/sql_functions/context_functions/current_vcluster.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/sso-configuration.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/streaming_pipeline_with_dynamic_table.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/studio-incremental-sync-practice.md +27 -23
- package/bin/skills/lakehouse-doc-en/references/studio-shell-task.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/supported-cloud-platforms.md +32 -0
- package/bin/skills/lakehouse-doc-en/references/table_rendering.md +18 -12
- package/bin/skills/lakehouse-doc-en/references/task-develop.md +89 -91
- package/bin/skills/lakehouse-doc-en/references/task_development.md +19 -17
- package/bin/skills/lakehouse-doc-en/references/task_group.md +16 -14
- package/bin/skills/lakehouse-doc-en/references/task_instance.md +21 -21
- package/bin/skills/lakehouse-doc-en/references/task_param.md +38 -35
- package/bin/skills/lakehouse-doc-en/references/task_param_reference.md +81 -79
- package/bin/skills/lakehouse-doc-en/references/task_scheduling_dependency.md +20 -21
- package/bin/skills/lakehouse-doc-en/references/tencentcloud_arn_and_externalid.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/trial-account-quotas-and-limits.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/tutorial_connect_to_lakehouse.md +69 -0
- package/bin/skills/lakehouse-doc-en/references/tutorials.md +4 -1
- package/bin/skills/lakehouse-doc-en/references/unique-key.md +167 -0
- package/bin/skills/lakehouse-doc-en/references/usageandbillingview.md +138 -0
- package/bin/skills/lakehouse-doc-en/references/use-dbt-dev.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/use-java-sdk-realtime-uploaddata.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/use-java-sdk-upload-data-local.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/use-models.md +128 -0
- package/bin/skills/lakehouse-doc-en/references/use-mysql-client.md +81 -81
- package/bin/skills/lakehouse-doc-en/references/use-python-sdk-upload-data.md +10 -12
- package/bin/skills/lakehouse-doc-en/references/user-identification.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/user_permission_grand_guide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/using-udf-in-dynamic-table.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/vc_cache.md +18 -22
- package/bin/skills/lakehouse-doc-en/references/vcluster_size_description.md +33 -31
- package/bin/skills/lakehouse-doc-en/references/virtual-cluster.md +43 -45
- package/bin/skills/lakehouse-doc-en/references/web-job-history.md +94 -108
- package/bin/skills/lakehouse-doc-en/references/web_search.md +16 -7
- package/bin/skills/lakehouse-doc-en/references/zettapark-data-engineering-demo.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/zettapark-dataframe-guide.md +144 -70
- package/bin/skills/lakehouse-doc-en/references/zettapark-dynamic-table-guide.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-etl-guide.md +73 -33
- package/bin/skills/lakehouse-doc-en/references/zettapark-feature-engineering.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-functions-guide.md +75 -46
- package/bin/skills/lakehouse-doc-en/references/zettapark-quick-start.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-stream-guide.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/zettapark-volume-guide.md +93 -29
- package/package.json +1 -1
- package/bin/skills/lakehouse-doc-en/references/CLAUDE.md +0 -606
- package/bin/skills/lakehouse-doc-en/references/modelprice.md +0 -155
|
@@ -0,0 +1,350 @@
|
|
|
1
|
+
# Databricks Notebook → Lakehouse Migration Guide: Retail Data Medallion Pipeline
|
|
2
|
+
|
|
3
|
+
If your data engineering pipeline runs on Databricks Notebooks, the migration effort to Singdata Lakehouse Studio is lower than you might expect. Databricks' PySpark DataFrame API — `select`, `filter`, `join`, `withColumn`, `when`, Window functions — has identical syntax in ZettaPark. Changes are limited to just 5 mechanical substitutions: import paths, session acquisition method, table path prefix, one API casing difference, and replacing `dbutils.notebook.run()` with Studio task dependencies.
|
|
4
|
+
|
|
5
|
+
This article validates this with a real project: a Databricks-based retail data Medallion pipeline (Bronze → Silver → Gold three-layer architecture, 14 Notebooks, 81 code cells) fully migrated to Singdata Lakehouse, offering three migration options, all 20 automated validations passing.
|
|
6
|
+
|
|
7
|
+
Full code on GitHub: [databricks2lakehouse-bootcamp](https://github.com/clickzetta/databricks2lakehouse-bootcamp)
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Source Project
|
|
12
|
+
|
|
13
|
+
[databricks2lakehouse-bootcamp](https://github.com/clickzetta/databricks2lakehouse-bootcamp) is forked from [DataWithBaraa/databricks_bootcamp_2026](https://github.com/DataWithBaraa/databricks_bootcamp_2026) (⭐335). The original tech stack is Databricks + PySpark + Delta Lake + Unity Catalog. The project implements a complete retail data warehouse from dual-source CRM/ERP ingestion to a star schema, covering 18,484 customers, 397 products, and 60,398 sales records, with complete data cleansing, type conversion, code mapping, and dimensional modeling.
|
|
14
|
+
|
|
15
|
+
Migrated code is in the `03_lakehouse/` directory, comparable file-by-file with `01_source/`.
|
|
16
|
+
|
|
17
|
+
## Conclusion First
|
|
18
|
+
|
|
19
|
+
You don't need to rewrite any business logic, or retrain your team. All 5 changes are mechanical substitutions.
|
|
20
|
+
|
|
21
|
+
| Change | Effort | Notes |
|
|
22
|
+
|--------|--------|------|
|
|
23
|
+
| Import path replacement | Very low | `pyspark.sql` → `clickzetta.zettapark`, global search-replace |
|
|
24
|
+
| Session acquisition method | Very low | `spark` (global injection) → Studio tasks use `clickzetta_dbutils`, local use `Session.builder.configs({})` |
|
|
25
|
+
| Table path prefix | Very low | `workspace.bronze.X` → `bronze.X`, remove catalog prefix |
|
|
26
|
+
| StructField casing | Very low | `field.dataType` → `field.datatype` (ZettaPark API difference) |
|
|
27
|
+
| Orchestration method | Low | `dbutils.notebook.run(nb)` → Studio task dependencies (DAG) |
|
|
28
|
+
|
|
29
|
+
`select`, `filter`, `join`, `withColumn`, `when`, `coalesce`, `trim`, `regexp_replace`, `to_date`, `cast`, `isNotNull`, Window functions, `ROW_NUMBER()` — these core data engineering operations have identical syntax and require no changes.
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Tech Stack Comparison
|
|
34
|
+
|
|
35
|
+
| | Databricks Notebook | Lakehouse Studio Task |
|
|
36
|
+
|---|---|---|
|
|
37
|
+
| Compute engine | Apache Spark (Databricks) | Singdata Lakehouse |
|
|
38
|
+
| DataFrame API | PySpark (`pyspark.sql`) | ZettaPark (`clickzetta.zettapark`) |
|
|
39
|
+
| Session acquisition | `spark` (Databricks global injection) | `clickzetta_dbutils.get_active_lakehouse_engine()` |
|
|
40
|
+
| Table naming | `workspace.bronze.crm_cust_info` (3-level) | `bronze.crm_cust_info` (2-level) |
|
|
41
|
+
| File path | `/Volumes/workspace/bronze/raw_sources/...` | `vol://bronze.raw_sources/...` |
|
|
42
|
+
| StructField | `field.dataType` | `field.datatype` |
|
|
43
|
+
| SQL execution | `spark.sql(q)` executes immediately | `session.sql(q).collect()` triggers execution |
|
|
44
|
+
| Notebook chaining | `dbutils.notebook.run(nb, timeout_seconds=0)` | Studio task dependencies (`--deps` parameter) |
|
|
45
|
+
| Scheduling orchestration | Databricks Jobs | Studio task DAG |
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+

|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Project Background
|
|
54
|
+
|
|
55
|
+
Data comes from a bicycle retailer's dual-source CRM + ERP system, including 6 CSV files:
|
|
56
|
+
|
|
57
|
+
| Data Source | Table Name | Row Count | Notes |
|
|
58
|
+
|--------|------|------|------|
|
|
59
|
+
| CRM | `crm_cust_info` | 18,494 | Customer info (includes dirty data and inconsistent casing) |
|
|
60
|
+
| CRM | `crm_prd_info` | 397 | Product info (includes category ID encoded in product key) |
|
|
61
|
+
| CRM | `crm_sales_details` | 60,398 | Sales transactions (includes yyyyMMdd format dates) |
|
|
62
|
+
| ERP | `erp_cust_az12` | 18,484 | Customer master data (includes NAS prefix, future dates) |
|
|
63
|
+
| ERP | `erp_loc_a101` | 18,484 | Customer addresses (includes hyphens and country code abbreviations) |
|
|
64
|
+
| ERP | `erp_px_cat_g1v2` | 37 | Product categories (includes YES/NO maintenance flags) |
|
|
65
|
+
|
|
66
|
+
Medallion architecture in three layers:
|
|
67
|
+
|
|
68
|
+
- **Bronze**: Raw CSV → 6 Delta tables (no transformation)
|
|
69
|
+
- **Silver**: Cleansing + normalization → 6 wide tables (trim, type conversion, enum mapping, column renaming)
|
|
70
|
+
- **Gold**: Star schema → dim_customers + dim_products + fact_sales
|
|
71
|
+
|
|
72
|
+
The original project has 14 Databricks Notebooks (81 code cells), chained via `dbutils.notebook.run()`.
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## Migration Steps
|
|
77
|
+
|
|
78
|
+
### Step 1: Replace Import Paths
|
|
79
|
+
|
|
80
|
+
Mechanical global replacement, no logic changes:
|
|
81
|
+
|
|
82
|
+
```python
|
|
83
|
+
# Databricks
|
|
84
|
+
import pyspark.sql.functions as F
|
|
85
|
+
from pyspark.sql.types import StringType, DateType
|
|
86
|
+
from pyspark.sql.functions import trim, col, length
|
|
87
|
+
from pyspark.sql.window import Window
|
|
88
|
+
|
|
89
|
+
# ZettaPark (package name only)
|
|
90
|
+
from clickzetta.zettapark import functions as F
|
|
91
|
+
from clickzetta.zettapark.types import StringType, DateType
|
|
92
|
+
from clickzetta.zettapark.functions import trim, col, length
|
|
93
|
+
from clickzetta.zettapark.window import Window
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
### Step 2: Replace Session Acquisition Method
|
|
97
|
+
|
|
98
|
+
Databricks injects `spark` into every Notebook without requiring explicit creation. Studio tasks obtain it via `clickzetta_dbutils`, also without needing to manage passwords or connection parameters:
|
|
99
|
+
|
|
100
|
+
```python
|
|
101
|
+
# Databricks: spark is globally available, use directly
|
|
102
|
+
df = spark.table("workspace.bronze.crm_cust_info")
|
|
103
|
+
|
|
104
|
+
# Studio task: platform-injected, get session via clickzetta_dbutils
|
|
105
|
+
from clickzetta_dbutils import get_active_lakehouse_engine
|
|
106
|
+
from clickzetta.zettapark.session import Session
|
|
107
|
+
from urllib.parse import urlparse, parse_qs
|
|
108
|
+
|
|
109
|
+
engine = get_active_lakehouse_engine(schema="quick_start")
|
|
110
|
+
url_str = str(engine.url)
|
|
111
|
+
parsed = urlparse(url_str.replace('clickzetta://', 'https://'))
|
|
112
|
+
params = parse_qs(parsed.query)
|
|
113
|
+
parts = parsed.hostname.split('.', 1)
|
|
114
|
+
|
|
115
|
+
session = Session.builder.configs({
|
|
116
|
+
"service": parts[1],
|
|
117
|
+
"instance": parts[0],
|
|
118
|
+
"magic_token": params['magic_token'][0],
|
|
119
|
+
"workspace": parsed.path.lstrip('/'),
|
|
120
|
+
"schema": params.get('schema', ['quick_start'])[0],
|
|
121
|
+
"vcluster": params.get('virtualcluster', ['DEFAULT'])[0],
|
|
122
|
+
}).getOrCreate()
|
|
123
|
+
|
|
124
|
+
# After this, session usage is identical to spark
|
|
125
|
+
df = session.table("bronze.crm_cust_info")
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
> 💡 **Note**: For local development and debugging, explicitly pass credentials with `Session.builder.configs({"instance":..., "password":..., ...}).create()`, without relying on `clickzetta_dbutils`. DataFrame code is identical after either acquisition method.
|
|
129
|
+
|
|
130
|
+
### Step 3: Remove Catalog Prefix from Table Paths
|
|
131
|
+
|
|
132
|
+
Unity Catalog uses three-level naming (catalog.schema.table). Lakehouse only requires two levels within a single workspace:
|
|
133
|
+
|
|
134
|
+
```python
|
|
135
|
+
# Databricks (Unity Catalog three-level)
|
|
136
|
+
df = spark.table("workspace.bronze.crm_cust_info")
|
|
137
|
+
df.write.mode("overwrite").format("delta").saveAsTable("workspace.silver.crm_customers")
|
|
138
|
+
|
|
139
|
+
# ZettaPark (two-level, remove catalog prefix)
|
|
140
|
+
df = session.table("bronze.crm_cust_info")
|
|
141
|
+
df.write.mode("overwrite").saveAsTable("silver.crm_customers") # .format("delta") not needed either
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Step 4: Fix StructField Casing
|
|
145
|
+
|
|
146
|
+
This is an API difference discovered in testing, affecting all code that iterates field types with `df.schema.fields`:
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
# Databricks / PySpark
|
|
150
|
+
for field in df.schema.fields:
|
|
151
|
+
if isinstance(field.dataType, StringType): # uppercase T
|
|
152
|
+
df = df.withColumn(field.name, trim(col(field.name)))
|
|
153
|
+
|
|
154
|
+
# ZettaPark (datatype all lowercase)
|
|
155
|
+
for field in df.schema.fields:
|
|
156
|
+
if isinstance(field.datatype, StringType): # lowercase t
|
|
157
|
+
df = df.withColumn(field.name, trim(col(field.name)))
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
### Step 5: Replace dbutils.notebook.run() with Studio Task Dependencies
|
|
161
|
+
|
|
162
|
+
The original project chains 14 Notebooks via `dbutils.notebook.run()`, which maps directly to Studio task DAG:
|
|
163
|
+
|
|
164
|
+
```python
|
|
165
|
+
# Databricks: orchestration notebook
|
|
166
|
+
notebooks = [
|
|
167
|
+
"./silver_crm_cust_info",
|
|
168
|
+
"./silver_crm_prd_info",
|
|
169
|
+
"./silver_crm_sales_details",
|
|
170
|
+
"./silver_erp_cust_az12",
|
|
171
|
+
"./silver_erp_loc_a101",
|
|
172
|
+
"./silver_erp_px_cat_g1v2"
|
|
173
|
+
]
|
|
174
|
+
for nb in notebooks:
|
|
175
|
+
dbutils.notebook.run(nb, timeout_seconds=0)
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
# Studio: set task dependencies with cz-cli, one-time configuration, platform handles scheduling
|
|
180
|
+
cz-cli task save-config bootcamp/silver_crm_cust_info \
|
|
181
|
+
--deps bootcamp/bronze_ingestion --profile aws_singapore_prod
|
|
182
|
+
|
|
183
|
+
cz-cli task save-config bootcamp/gold_dim_customers \
|
|
184
|
+
--deps bootcamp/silver_crm_cust_info,bootcamp/silver_erp_cust_az12,bootcamp/silver_erp_loc_a101 \
|
|
185
|
+
--profile aws_singapore_prod
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
Execute DAG (equivalent to the original orchestration notebook's chained calls):
|
|
189
|
+
|
|
190
|
+
```bash
|
|
191
|
+
cz-cli task execute bootcamp/init_lakehouse --profile aws_singapore_prod
|
|
192
|
+
# → automatically triggers bronze → silver (parallel) → gold (in dependency order)
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## Studio Task DAG After Migration
|
|
198
|
+
|
|
199
|
+
The original Databricks project used 2 orchestration notebooks chained in series. After migration, the Studio platform automatically manages dependencies and parallelism:
|
|
200
|
+
|
|
201
|
+
```
|
|
202
|
+
init_lakehouse (SQL task)
|
|
203
|
+
└── bronze_ingestion (Python task)
|
|
204
|
+
├── silver_crm_cust_info ← parallel execution
|
|
205
|
+
├── silver_crm_prd_info ← parallel execution
|
|
206
|
+
├── silver_crm_sales_details ← parallel execution
|
|
207
|
+
├── silver_erp_cust_az12 ← parallel execution
|
|
208
|
+
├── silver_erp_loc_a101 ← parallel execution
|
|
209
|
+
└── silver_erp_px_cat_g1v2 ← parallel execution
|
|
210
|
+
├── gold_dim_customers ← depends on silver_crm_cust_info + erp_cust + erp_loc
|
|
211
|
+
├── gold_dim_products ← depends on silver_crm_prd_info + erp_px_cat
|
|
212
|
+
└── gold_fact_sales ← depends on dim_customers + dim_products + silver_crm_sales
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
The original Databricks ran 6 silver notebooks sequentially; Studio runs 6 silver tasks in parallel — same logic, shorter overall runtime.
|
|
216
|
+
|
|
217
|
+
---
|
|
218
|
+
|
|
219
|
+
## Fully Compatible Parts
|
|
220
|
+
|
|
221
|
+
The following code has identical syntax on both sides, validated by testing — no changes needed:
|
|
222
|
+
|
|
223
|
+
```python
|
|
224
|
+
# String cleansing
|
|
225
|
+
for field in df.schema.fields:
|
|
226
|
+
if isinstance(field.datatype, StringType):
|
|
227
|
+
df = df.withColumn(field.name, trim(col(field.name)))
|
|
228
|
+
|
|
229
|
+
# Enum mapping (conditional replacement)
|
|
230
|
+
df = df.withColumn("cst_marital_status",
|
|
231
|
+
F.when(F.upper(F.col("cst_marital_status")) == "S", "Single")
|
|
232
|
+
.when(F.upper(F.col("cst_marital_status")) == "M", "Married")
|
|
233
|
+
.otherwise("n/a"))
|
|
234
|
+
|
|
235
|
+
# Composite string parsing (extract category ID from product key)
|
|
236
|
+
df = df.withColumn("cat_id", F.regexp_replace(F.substring(col("prd_key"), 1, 5), "-", "_"))
|
|
237
|
+
df = df.withColumn("prd_key", F.substring(col("prd_key"), 7, F.length(col("prd_key"))))
|
|
238
|
+
|
|
239
|
+
# Date format conversion (yyyyMMdd integer → DATE)
|
|
240
|
+
df = df.withColumn("sls_order_dt",
|
|
241
|
+
F.when(
|
|
242
|
+
(col("sls_order_dt") == 0) | (length(col("sls_order_dt")) != 8), None
|
|
243
|
+
).otherwise(F.to_date(col("sls_order_dt").cast("string"), "yyyyMMdd")))
|
|
244
|
+
|
|
245
|
+
# Conditional price fix (derive from sales/quantity when quantity != 0)
|
|
246
|
+
df = df.withColumn("sls_price",
|
|
247
|
+
F.when(
|
|
248
|
+
(col("sls_price").isNull()) | (col("sls_price") <= 0),
|
|
249
|
+
F.when(col("sls_quantity") != 0, col("sls_sales") / col("sls_quantity")).otherwise(None)
|
|
250
|
+
).otherwise(col("sls_price")))
|
|
251
|
+
|
|
252
|
+
# Prefix cleansing (remove invalid NAS prefix)
|
|
253
|
+
df = df.withColumn("cid",
|
|
254
|
+
F.when(col("cid").startswith("NAS"), F.substring(col("cid"), 4, F.length(col("cid"))))
|
|
255
|
+
.otherwise(col("cid")))
|
|
256
|
+
|
|
257
|
+
# Future date filtering (dirty data cleansing)
|
|
258
|
+
df = df.withColumn("bdate",
|
|
259
|
+
F.when(col("bdate") > F.current_date(), None).otherwise(col("bdate")))
|
|
260
|
+
|
|
261
|
+
# Multi-table LEFT JOIN + ROW_NUMBER for dimension table
|
|
262
|
+
df = session.sql("""
|
|
263
|
+
SELECT
|
|
264
|
+
ROW_NUMBER() OVER (ORDER BY ci.customer_id) AS customer_key,
|
|
265
|
+
ci.customer_id, ci.customer_number, ci.first_name, ci.last_name,
|
|
266
|
+
COALESCE(la.country, 'n/a') AS country, ci.marital_status,
|
|
267
|
+
CASE WHEN ci.gender <> 'n/a' THEN ci.gender ELSE COALESCE(ca.gender, 'n/a') END AS gender,
|
|
268
|
+
ca.birth_date, ci.created_date
|
|
269
|
+
FROM silver.crm_customers ci
|
|
270
|
+
LEFT JOIN silver.erp_customer_location la ON ci.customer_number = la.customer_number
|
|
271
|
+
LEFT JOIN silver.erp_customers ca ON ci.customer_number = ca.customer_number
|
|
272
|
+
""")
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
---
|
|
276
|
+
|
|
277
|
+
## E2E Validation Results
|
|
278
|
+
|
|
279
|
+
Tested on AWS Singapore instance (`aws_singapore_prod`), 20/20 all passed:
|
|
280
|
+
|
|
281
|
+
| Check | Expected | Result |
|
|
282
|
+
|--------|--------|------|
|
|
283
|
+
| bronze.crm_cust_info | 18,494 | ✅ |
|
|
284
|
+
| bronze.crm_prd_info | 397 | ✅ |
|
|
285
|
+
| bronze.crm_sales_details | 60,398 | ✅ |
|
|
286
|
+
| bronze.erp_cust_az12 | 18,484 | ✅ |
|
|
287
|
+
| bronze.erp_loc_a101 | 18,484 | ✅ |
|
|
288
|
+
| bronze.erp_px_cat_g1v2 | 37 | ✅ |
|
|
289
|
+
| silver.crm_customers | 18,490 | ✅ |
|
|
290
|
+
| silver.crm_products | 397 | ✅ |
|
|
291
|
+
| silver.crm_sales | 60,398 | ✅ |
|
|
292
|
+
| silver.erp_customers | 18,484 | ✅ |
|
|
293
|
+
| silver.erp_customer_location | 18,484 | ✅ |
|
|
294
|
+
| silver.erp_product_category | 37 | ✅ |
|
|
295
|
+
| gold.dim_customers | 18,490 | ✅ |
|
|
296
|
+
| gold.dim_products | 397 | ✅ |
|
|
297
|
+
| gold.fact_sales | 89,833 | ✅ |
|
|
298
|
+
| Total sales amount | 43,538,800 | ✅ |
|
|
299
|
+
| Rows with negative sales | 5 (raw data) | ✅ |
|
|
300
|
+
| Distinct customer count | 18,484 | ✅ |
|
|
301
|
+
| Distinct product SKUs | 295 | ✅ |
|
|
302
|
+
| Rows with null order date | 24 (raw data format issue) | ✅ |
|
|
303
|
+
|
|
304
|
+
> Bronze 18,494 → Silver 18,490: 4 records with NULL customer_id were filtered during Silver cleansing, consistent with original Databricks behavior.
|
|
305
|
+
|
|
306
|
+
---
|
|
307
|
+
|
|
308
|
+
## Three Migration Options Compared
|
|
309
|
+
|
|
310
|
+
This project provides three options suited to different team backgrounds and deployment needs:
|
|
311
|
+
|
|
312
|
+
| Option | File Location | Session Method | Change Volume | Use Case |
|
|
313
|
+
|------|----------|-------------|--------|----------|
|
|
314
|
+
| **A. ZettaPark Local** | `03_lakehouse/{bronze,silver,gold}/` | `Session.builder.configs({}).create()` | ~5% | Local development, CI/CD debugging |
|
|
315
|
+
| **B. Pure SQL** | `03_lakehouse/sql/` | Not needed | Full rewrite (logic unchanged) | SQL-first teams, Studio SQL tasks |
|
|
316
|
+
| **C. Studio Task** | `03_lakehouse/tasks/` | `clickzetta_dbutils` | ~5% | **Production deployment (recommended)** |
|
|
317
|
+
|
|
318
|
+
All three options produce identical results — row counts, metrics, and data are consistent. Option C is the recommended path for production, directly mapping to the Databricks Notebook + Jobs architecture.
|
|
319
|
+
|
|
320
|
+
## Notes
|
|
321
|
+
|
|
322
|
+
- **`field.datatype` casing**: ZettaPark's `StructField` property is `datatype` (all lowercase); PySpark uses `dataType` (camelCase). This is the only non-mechanical API difference. Search for all uses of `schema.fields` and update individually.
|
|
323
|
+
- **`clickzetta_dbutils` template for Studio tasks**: The `get_active_lakehouse_engine()` call returns the current task's connection info; parse the URL to build the Session. This template code is fixed and can be extracted to a shared module for reuse.
|
|
324
|
+
- **Volume path format**: Databricks `/Volumes/catalog/schema/volume/path` → ZettaPark `vol://schema.volume/path`, note the removal of the catalog level.
|
|
325
|
+
- **`.format("delta")` not needed**: ZettaPark's `saveAsTable()` writes in Lakehouse native format by default; no explicit format specification required.
|
|
326
|
+
- **Silver row count difference (18,494 → 18,490)**: The CRM source data contains 4 records with NULL customer_id, filtered during Silver cleansing. This is expected behavior.
|
|
327
|
+
|
|
328
|
+
## Related Documentation
|
|
329
|
+
|
|
330
|
+
### ZettaPark DataFrame API
|
|
331
|
+
|
|
332
|
+
- [ZettaPark Quick Start](zettapark-quick-start.md): Session creation, DataFrame basics
|
|
333
|
+
- [ZettaPark DataFrame API Guide](zettapark-dataframe-guide.md): Full reference for select, filter, join, withColumn, etc.
|
|
334
|
+
- [ZettaPark Common Functions Reference](zettapark-functions-guide.md): Usage of trim, when, regexp_replace, Window, etc.
|
|
335
|
+
- [ZettaPark Data Engineering in Practice](zettapark-etl-guide.md): Complete ETL pipeline examples
|
|
336
|
+
|
|
337
|
+
### Studio Tasks and Orchestration
|
|
338
|
+
|
|
339
|
+
- [Lakehouse Studio Introduction](lakehouse-studio-concept.md): Studio platform concepts and architecture
|
|
340
|
+
- [Task Development and Scheduling](task-develop.md): Creating, developing, and scheduling Studio tasks
|
|
341
|
+
- [Task Scheduling Dependencies](task_scheduling_dependency.md): `--deps` parameter for configuring task DAG
|
|
342
|
+
- [Python Task](python-task.md): Creating and running Python tasks in Studio
|
|
343
|
+
- [Studio Task Development and Operations (cz-cli)](cz-cli-studio-tasks.md): Managing Studio tasks via cz-cli command line
|
|
344
|
+
|
|
345
|
+
### Other Migration Guides
|
|
346
|
+
|
|
347
|
+
- [PySpark → ZettaPark Migration Guide (Formula 1)](pyspark-to-zettapark-migration-f1.md): PySpark DataFrame API query-by-query migration
|
|
348
|
+
- [Snowpark → ZettaPark Migration Guide](snowflake-snowpark-to-zettapark-migration.md): Snowflake Python DataFrame migration
|
|
349
|
+
- [Spark Migration Guide](spark-migration-guide.md): Spark ecosystem migration overview and common issues
|
|
350
|
+
- [SQL Compatibility Reference](migration-sql-compatibility.md): Cross-platform SQL syntax differences
|
package/bin/skills/lakehouse-doc-en/references/databricks-uc-governance-to-lakehouse-migration.md
ADDED
|
@@ -0,0 +1,327 @@
|
|
|
1
|
+
# Databricks Unity Catalog → Lakehouse Migration Guide: Permissions and Governance
|
|
2
|
+
|
|
3
|
+
If your data platform uses Databricks Unity Catalog to manage permissions — RBAC roles, column-level masking, row-level access control, data auditing — the migration effort to Singdata Lakehouse is lower than you might expect. GRANT/REVOKE syntax is fully consistent, role management requires no changes, and the `SET MASK` keyword is identical. Changes are concentrated in 3 areas: removing the catalog prefix (three-level → two-level naming), swapping one API in masking functions (`is_account_group_member` → `array_contains(current_roles(),...)`), and replacing declarative ROW FILTER with explicit security views for row-level security.
|
|
4
|
+
|
|
5
|
+
This article validates this with a financial payment scenario: a user table with PII fields (email/phone/card), an orders table with region-based isolation, and a balance/accounts table — complete migration of RBAC, column masking, row-level security, and audit logs, passing all 16 automated validations.
|
|
6
|
+
|
|
7
|
+
Full code on GitHub: [databricks2lakehouse-governance](https://github.com/clickzetta/databricks2lakehouse-governance)
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Source Project
|
|
12
|
+
|
|
13
|
+
Demo data is a financial payment scenario with 3 tables:
|
|
14
|
+
|
|
15
|
+
| Table | Rows | Sensitive Fields | Access Control Requirements |
|
|
16
|
+
|----|------|---------|------------|
|
|
17
|
+
| `users` | 100 | `email`, `phone`, `ssn_last4` | Masking: non-admin sees masked values |
|
|
18
|
+
| `orders` | 300 | `amount`, `region` | Row-level: analyst only sees North America |
|
|
19
|
+
| `accounts` | 50 | `balance`, `card_number` | Masking: last 4 digits of card visible |
|
|
20
|
+
|
|
21
|
+
Migrated code is in the `03_lakehouse/sql/` directory (6 SQL files), comparable file-by-file with `01_source/sql/` (original UC SQL).
|
|
22
|
+
|
|
23
|
+
## Conclusion First
|
|
24
|
+
|
|
25
|
+
Most permission SQL can be reused directly with **very few or zero changes**.
|
|
26
|
+
|
|
27
|
+
| Change | Effort | Notes |
|
|
28
|
+
|--------|--------|------|
|
|
29
|
+
| Masking function judgment API | Very low | `is_account_group_member('g')` → `array_contains(current_roles(), 'ws.g')` |
|
|
30
|
+
| Audit table and column names | Low | `system.access.audit` → `sys.information_schema.job_history`, different column names |
|
|
31
|
+
|
|
32
|
+
**No changes needed**: `CREATE ROLE`, `GRANT SELECT`, `GRANT ALL PRIVILEGES`, `WITH GRANT OPTION`, `SHOW GRANTS`, `SET MASK`, `SET ROW FILTER`, **three-level naming (workspace.schema.table)** — syntax fully consistent. Row-level security migrates with zero changes!
|
|
33
|
+
|
|
34
|
+
> 💡 Namespacing: UC uses `catalog.schema.table`; Lakehouse uses `workspace.schema.table` — the structure is identical, just replacing the catalog name with the workspace name (a connection parameter change, no SQL code changes needed).
|
|
35
|
+
|
|
36
|
+
Lakehouse also has additional capabilities that UC lacks: `mask_inner`/`mask_outer` (built-in string masking functions), `AI_MASK` (AI model semantic masking), `CREATE SHARE` (data sharing, SQL DDL simpler than Delta Sharing).
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## Tech Stack Comparison
|
|
41
|
+
|
|
42
|
+
| | Databricks Unity Catalog | Singdata Lakehouse |
|
|
43
|
+
|---|---|---|
|
|
44
|
+
| Namespace levels | Three-level: `catalog.schema.table` | Three-level: `workspace.schema.table` (same structure) |
|
|
45
|
+
| Role management | `CREATE ROLE` (metastore-level) | `CREATE ROLE` (workspace-level) |
|
|
46
|
+
| GRANT syntax | `GRANT ... ON TABLE cat.s.t TO ROLE r` | `GRANT ... ON TABLE s.t TO ROLE r` |
|
|
47
|
+
| Column masking judgment | `is_account_group_member('group')` | `array_contains(current_roles(), 'ws.role')` |
|
|
48
|
+
| Column masking application | `ALTER TABLE ... ALTER COLUMN c SET MASK f` | `ALTER TABLE ... ALTER COLUMN c SET MASK f` |
|
|
49
|
+
| Row-level security | `ALTER TABLE ... SET ROW FILTER f ON (col)` | `ALTER TABLE ... SET ROW FILTER f ON (col)` |
|
|
50
|
+
| Data sharing | Delta Sharing (UC + REST API) | `CREATE SHARE` / `GRANT select, read metadata ... TO SHARE` |
|
|
51
|
+
| String masking | Hand-written REGEXP_REPLACE | `mask_inner()` / `mask_outer()` (built-in functions) |
|
|
52
|
+
| AI intelligent masking | — | `AI_MASK()` (Lakehouse exclusive) |
|
|
53
|
+
| Tag / ABAC | UC Tag policies (cross-table tagging) | Not yet supported → use role naming + Schema isolation as substitute |
|
|
54
|
+
| Audit source | `system.access.audit` (account-level) | `sys.information_schema.job_history` (workspace-level) |
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+

|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## Project Background
|
|
63
|
+
|
|
64
|
+
Three tables covering typical financial scenarios:
|
|
65
|
+
|
|
66
|
+
- `users`: User registration info with PII fields like email/phone/ssn_last4, requiring column-level masking
|
|
67
|
+
- `orders`: Sales orders partitioned by region (North America/Europe/Asia Pacific), requiring row-level isolation
|
|
68
|
+
- `accounts`: Account balances and bank card info; balance and card_number are highly sensitive
|
|
69
|
+
|
|
70
|
+
The original UC design defines 3 roles:
|
|
71
|
+
- `payments_admin`: Full access to real data
|
|
72
|
+
- `payments_analyst`: Masked data + only North America orders
|
|
73
|
+
- `payments_viewer`: Masked data + only Asia orders
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Migration Steps
|
|
78
|
+
|
|
79
|
+
### Step 1: Remove Catalog Prefix
|
|
80
|
+
|
|
81
|
+
UC uses three-level naming; the Lakehouse workspace itself is the catalog boundary, requiring only two levels:
|
|
82
|
+
|
|
83
|
+
```sql
|
|
84
|
+
-- UC (three-level)
|
|
85
|
+
GRANT SELECT ON TABLE payments_catalog.raw.orders TO ROLE payments_analyst;
|
|
86
|
+
|
|
87
|
+
-- Lakehouse (two-level) — remove catalog prefix
|
|
88
|
+
GRANT SELECT ON TABLE gov_raw.orders TO ROLE payments_analyst;
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### Step 2: RBAC — Syntax Fully Consistent
|
|
92
|
+
|
|
93
|
+
GRANT/REVOKE/SHOW GRANTS syntax requires zero changes:
|
|
94
|
+
|
|
95
|
+
```sql
|
|
96
|
+
-- Identical on both sides
|
|
97
|
+
CREATE ROLE IF NOT EXISTS payments_analyst;
|
|
98
|
+
CREATE ROLE IF NOT EXISTS payments_admin;
|
|
99
|
+
|
|
100
|
+
GRANT SELECT ON TABLE gov_raw.orders TO ROLE payments_analyst;
|
|
101
|
+
GRANT ALL PRIVILEGES ON SCHEMA gov_raw TO ROLE payments_admin;
|
|
102
|
+
GRANT SELECT ON ALL TABLES IN SCHEMA gov_raw TO ROLE payments_viewer;
|
|
103
|
+
GRANT SELECT ON TABLE gov_raw.users TO ROLE payments_admin WITH GRANT OPTION;
|
|
104
|
+
|
|
105
|
+
SHOW GRANTS TO ROLE payments_analyst;
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Step 3: Column Masking — `SET MASK` Identical, One Function API Change
|
|
109
|
+
|
|
110
|
+
UC uses `is_account_group_member()`; Lakehouse uses `array_contains(current_roles(), 'workspace.role')`:
|
|
111
|
+
|
|
112
|
+
```python
|
|
113
|
+
# UC masking function
|
|
114
|
+
CREATE FUNCTION payments_catalog.raw.mask_email(email STRING)
|
|
115
|
+
RETURN CASE
|
|
116
|
+
WHEN is_account_group_member('payments_admin') THEN email
|
|
117
|
+
ELSE CONCAT(LEFT(email, 2), '***@***.***')
|
|
118
|
+
END;
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
```sql
|
|
122
|
+
-- Lakehouse masking function (only the judgment API changes, logic identical)
|
|
123
|
+
CREATE OR REPLACE FUNCTION gov_raw.mask_email(email STRING)
|
|
124
|
+
RETURNS STRING
|
|
125
|
+
RETURN CASE
|
|
126
|
+
WHEN array_contains(current_roles(), 'quick_start.payments_admin') -- ← current_roles()
|
|
127
|
+
OR array_contains(current_roles(), 'workspace_admin')
|
|
128
|
+
THEN email
|
|
129
|
+
ELSE CONCAT(LEFT(email, 2), '***@***.***')
|
|
130
|
+
END;
|
|
131
|
+
|
|
132
|
+
-- SET MASK application syntax is identical
|
|
133
|
+
ALTER TABLE gov_raw.users ALTER COLUMN email SET MASK gov_raw.mask_email;
|
|
134
|
+
ALTER TABLE gov_raw.users ALTER COLUMN phone SET MASK gov_raw.mask_phone;
|
|
135
|
+
ALTER TABLE gov_raw.accounts ALTER COLUMN card_number SET MASK gov_raw.mask_card;
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
Tested masking results (seen by non-admin users):
|
|
139
|
+
|
|
140
|
+
| Field | Original Value | After Masking |
|
|
141
|
+
|------|--------|--------|
|
|
142
|
+
| email | `user001@example.com` | `u***@***.***` |
|
|
143
|
+
| phone | `+12345678901` | `****8901` |
|
|
144
|
+
| card_number | `4866524041574` | `****-****-****-1574` |
|
|
145
|
+
|
|
146
|
+
### Step 4: Row-Level Security — `SET ROW FILTER` Syntax Fully Consistent
|
|
147
|
+
|
|
148
|
+
Lakehouse natively supports `SET ROW FILTER`, with syntax **fully consistent with UC, zero changes**:
|
|
149
|
+
|
|
150
|
+
```sql
|
|
151
|
+
-- UC (Databricks)
|
|
152
|
+
CREATE FUNCTION payments_catalog.raw.filter_orders_by_region(region STRING)
|
|
153
|
+
RETURN CASE
|
|
154
|
+
WHEN is_account_group_member('payments_admin') THEN TRUE
|
|
155
|
+
WHEN is_account_group_member('payments_analyst')
|
|
156
|
+
AND region = 'North America' THEN TRUE
|
|
157
|
+
ELSE FALSE
|
|
158
|
+
END;
|
|
159
|
+
|
|
160
|
+
ALTER TABLE payments_catalog.raw.orders
|
|
161
|
+
SET ROW FILTER payments_catalog.raw.filter_orders_by_region ON (region);
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
```sql
|
|
165
|
+
-- Lakehouse — syntax fully consistent, only change is is_account_group_member in function body
|
|
166
|
+
CREATE OR REPLACE FUNCTION gov_raw.filter_orders_by_role(region STRING)
|
|
167
|
+
RETURNS BOOLEAN
|
|
168
|
+
AS
|
|
169
|
+
array_contains(current_roles(), 'quick_start.payments_admin')
|
|
170
|
+
OR array_contains(current_roles(), 'workspace_admin')
|
|
171
|
+
OR (array_contains(current_roles(), 'quick_start.payments_analyst')
|
|
172
|
+
AND region = 'North America')
|
|
173
|
+
OR (array_contains(current_roles(), 'quick_start.payments_viewer')
|
|
174
|
+
AND region = 'Asia');
|
|
175
|
+
|
|
176
|
+
-- SET ROW FILTER syntax is identical
|
|
177
|
+
ALTER TABLE gov_raw.orders SET ROW FILTER gov_raw.filter_orders_by_role ON (region);
|
|
178
|
+
|
|
179
|
+
-- Verify binding
|
|
180
|
+
DESC EXTENDED gov_raw.orders;
|
|
181
|
+
-- Output includes # Row Filter section at the end
|
|
182
|
+
|
|
183
|
+
-- Remove
|
|
184
|
+
ALTER TABLE gov_raw.orders DROP ROW FILTER;
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
ROW FILTER takes full effect for SELECT, COUNT, UPDATE, DELETE — the current user (workspace_admin) queries 300 rows; a `payments_analyst` role user only sees the 52 North America rows.
|
|
188
|
+
|
|
189
|
+
### Step 5: Auditing — Table and Column Name Adaptation
|
|
190
|
+
|
|
191
|
+
UC uses `system.access.audit` (account-level); Lakehouse uses `sys.information_schema.job_history` (workspace-level):
|
|
192
|
+
|
|
193
|
+
```sql
|
|
194
|
+
-- UC auditing
|
|
195
|
+
SELECT event_time, user_name, action_name, request_params
|
|
196
|
+
FROM system.access.audit
|
|
197
|
+
WHERE event_time >= current_timestamp() - INTERVAL 7 DAYS
|
|
198
|
+
AND action_name IN ('createTable','grantPermission','revokePermission');
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
```sql
|
|
202
|
+
-- Lakehouse auditing (different column names, date uses literal string)
|
|
203
|
+
SELECT
|
|
204
|
+
job_type,
|
|
205
|
+
job_creator, -- UC: user_name
|
|
206
|
+
LEFT(job_text, 100) AS sql_preview, -- UC: request_params
|
|
207
|
+
status,
|
|
208
|
+
LEFT(start_time, 19) AS time
|
|
209
|
+
FROM sys.information_schema.job_history
|
|
210
|
+
WHERE start_time >= '2026-06-07' -- must be literal, cannot use NOW() - INTERVAL
|
|
211
|
+
AND (UPPER(job_text) LIKE 'GRANT%'
|
|
212
|
+
OR UPPER(job_text) LIKE 'REVOKE%'
|
|
213
|
+
OR UPPER(job_text) LIKE 'CREATE TABLE%')
|
|
214
|
+
ORDER BY start_time DESC;
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
---
|
|
218
|
+
|
|
219
|
+
## Data Sharing
|
|
220
|
+
|
|
221
|
+
Lakehouse's **Share** is the equivalent of Databricks Delta Sharing — share tables or views with other workspaces or external consumers, with syntax highly similar to UC:
|
|
222
|
+
|
|
223
|
+
```sql
|
|
224
|
+
-- Step 1: Create Share object
|
|
225
|
+
CREATE SHARE IF NOT EXISTS analytics_share
|
|
226
|
+
COMMENT 'Analytics data share for partners';
|
|
227
|
+
|
|
228
|
+
-- Step 2: Add tables/views to Share (dual permissions: select + read metadata)
|
|
229
|
+
GRANT select, read metadata ON VIEW gov_marts.orders_by_region TO SHARE analytics_share;
|
|
230
|
+
GRANT select, read metadata ON TABLE gov_raw.orders TO SHARE analytics_share;
|
|
231
|
+
|
|
232
|
+
-- Bulk add all tables in a schema
|
|
233
|
+
GRANT SELECT, READ METADATA ON ALL TABLES IN SCHEMA gov_raw TO SHARE analytics_share;
|
|
234
|
+
|
|
235
|
+
-- Step 3: View Share contents
|
|
236
|
+
DESC SHARE analytics_share;
|
|
237
|
+
|
|
238
|
+
-- Step 4: Consumer access
|
|
239
|
+
CREATE SCHEMA partner_schema FROM SHARE analytics_share;
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
`SELECT, READ METADATA` dual permission combination: `select` allows querying data; `read metadata` allows consumers to discover the existence of tables/views. Both are typically granted together. UC Delta Sharing configuration is done via the Databricks UI/REST API; Lakehouse Share uses SQL DDL directly, which is more concise.
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
## Lakehouse-Exclusive Capabilities
|
|
247
|
+
|
|
248
|
+
**`mask_inner` / `mask_outer`**: Built-in string masking functions not available in UC, usable directly in masking functions without AI connection:
|
|
249
|
+
|
|
250
|
+
```sql
|
|
251
|
+
-- mask_inner: mask middle characters (preserve first n and last m characters)
|
|
252
|
+
SELECT mask_inner('user001@example.com', 2, 7);
|
|
253
|
+
-- → usXXXXXXXXXXXXX.com (preserve first 2, last 7, mask middle)
|
|
254
|
+
|
|
255
|
+
-- mask_outer: mask outer characters
|
|
256
|
+
SELECT mask_outer('+12345678901', 1, 4);
|
|
257
|
+
-- → X1234567XXXX
|
|
258
|
+
|
|
259
|
+
-- More concise in masking functions than hand-written CONCAT+REGEXP_REPLACE
|
|
260
|
+
CREATE OR REPLACE FUNCTION gov_raw.mask_email_v2(email STRING)
|
|
261
|
+
RETURNS STRING
|
|
262
|
+
RETURN CASE
|
|
263
|
+
WHEN array_contains(current_roles(), 'quick_start.payments_admin') THEN email
|
|
264
|
+
ELSE mask_inner(email, 2, 7)
|
|
265
|
+
END;
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
**`AI_MASK`**: AI model-based semantic masking, automatically identifies PII types (names, phones, ID numbers, etc.). UC has no equivalent:
|
|
269
|
+
|
|
270
|
+
```sql
|
|
271
|
+
-- Requires AI connection configuration (Bailian/Tongyi models etc.)
|
|
272
|
+
SELECT AI_MASK('conn_bailian:qwen3.5-plus',
|
|
273
|
+
'User Wang Xiaoming, phone: 13800138000',
|
|
274
|
+
ARRAY('Name', 'Phone'));
|
|
275
|
+
-- → 'User ***, phone: 138****8000'
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
---
|
|
279
|
+
|
|
280
|
+
## Validation Results
|
|
281
|
+
|
|
282
|
+
Tested on AWS Singapore instance (`aws_singapore_prod`), 16/16 all passed:
|
|
283
|
+
|
|
284
|
+
| Check | Expected | Result |
|
|
285
|
+
|--------|--------|------|
|
|
286
|
+
| gov_raw.users | 100 | ✅ |
|
|
287
|
+
| gov_raw.orders | 300 | ✅ |
|
|
288
|
+
| gov_raw.accounts | 50 | ✅ |
|
|
289
|
+
| analyst has GRANT records | ≥2 | ✅ |
|
|
290
|
+
| admin has GRANT records | ≥1 | ✅ |
|
|
291
|
+
| viewer has GRANT records | ≥1 | ✅ |
|
|
292
|
+
| email is masked | True | ✅ |
|
|
293
|
+
| phone is masked | True | ✅ |
|
|
294
|
+
| card_number is masked | True | ✅ |
|
|
295
|
+
| mask_email function exists | True | ✅ |
|
|
296
|
+
| mask_phone function exists | True | ✅ |
|
|
297
|
+
| mask_card function exists | True | ✅ |
|
|
298
|
+
| orders_by_region admin sees 300 rows | 300 | ✅ |
|
|
299
|
+
| orders_analyst_view only sees North America | 52 | ✅ |
|
|
300
|
+
| analyst_view has only 1 region | 1 | ✅ |
|
|
301
|
+
| job_history has records | ≥1 | ✅ |
|
|
302
|
+
|
|
303
|
+
---
|
|
304
|
+
|
|
305
|
+
## Notes
|
|
306
|
+
|
|
307
|
+
- **`current_roles()` includes workspace prefix**: Lakehouse's `current_roles()` returns role names with workspace prefix (e.g., `quick_start.payments_admin`). When matching, the full prefix is required — `payments_admin` alone will not work.
|
|
308
|
+
- **`SET MASK` becomes invalid after table recreation**: After DROP TABLE + CREATE, column masks are not inherited automatically. `ALTER TABLE ... ALTER COLUMN c SET MASK` must be re-applied.
|
|
309
|
+
- **COPY INTO type inference**: External Volume COPY INTO does not support `inferSchema`. Column types must be explicitly declared when creating tables (especially phone/card should use STRING; otherwise inferred as BIGINT causing data loss).
|
|
310
|
+
- **job_history date condition**: `WHERE start_time >= '2026-06-07'` must be a literal string. `CURRENT_DATE()` or `INTERVAL` expressions are not supported.
|
|
311
|
+
- **Row-level security function judgment API**: The ROW FILTER function body also requires replacing `is_account_group_member('g')` with `array_contains(current_roles(), 'ws.g')`. This is the only change point. `ALTER TABLE ... SET ROW FILTER` / `DROP ROW FILTER` syntax is identical.
|
|
312
|
+
- **Tag / ABAC (honest limitation)**: UC supports fine-grained attribute-based access control (ABAC) via Tags — tagging tables/columns and then writing policies. Lakehouse does not currently support Tag-driven ABAC. The recommended approach is to use role naming conventions (e.g., `dept_finance_analyst`) + Schema isolation to simulate ABAC, or keep tag-based permissions managed on the Databricks side.
|
|
313
|
+
|
|
314
|
+
## Related Documentation
|
|
315
|
+
|
|
316
|
+
### Permissions and Security
|
|
317
|
+
|
|
318
|
+
- [GRANT](GRANT.md): Full GRANT syntax reference
|
|
319
|
+
- [CREATE ROLE](create-role.md): Role creation and management
|
|
320
|
+
- [SET MASK](set-mask.md): Column masking syntax
|
|
321
|
+
- [SHOW GRANTS](show-grants.md): View authorization relationships
|
|
322
|
+
|
|
323
|
+
### Other Migration Guides
|
|
324
|
+
|
|
325
|
+
- [Databricks Notebook → Lakehouse Migration Guide](databricks-notebook-to-studio-migration.md)
|
|
326
|
+
- [Databricks DLT → Lakehouse Migration Guide](databricks-dlt-to-lakehouse-migration.md)
|
|
327
|
+
- [dbt-databricks → dbt-clickzetta Migration Guide](dbt-databricks-to-clickzetta-migration.md)
|