@clickzetta/cz-cli-darwin-x64 0.5.16 → 0.5.17
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/cz-cli +0 -0
- package/bin/skills/lakehouse-doc-en/SKILL.md +6 -11
- package/bin/skills/lakehouse-doc-en/references/AIGateway.md +58 -13
- package/bin/skills/lakehouse-doc-en/references/Computation.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/DataSource_Amazon_DocumentDB.md +3 -1
- package/bin/skills/lakehouse-doc-en/references/Foreach.md +14 -14
- package/bin/skills/lakehouse-doc-en/references/JDBC-Driver.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/LakehouseAI-overview.md +21 -8
- package/bin/skills/lakehouse-doc-en/references/LakehouseDataGPT-tour.md +4 -9
- package/bin/skills/lakehouse-doc-en/references/LakehouseStudio-tour.md +14 -19
- package/bin/skills/lakehouse-doc-en/references/Lakehouse_Zilliz_MakeDataReadyforBIandAI.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/Logstash.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/Migrate_Spark_DataEngineeringBestPractices_Project_to_Lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/Notebook.md +17 -17
- package/bin/skills/lakehouse-doc-en/references/RemoteFunction-as-udf.md +14 -14
- package/bin/skills/lakehouse-doc-en/references/SQL_External_Catalog_Guide.md +1 -9
- package/bin/skills/lakehouse-doc-en/references/SUMMARY.md +59 -29
- package/bin/skills/lakehouse-doc-en/references/WINDOWFUNCTION.md +99 -57
- package/bin/skills/lakehouse-doc-en/references/Zettapark_Data_Engineering_Demo.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/access-control-configuration.md +1 -8
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-2-5-1.0.md +16 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-29-1.0.2.md +14 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-8-1.0.1.md +16 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-4-28-1.1.md +29 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-12-1.1.1.md +18 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-15-1.2.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-21-1.3.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-28-1.4.md +10 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-6-3-1.5.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/alicloud-arn-externalid.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/answer-accuracy-improve.md +120 -103
- package/bin/skills/lakehouse-doc-en/references/application-list.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/approval-list.md +16 -17
- package/bin/skills/lakehouse-doc-en/references/batch-load-parquet-file-into-lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/batch_sync.md +9 -9
- package/bin/skills/lakehouse-doc-en/references/batch_sync_Sop.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/batchloadparquetfileintoLakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/bulkloadv1-python-sdk.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/chart-auto-refresh-guide.md +12 -6
- package/bin/skills/lakehouse-doc-en/references/clickzetta-sample-data.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/code_approval.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/composite_task.md +31 -42
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_environment_and_data_generate.md +6 -9
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_javasdk_bulkload_realtime.md +4 -10
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_kafka_realtime_sync.md +1 -10
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_local_file_into_table_by_studio.md +0 -6
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_batchload_public_network.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_python_node.md +2 -7
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_realtime_cdc_public_network.md +13 -18
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_sql_insert.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/concepts.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/config-datasource.md +5 -7
- package/bin/skills/lakehouse-doc-en/references/connect-with-cli.md +116 -72
- package/bin/skills/lakehouse-doc-en/references/connect-with-cz-cli.md +151 -0
- package/bin/skills/lakehouse-doc-en/references/continue-job.md +9 -17
- package/bin/skills/lakehouse-doc-en/references/create-api-connection.md +315 -286
- package/bin/skills/lakehouse-doc-en/references/create-catalog-connection.md +1 -0
- package/bin/skills/lakehouse-doc-en/references/create-dynamic-table.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/create-external-catalog.md +85 -22
- package/bin/skills/lakehouse-doc-en/references/create-table-ddl.md +45 -0
- package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkendpoint.md +4 -6
- package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkservice.md +4 -7
- package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkendpoint.md +2 -7
- package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkservice.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/cz-cli-agent.md +15 -10
- package/bin/skills/lakehouse-doc-en/references/cz-cli-datasource.md +0 -8
- package/bin/skills/lakehouse-doc-en/references/cz-cli-sql.md +2 -45
- package/bin/skills/lakehouse-doc-en/references/cz-cli.md +53 -42
- package/bin/skills/lakehouse-doc-en/references/dashboard-version-management-guide.md +12 -4
- package/bin/skills/lakehouse-doc-en/references/data-integration-intro.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/data-integration.md +29 -27
- package/bin/skills/lakehouse-doc-en/references/data-load-summary.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/data-quality.md +25 -25
- package/bin/skills/lakehouse-doc-en/references/data-sharing.md +31 -54
- package/bin/skills/lakehouse-doc-en/references/data-sources.md +45 -45
- package/bin/skills/lakehouse-doc-en/references/data_catalog.md +23 -25
- package/bin/skills/lakehouse-doc-en/references/data_privacy.md +5 -2
- package/bin/skills/lakehouse-doc-en/references/data_sharing_between_accounts_guide.md +0 -4
- package/bin/skills/lakehouse-doc-en/references/data_visualization.md +4 -15
- package/bin/skills/lakehouse-doc-en/references/dataagent.md +39 -7
- package/bin/skills/lakehouse-doc-en/references/databricks-delta-to-lakehouse-migration.md +168 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-dlt-to-lakehouse-migration.md +331 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-external-catalog-practice.md +367 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-jobs-to-studio-migration.md +199 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-notebook-to-studio-migration.md +350 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-uc-governance-to-lakehouse-migration.md +327 -0
- package/bin/skills/lakehouse-doc-en/references/datagpt-model-config.md +34 -0
- package/bin/skills/lakehouse-doc-en/references/datagpt_data_source.md +50 -37
- package/bin/skills/lakehouse-doc-en/references/datagpt_introduction.md +55 -79
- package/bin/skills/lakehouse-doc-en/references/datagpt_quickstart.md +50 -64
- package/bin/skills/lakehouse-doc-en/references/datalake-acceleration.md +75 -2
- package/bin/skills/lakehouse-doc-en/references/dbt-databricks-to-clickzetta-migration.md +242 -0
- package/bin/skills/lakehouse-doc-en/references/dynamic-mask.md +30 -30
- package/bin/skills/lakehouse-doc-en/references/dynamic-table-bestpractice.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/dynamic-table-introduce.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/dynamic_table_summary.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/eco_integration/streamlit.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/eco_integration/superset.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/ecosystem-all.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/ecosystem.md +145 -0
- package/bin/skills/lakehouse-doc-en/references/external-catalog-summary.md +33 -38
- package/bin/skills/lakehouse-doc-en/references/external-function-combo-practice.md +466 -0
- package/bin/skills/lakehouse-doc-en/references/f6fc6447ee.md +7 -9
- package/bin/skills/lakehouse-doc-en/references/federation-query.md +56 -6
- package/bin/skills/lakehouse-doc-en/references/finebi-mysql.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/get-started-with-sample-data.md +10 -11
- package/bin/skills/lakehouse-doc-en/references/gitfolder.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/grant-privileges.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/iceberg-rest-catalog-databricks.md +166 -0
- package/bin/skills/lakehouse-doc-en/references/ide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/if_else_task.md +59 -57
- package/bin/skills/lakehouse-doc-en/references/input_output.md +10 -7
- package/bin/skills/lakehouse-doc-en/references/jobprofile-bestpractices.md +60 -64
- package/bin/skills/lakehouse-doc-en/references/kafka-connection.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/key-concepts.md +146 -117
- package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-gateway-cz-cli.md +317 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-sql-analysis.md +345 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-dqc-guide.md +300 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-medallion-sql-dt-guide.md +543 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-multi-cloud-acceleration.md +274 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-multimodal-ai-pipeline.md +198 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-quick-experience_guide.md +49 -52
- package/bin/skills/lakehouse-doc-en/references/lakehouse-volume-pipe-acceleration-guide.md +380 -0
- package/bin/skills/lakehouse-doc-en/references/langchain-plug-installation.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/management.md +4 -9
- package/bin/skills/lakehouse-doc-en/references/medallion-lakehouse-from-scratch.md +2 -1
- package/bin/skills/lakehouse-doc-en/references/metrics_answer_build.md +58 -21
- package/bin/skills/lakehouse-doc-en/references/migrate-spark-data-engineering-best-practices-to-lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/mindsdb.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/monitoring_and_alerting.md +65 -60
- package/bin/skills/lakehouse-doc-en/references/monitoring_item_specification.md +33 -33
- package/bin/skills/lakehouse-doc-en/references/multitable_batch_sync.md +16 -16
- package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync.md +65 -72
- package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync_sop.md +54 -52
- package/bin/skills/lakehouse-doc-en/references/navicat-mysql.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/om-dynamic-table.md +71 -66
- package/bin/skills/lakehouse-doc-en/references/om-vcluster.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-create-session.md +79 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-generate-auth-token.md +63 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-overview.md +96 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-quick-start.md +286 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-response-guide.md +264 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-safe-question-poll.md +201 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-query.md +99 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-stop.md +74 -0
- package/bin/skills/lakehouse-doc-en/references/overview.md +6 -7
- package/bin/skills/lakehouse-doc-en/references/permission-application.md +5 -5
- package/bin/skills/lakehouse-doc-en/references/pipe-introduction.md +1 -0
- package/bin/skills/lakehouse-doc-en/references/pipe-kafka-table-stream.md +72 -70
- package/bin/skills/lakehouse-doc-en/references/pipe-kafka.md +105 -110
- package/bin/skills/lakehouse-doc-en/references/pipe-overview.md +40 -40
- package/bin/skills/lakehouse-doc-en/references/pipe-storage-object.md +43 -48
- package/bin/skills/lakehouse-doc-en/references/pipe-summary.md +14 -4
- package/bin/skills/lakehouse-doc-en/references/pipe-syntax.md +58 -151
- package/bin/skills/lakehouse-doc-en/references/practice_python_task.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/pricing-ai-gateway.md +181 -0
- package/bin/skills/lakehouse-doc-en/references/pricing-lakehouse.md +316 -0
- package/bin/skills/lakehouse-doc-en/references/pricing.md +44 -288
- package/bin/skills/lakehouse-doc-en/references/private-link-general.md +0 -2
- package/bin/skills/lakehouse-doc-en/references/pyspark-to-zettapark-migration-f1.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python-igs.md +7 -3
- package/bin/skills/lakehouse-doc-en/references/python-sample-put-github-rt-events.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python-task.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector_advanced.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector_examples.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/python_sdk_guide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python_shell_datasource.md +11 -9
- package/bin/skills/lakehouse-doc-en/references/quick_start_batch_sync_data.md +9 -18
- package/bin/skills/lakehouse-doc-en/references/quick_start_bi_analysis.md +8 -25
- package/bin/skills/lakehouse-doc-en/references/quick_start_create_workspace.md +4 -6
- package/bin/skills/lakehouse-doc-en/references/quick_start_data_quality.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quick_start_etl.md +16 -20
- package/bin/skills/lakehouse-doc-en/references/quick_start_monitoring_and_alerting.md +10 -18
- package/bin/skills/lakehouse-doc-en/references/quick_start_sql_query.md +7 -10
- package/bin/skills/lakehouse-doc-en/references/quick_start_upload_data.md +5 -7
- package/bin/skills/lakehouse-doc-en/references/quick_start_user_management.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quick_start_workspace.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/quick_start_workspace_user.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quickstart.md +69 -56
- package/bin/skills/lakehouse-doc-en/references/quickstart_datashare_between_companies.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/quickstart_envirment_for_team.md +0 -24
- package/bin/skills/lakehouse-doc-en/references/realtime-pipeline-selection-guide.md +1 -2
- package/bin/skills/lakehouse-doc-en/references/realtime-sales-dashboard-with-dynamic-table.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/realtime_sync.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/release-note-2026-05-19.md +5 -3
- package/bin/skills/lakehouse-doc-en/references/revoke-privileges.md +3 -1
- package/bin/skills/lakehouse-doc-en/references/roles.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/row-filter.md +165 -0
- package/bin/skills/lakehouse-doc-en/references/row_level_permission.md +30 -19
- package/bin/skills/lakehouse-doc-en/references/scheduled_task.md +28 -21
- package/bin/skills/lakehouse-doc-en/references/security_overview.md +99 -21
- package/bin/skills/lakehouse-doc-en/references/set-command.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/setup.md +13 -15
- package/bin/skills/lakehouse-doc-en/references/show-grants.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/snowflake-dynamic-tables-to-lakehouse.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/spark-connector-summary.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/sql_functions/context_functions/current_vcluster.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/sso-configuration.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/streaming_pipeline_with_dynamic_table.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/studio-incremental-sync-practice.md +27 -23
- package/bin/skills/lakehouse-doc-en/references/studio-shell-task.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/supported-cloud-platforms.md +32 -0
- package/bin/skills/lakehouse-doc-en/references/table_rendering.md +18 -12
- package/bin/skills/lakehouse-doc-en/references/task-develop.md +89 -91
- package/bin/skills/lakehouse-doc-en/references/task_development.md +19 -17
- package/bin/skills/lakehouse-doc-en/references/task_group.md +16 -14
- package/bin/skills/lakehouse-doc-en/references/task_instance.md +21 -21
- package/bin/skills/lakehouse-doc-en/references/task_param.md +38 -35
- package/bin/skills/lakehouse-doc-en/references/task_param_reference.md +81 -79
- package/bin/skills/lakehouse-doc-en/references/task_scheduling_dependency.md +20 -21
- package/bin/skills/lakehouse-doc-en/references/tencentcloud_arn_and_externalid.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/trial-account-quotas-and-limits.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/tutorial_connect_to_lakehouse.md +69 -0
- package/bin/skills/lakehouse-doc-en/references/tutorials.md +4 -1
- package/bin/skills/lakehouse-doc-en/references/unique-key.md +167 -0
- package/bin/skills/lakehouse-doc-en/references/usageandbillingview.md +138 -0
- package/bin/skills/lakehouse-doc-en/references/use-dbt-dev.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/use-java-sdk-realtime-uploaddata.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/use-java-sdk-upload-data-local.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/use-models.md +128 -0
- package/bin/skills/lakehouse-doc-en/references/use-mysql-client.md +81 -81
- package/bin/skills/lakehouse-doc-en/references/use-python-sdk-upload-data.md +10 -12
- package/bin/skills/lakehouse-doc-en/references/user-identification.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/user_permission_grand_guide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/using-udf-in-dynamic-table.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/vc_cache.md +18 -22
- package/bin/skills/lakehouse-doc-en/references/vcluster_size_description.md +33 -31
- package/bin/skills/lakehouse-doc-en/references/virtual-cluster.md +43 -45
- package/bin/skills/lakehouse-doc-en/references/web-job-history.md +94 -108
- package/bin/skills/lakehouse-doc-en/references/web_search.md +16 -7
- package/bin/skills/lakehouse-doc-en/references/zettapark-data-engineering-demo.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/zettapark-dataframe-guide.md +144 -70
- package/bin/skills/lakehouse-doc-en/references/zettapark-dynamic-table-guide.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-etl-guide.md +73 -33
- package/bin/skills/lakehouse-doc-en/references/zettapark-feature-engineering.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-functions-guide.md +75 -46
- package/bin/skills/lakehouse-doc-en/references/zettapark-quick-start.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-stream-guide.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/zettapark-volume-guide.md +93 -29
- package/package.json +1 -1
- package/bin/skills/lakehouse-doc-en/references/CLAUDE.md +0 -606
- package/bin/skills/lakehouse-doc-en/references/modelprice.md +0 -155
|
@@ -0,0 +1,367 @@
|
|
|
1
|
+
# Databricks Unity Catalog Federation Query Practice
|
|
2
|
+
|
|
3
|
+
Singdata Lakehouse queries tables in Databricks Unity Catalog directly through an External Catalog. Data stays in Databricks' S3 storage without moving — Lakehouse handles SQL execution and result delivery. This guide uses an AWS environment as an example and walks through the complete configuration process from scratch.
|
|
4
|
+
|
|
5
|
+

|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Prerequisites
|
|
10
|
+
|
|
11
|
+
- Databricks workspace: Unity Catalog support required (Free Edition already supports it)
|
|
12
|
+
- Singdata Lakehouse instance: must be on the **same cloud platform** (both AWS) as the Databricks data storage (S3)
|
|
13
|
+
- Tools: cz-cli with the corresponding AWS profile pre-configured
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## SQL Commands Involved
|
|
18
|
+
|
|
19
|
+
| Command | Purpose |
|
|
20
|
+
|---------|---------|
|
|
21
|
+
| `CREATE CATALOG CONNECTION` | Store Databricks OAuth M2M authentication credentials |
|
|
22
|
+
| `CREATE EXTERNAL CATALOG` | Create an external catalog pointing to Databricks Unity Catalog |
|
|
23
|
+
| `SHOW SCHEMAS IN` | List schemas in a Databricks Catalog |
|
|
24
|
+
| `SHOW TABLES IN` | List tables in a schema |
|
|
25
|
+
| `SELECT` | Query Databricks table data |
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Databricks Configuration
|
|
30
|
+
|
|
31
|
+
> 💡 **Account Console vs Workspace**: Databricks has two different entry points. `https://accounts.cloud.databricks.com` is the **Account Console**, which manages organization, users, service principals, and other account-level settings. `https://dbc-xxx.cloud.databricks.com` is the **Workspace**, used for data development. The following SP configuration is done in Account Console; permission grants are done in the Workspace.
|
|
32
|
+
|
|
33
|
+
### Create a Service Principal
|
|
34
|
+
|
|
35
|
+
Open `https://accounts.cloud.databricks.com` → **User management** → **Service principals** → **Add service principal**, and give it a recognizable name (e.g., `lakehouse_connector`).
|
|
36
|
+
|
|
37
|
+
> If `https://accounts.cloud.databricks.com` redirects directly into a Workspace, you may be using the Community Edition, which does not support Unity Catalog and cannot proceed. The Free Edition supports Unity Catalog and can use this feature normally.
|
|
38
|
+
|
|
39
|
+
After creating the SP, click its name to enter the details page and complete the following:
|
|
40
|
+
|
|
41
|
+
1. **Roles** tab → Enable **Account admin**
|
|
42
|
+
2. **Principal information** tab → Record the **Application ID** (this is the `CLIENT_ID` used later, in UUID format)
|
|
43
|
+
3. **Credentials & secrets** tab → **Generate secret** → Record the complete **Secret** generated (this is the `CLIENT_SECRET` used later)
|
|
44
|
+
|
|
45
|
+
> **Note**: The Application ID is visible on both the SP list page and the Principal information tab. The Secret is only shown in full at generation time — save it immediately. Secrets have an `Expires at` expiration time; after expiry you must generate a new one and update the Catalog Connection.
|
|
46
|
+
|
|
47
|
+
### Add the SP to the Workspace
|
|
48
|
+
|
|
49
|
+
In the Databricks Workspace → **Settings** → **Identity and access** → **Service Principals** → **Add service principal**, and add the SP you just created.
|
|
50
|
+
|
|
51
|
+
> **Note**: The SP must satisfy three conditions simultaneously for authentication to succeed: Account Admin role + added to Workspace + Catalog/Schema permissions. Missing any one of these will result in an `invalid_client` error.
|
|
52
|
+
|
|
53
|
+
### Enable Metastore External Data Access
|
|
54
|
+
|
|
55
|
+
In the Databricks Workspace → **Catalog** → gear icon → **Metastore** → **Details** tab → find **External data access** → enable the toggle.
|
|
56
|
+
|
|
57
|
+
Without this option enabled, querying data will produce:
|
|
58
|
+
|
|
59
|
+
```
|
|
60
|
+
PermissionDenied: External Data Access from non Databricks Compute environment is disabled for metastore
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
### Grant Catalog and Schema Permissions
|
|
64
|
+
|
|
65
|
+
In Databricks Catalog Explorer, grant the following permissions to the SP:
|
|
66
|
+
|
|
67
|
+
- **Catalog level**: `USE CATALOG`
|
|
68
|
+
- **Schema level**: `USE SCHEMA`, `SELECT`, `EXTERNAL USE SCHEMA`
|
|
69
|
+
|
|
70
|
+
> `EXTERNAL USE SCHEMA` is a required permission for federation queries. It can be granted in the regular Permissions panel; no SQL execution is needed.
|
|
71
|
+
|
|
72
|
+
If you have a SQL execution environment (Notebook or SQL Warehouse), you can also use GRANT commands (replace `<application-id>` with the SP's Application ID in UUID format):
|
|
73
|
+
|
|
74
|
+
```sql
|
|
75
|
+
GRANT USE CATALOG ON CATALOG workspace
|
|
76
|
+
TO `<application-id>`;
|
|
77
|
+
|
|
78
|
+
GRANT USE SCHEMA ON SCHEMA workspace.table_types_demo
|
|
79
|
+
TO `<application-id>`;
|
|
80
|
+
|
|
81
|
+
GRANT SELECT ON SCHEMA workspace.table_types_demo
|
|
82
|
+
TO `<application-id>`;
|
|
83
|
+
|
|
84
|
+
-- Required permission for federation queries
|
|
85
|
+
GRANT EXTERNAL USE SCHEMA ON SCHEMA workspace.table_types_demo
|
|
86
|
+
TO `<application-id>`;
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
## Create a Catalog Connection
|
|
92
|
+
|
|
93
|
+
```sql
|
|
94
|
+
CREATE CATALOG CONNECTION IF NOT EXISTS databricks_conn
|
|
95
|
+
TYPE databricks
|
|
96
|
+
HOST = 'https://<workspace-url>.cloud.databricks.com'
|
|
97
|
+
CLIENT_ID = '<application-id>'
|
|
98
|
+
CLIENT_SECRET = '<oauth-secret>'
|
|
99
|
+
ACCESS_REGION = '<s3-bucket-region>';
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
> **`ACCESS_REGION` should be the region of the S3 bucket, not the Databricks workspace region.**
|
|
103
|
+
>
|
|
104
|
+
> These are often different — the region selected when creating the Databricks workspace and the region of the S3 bucket where data is actually stored may not match. Using the wrong value causes query timeouts or `PermanentRedirect` errors.
|
|
105
|
+
>
|
|
106
|
+
> **How to confirm the S3 bucket region**: In the Databricks Workspace → left sidebar **Catalog** → click any table → in the right panel find the **Details** tab (or expand the properties panel on the right after clicking the table name) → check the **Storage Location** field, e.g., `s3://my-bucket/path/`. Then go to the AWS S3 Console, search for this bucket, and check its **AWS Region** attribute.
|
|
107
|
+
|
|
108
|
+
Verify the connection:
|
|
109
|
+
|
|
110
|
+
```sql
|
|
111
|
+
SHOW CATALOG CONNECTIONS;
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Create an External Catalog
|
|
117
|
+
|
|
118
|
+
```sql
|
|
119
|
+
CREATE EXTERNAL CATALOG IF NOT EXISTS databricks_catalog
|
|
120
|
+
CONNECTION databricks_conn
|
|
121
|
+
OPTIONS ('catalog' = '<databricks-catalog-name>');
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
The value of `catalog` is the name of the catalog in Databricks Unity Catalog. In the Databricks Workspace → left sidebar **Catalog** icon → expand the left panel. Catalogs listed under **My organization** are the available ones (e.g., `workspace`, `main`, `hive_metastore`).
|
|
125
|
+
|
|
126
|
+
Verify:
|
|
127
|
+
|
|
128
|
+
```sql
|
|
129
|
+
SHOW SCHEMAS IN databricks_catalog;
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
Sample output:
|
|
133
|
+
|
|
134
|
+
```
|
|
135
|
+
schema_name
|
|
136
|
+
-----------
|
|
137
|
+
default
|
|
138
|
+
information_schema
|
|
139
|
+
table_types_demo
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
---
|
|
143
|
+
|
|
144
|
+
## Querying Data
|
|
145
|
+
|
|
146
|
+
```sql
|
|
147
|
+
-- View table list
|
|
148
|
+
SHOW TABLES IN databricks_catalog.table_types_demo;
|
|
149
|
+
|
|
150
|
+
-- Query data
|
|
151
|
+
SELECT * FROM databricks_catalog.table_types_demo.orders_external LIMIT 10;
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
---
|
|
155
|
+
|
|
156
|
+
## Federation Queries: Cross-Platform SQL Analysis
|
|
157
|
+
|
|
158
|
+
The core value of federation queries is: **Lakehouse can directly query and join Databricks data, and can also write Databricks data into Lakehouse local tables** — all without any data migration.
|
|
159
|
+
|
|
160
|
+
### Scenario 1: JOIN Between Databricks Tables
|
|
161
|
+
|
|
162
|
+
`orders_external` (orders) and `customers_external` (customers) are joined via `customer_id` to calculate the order count and total spend per customer. Both tables have a `price` field of type `DECIMAL(10,2)`, requiring no conversion:
|
|
163
|
+
|
|
164
|
+
```sql
|
|
165
|
+
SELECT
|
|
166
|
+
c.customer_name,
|
|
167
|
+
c.country,
|
|
168
|
+
c.loyalty_tier,
|
|
169
|
+
COUNT(o.order_id) AS order_count,
|
|
170
|
+
SUM(o.price) AS total_revenue
|
|
171
|
+
FROM databricks_catalog.table_types_demo.orders_external o
|
|
172
|
+
JOIN databricks_catalog.table_types_demo.customers_external c
|
|
173
|
+
ON o.customer_id = c.customer_id
|
|
174
|
+
GROUP BY c.customer_name, c.country, c.loyalty_tier
|
|
175
|
+
ORDER BY total_revenue DESC;
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
Query results:
|
|
179
|
+
|
|
180
|
+
| customer_name | country | loyalty_tier | order_count | total_revenue |
|
|
181
|
+
|---------------|-----------|--------------|-------------|---------------|
|
|
182
|
+
| Alice Chen | China | Gold | 2 | 1698.99 |
|
|
183
|
+
| Frank Liu | China | Silver | 1 | 299.99 |
|
|
184
|
+
| Carol Zhang | China | Platinum | 2 | 249.98 |
|
|
185
|
+
| David Lee | Singapore | Bronze | 1 | 129.99 |
|
|
186
|
+
| Emma Wang | China | Silver | 1 | 89.99 |
|
|
187
|
+
|
|
188
|
+
### Scenario 2: Low Inventory Alert
|
|
189
|
+
|
|
190
|
+
`inventory_delta` records inventory by warehouse (fields: `product_id`, `warehouse_location`, `quantity_available`). Query products with inventory below a threshold, summarized by warehouse:
|
|
191
|
+
|
|
192
|
+
```sql
|
|
193
|
+
SELECT
|
|
194
|
+
warehouse_location,
|
|
195
|
+
COUNT(*) AS low_stock_products,
|
|
196
|
+
MIN(quantity_available) AS min_stock,
|
|
197
|
+
AVG(quantity_available) AS avg_stock
|
|
198
|
+
FROM databricks_catalog.table_types_demo.inventory_delta
|
|
199
|
+
WHERE quantity_available < 200
|
|
200
|
+
GROUP BY warehouse_location
|
|
201
|
+
ORDER BY min_stock;
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
Query results:
|
|
205
|
+
|
|
206
|
+
| warehouse_location | low_stock_products | min_stock | avg_stock |
|
|
207
|
+
|--------------------|--------------------|-----------|-----------|
|
|
208
|
+
| Warehouse A | 1 | 50 | 50.0 |
|
|
209
|
+
| Warehouse B | 2 | 75 | 112.5 |
|
|
210
|
+
|
|
211
|
+
### Scenario 3: Writing Databricks Data to Lakehouse
|
|
212
|
+
|
|
213
|
+
Federation query results can be written directly into Lakehouse local tables for data consolidation. In the example below, `public` is a Lakehouse local schema (not Databricks) — replace it with your actual local schema name:
|
|
214
|
+
|
|
215
|
+
```sql
|
|
216
|
+
-- Extract completed orders from Databricks into a Lakehouse local table
|
|
217
|
+
CREATE TABLE public.orders_from_databricks AS
|
|
218
|
+
SELECT
|
|
219
|
+
order_id,
|
|
220
|
+
customer_id,
|
|
221
|
+
order_date,
|
|
222
|
+
product_name,
|
|
223
|
+
CAST(price AS DECIMAL(10,2)) AS price,
|
|
224
|
+
status
|
|
225
|
+
FROM databricks_catalog.table_types_demo.orders_external
|
|
226
|
+
WHERE status = 'Delivered';
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
Query the local table with no network overhead:
|
|
230
|
+
|
|
231
|
+
```sql
|
|
232
|
+
SELECT * FROM public.orders_from_databricks;
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
| order_id | customer_id | order_date | product_name | price | status |
|
|
236
|
+
|----------|-------------|------------|-----------------|---------|-----------|
|
|
237
|
+
| 2006 | 1005 | 2026-05-20 | Webcam HD | 89.99 | Delivered |
|
|
238
|
+
| 2001 | 1001 | 2026-05-15 | Laptop Pro | 1299.99 | Delivered |
|
|
239
|
+
| 2004 | 1001 | 2026-05-18 | Monitor 27inch | 399.00 | Delivered |
|
|
240
|
+
|
|
241
|
+
### Scenario 4: Dynamic Table Consuming Databricks Data
|
|
242
|
+
|
|
243
|
+
Use Databricks data as the upstream source for a Dynamic Table to periodically aggregate into Lakehouse. `public` is a Lakehouse local schema — replace it with your actual local schema name:
|
|
244
|
+
|
|
245
|
+
```sql
|
|
246
|
+
CREATE OR REPLACE DYNAMIC TABLE public.orders_daily_summary
|
|
247
|
+
REFRESH INTERVAL 1 HOUR VCLUSTER DEFAULT
|
|
248
|
+
COMMENT 'Daily order summary aggregated from Databricks'
|
|
249
|
+
AS
|
|
250
|
+
SELECT
|
|
251
|
+
order_date,
|
|
252
|
+
COUNT(*) AS order_count,
|
|
253
|
+
SUM(CAST(price AS DECIMAL(10,2))) AS total_revenue,
|
|
254
|
+
AVG(CAST(price AS DECIMAL(10,2))) AS avg_order_value
|
|
255
|
+
FROM databricks_catalog.table_types_demo.orders_external
|
|
256
|
+
GROUP BY order_date;
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
Query after the first refresh:
|
|
260
|
+
|
|
261
|
+
```sql
|
|
262
|
+
REFRESH DYNAMIC TABLE public.orders_daily_summary;
|
|
263
|
+
SELECT * FROM public.orders_daily_summary ORDER BY order_date;
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
| order_date | order_count | total_revenue | avg_order_value |
|
|
267
|
+
|------------|-------------|---------------|-----------------|
|
|
268
|
+
| 2026-05-15 | 1 | 1299.99 | 1299.99 |
|
|
269
|
+
| 2026-05-17 | 1 | 49.99 | 49.99 |
|
|
270
|
+
| 2026-05-18 | 1 | 399.00 | 399.00 |
|
|
271
|
+
| 2026-05-19 | 1 | 129.99 | 129.99 |
|
|
272
|
+
| 2026-05-20 | 1 | 89.99 | 89.99 |
|
|
273
|
+
| 2026-06-04 | 2 | 499.98 | 249.99 |
|
|
274
|
+
|
|
275
|
+
> **Note**: When a Dynamic Table references an External Catalog table, every refresh is a full scan because Databricks tables do not support Table Stream. For large data volumes, it is recommended to first snapshot the data into a local table using `CREATE TABLE ... AS SELECT`, then build the Dynamic Table on the local table.
|
|
276
|
+
|
|
277
|
+
---
|
|
278
|
+
|
|
279
|
+
## Supported Table Types
|
|
280
|
+
|
|
281
|
+
Not all Databricks tables can be queried from Lakehouse. Support depends on the table's storage type:
|
|
282
|
+
|
|
283
|
+
| Table Type | Format | Supported | Notes |
|
|
284
|
+
|------------|--------|-----------|-------|
|
|
285
|
+
| `TABLE_DELTA_EXTERNAL` | Delta | Yes | Fully supported, recommended |
|
|
286
|
+
| `TABLE_DELTA` | Delta | Yes | Fully supported |
|
|
287
|
+
| `TABLE_EXTERNAL` (Delta format) | Delta | Yes | Supported |
|
|
288
|
+
| `TABLE_EXTERNAL` (Parquet/CSV/JSON) | Non-Delta | No | Reports `unsupported databricks table format` |
|
|
289
|
+
| `TABLE_DB_STORAGE` | Managed Delta | No | Cross-platform access not supported |
|
|
290
|
+
| `VIEW` | — | No | Driver compatibility issue |
|
|
291
|
+
|
|
292
|
+
**Key conclusion**: Only **Delta format** tables support federation queries, whether External or regular Delta tables. External tables in Parquet, CSV, or JSON format are currently not supported.
|
|
293
|
+
|
|
294
|
+
When creating tables in Databricks, prefer Delta format:
|
|
295
|
+
|
|
296
|
+
```sql
|
|
297
|
+
-- Recommended: Delta External Table
|
|
298
|
+
CREATE TABLE catalog.schema.my_table
|
|
299
|
+
USING DELTA
|
|
300
|
+
LOCATION 's3://my-bucket/my-table/';
|
|
301
|
+
|
|
302
|
+
-- Not recommended (Lakehouse cannot query this)
|
|
303
|
+
CREATE TABLE catalog.schema.my_table
|
|
304
|
+
USING PARQUET
|
|
305
|
+
LOCATION 's3://my-bucket/my-table/';
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
---
|
|
309
|
+
|
|
310
|
+
## Common Error Troubleshooting
|
|
311
|
+
|
|
312
|
+
### `invalid_client`
|
|
313
|
+
|
|
314
|
+
OAuth authentication failed. Check in order:
|
|
315
|
+
|
|
316
|
+
1. Has the SP enabled **Account admin** in Account Console → **Roles**?
|
|
317
|
+
2. Has the SP been added in Workspace → **Settings** → **Service Principals**?
|
|
318
|
+
3. Is the `CLIENT_SECRET` the complete value (not a masked value like `50db****7f61`)? Go to Account Console → SP → **Credentials & secrets** to generate a new one.
|
|
319
|
+
4. Has the Secret expired? Check the `Expires at` field on the **Credentials & secrets** page. If expired, generate a new Secret and re-run `CREATE CATALOG CONNECTION`.
|
|
320
|
+
|
|
321
|
+
### `PermissionDenied: External Data Access ... is disabled`
|
|
322
|
+
|
|
323
|
+
The Metastore has not enabled external data access. In the Databricks Workspace → Catalog → gear icon → Metastore → **External data access** → enable.
|
|
324
|
+
|
|
325
|
+
### `PermissionDenied: User does not have USE CATALOG`
|
|
326
|
+
|
|
327
|
+
The SP does not have Catalog access permission. In Databricks Catalog Explorer, find the corresponding Catalog → Permissions → Grant → add `USE CATALOG` for the SP.
|
|
328
|
+
|
|
329
|
+
### `PermissionDenied: User does not have EXTERNAL USE SCHEMA`
|
|
330
|
+
|
|
331
|
+
The SP does not have external access permission for the Schema. In Databricks Catalog Explorer → Schema → Permissions → Grant → add `USE SCHEMA`, `SELECT`, and `EXTERNAL USE SCHEMA` for the SP.
|
|
332
|
+
|
|
333
|
+
### `NotFound: Catalog 'main' does not exist`
|
|
334
|
+
|
|
335
|
+
The catalog name specified in `OPTIONS ('catalog' = ...)` does not exist. Open the Databricks Workspace → Catalog panel to check the actual catalog names.
|
|
336
|
+
|
|
337
|
+
### Query Timeout (300 seconds) or `PermanentRedirect`
|
|
338
|
+
|
|
339
|
+
`ACCESS_REGION` is incorrect; S3 requests are being redirected. Check the table's actual storage location (Catalog Explorer → table details → Storage Location), confirm the S3 bucket's region, and recreate the Catalog Connection.
|
|
340
|
+
|
|
341
|
+
### `unsupported databricks table format {} [PARQUET/CSV/JSON]`
|
|
342
|
+
|
|
343
|
+
The table uses a non-Delta format, which is currently not supported for querying from Lakehouse. Recreate the table in Databricks using Delta format, or convert the data to a Delta table.
|
|
344
|
+
|
|
345
|
+
### `Table cannot be accessed from outside of Databricks Compute Environment ... kind being TABLE_DB_STORAGE`
|
|
346
|
+
|
|
347
|
+
The table is a Databricks Managed Table — data is stored in Databricks-controlled storage and does not support direct cross-platform access. The table must first be converted to an External Table in Databricks before it can be queried.
|
|
348
|
+
|
|
349
|
+
---
|
|
350
|
+
|
|
351
|
+
## Important Notes
|
|
352
|
+
|
|
353
|
+
| Note | Description |
|
|
354
|
+
|------|-------------|
|
|
355
|
+
| **Cloud platform restriction** | Databricks' S3 storage must be on the same cloud platform as the Lakehouse instance (both AWS). Databricks on GCP/Azure cannot be interconnected with an AWS Lakehouse. |
|
|
356
|
+
| **Region consistency** | `ACCESS_REGION` must match the S3 bucket's region, not the workspace region. |
|
|
357
|
+
| **Read-only restriction** | External Catalogs are read-only — writing data from Lakehouse to Databricks (INSERT/UPDATE/DELETE) is not supported. The reverse (writing Databricks data into Lakehouse) is fully supported; see Scenario 3. |
|
|
358
|
+
| **Version requirement** | Requires a version that supports Unity Catalog. Free Edition is supported; Community Edition is not. |
|
|
359
|
+
| **Save your Secret** | The OAuth Secret is only shown in full at generation time — save it immediately. If lost, you must generate a new one. |
|
|
360
|
+
|
|
361
|
+
---
|
|
362
|
+
|
|
363
|
+
## Related Documentation
|
|
364
|
+
|
|
365
|
+
- [Create Catalog Connection](create-catalog-connection.md) — Complete DDL syntax and parameter descriptions
|
|
366
|
+
- [Create Databricks External Catalog](create-external-catalog.md) — External Catalog DDL syntax
|
|
367
|
+
- [Federation Query Usage Guide](SQL_External_Catalog_Guide.md) — Complete federation query examples
|
|
@@ -0,0 +1,199 @@
|
|
|
1
|
+
# Databricks Jobs → Lakehouse Studio Migration Guide: E-Commerce ETL Pipeline
|
|
2
|
+
|
|
3
|
+
If your data pipeline runs on Databricks Jobs — multi-task DAGs, task dependencies, scheduled triggers — the core migration effort to Singdata Lakehouse Studio is very low. Task content (PySpark/SQL code) is minimally rewritten with ZettaPark (4 mechanical substitutions). Task orchestration (DAG dependencies, cron scheduling) is rebuilt with a few `cz-cli` commands, all configured in one pass.
|
|
4
|
+
|
|
5
|
+
This article validates this with a real e-commerce ETL pipeline: Bronze ingestion → Silver cleansing and joining → Gold aggregation. 3 tasks + DAG dependencies + daily 02:00 schedule, fully migrated to Lakehouse Studio, passing all 8 automated validations.
|
|
6
|
+
|
|
7
|
+
Full code on GitHub: [databricks2lakehouse-jobs](https://github.com/clickzetta/databricks2lakehouse-jobs)
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Source Project
|
|
12
|
+
|
|
13
|
+
`01_source/jobs/ecommerce_etl_job.json`: Original Databricks Jobs definition — 3 notebook tasks, dependency chain 01→02→03, daily trigger at 02:00 AM:
|
|
14
|
+
|
|
15
|
+
```json
|
|
16
|
+
{
|
|
17
|
+
"name": "ecommerce_etl_pipeline",
|
|
18
|
+
"schedule": {"quartz_cron_expression": "0 0 2 * * ?"},
|
|
19
|
+
"tasks": [
|
|
20
|
+
{"task_key": "ingest_raw", "notebook_task": {...}},
|
|
21
|
+
{"task_key": "transform_silver", "depends_on": [{"task_key": "ingest_raw"}]},
|
|
22
|
+
{"task_key": "aggregate_gold", "depends_on": [{"task_key": "transform_silver"}]}
|
|
23
|
+
],
|
|
24
|
+
"email_notifications": {"on_failure": ["oncall@company.com"]}
|
|
25
|
+
}
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
The pipeline processes e-commerce clickstream: 500 events × 30 products → daily sales summary across 5 categories.
|
|
29
|
+
|
|
30
|
+
Migrated code is in `03_lakehouse/tasks/`, comparable file-by-file with `01_source/notebooks/`.
|
|
31
|
+
|
|
32
|
+
## Conclusion First
|
|
33
|
+
|
|
34
|
+
| Change | Effort | Notes |
|
|
35
|
+
|--------|--------|------|
|
|
36
|
+
| Task content (Python code) | Very low | ZettaPark 4 substitutions (import/session/table path/saveAsTable) |
|
|
37
|
+
| Task creation | Low | `cz-cli task create --type PYTHON --folder <id>` |
|
|
38
|
+
| Dependencies | Low | `--dep-tasks '[{"taskId":N,"taskName":"x"}]'` |
|
|
39
|
+
| Quartz cron → standard cron | Very low | `"0 0 2 * * ?"` → `"0 2 * * *"` |
|
|
40
|
+
| Alert notifications | Low | Databricks email → Studio monitoring rules (email/DingTalk/Feishu) |
|
|
41
|
+
|
|
42
|
+
`dbutils.notebook.run(nb)`, Job cluster configuration — no migration needed, handled automatically by Studio DAG and Virtual Cluster.
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Tech Stack Comparison
|
|
47
|
+
|
|
48
|
+
| | Databricks Jobs | Lakehouse Studio |
|
|
49
|
+
|---|---|---|
|
|
50
|
+
| Pipeline definition | Job JSON (`tasks: [...]`) | `cz-cli task create/save-config` |
|
|
51
|
+
| Task dependencies | `depends_on: [{task_key}]` | `--dep-tasks '[{"taskId":N,"taskName":"x"}]'` |
|
|
52
|
+
| Task content | Databricks Notebook (PySpark) | Studio Python task (ZettaPark) |
|
|
53
|
+
| Session | `spark` (global injection) | `clickzetta_dbutils.get_active_lakehouse_engine()` |
|
|
54
|
+
| Schedule cron | Quartz `"0 0 2 * * ?"` | Standard `"0 2 * * *"` |
|
|
55
|
+
| Cluster configuration | `job_clusters: [{...EC2 config}]` | Virtual Cluster auto-managed, no configuration needed |
|
|
56
|
+
| `dbutils.notebook.run(nb)` | Chained invocation | Replaced by Studio DAG dependencies |
|
|
57
|
+
| Failure alerts | `email_notifications.on_failure` | Studio monitoring rules (email/DingTalk/Feishu) |
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+

|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Migration Steps
|
|
66
|
+
|
|
67
|
+
### Step 1: Task Content — ZettaPark 4 Substitutions
|
|
68
|
+
|
|
69
|
+
Each notebook requires minimal changes; all business logic is preserved:
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
# Databricks notebook (original)
|
|
73
|
+
from pyspark.sql import functions as F # ← pyspark
|
|
74
|
+
|
|
75
|
+
df = spark.read.csv("/Volumes/ecommerce/landing/events/") # ← spark global
|
|
76
|
+
events = spark.table("ecommerce.bronze.raw_events") # ← 3-level naming
|
|
77
|
+
|
|
78
|
+
df.write.saveAsTable("ecommerce.silver.events_enriched")
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
```python
|
|
82
|
+
# Studio Python task (03_lakehouse/tasks/02_transform_silver.py)
|
|
83
|
+
from clickzetta.zettapark import functions as F # ① import
|
|
84
|
+
# session injected by platform (via clickzetta_dbutils) # ② session
|
|
85
|
+
|
|
86
|
+
events = session.table("jobs_bronze.raw_events") # ③ table path
|
|
87
|
+
df.write.saveAsTable("jobs_silver.events_enriched") # ④ saveAsTable
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
DataFrame logic (join/filter/groupBy/agg/withColumn) is completely unchanged.
|
|
91
|
+
|
|
92
|
+
### Step 2: Task Creation
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
# "task_key" in Databricks Jobs JSON corresponds to Studio task name
|
|
96
|
+
# --type is required (SQL / PYTHON / SHELL etc.)
|
|
97
|
+
# --folder specifies task folder ID (query via cz-cli task list-folders)
|
|
98
|
+
|
|
99
|
+
cz-cli task create etl_01_ingest_raw --type PYTHON --folder 91047 --profile aws_singapore_prod
|
|
100
|
+
cz-cli task create etl_02_transform_silver --type PYTHON --folder 91047 --profile aws_singapore_prod
|
|
101
|
+
cz-cli task create etl_03_aggregate_gold --type PYTHON --folder 91047 --profile aws_singapore_prod
|
|
102
|
+
|
|
103
|
+
# Upload task scripts
|
|
104
|
+
cz-cli task save-content <task_id> --file 03_lakehouse/tasks/01_ingest_raw.py
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### Step 3: Set DAG Dependencies
|
|
108
|
+
|
|
109
|
+
Databricks Jobs uses `depends_on: [{task_key}]`; Studio uses `--dep-tasks` (requires both taskId and taskName):
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
# Databricks Job JSON:
|
|
113
|
+
# {"task_key": "transform_silver", "depends_on": [{"task_key": "ingest_raw"}]}
|
|
114
|
+
|
|
115
|
+
# Studio equivalent (requires taskId + taskName both):
|
|
116
|
+
cz-cli task save-config <id_02> \
|
|
117
|
+
--deps replace \
|
|
118
|
+
--dep-tasks '[{"taskId":10143594,"taskName":"etl_01_ingest_raw"}]'
|
|
119
|
+
|
|
120
|
+
cz-cli task save-config <id_03> \
|
|
121
|
+
--deps replace \
|
|
122
|
+
--dep-tasks '[{"taskId":10144488,"taskName":"etl_02_transform_silver"}]'
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Step 4: Schedule Cron
|
|
126
|
+
|
|
127
|
+
Databricks uses Quartz 6-field format; Studio uses standard 5-field cron:
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
# Databricks: "quartz_cron_expression": "0 0 2 * * ?" (seconds minutes hours day month weekday)
|
|
131
|
+
# Studio: standard cron "0 2 * * *" (minutes hours day month weekday)
|
|
132
|
+
|
|
133
|
+
cz-cli task save-cron <id_01> --cron "0 2 * * *"
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
### Step 5: Deploy
|
|
137
|
+
|
|
138
|
+
```bash
|
|
139
|
+
cz-cli task deploy <id_01> # Must deploy before task can be scheduled
|
|
140
|
+
cz-cli task deploy <id_02>
|
|
141
|
+
cz-cli task deploy <id_03>
|
|
142
|
+
|
|
143
|
+
# Manual trigger (equivalent to Databricks "Run now")
|
|
144
|
+
cz-cli task execute <id_01>
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### Step 6: Failure Alert Configuration
|
|
148
|
+
|
|
149
|
+
Databricks configures `email_notifications` directly in the Job JSON; Studio configures via monitoring rules:
|
|
150
|
+
|
|
151
|
+
| Databricks | Studio |
|
|
152
|
+
|---|---|
|
|
153
|
+
| Job JSON `email_notifications.on_failure` | Studio UI: Alert Monitoring → New Monitoring Rule |
|
|
154
|
+
| Email only | Supports email, SMS, phone (high severity), Webhook (DingTalk/Feishu) |
|
|
155
|
+
| Task-level configuration | Notification policies can be reused across tasks |
|
|
156
|
+
|
|
157
|
+
**Configuration path**: Studio UI → Operations Monitoring → Alert Monitoring → New Monitoring Rule → Select "Task Instance Failure" event → Configure notification method (email/DingTalk/Feishu Webhook)
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
## E2E Validation Results
|
|
162
|
+
|
|
163
|
+
Tested on AWS Singapore instance, 8/8 all passed:
|
|
164
|
+
|
|
165
|
+
| Check | Expected | Result |
|
|
166
|
+
|--------|--------|------|
|
|
167
|
+
| jobs_bronze.raw_events | 500 | ✅ |
|
|
168
|
+
| jobs_bronze.products | 30 | ✅ |
|
|
169
|
+
| jobs_silver.events_enriched | 500 | ✅ |
|
|
170
|
+
| jobs_gold.daily_sales rows | 115 | ✅ |
|
|
171
|
+
| Total sales amount | 12,814.84 | ✅ |
|
|
172
|
+
| Total order count | 119 | ✅ |
|
|
173
|
+
| Category count | 5 | ✅ |
|
|
174
|
+
| Studio tasks all ONLINE | 3/3 | ✅ |
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## Notes
|
|
179
|
+
|
|
180
|
+
- **`--dep-tasks` requires both taskId and taskName**: Passing only taskId will return `taskName is required`. Both fields are mandatory.
|
|
181
|
+
- **`--folder` takes folder ID, not name**: Query the ID with `cz-cli task list-folders`.
|
|
182
|
+
- **Task names cannot contain slashes**: The `folder/taskname` format will be parsed as a path in the CLI. Use the `--folder <id>` parameter to specify the folder, and only write the task name in the name field.
|
|
183
|
+
- **`--type` is required**: `cz-cli task create` will error without `--type`. Common types: `PYTHON`, `SQL`, `SHELL`.
|
|
184
|
+
- **Cron format conversion**: Quartz 6-field (seconds minutes hours day month weekday) → standard 5-field (minutes hours day month weekday). `"0 0 2 * * ?"` → `"0 2 * * *"`.
|
|
185
|
+
|
|
186
|
+
## Related Documentation
|
|
187
|
+
|
|
188
|
+
### Studio Task Development
|
|
189
|
+
|
|
190
|
+
- [Task Development and Scheduling](task-develop.md): Creating, editing content, and scheduling Studio tasks
|
|
191
|
+
- [Task Scheduling Dependencies](task_scheduling_dependency.md): DAG dependency configuration details
|
|
192
|
+
- [Studio Python Task Development Guide (ZettaPark)](studio-python-task-zettapark.md)
|
|
193
|
+
- [Studio Task Development and Operations (cz-cli)](cz-cli-studio-tasks.md)
|
|
194
|
+
|
|
195
|
+
### Other Migration Guides
|
|
196
|
+
|
|
197
|
+
- [Databricks Notebook → Lakehouse Migration Guide](databricks-notebook-to-studio-migration.md): Single Notebook → Studio task
|
|
198
|
+
- [Databricks DLT → Lakehouse Migration Guide](databricks-dlt-to-lakehouse-migration.md): Declarative pipeline migration
|
|
199
|
+
- [Databricks Unity Catalog → Lakehouse Migration Guide](databricks-uc-governance-to-lakehouse-migration.md): Permissions and governance
|