@clickzetta/cz-cli-darwin-arm64 0.5.16 → 0.5.18
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/cz-cli +0 -0
- package/bin/skills/lakehouse-doc-en/SKILL.md +6 -11
- package/bin/skills/lakehouse-doc-en/references/AIGateway.md +58 -13
- package/bin/skills/lakehouse-doc-en/references/Computation.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/DataSource_Amazon_DocumentDB.md +3 -1
- package/bin/skills/lakehouse-doc-en/references/Foreach.md +14 -14
- package/bin/skills/lakehouse-doc-en/references/JDBC-Driver.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/LakehouseAI-overview.md +21 -8
- package/bin/skills/lakehouse-doc-en/references/LakehouseDataGPT-tour.md +4 -9
- package/bin/skills/lakehouse-doc-en/references/LakehouseStudio-tour.md +14 -19
- package/bin/skills/lakehouse-doc-en/references/Lakehouse_Zilliz_MakeDataReadyforBIandAI.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/Logstash.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/Migrate_Spark_DataEngineeringBestPractices_Project_to_Lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/Notebook.md +17 -17
- package/bin/skills/lakehouse-doc-en/references/RemoteFunction-as-udf.md +14 -14
- package/bin/skills/lakehouse-doc-en/references/SQL_External_Catalog_Guide.md +1 -9
- package/bin/skills/lakehouse-doc-en/references/SUMMARY.md +59 -29
- package/bin/skills/lakehouse-doc-en/references/WINDOWFUNCTION.md +99 -57
- package/bin/skills/lakehouse-doc-en/references/Zettapark_Data_Engineering_Demo.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/access-control-configuration.md +1 -8
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-2-5-1.0.md +16 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-29-1.0.2.md +14 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-3-8-1.0.1.md +16 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-4-28-1.1.md +29 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-12-1.1.1.md +18 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-15-1.2.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-21-1.3.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-5-28-1.4.md +10 -0
- package/bin/skills/lakehouse-doc-en/references/aigw-2026-6-3-1.5.md +9 -0
- package/bin/skills/lakehouse-doc-en/references/alicloud-arn-externalid.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/answer-accuracy-improve.md +120 -103
- package/bin/skills/lakehouse-doc-en/references/application-list.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/approval-list.md +16 -17
- package/bin/skills/lakehouse-doc-en/references/batch-load-parquet-file-into-lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/batch_sync.md +9 -9
- package/bin/skills/lakehouse-doc-en/references/batch_sync_Sop.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/batchloadparquetfileintoLakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/bulkloadv1-python-sdk.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/chart-auto-refresh-guide.md +12 -6
- package/bin/skills/lakehouse-doc-en/references/clickzetta-sample-data.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/code_approval.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/composite_task.md +31 -42
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_environment_and_data_generate.md +6 -9
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_javasdk_bulkload_realtime.md +4 -10
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_kafka_realtime_sync.md +1 -10
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_local_file_into_table_by_studio.md +0 -6
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_batchload_public_network.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_python_node.md +2 -7
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_realtime_cdc_public_network.md +13 -18
- package/bin/skills/lakehouse-doc-en/references/comprehensive_guide_to_ingesting_studio_sql_insert.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/concepts.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/config-datasource.md +5 -7
- package/bin/skills/lakehouse-doc-en/references/connect-with-cli.md +116 -72
- package/bin/skills/lakehouse-doc-en/references/connect-with-cz-cli.md +151 -0
- package/bin/skills/lakehouse-doc-en/references/continue-job.md +9 -17
- package/bin/skills/lakehouse-doc-en/references/create-api-connection.md +315 -286
- package/bin/skills/lakehouse-doc-en/references/create-catalog-connection.md +1 -0
- package/bin/skills/lakehouse-doc-en/references/create-dynamic-table.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/create-external-catalog.md +85 -22
- package/bin/skills/lakehouse-doc-en/references/create-table-ddl.md +45 -0
- package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkendpoint.md +4 -6
- package/bin/skills/lakehouse-doc-en/references/creating_alicloud_privatelinkservice.md +4 -7
- package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkendpoint.md +2 -7
- package/bin/skills/lakehouse-doc-en/references/creating_tencentcloud_privatelinkservice.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/cz-cli-agent.md +15 -10
- package/bin/skills/lakehouse-doc-en/references/cz-cli-datasource.md +0 -8
- package/bin/skills/lakehouse-doc-en/references/cz-cli-sql.md +2 -45
- package/bin/skills/lakehouse-doc-en/references/cz-cli.md +53 -42
- package/bin/skills/lakehouse-doc-en/references/dashboard-version-management-guide.md +12 -4
- package/bin/skills/lakehouse-doc-en/references/data-integration-intro.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/data-integration.md +29 -27
- package/bin/skills/lakehouse-doc-en/references/data-load-summary.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/data-quality.md +25 -25
- package/bin/skills/lakehouse-doc-en/references/data-sharing.md +31 -54
- package/bin/skills/lakehouse-doc-en/references/data-sources.md +45 -45
- package/bin/skills/lakehouse-doc-en/references/data_catalog.md +23 -25
- package/bin/skills/lakehouse-doc-en/references/data_privacy.md +5 -2
- package/bin/skills/lakehouse-doc-en/references/data_sharing_between_accounts_guide.md +0 -4
- package/bin/skills/lakehouse-doc-en/references/data_visualization.md +4 -15
- package/bin/skills/lakehouse-doc-en/references/dataagent.md +39 -7
- package/bin/skills/lakehouse-doc-en/references/databricks-delta-to-lakehouse-migration.md +168 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-dlt-to-lakehouse-migration.md +331 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-external-catalog-practice.md +367 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-jobs-to-studio-migration.md +199 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-notebook-to-studio-migration.md +350 -0
- package/bin/skills/lakehouse-doc-en/references/databricks-uc-governance-to-lakehouse-migration.md +327 -0
- package/bin/skills/lakehouse-doc-en/references/datagpt-model-config.md +34 -0
- package/bin/skills/lakehouse-doc-en/references/datagpt_data_source.md +50 -37
- package/bin/skills/lakehouse-doc-en/references/datagpt_introduction.md +55 -79
- package/bin/skills/lakehouse-doc-en/references/datagpt_quickstart.md +50 -64
- package/bin/skills/lakehouse-doc-en/references/datalake-acceleration.md +75 -2
- package/bin/skills/lakehouse-doc-en/references/dbt-databricks-to-clickzetta-migration.md +242 -0
- package/bin/skills/lakehouse-doc-en/references/dynamic-mask.md +30 -30
- package/bin/skills/lakehouse-doc-en/references/dynamic-table-bestpractice.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/dynamic-table-introduce.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/dynamic_table_summary.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/eco_integration/streamlit.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/eco_integration/superset.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/ecosystem-all.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/ecosystem.md +145 -0
- package/bin/skills/lakehouse-doc-en/references/external-catalog-summary.md +33 -38
- package/bin/skills/lakehouse-doc-en/references/external-function-combo-practice.md +466 -0
- package/bin/skills/lakehouse-doc-en/references/f6fc6447ee.md +7 -9
- package/bin/skills/lakehouse-doc-en/references/federation-query.md +56 -6
- package/bin/skills/lakehouse-doc-en/references/finebi-mysql.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/get-started-with-sample-data.md +10 -11
- package/bin/skills/lakehouse-doc-en/references/gitfolder.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/grant-privileges.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/iceberg-rest-catalog-databricks.md +166 -0
- package/bin/skills/lakehouse-doc-en/references/ide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/if_else_task.md +59 -57
- package/bin/skills/lakehouse-doc-en/references/input_output.md +10 -7
- package/bin/skills/lakehouse-doc-en/references/jobprofile-bestpractices.md +60 -64
- package/bin/skills/lakehouse-doc-en/references/kafka-connection.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/key-concepts.md +146 -117
- package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-gateway-cz-cli.md +317 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-ai-sql-analysis.md +345 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-dqc-guide.md +300 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-medallion-sql-dt-guide.md +543 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-multi-cloud-acceleration.md +274 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-multimodal-ai-pipeline.md +198 -0
- package/bin/skills/lakehouse-doc-en/references/lakehouse-quick-experience_guide.md +49 -52
- package/bin/skills/lakehouse-doc-en/references/lakehouse-volume-pipe-acceleration-guide.md +380 -0
- package/bin/skills/lakehouse-doc-en/references/langchain-plug-installation.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/management.md +4 -9
- package/bin/skills/lakehouse-doc-en/references/medallion-lakehouse-from-scratch.md +2 -1
- package/bin/skills/lakehouse-doc-en/references/metrics_answer_build.md +58 -21
- package/bin/skills/lakehouse-doc-en/references/migrate-spark-data-engineering-best-practices-to-lakehouse.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/mindsdb.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/monitoring_and_alerting.md +65 -60
- package/bin/skills/lakehouse-doc-en/references/monitoring_item_specification.md +33 -33
- package/bin/skills/lakehouse-doc-en/references/multitable_batch_sync.md +16 -16
- package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync.md +65 -72
- package/bin/skills/lakehouse-doc-en/references/multitable_realtime_sync_sop.md +54 -52
- package/bin/skills/lakehouse-doc-en/references/navicat-mysql.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/om-dynamic-table.md +71 -66
- package/bin/skills/lakehouse-doc-en/references/om-vcluster.md +2 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-create-session.md +79 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-generate-auth-token.md +63 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-overview.md +96 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-quick-start.md +286 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-response-guide.md +264 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-safe-question-poll.md +201 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-query.md +99 -0
- package/bin/skills/lakehouse-doc-en/references/open-api-text2insight-stop.md +74 -0
- package/bin/skills/lakehouse-doc-en/references/overview.md +6 -7
- package/bin/skills/lakehouse-doc-en/references/permission-application.md +5 -5
- package/bin/skills/lakehouse-doc-en/references/pipe-introduction.md +1 -0
- package/bin/skills/lakehouse-doc-en/references/pipe-kafka-table-stream.md +72 -70
- package/bin/skills/lakehouse-doc-en/references/pipe-kafka.md +105 -110
- package/bin/skills/lakehouse-doc-en/references/pipe-overview.md +40 -40
- package/bin/skills/lakehouse-doc-en/references/pipe-storage-object.md +43 -48
- package/bin/skills/lakehouse-doc-en/references/pipe-summary.md +14 -4
- package/bin/skills/lakehouse-doc-en/references/pipe-syntax.md +58 -151
- package/bin/skills/lakehouse-doc-en/references/practice_python_task.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/pricing-ai-gateway.md +181 -0
- package/bin/skills/lakehouse-doc-en/references/pricing-lakehouse.md +316 -0
- package/bin/skills/lakehouse-doc-en/references/pricing.md +44 -288
- package/bin/skills/lakehouse-doc-en/references/private-link-general.md +0 -2
- package/bin/skills/lakehouse-doc-en/references/pyspark-to-zettapark-migration-f1.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python-igs.md +7 -3
- package/bin/skills/lakehouse-doc-en/references/python-sample-put-github-rt-events.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python-task.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector_advanced.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/python_reference/connector_examples.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/python_sdk_guide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/python_shell_datasource.md +11 -9
- package/bin/skills/lakehouse-doc-en/references/quick_start_batch_sync_data.md +9 -18
- package/bin/skills/lakehouse-doc-en/references/quick_start_bi_analysis.md +8 -25
- package/bin/skills/lakehouse-doc-en/references/quick_start_create_workspace.md +4 -6
- package/bin/skills/lakehouse-doc-en/references/quick_start_data_quality.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quick_start_etl.md +16 -20
- package/bin/skills/lakehouse-doc-en/references/quick_start_monitoring_and_alerting.md +10 -18
- package/bin/skills/lakehouse-doc-en/references/quick_start_sql_query.md +7 -10
- package/bin/skills/lakehouse-doc-en/references/quick_start_upload_data.md +5 -7
- package/bin/skills/lakehouse-doc-en/references/quick_start_user_management.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quick_start_workspace.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/quick_start_workspace_user.md +8 -8
- package/bin/skills/lakehouse-doc-en/references/quickstart.md +69 -56
- package/bin/skills/lakehouse-doc-en/references/quickstart_datashare_between_companies.md +0 -5
- package/bin/skills/lakehouse-doc-en/references/quickstart_envirment_for_team.md +0 -24
- package/bin/skills/lakehouse-doc-en/references/realtime-pipeline-selection-guide.md +1 -2
- package/bin/skills/lakehouse-doc-en/references/realtime-sales-dashboard-with-dynamic-table.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/realtime_sync.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/release-note-2026-05-19.md +5 -3
- package/bin/skills/lakehouse-doc-en/references/revoke-privileges.md +3 -1
- package/bin/skills/lakehouse-doc-en/references/roles.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/row-filter.md +165 -0
- package/bin/skills/lakehouse-doc-en/references/row_level_permission.md +30 -19
- package/bin/skills/lakehouse-doc-en/references/scheduled_task.md +28 -21
- package/bin/skills/lakehouse-doc-en/references/security_overview.md +99 -21
- package/bin/skills/lakehouse-doc-en/references/set-command.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/setup.md +13 -15
- package/bin/skills/lakehouse-doc-en/references/show-grants.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/snowflake-dynamic-tables-to-lakehouse.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/spark-connector-summary.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/sql_functions/context_functions/current_vcluster.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/sso-configuration.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/streaming_pipeline_with_dynamic_table.md +0 -1
- package/bin/skills/lakehouse-doc-en/references/studio-incremental-sync-practice.md +27 -23
- package/bin/skills/lakehouse-doc-en/references/studio-shell-task.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/supported-cloud-platforms.md +32 -0
- package/bin/skills/lakehouse-doc-en/references/table_rendering.md +18 -12
- package/bin/skills/lakehouse-doc-en/references/task-develop.md +89 -91
- package/bin/skills/lakehouse-doc-en/references/task_development.md +19 -17
- package/bin/skills/lakehouse-doc-en/references/task_group.md +16 -14
- package/bin/skills/lakehouse-doc-en/references/task_instance.md +21 -21
- package/bin/skills/lakehouse-doc-en/references/task_param.md +38 -35
- package/bin/skills/lakehouse-doc-en/references/task_param_reference.md +81 -79
- package/bin/skills/lakehouse-doc-en/references/task_scheduling_dependency.md +20 -21
- package/bin/skills/lakehouse-doc-en/references/tencentcloud_arn_and_externalid.md +1 -5
- package/bin/skills/lakehouse-doc-en/references/trial-account-quotas-and-limits.md +1 -3
- package/bin/skills/lakehouse-doc-en/references/tutorial_connect_to_lakehouse.md +69 -0
- package/bin/skills/lakehouse-doc-en/references/tutorials.md +4 -1
- package/bin/skills/lakehouse-doc-en/references/unique-key.md +167 -0
- package/bin/skills/lakehouse-doc-en/references/usageandbillingview.md +138 -0
- package/bin/skills/lakehouse-doc-en/references/use-dbt-dev.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/use-java-sdk-realtime-uploaddata.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/use-java-sdk-upload-data-local.md +3 -3
- package/bin/skills/lakehouse-doc-en/references/use-models.md +128 -0
- package/bin/skills/lakehouse-doc-en/references/use-mysql-client.md +81 -81
- package/bin/skills/lakehouse-doc-en/references/use-python-sdk-upload-data.md +10 -12
- package/bin/skills/lakehouse-doc-en/references/user-identification.md +2 -3
- package/bin/skills/lakehouse-doc-en/references/user_permission_grand_guide.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/using-udf-in-dynamic-table.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/vc_cache.md +18 -22
- package/bin/skills/lakehouse-doc-en/references/vcluster_size_description.md +33 -31
- package/bin/skills/lakehouse-doc-en/references/virtual-cluster.md +43 -45
- package/bin/skills/lakehouse-doc-en/references/web-job-history.md +94 -108
- package/bin/skills/lakehouse-doc-en/references/web_search.md +16 -7
- package/bin/skills/lakehouse-doc-en/references/zettapark-data-engineering-demo.md +1 -1
- package/bin/skills/lakehouse-doc-en/references/zettapark-dataframe-guide.md +144 -70
- package/bin/skills/lakehouse-doc-en/references/zettapark-dynamic-table-guide.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-etl-guide.md +73 -33
- package/bin/skills/lakehouse-doc-en/references/zettapark-feature-engineering.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-functions-guide.md +75 -46
- package/bin/skills/lakehouse-doc-en/references/zettapark-quick-start.md +2 -2
- package/bin/skills/lakehouse-doc-en/references/zettapark-stream-guide.md +4 -4
- package/bin/skills/lakehouse-doc-en/references/zettapark-volume-guide.md +93 -29
- package/package.json +1 -1
- package/bin/skills/lakehouse-doc-en/references/CLAUDE.md +0 -606
- package/bin/skills/lakehouse-doc-en/references/modelprice.md +0 -155
|
@@ -30,7 +30,6 @@ The core working principle of Data Share is:
|
|
|
30
30
|
|
|
31
31
|
**Data Sharing Flow**:
|
|
32
32
|
|
|
33
|
-
:-: 
|
|
34
33
|
|
|
35
34
|
The data provider retains full control and can update or remove shared data at any time. When source data is updated, the consumer immediately obtains the latest data without additional data sync operations.
|
|
36
35
|
|
|
@@ -69,7 +68,6 @@ ALTER SHARE taxi_data_share ADD INSTANCE target_instance_name;
|
|
|
69
68
|
5. Under **Receiving Instances**, click **Add** and enter the consumer's service instance name.
|
|
70
69
|
6. Click **Confirm** to complete creation.
|
|
71
70
|
|
|
72
|
-
:-: 
|
|
73
71
|
|
|
74
72
|
#### 4.1.3 Creating a View for Partial Data Sharing
|
|
75
73
|
|
|
@@ -140,7 +138,6 @@ DESC SHARE source_instance_name.taxi_data_share;
|
|
|
140
138
|
2. In the left navigation, select **Data Management** -> **Data Share**.
|
|
141
139
|
3. Switch to the **Shared with Me** tab to see all received shares.
|
|
142
140
|
|
|
143
|
-
:-: 
|
|
144
141
|
|
|
145
142
|
### 5.2 Creating a Local Schema Linked to Shared Data
|
|
146
143
|
|
|
@@ -165,7 +162,6 @@ Where:
|
|
|
165
162
|
3. In the popup, select the **Source Schema** and specify the schema name to be created locally.
|
|
166
163
|
4. Click **Confirm** to complete the extraction.
|
|
167
164
|
|
|
168
|
-
:-: 
|
|
169
165
|
|
|
170
166
|
### 5.3 Using Shared Data
|
|
171
167
|
|
|
@@ -21,8 +21,7 @@ After clicking "Chart" to visualize the worksheet results, you need to select th
|
|
|
21
21
|
|
|
22
22
|
Hover over the chart to view detailed information for each data point. For example, you can view results as a line chart:
|
|
23
23
|
|
|
24
|
-
|
|
25
|
-
|
|
24
|
+
^
|
|
26
25
|
^
|
|
27
26
|
|
|
28
27
|
In the "Settings" panel on the right side of the visualization area, configure what to display:
|
|
@@ -35,7 +34,7 @@ In the "Settings" panel on the right side of the visualization area, configure w
|
|
|
35
34
|
|
|
36
35
|
3. Y-axis field:
|
|
37
36
|
|
|
38
|
-
|
|
37
|
+
^
|
|
39
38
|
|
|
40
39
|
The Y-axis supports aggregate functions to derive a single value from multiple data points. The available aggregation methods are:
|
|
41
40
|
|
|
@@ -59,8 +58,7 @@ When there are many X-axis values and you need to view a specific data point on
|
|
|
59
58
|
|
|
60
59
|
For example, in the chart below, the accurate value is shown only when hovering over the visualization and the specific timestamp information appears.
|
|
61
60
|
|
|
62
|
-
|
|
63
|
-
|
|
61
|
+
^
|
|
64
62
|
^
|
|
65
63
|
|
|
66
64
|
## Use Cases
|
|
@@ -75,16 +73,7 @@ For example, in the chart below, the accurate value is shown only when hovering
|
|
|
75
73
|
select order_date, count(*) as c from big_data_table group by order_date;
|
|
76
74
|
```
|
|
77
75
|
|
|
78
|
-

|
|
79
|
-
|
|
80
76
|
^
|
|
81
|
-
|
|
82
|
-

|
|
83
|
-
|
|
84
|
-
^
|
|
85
|
-
|
|
86
|
-

|
|
87
|
-
|
|
88
77
|
^
|
|
89
78
|
|
|
90
79
|
**Scenario 2**: To **ignore** time span differences and keep only the specific result data points, cast the result to a string (`order_date::string`):
|
|
@@ -93,4 +82,4 @@ select order_date, count(*) as c from big_data_table group by order_date;
|
|
|
93
82
|
select order_date::string, count(*) as c from big_data_table group by order_date;
|
|
94
83
|
```
|
|
95
84
|
|
|
96
|
-
|
|
85
|
+
^
|
|
@@ -1,9 +1,11 @@
|
|
|
1
|
-
## What is Data Agent
|
|
1
|
+
## What is Data Engineering Agent
|
|
2
2
|
|
|
3
|
-
Data Agent is an AI-powered agent built on top of Singdata Lakehouse and Studio. It covers the full lifecycle of "development, operations, and governance" and implements intelligent platform upgrades through an Agentic AIOps philosophy — transforming data development from "people operating the platform" to "people directing the agent."
|
|
3
|
+
Data Engineering Agent is an AI-powered agent built on top of Singdata Lakehouse and Studio. It covers the full lifecycle of "development, operations, and governance" and implements intelligent platform upgrades through an Agentic AIOps philosophy — transforming data development from "people operating the platform" to "people directing the agent."
|
|
4
4
|
|
|
5
5
|
Data Agent is not just a tool that makes data teams more productive. It is a **data intelligence collaboration system** that enables everyone in the company to work with data.
|
|
6
6
|
|
|
7
|
+
^
|
|
8
|
+
|
|
7
9
|
## User Value
|
|
8
10
|
|
|
9
11
|
* **Higher productivity: reclaim 80% of your time for what truly matters**
|
|
@@ -42,18 +44,18 @@ For example:
|
|
|
42
44
|
**High cost of understanding standards** Each business domain has its own layering rules, naming conventions, and field standards, scattered across various documents. Engineers must "catch up" before taking on any new requirement, and even minor oversights get flagged in reviews, keeping rework costs high.
|
|
43
45
|
|
|
44
46
|
> Example prompt:
|
|
45
|
-
> I need to design a Medallion architecture data warehouse based on this metric requirements spec to support GMV analysis. I've already planned the tables for each layer: [Bronze layer] xxx [Silver layer] xxx [Gold layer] xxx. Based on this table list, please generate a data warehouse modeling standards document.
|
|
47
|
+
> I need to design a Medallion architecture data warehouse based on this metric requirements spec to support GMV analysis. I've already planned the tables for each layer: \[Bronze layer] xxx \[Silver layer] xxx \[Gold layer] xxx. Based on this table list, please generate a data warehouse modeling standards document.
|
|
46
48
|
|
|
47
|
-
|
|
49
|
+
^
|
|
48
50
|
|
|
49
51
|
### Scenario 2: Ad-hoc Data Retrieval
|
|
50
52
|
|
|
51
53
|
**Everything waits in the queue** Exploratory analysis, market research, and other ad-hoc requests are naturally lower priority and get perpetually pushed aside by formal requests. By the time the data finally arrives, the decision window has often closed and the business has already fallen behind the market. The core problem: ad-hoc analysis has no self-service path — it must go through the data team, which simply doesn't have the bandwidth to continuously handle low-priority requests.
|
|
52
54
|
|
|
53
55
|
> Example prompt:
|
|
54
|
-
> Query brazilianecommerce.
|
|
56
|
+
> Query brazilianecommerce.olist\_orders and count orders by day.
|
|
55
57
|
|
|
56
|
-
|
|
58
|
+
^
|
|
57
59
|
|
|
58
60
|
### Scenario 3: Day-to-day Operations
|
|
59
61
|
|
|
@@ -77,4 +79,34 @@ Daily task operations are the most critical routine work on an enterprise data p
|
|
|
77
79
|
> Please help me analyze which instances failed in the past week.
|
|
78
80
|
> For the task with instance ID xxx, what was the failure reason and which downstream tasks were affected?
|
|
79
81
|
|
|
80
|
-
|
|
82
|
+
^
|
|
83
|
+
|
|
84
|
+
### Scenario 4: Studio Task Development and Management
|
|
85
|
+
|
|
86
|
+
Studio tasks are the core scheduling unit of the Lakehouse data pipeline, but traditional management is cumbersome: creating a task requires entering the IDE and configuring step by step, modifying a schedule requires locating the task and opening the schedule panel, and dependency configuration requires manually maintaining task IDs.
|
|
87
|
+
|
|
88
|
+
Data Agent operates the full lifecycle of Studio tasks directly through natural language:
|
|
89
|
+
|
|
90
|
+
* **Create tasks**: describe the task logic, and the agent automatically generates SQL or Python code, creates the task, and configures the Virtual Cluster
|
|
91
|
+
* **Schedule configuration**: tell the agent "run at 2 AM every day" and it automatically converts this to a cron expression and applies it
|
|
92
|
+
* **Dependency orchestration**: describe the dependencies between tasks, and the agent automatically configures upstream and downstream dependency chains, avoiding manual task ID lookups
|
|
93
|
+
* **Batch operations**: publish multiple tasks or bulk-update the retry strategy for a category of tasks in a single instruction
|
|
94
|
+
|
|
95
|
+
> Example prompt:
|
|
96
|
+
> Create a Python task that runs at 3 AM every day, triggers after the ods\_order\_load task completes, and aggregates yesterday's order data into dws\_order\_daily.
|
|
97
|
+
|
|
98
|
+
### Scenario 5: Data Source Management
|
|
99
|
+
|
|
100
|
+
Onboarding enterprise data is the first mile of data engineering, involving connection configuration, sync strategy, status monitoring, and more — manual operations are error-prone and hard to track.
|
|
101
|
+
|
|
102
|
+
Data Agent supports the following data source management operations:
|
|
103
|
+
|
|
104
|
+
* **Quick onboarding**: describe the database type and connection details, and the agent automatically creates the data source and tests connectivity
|
|
105
|
+
* **Sync configuration**: specify the source and target tables, and the agent selects a full or incremental (CDC) sync strategy based on business needs
|
|
106
|
+
* **Status queries**: ask in a single sentence to get sync delays, recent failure records, and data volume statistics for all data sources
|
|
107
|
+
* **Troubleshooting**: when a sync task fails, the agent automatically pulls error logs, analyzes the root cause, and recommends fix steps
|
|
108
|
+
|
|
109
|
+
> Example prompt:
|
|
110
|
+
> Help me check which data sources currently have sync delays exceeding 30 minutes, and which ones had sync failures in the last 24 hours.
|
|
111
|
+
|
|
112
|
+
^
|
|
@@ -0,0 +1,168 @@
|
|
|
1
|
+
# Databricks Delta Tables → Lakehouse Migration Guide
|
|
2
|
+
|
|
3
|
+
Data on Databricks can almost entirely be migrated to Singdata Lakehouse. There are two complementary paths: **federated direct read** (data stays in place, query Databricks tables directly from Lakehouse) and **Studio built-in sync** (move various table types into Lakehouse native format). All 7 table types have been tested and pass, with row counts and field values fully consistent.
|
|
4
|
+
|
|
5
|
+
Full code on GitHub: [databricks2lakehouse-delta](https://github.com/clickzetta/databricks2lakehouse-delta)
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Conclusion First
|
|
10
|
+
|
|
11
|
+
**Two paths** — selection rule: if federated read is available (External Delta), use federation + CTAS; for everything else, use Studio built-in sync.
|
|
12
|
+
|
|
13
|
+
| Table Type | Federated Read | Studio Sync | Recommended Path |
|
|
14
|
+
|---|:---:|:---:|---|
|
|
15
|
+
| **External Delta** | ✅ Tested | ✅ Tested | Federated read (no data movement) or CTAS landing |
|
|
16
|
+
| **Managed Delta** | ❌ | ✅ Tested | Studio built-in sync |
|
|
17
|
+
| **Parquet External** | ❌ | ✅ Tested | Studio built-in sync |
|
|
18
|
+
| **CSV External** | ❌ | ✅ Tested | Studio built-in sync |
|
|
19
|
+
| **JSON External** | ❌ | ✅ Tested | Studio built-in sync |
|
|
20
|
+
| **Managed Iceberg** | ❌ | ✅ Tested | Studio built-in sync |
|
|
21
|
+
| **View** | ❌ | ✅ Tested (materialized as table) | Studio built-in sync |
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## Technical Background
|
|
26
|
+
|
|
27
|
+
Databricks uses Unity Catalog three-level naming (`catalog.schema.table`), with data stored in two categories:
|
|
28
|
+
- **External tables**: Data files reside in the customer's own S3/ADLS; Databricks only manages metadata
|
|
29
|
+
- **Managed tables**: Data files reside in Databricks managed storage, inaccessible externally
|
|
30
|
+
|
|
31
|
+
This difference determines the scope of federated reads — only External Delta tables can be directly queried via Lakehouse External Catalog federation. Managed tables must be migrated via Studio sync jobs.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+

|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## Path A: Federated Direct Read (External Delta Only)
|
|
40
|
+
|
|
41
|
+
Federated read queries Databricks tables directly via Lakehouse External Catalog with **no data movement**, suitable for parallel access during transition or PoC validation.
|
|
42
|
+
|
|
43
|
+
### Configuration (one-time)
|
|
44
|
+
|
|
45
|
+
```sql
|
|
46
|
+
-- Step 1: Create a Catalog Connection to Databricks
|
|
47
|
+
CREATE CATALOG CONNECTION IF NOT EXISTS databricks_conn
|
|
48
|
+
TYPE = DATABRICKS_UNITY_CATALOG
|
|
49
|
+
PROPERTIES (
|
|
50
|
+
'host' = 'https://dbc-xxxx.cloud.databricks.com',
|
|
51
|
+
'catalog' = 'workspace'
|
|
52
|
+
-- Authentication parameters configured via Studio UI
|
|
53
|
+
);
|
|
54
|
+
|
|
55
|
+
-- Step 2: Create External Catalog (federation entry point)
|
|
56
|
+
CREATE EXTERNAL CATALOG IF NOT EXISTS databricks_new_catalog
|
|
57
|
+
USING CONNECTION databricks_conn;
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
### Federated Query
|
|
61
|
+
|
|
62
|
+
```sql
|
|
63
|
+
-- Query Databricks tables directly (cross-region/cross-cloud supported, via public internet)
|
|
64
|
+
SELECT * FROM databricks_new_catalog.table_types_demo.customers_external;
|
|
65
|
+
SELECT COUNT(*) FROM databricks_new_catalog.table_types_demo.orders_external;
|
|
66
|
+
|
|
67
|
+
-- CTAS: land the federated table as a Lakehouse native table (optional)
|
|
68
|
+
CREATE TABLE delta_migration.customers AS
|
|
69
|
+
SELECT * FROM databricks_new_catalog.table_types_demo.customers_external;
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Federated Read Limitations
|
|
73
|
+
|
|
74
|
+
Federated read **only supports External Delta format**. Test results:
|
|
75
|
+
|
|
76
|
+
| Table Tested | Format | Result |
|
|
77
|
+
|---|---|---|
|
|
78
|
+
| `customers_external` | External Delta | ✅ 7 rows, schema/types correct |
|
|
79
|
+
| `orders_external` | External Delta | ✅ 8 rows, aggregation correct |
|
|
80
|
+
| `inventory_delta` | External Delta | ✅ 4 rows |
|
|
81
|
+
| `customers_managed` | Managed Delta | ❌ Data in Databricks managed storage, no external access |
|
|
82
|
+
| `products_parquet` | External Parquet | ❌ `unsupported databricks table format [PARQUET]` |
|
|
83
|
+
| `suppliers_csv` | External CSV | ❌ `unsupported databricks table format [CSV]` |
|
|
84
|
+
| `shipments_iceberg` | Managed Iceberg | ❌ S3 400 (data in Databricks managed bucket) |
|
|
85
|
+
| `customer_orders_view` | View | ❌ Not supported |
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
## Path B: Studio Built-in Sync (Handles All Table Types)
|
|
90
|
+
|
|
91
|
+
Studio has a built-in Databricks data source with visual sync task configuration to move all table types into Lakehouse native format.
|
|
92
|
+
|
|
93
|
+
### Configuration Steps
|
|
94
|
+
|
|
95
|
+
1. Studio UI → Data Integration → New Data Source → Select Databricks
|
|
96
|
+
2. Enter Databricks Workspace URL + authentication info (Service Principal)
|
|
97
|
+
3. New sync task → Select source tables → Configure target schema → Execute
|
|
98
|
+
|
|
99
|
+
### Test Results (All 7 Table Types SUCCEED)
|
|
100
|
+
|
|
101
|
+
Sync tasks were run against all table types in the `table_types_demo` schema in a real Databricks environment (AWS us-east-1, Unity Catalog). All succeeded:
|
|
102
|
+
|
|
103
|
+
| Source Table | Type | Result | Duration | Source Rows | Target Rows | Consistent |
|
|
104
|
+
|---|---|:---:|:---:|:---:|:---:|:---:|
|
|
105
|
+
| `customers_managed` | Managed Delta | ✅ | ~88s | 7 | 7 | ✅ |
|
|
106
|
+
| `customers_external` | External Delta | ✅ | ~84s | 7 | 7 | ✅ |
|
|
107
|
+
| `products_parquet` | Parquet | ✅ | ~87s | 5 | 5 | ✅ |
|
|
108
|
+
| `suppliers_csv` | CSV | ✅ | ~93s | 3 | 3 | ✅ |
|
|
109
|
+
| `product_reviews_json` | JSON | ✅ | ~109s | 3 | 3 | ✅ |
|
|
110
|
+
| `shipments_iceberg` | Managed Iceberg | ✅ | ~92s | 5 | 5 | ✅ |
|
|
111
|
+
| `customer_orders_view` | View | ✅ | ~86s | 8 | 8 | ✅ |
|
|
112
|
+
|
|
113
|
+
> Single task duration is 84–109 seconds (including sync cluster cold start), largely independent of data volume. Estimate total migration time linearly based on table count.
|
|
114
|
+
|
|
115
|
+
### Data Consistency Verification
|
|
116
|
+
|
|
117
|
+
**Field-level row-by-row comparison** was performed on 5 tables (Databricks SDK reads source, cz-cli reads target). Findings:
|
|
118
|
+
|
|
119
|
+
- `products_parquet`, `product_reviews_json`, `shipments_iceberg`, `suppliers_csv`: **All fields fully consistent**
|
|
120
|
+
- `customers_external`, `suppliers_csv`: email/contact_email fields show masked values on the Lakehouse side (e.g., `a***@example.com`)
|
|
121
|
+
|
|
122
|
+
> Masking is the Lakehouse workspace's **read-time column masking policy** automatically matching by column name. The data itself is migrated intact. To see plaintext values, use a role with appropriate permissions or query in a schema without a masking policy bound.
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## Delta-Specific Feature Handling
|
|
127
|
+
|
|
128
|
+
| Feature | Migration Impact |
|
|
129
|
+
|---|---|
|
|
130
|
+
| **Deletion Vectors (DV)** | ✅ Federated read correctly recognizes DVs; deleted rows are not returned. Data state is consistent after sync to Lakehouse |
|
|
131
|
+
| **Change Data Feed (CDF)** | Current data syncs normally. CDF incremental consumption interface (`table_changes()`) has no equivalent in Lakehouse → rebuild incremental pipelines using Lakehouse Table Stream |
|
|
132
|
+
| **Liquid Clustering** | Does not affect data content. After migration to Lakehouse, rebuild layout optimization using Lakehouse `CLUSTER BY` or indexing mechanisms |
|
|
133
|
+
|
|
134
|
+
---
|
|
135
|
+
|
|
136
|
+
## Connectivity Notes
|
|
137
|
+
|
|
138
|
+
- **Cross-region/cross-cloud supported**: Both federated read and Studio sync use the public internet. Tested: Lakehouse AWS Singapore ↔ Databricks us-east-1 cross-region, cross-cloud connectivity confirmed
|
|
139
|
+
- **Only hard limitation**: `COPY INTO`/`Pipe` object storage connections cannot cross cloud providers (e.g., an Alibaba Cloud instance cannot connect to AWS S3), but this affects direct S3 file reads, not reading tables via Databricks External Catalog
|
|
140
|
+
|
|
141
|
+
---
|
|
142
|
+
|
|
143
|
+
## Notes
|
|
144
|
+
|
|
145
|
+
- **Managed tables cannot be federated**: Managed Delta/Iceberg data resides in Databricks managed storage with no external access — must use Studio sync
|
|
146
|
+
- **Parquet/CSV/JSON External tables cannot be federated**: External Catalog currently only supports Delta format; use Studio sync for other formats
|
|
147
|
+
- **Array columns not yet supported for sync**: Columns with `ARRAY<...>` types are not yet supported by Studio sync tasks. Convert them to STRING using `to_json()` on the source side before syncing
|
|
148
|
+
- **Configure sync tasks table by table**: Studio's built-in Databricks data source does not yet support bulk schema-wide sync; configure each table individually. Scripts can be used to batch-generate configurations
|
|
149
|
+
|
|
150
|
+
## Related Documentation
|
|
151
|
+
|
|
152
|
+
### Federated Query
|
|
153
|
+
|
|
154
|
+
- [External Catalog Overview](external-catalog-concept.md): External Catalog federated query principles
|
|
155
|
+
- [Federated Query Guide](SQL_External_Catalog_Guide.md): SQL syntax and usage examples
|
|
156
|
+
|
|
157
|
+
### Data Ingestion
|
|
158
|
+
|
|
159
|
+
- [Data Ingestion Overview](streaming_data_pipeline_overview.md): Full ingestion solution landscape
|
|
160
|
+
- [COPY INTO](copy-into.md): Bulk load from object storage
|
|
161
|
+
- [Pipe (Continuous Ingestion)](pipe-overview.md): Continuously monitor object storage for new files
|
|
162
|
+
|
|
163
|
+
### Other Migration Guides
|
|
164
|
+
|
|
165
|
+
- [Databricks Notebook → Lakehouse Migration Guide](databricks-notebook-to-studio-migration.md)
|
|
166
|
+
- [Databricks DLT → Lakehouse Migration Guide](databricks-dlt-to-lakehouse-migration.md)
|
|
167
|
+
- [Databricks Jobs → Lakehouse Studio Migration Guide](databricks-jobs-to-studio-migration.md)
|
|
168
|
+
- [Databricks Unity Catalog → Lakehouse Migration Guide](databricks-uc-governance-to-lakehouse-migration.md)
|
|
@@ -0,0 +1,331 @@
|
|
|
1
|
+
# Databricks DLT → Lakehouse Migration Guide: Apparel Retail Streaming Pipeline
|
|
2
|
+
|
|
3
|
+
If your data pipeline runs on Databricks Delta Live Tables, the core migration effort to Singdata Lakehouse is lower than you might expect — **most PySpark DataFrame code can be reused directly**. DLT uses decorators (`@dlt.table`, `@dlt.expect_or_drop`) to wrap Python functions into pipeline nodes. ZettaPark removes the decorators and replaces them with `df.write.saveAsTable()` — business logic is unchanged, line for line.
|
|
4
|
+
|
|
5
|
+
This article validates this with a real project: a Databricks DLT-based apparel retail streaming pipeline (Bronze ingestion → Silver SCD2 cleansing → Gold aggregation analytics) fully migrated to Singdata Lakehouse, passing all 16 automated validations.
|
|
6
|
+
|
|
7
|
+
Full code on GitHub: [databricks2lakehouse-dlt-apparel](https://github.com/clickzetta/databricks2lakehouse-dlt-apparel)
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Source Project
|
|
12
|
+
|
|
13
|
+
[databricks2lakehouse-dlt-apparel](https://github.com/clickzetta/databricks2lakehouse-dlt-apparel) is forked from [jrlasak/databricks_apparel_streaming](https://github.com/jrlasak/databricks_apparel_streaming) (⭐45). The original tech stack is Databricks DLT + Auto Loader + Delta Lake. The project implements a complete data pipeline for an apparel retailer across 4 dimensions (customers, products, stores, transactions), covering streaming ingestion, SCD Type 2 history tracking, data quality constraints, and Gold layer aggregation analytics.
|
|
14
|
+
|
|
15
|
+
Migrated code is in the `03_lakehouse/` directory, with three available approaches that can be compared file-by-file with `01_source/dlt/`.
|
|
16
|
+
|
|
17
|
+
## Conclusion First
|
|
18
|
+
|
|
19
|
+
**Option C (Dynamic Table) is the closest equivalent for DLT pipelines** — declarative definition + automatic scheduled refresh, consistent with DLT's "define and execute" philosophy. If the existing team is comfortable with PySpark DataFrame, Option A (ZettaPark) requires only 5 mechanical substitutions with all business logic fully preserved.
|
|
20
|
+
|
|
21
|
+
| Option | Files | DLT Equivalent | Characteristics |
|
|
22
|
+
|------|------|---------|------|
|
|
23
|
+
| **A. ZettaPark Python** | `03_lakehouse/*.py` | `@dlt.table` → `df.write.saveAsTable()` | Minimal code changes, retains PySpark skills |
|
|
24
|
+
| **B. Pure SQL** | `03_lakehouse/sql/` | SQL equivalent implementation | SQL-first teams |
|
|
25
|
+
| **C. Dynamic Table (GIC)** | `03_lakehouse/dynamic_tables/` | **Native equivalent of @dlt.table** | Declarative + auto-refresh, closest to DLT semantics |
|
|
26
|
+
|
|
27
|
+
**Option A change list** (ZettaPark main path):
|
|
28
|
+
|
|
29
|
+
| Change | Effort | Notes |
|
|
30
|
+
|--------|--------|------|
|
|
31
|
+
| `import dlt` / `from pyspark...` | Very low | `from clickzetta.zettapark import functions as F`, replace package name |
|
|
32
|
+
| `spark` global | Very low | `session = Session.builder.configs({}).create()` |
|
|
33
|
+
| `@dlt.table(name=X)` | Very low | Add `df.write.mode("overwrite").saveAsTable(X)` as last line of function |
|
|
34
|
+
| `@dlt.expect_or_drop("msg","cond")` | Very low | `df.filter(F.col(...)...)`, semantically equivalent |
|
|
35
|
+
| `dlt.read_stream("LIVE.X")` / `dlt.read("LIVE.X")` | Very low | `session.table("X")`, same as PySpark |
|
|
36
|
+
| `dlt.create_auto_cdc_flow(scd_type=2)` | Low | `F.lead().over(Window.partitionBy(key).orderBy(seq))`, ZettaPark Window is identical to PySpark |
|
|
37
|
+
| `F.window("event_time","1 day")` | Very low | `F.to_date(F.col("event_time"))`, ZettaPark has no F.window |
|
|
38
|
+
| Auto Loader → | Low | `session.read.csv("vol://...")` batch; use Pipe for streaming in production |
|
|
39
|
+
|
|
40
|
+
JOIN logic, aggregation functions (`F.sum/count/avg`), `F.when/coalesce/lead/lag`, `Window` — all require no changes, fully consistent with PySpark API.
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## Tech Stack Comparison
|
|
45
|
+
|
|
46
|
+
| | Databricks DLT | ZettaPark (after migration) |
|
|
47
|
+
|---|---|---|
|
|
48
|
+
| Pipeline definition | Python decorator `@dlt.table` | `df.write.mode("overwrite").saveAsTable(X)` |
|
|
49
|
+
| Data quality constraints | `@dlt.expect_or_drop("msg","cond")` | `df.filter(condition)` (same semantics) |
|
|
50
|
+
| Streaming read | `dlt.read_stream("LIVE.X")` | `session.table("X")` (DT auto-incremental) |
|
|
51
|
+
| Static read | `dlt.read("LIVE.X")` | `session.table("X")` |
|
|
52
|
+
| SCD Type 2 | `dlt.create_auto_cdc_flow(stored_as_scd_type=2)` | `F.lead().over(Window.partitionBy(key).orderBy(seq))` |
|
|
53
|
+
| Time window aggregation | `F.window("event_time","1 day")` | `F.to_date(F.col("event_time"))` |
|
|
54
|
+
| File ingestion | Auto Loader (`cloudFiles`) | `session.read.csv("vol://...")` / Pipe (streaming) |
|
|
55
|
+
| DataFrame API | PySpark | **Fully consistent** (F.col/when/lead/Window etc.) |
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+

|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Project Background
|
|
64
|
+
|
|
65
|
+
Apparel retailer with 4 data streams + Bronze/Silver/Gold three-layer architecture:
|
|
66
|
+
|
|
67
|
+
| Data Domain | Raw Events | Row Count | Notes |
|
|
68
|
+
|--------|---------|------|------|
|
|
69
|
+
| Customers | `raw_customers` | 150 | Includes SCD2 update records (30 historical) |
|
|
70
|
+
| Products | `raw_products` | 50 | Includes category/brand/size/color |
|
|
71
|
+
| Stores | `raw_stores` | 5 | 5 stores |
|
|
72
|
+
| Transactions | `raw_sales` | 500 | Includes discount, tax, payment method |
|
|
73
|
+
|
|
74
|
+
DLT pipeline file structure:
|
|
75
|
+
|
|
76
|
+
```
|
|
77
|
+
01_bronze.py — Auto Loader streaming ingestion → 4 Bronze tables
|
|
78
|
+
02A_silver.py — @dlt.view + @dlt.expect_or_drop (data quality filtering)
|
|
79
|
+
02B_silver.py — dlt.create_auto_cdc_flow (SCD Type 2 dimension tables)
|
|
80
|
+
02C_silver.py — Sales transaction cleansing
|
|
81
|
+
03_gold.py — 4 Gold aggregation tables
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## Migration Steps
|
|
87
|
+
|
|
88
|
+
### Step 1: `@dlt.table` / Auto Loader → ZettaPark
|
|
89
|
+
|
|
90
|
+
DLT uses decorators to declare tables + `spark.readStream` for streaming ingestion. ZettaPark uses `session.read.csv("vol://...")` to load files, with `df.write.saveAsTable()` replacing decorators:
|
|
91
|
+
|
|
92
|
+
```python
|
|
93
|
+
# Databricks DLT (01_source/dlt/01_bronze.py)
|
|
94
|
+
@dlt.table(name="bronze_sales")
|
|
95
|
+
def bronze_sales():
|
|
96
|
+
return spark.readStream.format("cloudFiles").option("cloudFiles.format","csv").load(VOL_PATH)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
```python
|
|
100
|
+
# ZettaPark (03_lakehouse/01_bronze.py) — minimal rewrite
|
|
101
|
+
from clickzetta.zettapark import Session # ← import pyspark → clickzetta.zettapark
|
|
102
|
+
session = Session.builder.configs({...}).create() # ← spark global injection → explicit creation
|
|
103
|
+
|
|
104
|
+
df = session.read.option("header","true").csv("vol://apparel_bronze.raw_data/sales.csv")
|
|
105
|
+
# readStream.format("cloudFiles")... → session.read.csv("vol://...")
|
|
106
|
+
df.write.mode("overwrite").saveAsTable("apparel_bronze.raw_sales")
|
|
107
|
+
# @dlt.table(name=X) → df.write.saveAsTable(X) (last line of function body)
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
> 💡 **Note**: Auto Loader continuously monitors new files (streaming). The Lakehouse equivalent is **Pipe** (automatically ingests after configuration). Use Pipe instead of COPY INTO in production environments.
|
|
111
|
+
|
|
112
|
+
### Step 2: `@dlt.expect_or_drop` → `df.filter()`
|
|
113
|
+
|
|
114
|
+
DLT data quality constraints translate directly to ZettaPark `.filter()` — semantically equivalent, API consistent:
|
|
115
|
+
|
|
116
|
+
```python
|
|
117
|
+
# Databricks DLT (02A_silver.py)
|
|
118
|
+
@dlt.view(name="customers_cleaned_stream")
|
|
119
|
+
@dlt.expect_or_drop("valid_customer_id", "customer_id IS NOT NULL")
|
|
120
|
+
@dlt.expect("realistic_age", "age >= 18 AND age <= 100")
|
|
121
|
+
def customers_cleaned_stream():
|
|
122
|
+
return dlt.read_stream(f"LIVE.{BRONZE_CUSTOMERS}")
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
```python
|
|
126
|
+
# ZettaPark (03_lakehouse/02B_silver.py)
|
|
127
|
+
customers = session.table("apparel_bronze.raw_customers") # dlt.read_stream → session.table()
|
|
128
|
+
customers = (customers
|
|
129
|
+
.filter(F.col("customer_id").isNotNull()) # @dlt.expect_or_drop("valid_customer_id",...)
|
|
130
|
+
.filter((F.col("age") >= 18) & (F.col("age") <= 100)) # @dlt.expect("realistic_age",...)
|
|
131
|
+
)
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
### Step 3: SCD Type 2 — `dlt.create_auto_cdc_flow` → `LEAD()` Window Function
|
|
135
|
+
|
|
136
|
+
`create_auto_cdc_flow(stored_as_scd_type=2)` is fundamentally a LEAD window function. ZettaPark implements the equivalent logic directly — `F.lead()` and `Window` APIs are fully consistent between the two:
|
|
137
|
+
|
|
138
|
+
```python
|
|
139
|
+
# Databricks DLT (02B_silver.py)
|
|
140
|
+
dlt.create_auto_cdc_flow(
|
|
141
|
+
target=SILVER_CUSTOMERS,
|
|
142
|
+
source=f"live.{CUSTOMERS_CLEANED_STREAM}",
|
|
143
|
+
keys=["customer_id"],
|
|
144
|
+
sequence_by=F.col("last_update_time"),
|
|
145
|
+
stored_as_scd_type=2,
|
|
146
|
+
)
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
# ZettaPark (03_lakehouse/02B_silver.py)
|
|
151
|
+
from clickzetta.zettapark.window import Window # ← pyspark.sql.window → clickzetta.zettapark.window
|
|
152
|
+
|
|
153
|
+
w = Window.partitionBy("customer_id").orderBy("last_update_time")
|
|
154
|
+
# F.lead() and Window API are fully consistent, no changes needed
|
|
155
|
+
silver_customers = customers.withColumn(
|
|
156
|
+
"__end_at", F.lead("last_update_time").over(w) # SCD2 end timestamp
|
|
157
|
+
).withColumn(
|
|
158
|
+
"__is_current", F.lead("last_update_time").over(w).isNull()
|
|
159
|
+
).withColumn(
|
|
160
|
+
"__end_at", F.coalesce(F.col("__end_at"), F.lit("9999-12-31 23:59:59").cast(TimestampType()))
|
|
161
|
+
)
|
|
162
|
+
silver_customers.write.mode("overwrite").saveAsTable("apparel_silver.silver_customers")
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
**Result**: 150 raw records → 150 rows (including 30 historical) → 120 current snapshot records (`__is_current = TRUE`).
|
|
166
|
+
|
|
167
|
+
### Step 4: Time Window Aggregation — `F.window()` → `F.to_date()`
|
|
168
|
+
|
|
169
|
+
`F.window("event_time","1 day")` is not implemented in ZettaPark. Use `F.to_date()` instead (functionally equivalent, results consistent):
|
|
170
|
+
|
|
171
|
+
```python
|
|
172
|
+
# Databricks DLT (03_gold.py)
|
|
173
|
+
df.groupBy(F.window("event_time","1 day").alias("sale_window"), "store_id","store_name") \
|
|
174
|
+
.agg(F.round(F.sum("total_amount"),2).alias("total_revenue")) \
|
|
175
|
+
.select(F.col("sale_window.start").cast("date").alias("sale_date"), ...)
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
```python
|
|
179
|
+
# ZettaPark (03_lakehouse/03_gold.py)
|
|
180
|
+
facts.withColumn("sale_date", F.to_date(F.col("event_time"))) # F.window → F.to_date
|
|
181
|
+
.groupBy("sale_date","store_id","store_name")
|
|
182
|
+
.agg(F.round(F.sum("total_amount"),2).alias("total_revenue"),
|
|
183
|
+
F.count("transaction_id").alias("total_transactions"), ...)
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
> 💡 `F.to_date()` is equivalent to truncating to the day, which produces exactly the same grouping result as `F.window("1 day")`.
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## Pure SQL Alternative
|
|
191
|
+
|
|
192
|
+
If the team prefers SQL, `03_lakehouse/sql/` provides equivalent SQL scripts:
|
|
193
|
+
|
|
194
|
+
```bash
|
|
195
|
+
cz-cli sql --file 03_lakehouse/sql/02_silver.sql --profile aws_singapore_prod --sync --write
|
|
196
|
+
cz-cli sql --file 03_lakehouse/sql/03_gold.sql --profile aws_singapore_prod --sync --write
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
| DLT Python | SQL Equivalent |
|
|
200
|
+
|---|---|
|
|
201
|
+
| `@dlt.expect_or_drop` | `WHERE condition` |
|
|
202
|
+
| `create_auto_cdc_flow(scd=2)` | `LEAD() OVER (PARTITION BY key ORDER BY seq)` |
|
|
203
|
+
| `F.window("event_time","1 day")` | `DATE_TRUNC('day', event_time)` |
|
|
204
|
+
|
|
205
|
+
ZettaPark is the recommended primary path (minimal code changes, retains PySpark skills); SQL is suitable for SQL-first teams or rapid validation.
|
|
206
|
+
|
|
207
|
+
---
|
|
208
|
+
|
|
209
|
+
## Option C: Dynamic Table (GIC) — Native Equivalent of DLT
|
|
210
|
+
|
|
211
|
+
Dynamic Table is the most direct equivalent of DLT's `@dlt.table` — declare SQL declaratively, and the platform automatically refreshes on schedule. DLT triggers on new data automatically; Dynamic Table refreshes on a `REFRESH INTERVAL` schedule. Logically equivalent.
|
|
212
|
+
|
|
213
|
+
```python
|
|
214
|
+
# Databricks DLT — declarative pipeline
|
|
215
|
+
@dlt.table(name="gold_daily_sales_by_store")
|
|
216
|
+
def gold_daily_sales_by_store():
|
|
217
|
+
return df.groupBy(F.window("event_time","1 day"), "store_id","store_name") \
|
|
218
|
+
.agg(F.sum("total_amount").alias("total_revenue"))
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
```sql
|
|
222
|
+
-- Lakehouse Dynamic Table — direct equivalent (03_lakehouse/dynamic_tables/gold_dynamic_tables.sql)
|
|
223
|
+
CREATE OR REPLACE DYNAMIC TABLE apparel_gold.dt_daily_sales_by_store
|
|
224
|
+
REFRESH INTERVAL 10 MINUTE -- DLT triggers on new data; DT triggers on schedule
|
|
225
|
+
VCLUSTER DEFAULT
|
|
226
|
+
AS
|
|
227
|
+
SELECT CAST(event_time AS DATE) AS sale_date, store_id, store_name,
|
|
228
|
+
ROUND(SUM(total_amount), 2) AS total_revenue, ...
|
|
229
|
+
FROM apparel_gold.denormalized_sales_facts
|
|
230
|
+
GROUP BY CAST(event_time AS DATE), store_id, store_name;
|
|
231
|
+
|
|
232
|
+
-- Manual trigger (equivalent to DLT pipeline trigger)
|
|
233
|
+
REFRESH DYNAMIC TABLE apparel_gold.dt_daily_sales_by_store;
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
| DLT | Dynamic Table | Notes |
|
|
237
|
+
|---|---|---|
|
|
238
|
+
| `@dlt.table(name=X)` | `CREATE DYNAMIC TABLE X ... AS SELECT ...` | Direct equivalent |
|
|
239
|
+
| Auto-refresh on new data | `REFRESH INTERVAL N MINUTE` | Different scheduling method, semantically equivalent |
|
|
240
|
+
| `F.window("1 day")` | `CAST(event_time AS DATE)` | DT uses SQL |
|
|
241
|
+
| `create_auto_cdc_flow(scd=2)` | `LEAD() OVER Window` embedded in DT definition | Standard window function |
|
|
242
|
+
| DLT pipeline DAG | DT dependencies (downstream DT references upstream DT) | Declarative dependencies |
|
|
243
|
+
|
|
244
|
+
Tested in AWS Singapore: 4 Dynamic Tables created and refreshed, all passed e2e validation: `dt_customers_current` (150 rows / 120 current), `dt_daily_sales_by_store` (367), `dt_product_performance` (50), `dt_customer_lifetime_value` (96).
|
|
245
|
+
|
|
246
|
+
## About Pipe (Auto Loader Equivalent)
|
|
247
|
+
|
|
248
|
+
**Pipe** is the Databricks Auto Loader equivalent — continuously monitors object storage (OSS/S3/COS) for new files, automatically triggering COPY INTO ingestion. Tested in AWS Singapore: after uploading a new CSV to S3, Pipe auto-ingested within 15 seconds.
|
|
249
|
+
|
|
250
|
+
```sql
|
|
251
|
+
-- Step 1: Create External Volume (connected to S3/OSS/COS)
|
|
252
|
+
CREATE EXTERNAL VOLUME apparel_bronze.s3_sales_landing
|
|
253
|
+
LOCATION 's3://your-bucket/apparel_landing/'
|
|
254
|
+
USING CONNECTION s3_conn
|
|
255
|
+
DIRECTORY = (enable=true, auto_refresh=true)
|
|
256
|
+
RECURSIVE = true;
|
|
257
|
+
|
|
258
|
+
-- Step 2: Create table (External Volume does not support inferSchema, explicit definition required)
|
|
259
|
+
CREATE TABLE apparel_bronze.raw_sales (
|
|
260
|
+
transaction_id BIGINT, store_id BIGINT, event_time STRING,
|
|
261
|
+
customer_id BIGINT, product_id BIGINT, quantity BIGINT,
|
|
262
|
+
unit_price DOUBLE, total_amount DOUBLE, payment_method STRING,
|
|
263
|
+
discount_applied DOUBLE, tax_amount DOUBLE
|
|
264
|
+
);
|
|
265
|
+
|
|
266
|
+
-- Step 3: Create Pipe
|
|
267
|
+
-- DLT: spark.readStream.format("cloudFiles").option("cloudFiles.format","csv").load(PATH)
|
|
268
|
+
CREATE OR REPLACE PIPE apparel_bronze.pipe_sales
|
|
269
|
+
VIRTUAL_CLUSTER = 'DEFAULT'
|
|
270
|
+
INGEST_MODE = 'LIST_PURGE' -- Poll for new files, equivalent to Auto Loader LIST mode
|
|
271
|
+
AS COPY INTO apparel_bronze.raw_sales
|
|
272
|
+
FROM VOLUME apparel_bronze.s3_sales_landing
|
|
273
|
+
USING CSV OPTIONS ('header'='true')
|
|
274
|
+
PURGE = TRUE -- Delete landing zone files after processing
|
|
275
|
+
ON_ERROR = CONTINUE;
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
> 💡 **Pipe vs COPY INTO**: COPY INTO is a one-time batch load; Pipe continuously monitors and auto-triggers when new files arrive — a direct equivalent to Auto Loader. Internal Volumes only support COPY INTO; Pipe requires an External Volume (OSS/S3/COS).
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
## E2E Validation Results
|
|
283
|
+
|
|
284
|
+
Tested on AWS Singapore instance (`aws_singapore_prod`), **20/20 all passed** (including Dynamic Table validation):
|
|
285
|
+
|
|
286
|
+
| Check | Expected | Result |
|
|
287
|
+
|--------|--------|------|
|
|
288
|
+
| bronze.raw_customers | 150 | ✅ |
|
|
289
|
+
| bronze.raw_products | 50 | ✅ |
|
|
290
|
+
| bronze.raw_stores | 5 | ✅ |
|
|
291
|
+
| bronze.raw_sales | 500 | ✅ |
|
|
292
|
+
| silver_customers (including history) | 150 | ✅ |
|
|
293
|
+
| silver_products | 50 | ✅ |
|
|
294
|
+
| silver_stores | 5 | ✅ |
|
|
295
|
+
| silver_sales_transactions | 500 | ✅ |
|
|
296
|
+
| silver_customers_current | 120 | ✅ |
|
|
297
|
+
| silver_products_current | 50 | ✅ |
|
|
298
|
+
| silver_stores_current | 5 | ✅ |
|
|
299
|
+
| gold.denormalized_sales_facts | 500 | ✅ |
|
|
300
|
+
| gold.gold_product_performance | 50 | ✅ |
|
|
301
|
+
| Total sales amount | 281,490 | ✅ |
|
|
302
|
+
| Customers with purchases | 96 | ✅ |
|
|
303
|
+
| SCD2 historical records | 30 | ✅ |
|
|
304
|
+
| **dt_customers_current** (Dynamic Table) | 150 | ✅ |
|
|
305
|
+
| **dt_daily_sales_by_store** (Dynamic Table) | ≥100 | ✅ 367 |
|
|
306
|
+
| **dt_product_performance** (Dynamic Table) | 50 | ✅ |
|
|
307
|
+
| **dt_customer_lifetime_value** (Dynamic Table) | 96 | ✅ |
|
|
308
|
+
| SCD2 historical records | 30 | ✅ |
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
## Notes
|
|
313
|
+
|
|
314
|
+
- **Streaming vs batch**: Auto Loader continuously monitors new files; COPY INTO is a one-time batch. Use Lakehouse **Pipe** instead of Auto Loader in production — automatically ingests new files after configuration, no manual triggering required.
|
|
315
|
+
- **`@dlt.expect` warn semantics**: `expect` (warn level) in DLT counts but does not drop records. When migrating to SQL, you can choose to not filter (ignore entirely) or add a `WHERE` filter (equivalent to upgrading to `expect_or_drop`). Decide based on business requirements.
|
|
316
|
+
- **DT auto-incremental**: Lakehouse Dynamic Tables automatically refresh incrementally after definition (similar to DLT streaming updates), without needing to explicitly write `read_stream`.
|
|
317
|
+
- **SCD2 `__is_current` field**: `LEAD()` returning NULL means there is no subsequent version, i.e., this is the current record. `COALESCE(__end_at, '9999-12-31')` is the standard SCD2 convention, with identical behavior on both sides.
|
|
318
|
+
|
|
319
|
+
## Related Documentation
|
|
320
|
+
|
|
321
|
+
### Dynamic Table (GIC)
|
|
322
|
+
|
|
323
|
+
- [Dynamic Table Overview](dynamic-table-overview.md): Declarative incremental computation principles and architecture
|
|
324
|
+
- [Dynamic Table SQL Reference](dynamic-table-sql.md): Full CREATE DYNAMIC TABLE syntax
|
|
325
|
+
- [Dynamic Table Usage Guide](SQL_DynamicTable_Guide.md): Unified batch-streaming pipeline examples
|
|
326
|
+
|
|
327
|
+
### Other Migration Guides
|
|
328
|
+
|
|
329
|
+
- [Databricks Notebook → Lakehouse Migration Guide (Retail Medallion Pipeline)](databricks-notebook-to-studio-migration.md)
|
|
330
|
+
- [dbt-databricks → dbt-clickzetta Migration Guide (Financial Payment Pipeline)](dbt-databricks-to-clickzetta-migration.md)
|
|
331
|
+
- [Spark Migration Guide](spark-migration-guide.md): Spark ecosystem migration overview
|