PyPI - snowpark-checkpoints-collectors - Versions diffs - 0.1.0rc1__tar.gz → 0.1.0rc3__tar.gz - Mend

snowpark-checkpoints-collectors 0.1.0rc1tar.gz → 0.1.0rc3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (92) hide show

{snowpark_checkpoints_collectors-0.1.0rc1 → snowpark_checkpoints_collectors-0.1.0rc3}/.gitignore RENAMED Viewed

@@ -4,10 +4,13 @@
 # demos
 snowpark-checkpoints-output/
+Demos/Demos/
+Demos/snowpark-checkpoints-output/
 # env
 wheelvenv/
 # version
 !__version__.py

snowpark_checkpoints_collectors-0.1.0rc3/PKG-INFO ADDED Viewed

@@ -0,0 +1,146 @@
+Metadata-Version: 2.4
+Name: snowpark-checkpoints-collectors
+Version: 0.1.0rc3
+Summary: Snowpark column and table statistics collection
+Project-URL: Bug Tracker, https://github.com/snowflakedb/snowpark-checkpoints/issues
+Project-URL: Source code, https://github.com/snowflakedb/snowpark-checkpoints/
+Author-email: "Snowflake, Inc." <snowflake-python-libraries-dl@snowflake.com>
+License: Apache License, Version 2.0
+License-File: LICENSE
+Keywords: Snowflake,Snowpark,analytics,cloud,database,db
+Classifier: Development Status :: 4 - Beta
+Classifier: Environment :: Console
+Classifier: Environment :: Other Environment
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Education
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: System Administrators
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Programming Language :: SQL
+Classifier: Topic :: Database
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Classifier: Topic :: Software Development
+Classifier: Topic :: Software Development :: Libraries
+Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: <3.12,>=3.9
+Requires-Dist: pandera[io]==0.20.4
+Requires-Dist: pyspark
+Requires-Dist: snowflake-connector-python
+Requires-Dist: snowflake-snowpark-python
+Provides-Extra: development
+Requires-Dist: coverage>=7.6.7; extra == 'development'
+Requires-Dist: deepdiff>=8.0.0; extra == 'development'
+Requires-Dist: hatchling==1.25.0; extra == 'development'
+Requires-Dist: pre-commit>=4.0.1; extra == 'development'
+Requires-Dist: pyarrow>=18.0.0; extra == 'development'
+Requires-Dist: pytest-cov>=6.0.0; extra == 'development'
+Requires-Dist: pytest>=8.3.3; extra == 'development'
+Requires-Dist: setuptools>=70.0.0; extra == 'development'
+Requires-Dist: twine==5.1.1; extra == 'development'
+Description-Content-Type: text/markdown
+# snowpark-checkpoints-collectors
+---
+**NOTE**
+This package is on Private Preview.
+---
+**snowpark-checkpoints-collector** package offers a function for extracting information from PySpark dataframes. We can then use that data to validate against the converted Snowpark dataframes to ensure that behavioral equivalence has been achieved.
+## Features
+- Schema inference collected data mode (Schema): This is the default mode, which leverages Pandera schema inference to obtain the metadata and checks that will be evaluated for the specified dataframe. This mode also collects custom data from columns of the DataFrame based on the PySpark type.
+- DataFrame collected data mode (DataFrame): This mode collects the data of the PySpark dataframe. In this case, the mechanism saves all data of the given dataframe in parquet format. Using the default user Snowflake connection, it tries to upload the parquet files into the Snowflake temporal stage and create a table based on the information in the stage. The name of the file and the table is the same as the checkpoint.
+## Functionalities
+### Collect DataFrame Checkpoint
+```python
+from pyspark.sql import DataFrame as SparkDataFrame
+from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
+from typing import Optional
+# Signature of the function
+def collect_dataframe_checkpoint(
+    df: SparkDataFrame,
+    checkpoint_name: str,
+    sample: Optional[float] = None,
+    mode: Optional[CheckpointMode] = None,
+    output_path: Optional[str] = None,
+) -> None:
+    ...
+```
+- `df`: The input Spark dataframe to collect.
+- `checkpoint_name`: Name of the checkpoint schema file or dataframe.
+- `sample`: Fraction of DataFrame to sample for schema inference, defaults to 1.0.
+- `mode`: The mode to execution the collection (Schema or Dataframe), defaults to CheckpointMode.Schema.
+- `output_path`: The output path to save the checkpoint, defaults to current working directory.
+## Usage Example
+### Schema mode
+```python
+from pyspark.sql import SparkSession
+from snowflake.snowpark_checkpoints_collector import collect_dataframe_checkpoint
+from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
+spark_session = SparkSession.builder.getOrCreate()
+sample_size = 1.0
+pyspark_df = spark_session.createDataFrame(
+    [("apple", 21), ("lemon", 34), ("banana", 50)], schema="fruit string, age integer"
+)
+collect_dataframe_checkpoint(
+    pyspark_df,
+    checkpoint_name="collect_checkpoint_mode_1",
+    sample=sample_size,
+    mode=CheckpointMode.SCHEMA,
+)
+```
+### Dataframe mode
+```python
+from pyspark.sql import SparkSession
+from snowflake.snowpark_checkpoints_collector import collect_dataframe_checkpoint
+from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
+from pyspark.sql.types import StructType, StructField, ByteType, StringType, IntegerType
+spark_schema = StructType(
+    [
+        StructField("BYTE", ByteType(), True),
+        StructField("STRING", StringType(), True),
+        StructField("INTEGER", IntegerType(), True)
+    ]
+)
+data = [(1, "apple", 21), (2, "lemon", 34), (3, "banana", 50)]
+spark_session = SparkSession.builder.getOrCreate()
+pyspark_df = spark_session.createDataFrame(data, schema=spark_schema).orderBy(
+    "INTEGER"
+)
+collect_dataframe_checkpoint(
+    pyspark_df,
+    checkpoint_name="collect_checkpoint_mode_2",
+    mode=CheckpointMode.DATAFRAME,
+)
+```
+------

snowpark_checkpoints_collectors-0.1.0rc3/README.md ADDED Viewed

@@ -0,0 +1,102 @@
+# snowpark-checkpoints-collectors
+---
+**NOTE**
+This package is on Private Preview.
+---
+**snowpark-checkpoints-collector** package offers a function for extracting information from PySpark dataframes. We can then use that data to validate against the converted Snowpark dataframes to ensure that behavioral equivalence has been achieved.
+## Features
+- Schema inference collected data mode (Schema): This is the default mode, which leverages Pandera schema inference to obtain the metadata and checks that will be evaluated for the specified dataframe. This mode also collects custom data from columns of the DataFrame based on the PySpark type.
+- DataFrame collected data mode (DataFrame): This mode collects the data of the PySpark dataframe. In this case, the mechanism saves all data of the given dataframe in parquet format. Using the default user Snowflake connection, it tries to upload the parquet files into the Snowflake temporal stage and create a table based on the information in the stage. The name of the file and the table is the same as the checkpoint.
+## Functionalities
+### Collect DataFrame Checkpoint
+```python
+from pyspark.sql import DataFrame as SparkDataFrame
+from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
+from typing import Optional
+# Signature of the function
+def collect_dataframe_checkpoint(
+    df: SparkDataFrame,
+    checkpoint_name: str,
+    sample: Optional[float] = None,
+    mode: Optional[CheckpointMode] = None,
+    output_path: Optional[str] = None,
+) -> None:
+    ...
+```
+- `df`: The input Spark dataframe to collect.
+- `checkpoint_name`: Name of the checkpoint schema file or dataframe.
+- `sample`: Fraction of DataFrame to sample for schema inference, defaults to 1.0.
+- `mode`: The mode to execution the collection (Schema or Dataframe), defaults to CheckpointMode.Schema.
+- `output_path`: The output path to save the checkpoint, defaults to current working directory.
+## Usage Example
+### Schema mode
+```python
+from pyspark.sql import SparkSession
+from snowflake.snowpark_checkpoints_collector import collect_dataframe_checkpoint
+from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
+spark_session = SparkSession.builder.getOrCreate()
+sample_size = 1.0
+pyspark_df = spark_session.createDataFrame(
+    [("apple", 21), ("lemon", 34), ("banana", 50)], schema="fruit string, age integer"
+)
+collect_dataframe_checkpoint(
+    pyspark_df,
+    checkpoint_name="collect_checkpoint_mode_1",
+    sample=sample_size,
+    mode=CheckpointMode.SCHEMA,
+)
+```
+### Dataframe mode
+```python
+from pyspark.sql import SparkSession
+from snowflake.snowpark_checkpoints_collector import collect_dataframe_checkpoint
+from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
+from pyspark.sql.types import StructType, StructField, ByteType, StringType, IntegerType
+spark_schema = StructType(
+    [
+        StructField("BYTE", ByteType(), True),
+        StructField("STRING", StringType(), True),
+        StructField("INTEGER", IntegerType(), True)
+    ]
+)
+data = [(1, "apple", 21), (2, "lemon", 34), (3, "banana", 50)]
+spark_session = SparkSession.builder.getOrCreate()
+pyspark_df = spark_session.createDataFrame(data, schema=spark_schema).orderBy(
+    "INTEGER"
+)
+collect_dataframe_checkpoint(
+    pyspark_df,
+    checkpoint_name="collect_checkpoint_mode_2",
+    mode=CheckpointMode.DATAFRAME,
+)
+```
+------

{snowpark_checkpoints_collectors-0.1.0rc1 → snowpark_checkpoints_collectors-0.1.0rc3}/pyproject.toml RENAMED Viewed

@@ -3,7 +3,9 @@ build-backend = "hatchling.build"
 requires = ["hatchling"]
 [project]
-authors = [{name = "Snowflake Inc."}]
+authors = [
+  {name = "Snowflake, Inc.", email = "snowflake-python-libraries-dl@snowflake.com"},
+]
 classifiers = [
   "Development Status :: 4 - Beta",
   "Environment :: Console",
@@ -30,6 +32,7 @@ dependencies = [
   "pandera[io]==0.20.4",
 ]
 description = "Snowpark column and table statistics collection"
+dynamic = ['version']
 keywords = [
   'Snowflake',
   'analytics',
@@ -38,11 +41,10 @@ keywords = [
   'db',
   'Snowpark',
 ]
-license = {file = "LICENSE"}
+license = {text = "Apache License, Version 2.0"}
 name = "snowpark-checkpoints-collectors"
 readme = "README.md"
 requires-python = '>=3.9,<3.12'
-dynamic = ['version']
 [project.optional-dependencies]
 development = [
@@ -113,7 +115,6 @@ exclude_lines = [
   "if __name__ == .__main__.:",
 ]
 [tool.hatch.envs.linter.scripts]
 check = [
   'ruff check --fix .',
@@ -121,7 +122,7 @@ check = [
 [tool.hatch.envs.test.scripts]
 check = [
-  "pip install -e ../snowpark-checkpoints-configuration" ,
+  "pip install -e ../snowpark-checkpoints-configuration",
   'pytest -v --junitxml=test/outcome/test-results.xml --cov=. --cov-config=test/.coveragerc --cov-report=xml:test/outcome/coverage-{matrix:python:{env:PYTHON_VERSION:unset}}.xml {args:test} --cov-report=term --cov-report=json:test/outcome/coverage-{matrix:python:{env:PYTHON_VERSION:unset}}.json',
 ]

{snowpark_checkpoints_collectors-0.1.0rc1 → snowpark_checkpoints_collectors-0.1.0rc3}/src/snowflake/snowpark_checkpoints_collector/__init__.py RENAMED Viewed

@@ -2,9 +2,10 @@
 # Copyright (c) 2012-2024 Snowflake Computing Inc. All rights reserved.
 #
-__all__ = ["collect_dataframe_checkpoint", "Singleton"]
+__all__ = ["collect_dataframe_checkpoint", "CheckpointMode"]
-from snowflake.snowpark_checkpoints_collector.singleton import Singleton
 from snowflake.snowpark_checkpoints_collector.summary_stats_collector import (
     collect_dataframe_checkpoint,
 )
+from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode

{snowpark_checkpoints_collectors-0.1.0rc1 → snowpark_checkpoints_collectors-0.1.0rc3}/src/snowflake/snowpark_checkpoints_collector/collection_common.py RENAMED Viewed

@@ -8,8 +8,13 @@ from enum import IntEnum
 class CheckpointMode(IntEnum):
+    """Enum class representing the collection mode."""
     SCHEMA = 1
+    """Collect automatic schema inference"""
     DATAFRAME = 2
+    """Export DataFrame as Parquet file to Snowflake"""
 # CONSTANTS
@@ -76,11 +81,13 @@ COLUMN_IS_UNIQUE_SIZE_KEY = "is_unique_size"
 COLUMN_KEY_TYPE_KEY = "key_type"
 COLUMN_MARGIN_ERROR_KEY = "margin_error"
 COLUMN_MAX_KEY = "max"
+COLUMN_MAX_LENGTH_KEY = "max_length"
 COLUMN_MAX_SIZE_KEY = "max_size"
 COLUMN_MEAN_KEY = "mean"
 COLUMN_MEAN_SIZE_KEY = "mean_size"
 COLUMN_METADATA_KEY = "metadata"
 COLUMN_MIN_KEY = "min"
+COLUMN_MIN_LENGTH_KEY = "min_length"
 COLUMN_MIN_SIZE_KEY = "min_size"
 COLUMN_NAME_KEY = "name"
 COLUMN_NULL_COUNT_KEY = "null_count"
@@ -90,6 +97,7 @@ COLUMN_ROWS_NULL_COUNT_KEY = "rows_null_count"
 COLUMN_SIZE_KEY = "size"
 COLUMN_TRUE_COUNT_KEY = "true_count"
 COLUMN_TYPE_KEY = "type"
+COLUMN_VALUE_KEY = "value"
 COLUMN_VALUE_TYPE_KEY = "value_type"
 COLUMNS_KEY = "columns"
@@ -121,6 +129,8 @@ UNKNOWN_SOURCE_FILE = "unknown"
 UNKNOWN_LINE_OF_CODE = -1
 BACKSLASH_TOKEN = "\\"
 SLASH_TOKEN = "/"
+PYSPARK_NONE_SIZE_VALUE = -1
+PANDAS_LONG_TYPE = "Int64"
 # ENVIRONMENT VARIABLES
 SNOWFLAKE_CHECKPOINT_CONTRACT_FILE_PATH_ENV_VAR = (

{snowpark_checkpoints_collectors-0.1.0rc1 → snowpark_checkpoints_collectors-0.1.0rc3}/src/snowflake/snowpark_checkpoints_collector/collection_result/model/collection_point_result_manager.py RENAMED Viewed

@@ -5,10 +5,10 @@ import json
 from typing import Optional
-from snowflake.snowpark_checkpoints_collector import Singleton
 from snowflake.snowpark_checkpoints_collector.collection_result.model import (
     CollectionPointResult,
 )
+from snowflake.snowpark_checkpoints_collector.singleton import Singleton
 from snowflake.snowpark_checkpoints_collector.utils import file_utils

{snowpark_checkpoints_collectors-0.1.0rc1 → snowpark_checkpoints_collectors-0.1.0rc3}/src/snowflake/snowpark_checkpoints_collector/column_collection/column_collector_manager.py RENAMED Viewed

@@ -1,7 +1,7 @@
 #
 # Copyright (c) 2012-2024 Snowflake Computing Inc. All rights reserved.
 #
-from pandas import Series
+from pyspark.sql import DataFrame as SparkDataFrame
 from pyspark.sql.types import StructField
 from snowflake.snowpark_checkpoints_collector.collection_common import (
@@ -88,14 +88,14 @@ class ColumnCollectorManager:
     """Manage class for column collector based on type."""
     def collect_column(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         """Collect the data of the column based on the column type.
         Args:
             clm_name (str): the name of the column.
             struct_field (pyspark.sql.types.StructField): the struct field of the column type.
-            values (pandas.Series): the column values as Pandas.Series.
+            values (pyspark.sql.DataFrame): the column values as PySpark DataFrame.
         Returns:
             dict[str, any]: The data collected.
@@ -112,7 +112,7 @@ class ColumnCollectorManager:
     @column_register(ARRAY_COLUMN_TYPE)
     def _collect_array_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = ArrayColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -120,7 +120,7 @@ class ColumnCollectorManager:
     @column_register(BINARY_COLUMN_TYPE)
     def _collect_binary_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = BinaryColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -128,7 +128,7 @@ class ColumnCollectorManager:
     @column_register(BOOLEAN_COLUMN_TYPE)
     def _collect_boolean_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = BooleanColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -136,7 +136,7 @@ class ColumnCollectorManager:
     @column_register(DATE_COLUMN_TYPE)
     def _collect_date_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = DateColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -144,7 +144,7 @@ class ColumnCollectorManager:
     @column_register(DAYTIMEINTERVAL_COLUMN_TYPE)
     def _collect_day_time_interval_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = DayTimeIntervalColumnCollector(
             clm_name, struct_field, values
@@ -154,7 +154,7 @@ class ColumnCollectorManager:
     @column_register(DECIMAL_COLUMN_TYPE)
     def _collect_decimal_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = DecimalColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -162,7 +162,7 @@ class ColumnCollectorManager:
     @column_register(MAP_COLUMN_TYPE)
     def _collect_map_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = MapColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -170,7 +170,7 @@ class ColumnCollectorManager:
     @column_register(NULL_COLUMN_TYPE)
     def _collect_null_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = NullColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -185,7 +185,7 @@ class ColumnCollectorManager:
         DOUBLE_COLUMN_TYPE,
     )
     def _collect_numeric_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = NumericColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -193,7 +193,7 @@ class ColumnCollectorManager:
     @column_register(STRING_COLUMN_TYPE)
     def _collect_string_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = StringColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -201,7 +201,7 @@ class ColumnCollectorManager:
     @column_register(STRUCT_COLUMN_TYPE)
     def _collect_struct_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = StructColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -209,7 +209,7 @@ class ColumnCollectorManager:
     @column_register(TIMESTAMP_COLUMN_TYPE)
     def _collect_timestamp_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = TimestampColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
@@ -217,21 +217,21 @@ class ColumnCollectorManager:
     @column_register(TIMESTAMP_NTZ_COLUMN_TYPE)
     def _collect_timestampntz_type_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         column_collector = TimestampNTZColumnCollector(clm_name, struct_field, values)
         collected_data = column_collector.get_data()
         return collected_data
     def collect_empty_custom_data(
-        self, clm_name: str, struct_field: StructField, values: Series
+        self, clm_name: str, struct_field: StructField, values: SparkDataFrame
     ) -> dict[str, any]:
         """Collect the data of a empty column.
         Args:
             clm_name (str): the name of the column.
             struct_field (pyspark.sql.types.StructField): the struct field of the column type.
-            values (pandas.Series): the column values as Pandas.Series.
+            values (pyspark.sql.DataFrame): the column values as PySpark DataFrame.
         Returns:
             dict[str, any]: The data collected.

{snowpark_checkpoints_collectors-0.1.0rc1 → snowpark_checkpoints_collectors-0.1.0rc3}/src/snowflake/snowpark_checkpoints_collector/column_collection/model/array_column_collector.py RENAMED Viewed

@@ -3,7 +3,12 @@
 #
 from statistics import mean
-from pandas import Series
+from pyspark.sql import DataFrame as SparkDataFrame
+from pyspark.sql.functions import array as spark_array
+from pyspark.sql.functions import coalesce as spark_coalesce
+from pyspark.sql.functions import col as spark_col
+from pyspark.sql.functions import explode as spark_explode
+from pyspark.sql.functions import size as spark_size
 from pyspark.sql.types import StructField
 from snowflake.snowpark_checkpoints_collector.collection_common import (
@@ -13,6 +18,8 @@ from snowflake.snowpark_checkpoints_collector.collection_common import (
     COLUMN_MEAN_SIZE_KEY,
     COLUMN_MIN_SIZE_KEY,
     COLUMN_NULL_VALUE_PROPORTION_KEY,
+    COLUMN_SIZE_KEY,
+    COLUMN_VALUE_KEY,
     COLUMN_VALUE_TYPE_KEY,
     CONTAINS_NULL_KEY,
     ELEMENT_TYPE_KEY,
@@ -30,22 +37,22 @@ class ArrayColumnCollector(ColumnCollectorBase):
         name (str): the name of the column.
         type (str): the type of the column.
         struct_field (pyspark.sql.types.StructField): the struct field of the column type.
-        values (pandas.Series): the column values as Pandas.Series.
+        column_df (pyspark.sql.DataFrame): the column values as PySpark DataFrame.
     """
     def __init__(
-        self, clm_name: str, struct_field: StructField, clm_values: Series
+        self, clm_name: str, struct_field: StructField, clm_df: SparkDataFrame
     ) -> None:
         """Init ArrayColumnCollector.
         Args:
             clm_name (str): the name of the column.
             struct_field (pyspark.sql.types.StructField): the struct field of the column type.
-            clm_values (pandas.Series): the column values as Pandas.Series.
+            clm_df (pyspark.sql.DataFrame): the column values as PySpark DataFrame.
         """
-        super().__init__(clm_name, struct_field, clm_values)
+        super().__init__(clm_name, struct_field, clm_df)
         self._array_size_collection = self._compute_array_size_collection()
     def get_custom_data(self) -> dict[str, any]:
@@ -73,23 +80,22 @@ class ArrayColumnCollector(ColumnCollectorBase):
         return custom_data_dict
     def _compute_array_size_collection(self) -> list[int]:
-        size_collection = []
-        for array in self.values:
-            if array is None:
-                continue
+        select_result = self.column_df.select(
+            spark_size(spark_coalesce(spark_col(self.name), spark_array([]))).alias(
+                COLUMN_SIZE_KEY
+            )
+        ).collect()
-            length = len(array)
-            size_collection.append(length)
+        size_collection = [row[COLUMN_SIZE_KEY] for row in select_result]
         return size_collection
     def _compute_null_value_proportion(self) -> float:
-        null_counter = 0
-        for array in self.values:
-            if array is None:
-                continue
+        select_result = self.column_df.select(
+            spark_explode(spark_col(self.name)).alias(COLUMN_VALUE_KEY)
+        )
-            null_counter += array.count(None)
+        null_counter = select_result.where(spark_col(COLUMN_VALUE_KEY).isNull()).count()
         total_values = sum(self._array_size_collection)
         null_value_proportion = (null_counter / total_values) * 100

snowpark-checkpoints-collectors 0.1.0rc1__tar.gz → 0.1.0rc3__tar.gz

snowpark-checkpoints-collectors 0.1.0rc1tar.gz → 0.1.0rc3tar.gz