PyPI - bizon - Versions diffs - 0.1.0__tar.gz → 0.1.1__tar.gz - Mend

bizon 0.1.0tar.gz → 0.1.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (134) hide show

{bizon-0.1.0 → bizon-0.1.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.3
 Name: bizon
-Version: 0.1.0
+Version: 0.1.1
 Summary: Extract and load your data reliably from API Clients with native fault-tolerant and checkpointing mechanism.
 Author: Antoine Balliet
 Author-email: antoine.balliet@gmail.com
@@ -11,6 +11,7 @@ Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Provides-Extra: bigquery
+Provides-Extra: datadog
 Provides-Extra: gsheets
 Provides-Extra: kafka
 Provides-Extra: postgres
@@ -19,6 +20,7 @@ Requires-Dist: avro (>=1.12.0,<2.0.0) ; extra == "kafka"
 Requires-Dist: backoff (>=2.2.1,<3.0.0)
 Requires-Dist: click (>=8.1.7,<9.0.0)
 Requires-Dist: confluent-kafka (>=2.6.0,<3.0.0) ; extra == "kafka"
+Requires-Dist: datadog (>=0.50.2,<0.51.0) ; extra == "datadog"
 Requires-Dist: dpath (>=2.2.0,<3.0.0)
 Requires-Dist: fastavro (>=1.9.7,<2.0.0) ; extra == "kafka"
 Requires-Dist: google-cloud-bigquery (>=3.25.0,<4.0.0) ; extra == "bigquery"
@@ -27,6 +29,7 @@ Requires-Dist: google-cloud-storage (>=2.17.0,<3.0.0)
 Requires-Dist: gspread (>=6.1.2,<7.0.0) ; extra == "gsheets"
 Requires-Dist: kafka-python (>=2.0.2,<3.0.0) ; extra == "kafka"
 Requires-Dist: loguru (>=0.7.2,<0.8.0)
+Requires-Dist: orjson (>=3.10.16,<4.0.0)
 Requires-Dist: pendulum (>=3.0.0,<4.0.0)
 Requires-Dist: pika (>=1.3.2,<2.0.0) ; extra == "rabbitmq"
 Requires-Dist: polars (>=1.16.0,<2.0.0)
@@ -39,8 +42,10 @@ Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
 Requires-Dist: pytz (>=2024.2,<2025.0)
 Requires-Dist: pyyaml (>=6.0.1,<7.0.0)
 Requires-Dist: requests (>=2.28.2,<3.0.0)
+Requires-Dist: simplejson (>=3.20.1,<4.0.0)
 Requires-Dist: sqlalchemy (>=2.0.32,<3.0.0)
 Requires-Dist: sqlalchemy-bigquery (>=1.11.0,<2.0.0) ; extra == "bigquery"
+Requires-Dist: tenacity (>=9.0.0,<10.0.0)
 Description-Content-Type: text/markdown
 # bizon ⚡️
@@ -48,13 +53,17 @@ Extract and load your largest data streams with a framework you can trust for bi
 ## Features
 - **Natively fault-tolerant**: Bizon uses a checkpointing mechanism to keep track of the progress and recover from the last checkpoint.
 - **High throughput**: Bizon is designed to handle high throughput and can process billions of records.
 - **Queue system agnostic**: Bizon is agnostic of the queuing system, you can use any queuing system among Python Queue, RabbitMQ, Kafka or Redpanda. Thanks to the `bizon.engine.queue.Queue` interface, adapters can be written for any queuing system.
-- **Pipeline metrics**: Bizon provides exhaustive pipeline metrics and implement OpenTelemetry for tracing. You can monitor:
+- **Pipeline metrics**: Bizon provides exhaustive pipeline metrics and implement Datadog & OpenTelemetry for tracing. You can monitor:
     - ETAs for completion
     - Number of records processed
     - Completion percentage
     - Latency Source <> Destination
 - **Lightweight & lean**: Bizon is lightweight, minimal codebase and only uses few dependencies:
     - `requests` for HTTP requests
     - `pyyaml` for configuration
@@ -83,8 +92,8 @@ Create a file named `config.yml` in your working directory with the following co
 name: demo-creatures-pipeline
 source:
-  source_name: dummy
-  stream_name: creatures
+  name: dummy
+  stream: creatures
   authentication:
     type: api_key
     params:
@@ -105,7 +114,7 @@ bizon run config.yml
 Backend is the interface used by Bizon to store its state. It can be configured in the `backend` section of the configuration file. The following backends are supported:
 - `sqlite`: In-memory SQLite database, useful for testing and development.
-- `biguquery`: Google BigQuery backend, perfect for light setup & production.
+- `bigquery`: Google BigQuery backend, perfect for light setup & production.
 - `postgres`: PostgreSQL backend, for production use and frequent cursor updates.
 ## Queue configuration
@@ -115,6 +124,13 @@ Queue is the interface used by Bizon to exchange data between `Source` and `Dest
 - `rabbitmq`: RabbitMQ, for production use and high throughput.
 - `kafka`: Apache Kafka, for production use and high throughput and strong persistence.
+## Runner configuration
+Runner is the interface used by Bizon to run the pipeline. It can be configured in the `runner` section of the configuration file. The following runners are supported:
+- `thread` (asynchronous)
+- `process` (asynchronous)
+- `stream` (synchronous)
 ## Start syncing your data 🚀
 ### Quick setup without any dependencies ✌️

{bizon-0.1.0 → bizon-0.1.1}/README.md RENAMED Viewed

@@ -3,13 +3,17 @@ Extract and load your largest data streams with a framework you can trust for bi
 ## Features
 - **Natively fault-tolerant**: Bizon uses a checkpointing mechanism to keep track of the progress and recover from the last checkpoint.
 - **High throughput**: Bizon is designed to handle high throughput and can process billions of records.
 - **Queue system agnostic**: Bizon is agnostic of the queuing system, you can use any queuing system among Python Queue, RabbitMQ, Kafka or Redpanda. Thanks to the `bizon.engine.queue.Queue` interface, adapters can be written for any queuing system.
-- **Pipeline metrics**: Bizon provides exhaustive pipeline metrics and implement OpenTelemetry for tracing. You can monitor:
+- **Pipeline metrics**: Bizon provides exhaustive pipeline metrics and implement Datadog & OpenTelemetry for tracing. You can monitor:
     - ETAs for completion
     - Number of records processed
     - Completion percentage
     - Latency Source <> Destination
 - **Lightweight & lean**: Bizon is lightweight, minimal codebase and only uses few dependencies:
     - `requests` for HTTP requests
     - `pyyaml` for configuration
@@ -38,8 +42,8 @@ Create a file named `config.yml` in your working directory with the following co
 name: demo-creatures-pipeline
 source:
-  source_name: dummy
-  stream_name: creatures
+  name: dummy
+  stream: creatures
   authentication:
     type: api_key
     params:
@@ -60,7 +64,7 @@ bizon run config.yml
 Backend is the interface used by Bizon to store its state. It can be configured in the `backend` section of the configuration file. The following backends are supported:
 - `sqlite`: In-memory SQLite database, useful for testing and development.
-- `biguquery`: Google BigQuery backend, perfect for light setup & production.
+- `bigquery`: Google BigQuery backend, perfect for light setup & production.
 - `postgres`: PostgreSQL backend, for production use and frequent cursor updates.
 ## Queue configuration
@@ -70,6 +74,13 @@ Queue is the interface used by Bizon to exchange data between `Source` and `Dest
 - `rabbitmq`: RabbitMQ, for production use and high throughput.
 - `kafka`: Apache Kafka, for production use and high throughput and strong persistence.
+## Runner configuration
+Runner is the interface used by Bizon to run the pipeline. It can be configured in the `runner` section of the configuration file. The following runners are supported:
+- `thread` (asynchronous)
+- `process` (asynchronous)
+- `stream` (synchronous)
 ## Start syncing your data 🚀
 ### Quick setup without any dependencies ✌️

bizon-0.1.1/bizon/alerting/alerts.py ADDED Viewed

@@ -0,0 +1,23 @@
+from abc import ABC, abstractmethod
+from typing import Dict, List
+from loguru import logger
+from bizon.alerting.models import AlertingConfig, AlertMethod, LogLevel
+class AbstractAlert(ABC):
+    def __init__(self, type: AlertMethod, config: AlertingConfig, log_levels: List[LogLevel] = [LogLevel.ERROR]):
+        self.type = type
+        self.config = config
+        self.log_levels = log_levels
+    @abstractmethod
+    def handler(self, message: Dict) -> None:
+        pass
+    def add_handlers(self) -> None:
+        levels = [level.value for level in self.log_levels]
+        for level in levels:
+            logger.add(self.handler, level=level, format="{message}")

bizon-0.1.1/bizon/alerting/models.py ADDED Viewed

@@ -0,0 +1,28 @@
+from enum import Enum
+from typing import List, Optional, Union
+from pydantic import BaseModel
+from bizon.alerting.slack.config import SlackConfig
+class LogLevel(str, Enum):
+    DEBUG = "DEBUG"
+    INFO = "INFO"
+    WARNING = "WARNING"
+    ERROR = "ERROR"
+    CRITICAL = "CRITICAL"
+class AlertMethod(str, Enum):
+    """Alerting methods"""
+    SLACK = "slack"
+class AlertingConfig(BaseModel):
+    """Alerting configuration model"""
+    type: AlertMethod
+    log_levels: Optional[List[LogLevel]] = [LogLevel.ERROR]
+    config: Union[SlackConfig]

bizon-0.1.1/bizon/alerting/slack/__init__.py ADDED Viewed

File without changes

bizon-0.1.1/bizon/alerting/slack/config.py ADDED Viewed

@@ -0,0 +1,5 @@
+from pydantic import BaseModel
+class SlackConfig(BaseModel):
+    webhook_url: str

bizon-0.1.1/bizon/alerting/slack/handler.py ADDED Viewed

@@ -0,0 +1,39 @@
+import os
+from typing import Dict, List
+import requests
+from loguru import logger
+from bizon.alerting.alerts import AbstractAlert, AlertMethod
+from bizon.alerting.models import LogLevel
+from bizon.alerting.slack.config import SlackConfig
+class SlackHandler(AbstractAlert):
+    def __init__(self, config: SlackConfig, log_levels: List[LogLevel] = [LogLevel.ERROR]):
+        super().__init__(type=AlertMethod.SLACK, config=config, log_levels=log_levels)
+        self.webhook_url = config.webhook_url
+    def handler(self, message: Dict) -> None:
+        """
+        Custom handler to send error logs to Slack, with additional context.
+        """
+        log_entry = message.record
+        error_message = (
+            f"*Sync*: `{os.environ.get('BIZON_SYNC_NAME', 'N/A')}`\n"
+            f"*Source*: `{os.environ.get('BIZON_SOURCE_NAME', 'N/A')}` - `{os.environ.get('BIZON_SOURCE_STREAM', 'N/A')}`\n"  # noqa
+            f"*Destination*: `{os.environ.get('BIZON_DESTINATION_NAME', 'N/A')}`\n\n"
+            f"*Message:*\n```{log_entry['message']}```\n"
+            f"*File:* `{log_entry['file'].path}:{log_entry['line']}`\n"
+            f"*Function:* `{log_entry['function']}`\n"
+            f"*Level:* `{log_entry['level'].name}`\n"
+        )
+        payload = {"text": f":rotating_light: *Bizon Pipeline Alert* :rotating_light:\n\n{error_message}"}
+        try:
+            response = requests.post(self.webhook_url, json=payload)
+            response.raise_for_status()
+        except requests.exceptions.RequestException as e:
+            logger.error(f"Failed to send log to Slack: {e}")
+        return None

bizon-0.1.1/bizon/cli/__init__.py ADDED Viewed

File without changes

{bizon-0.1.0 → bizon-0.1.1}/bizon/cli/main.py RENAMED Viewed

@@ -83,7 +83,7 @@ def destination():
 @click.option(
     "--runner",
     required=False,
-    type=click.Choice(["thread", "process"]),
+    type=click.Choice(["thread", "process", "stream"]),
     default="thread",
     show_default=True,
     help="Runner type to use. Thread or Process.",
@@ -117,9 +117,13 @@ def run(
     set_runner_in_config(config=config, runner=runner)
     runner = RunnerFactory.create_from_config_dict(config=config)
-    runner.run()
+    result = runner.run()
-    click.echo("Pipeline finished.")
+    if result.is_success:
+        click.secho("Pipeline finished successfully.", fg="green")
+    else:
+        raise click.exceptions.ClickException(result.to_string())
 if __name__ == "__main__":

{bizon-0.1.0 → bizon-0.1.1}/bizon/common/models.py RENAMED Viewed

@@ -1,13 +1,21 @@
-from typing import Union
+from typing import Optional, Union
 from pydantic import BaseModel, ConfigDict, Field
-from bizon.destinations.bigquery.src.config import BigQueryConfig
-from bizon.destinations.bigquery_streaming.src.config import BigQueryStreamingConfig
-from bizon.destinations.file.src.config import FileDestinationConfig
-from bizon.destinations.logger.src.config import LoggerConfig
+from bizon.alerting.models import AlertingConfig
+from bizon.connectors.destinations.bigquery.src.config import BigQueryConfig
+from bizon.connectors.destinations.bigquery_streaming.src.config import (
+    BigQueryStreamingConfig,
+)
+from bizon.connectors.destinations.bigquery_streaming_v2.src.config import (
+    BigQueryStreamingV2Config,
+)
+from bizon.connectors.destinations.file.src.config import FileDestinationConfig
+from bizon.connectors.destinations.logger.src.config import LoggerConfig
 from bizon.engine.config import EngineConfig
+from bizon.monitoring.config import MonitoringConfig
 from bizon.source.config import SourceConfig, SourceSyncModes
+from bizon.transform.config import TransformModel
 class BizonConfig(BaseModel):
@@ -23,9 +31,15 @@ class BizonConfig(BaseModel):
         default=...,
     )
+    transforms: Optional[list[TransformModel]] = Field(
+        description="List of transformations to apply to the source data",
+        default=[],
+    )
     destination: Union[
         BigQueryConfig,
         BigQueryStreamingConfig,
+        BigQueryStreamingV2Config,
         LoggerConfig,
         FileDestinationConfig,
     ] = Field(
@@ -39,6 +53,16 @@ class BizonConfig(BaseModel):
         default=EngineConfig(),
     )
+    alerting: Optional[AlertingConfig] = Field(
+        description="Alerting configuration",
+        default=None,
+    )
+    monitoring: Optional[MonitoringConfig] = Field(
+        description="Monitoring configuration",
+        default=None,
+    )
 class SyncMetadata(BaseModel):
     """Model which stores general metadata around a sync.
@@ -57,8 +81,8 @@ class SyncMetadata(BaseModel):
         return cls(
             name=config.name,
             job_id=job_id,
-            source_name=config.source.source_name,
-            stream_name=config.source.stream_name,
+            source_name=config.source.name,
+            stream_name=config.source.stream,
             sync_mode=config.source.sync_mode,
             destination_name=config.destination.name,
         )

{bizon-0.1.0/bizon → bizon-0.1.1/bizon/connectors}/destinations/bigquery/config/bigquery.example.yml RENAMED Viewed

@@ -1,6 +1,8 @@
+name: hubspot contacts to bigquery
 source:
   name: hubspot
-  stream_name: contacts
+  stream: contacts
   properties:
     strategy: all
   authentication:
@@ -34,6 +36,3 @@ destination:
         "client_x509_cert_url": "",
         "universe_domain": "googleapis.com"
       }
-pipeline:
-  log_level: DEBUG

bizon-0.1.1/bizon/connectors/destinations/bigquery/src/config.py ADDED Viewed

@@ -0,0 +1,127 @@
+from enum import Enum
+from typing import Literal, Optional
+import polars as pl
+from pydantic import BaseModel, Field
+from bizon.destination.config import (
+    AbstractDestinationConfig,
+    AbstractDestinationDetailsConfig,
+    DestinationColumn,
+    DestinationTypes,
+)
+class GCSBufferFormat(str, Enum):
+    PARQUET = "parquet"
+    CSV = "csv"
+class TimePartitioning(str, Enum):
+    DAY = "DAY"
+    HOUR = "HOUR"
+    MONTH = "MONTH"
+    YEAR = "YEAR"
+class BigQueryColumnType(str, Enum):
+    BOOLEAN = "BOOLEAN"
+    BYTES = "BYTES"
+    DATE = "DATE"
+    DATETIME = "DATETIME"
+    FLOAT = "FLOAT"
+    FLOAT64 = "FLOAT64"
+    GEOGRAPHY = "GEOGRAPHY"
+    INTEGER = "INTEGER"
+    INT64 = "INT64"
+    NUMERIC = "NUMERIC"
+    BIGNUMERIC = "BIGNUMERIC"
+    JSON = "JSON"
+    RECORD = "RECORD"
+    STRING = "STRING"
+    TIME = "TIME"
+    TIMESTAMP = "TIMESTAMP"
+class BigQueryColumnMode(str, Enum):
+    NULLABLE = "NULLABLE"
+    REQUIRED = "REQUIRED"
+    REPEATED = "REPEATED"
+BIGQUERY_TO_POLARS_TYPE_MAPPING = {
+    "STRING": pl.String,
+    "BYTES": pl.Binary,
+    "INTEGER": pl.Int64,
+    "INT64": pl.Int64,
+    "FLOAT": pl.Float64,
+    "FLOAT64": pl.Float64,
+    "NUMERIC": pl.Float64,  # Can be refined for precision with Decimal128 if needed
+    "BIGNUMERIC": pl.Float64,  # Similar to NUMERIC
+    "BOOLEAN": pl.Boolean,
+    "BOOL": pl.Boolean,
+    "TIMESTAMP": pl.String,  # We use BigQuery internal parsing to convert to datetime
+    "DATE": pl.String,  # We use BigQuery internal parsing to convert to datetime
+    "DATETIME": pl.String,  # We use BigQuery internal parsing to convert to datetime
+    "TIME": pl.Time,
+    "GEOGRAPHY": pl.Object,  # Polars doesn't natively support geography types
+    "ARRAY": pl.List,  # Requires additional handling for element types
+    "JSON": pl.String,
+}
+class BigQueryColumn(DestinationColumn):
+    name: str = Field(..., description="Name of the column")
+    type: BigQueryColumnType = Field(..., description="Type of the column")
+    mode: BigQueryColumnMode = Field(..., description="Mode of the column")
+    description: Optional[str] = Field(None, description="Description of the column")
+    default_value_expression: Optional[str] = Field(None, description="Default value expression")
+    @property
+    def polars_type(self):
+        return BIGQUERY_TO_POLARS_TYPE_MAPPING.get(self.type.upper())
+class BigQueryAuthentication(BaseModel):
+    service_account_key: str = Field(
+        description="Service Account Key JSON string. If empty it will be infered",
+        default="",
+    )
+class BigQueryRecordSchemaConfig(BaseModel):
+    destination_id: str = Field(..., description="Destination ID")
+    record_schema: list[BigQueryColumn] = Field(..., description="Record schema")
+    # BigQuery Clustering Keys
+    clustering_keys: Optional[list[str]] = Field(None, description="Clustering keys")
+class BigQueryConfigDetails(AbstractDestinationDetailsConfig):
+    # Table details
+    project_id: str = Field(..., description="BigQuery Project ID")
+    dataset_id: str = Field(..., description="BigQuery Dataset ID")
+    dataset_location: str = Field(default="US", description="BigQuery Dataset location")
+    # GCS Buffer
+    gcs_buffer_bucket: str = Field(..., description="GCS Buffer bucket")
+    gcs_buffer_format: GCSBufferFormat = Field(default=GCSBufferFormat.PARQUET, description="GCS Buffer format")
+    # Time partitioning
+    time_partitioning: TimePartitioning = Field(
+        default=TimePartitioning.DAY, description="BigQuery Time partitioning type"
+    )
+    # Schema for unnesting
+    record_schemas: Optional[list[BigQueryRecordSchemaConfig]] = Field(
+        default=None, description="Schema for the records. Required if unnest is set to true."
+    )
+    authentication: Optional[BigQueryAuthentication] = None
+class BigQueryConfig(AbstractDestinationConfig):
+    name: Literal[DestinationTypes.BIGQUERY]
+    buffer_size: Optional[int] = 400
+    config: BigQueryConfigDetails

{bizon-0.1.0/bizon → bizon-0.1.1/bizon/connectors}/destinations/bigquery/src/destination.py RENAMED Viewed

@@ -1,5 +1,4 @@
 import io
-import json
 import os
 import tempfile
 import traceback
@@ -13,18 +12,24 @@ from google.cloud.bigquery import DatasetReference, TimePartitioning
 from loguru import logger
 from bizon.common.models import SyncMetadata
-from bizon.destinations.config import NormalizationType
-from bizon.destinations.destination import AbstractDestination
+from bizon.destination.destination import AbstractDestination
 from bizon.engine.backend.backend import AbstractBackend
 from bizon.source.config import SourceSyncModes
+from bizon.source.source import AbstractSourceCallback
-from .config import BigQueryConfigDetails
+from .config import BigQueryColumn, BigQueryConfigDetails
 class BigQueryDestination(AbstractDestination):
-    def __init__(self, sync_metadata: SyncMetadata, config: BigQueryConfigDetails, backend: AbstractBackend):
-        super().__init__(sync_metadata, config, backend)
+    def __init__(
+        self,
+        sync_metadata: SyncMetadata,
+        config: BigQueryConfigDetails,
+        backend: AbstractBackend,
+        source_callback: AbstractSourceCallback,
+    ):
+        super().__init__(sync_metadata, config, backend, source_callback)
         self.config: BigQueryConfigDetails = config
         if config.authentication and config.authentication.service_account_key:
@@ -44,7 +49,7 @@ class BigQueryDestination(AbstractDestination):
     @property
     def table_id(self) -> str:
-        tabled_id = self.config.table_id or f"{self.sync_metadata.source_name}_{self.sync_metadata.stream_name}"
+        tabled_id = self.destination_id or f"{self.sync_metadata.source_name}_{self.sync_metadata.stream_name}"
         return f"{self.project_id}.{self.dataset_id}.{tabled_id}"
     @property
@@ -61,28 +66,24 @@ class BigQueryDestination(AbstractDestination):
     def get_bigquery_schema(self, df_destination_records: pl.DataFrame) -> List[bigquery.SchemaField]:
-        # we keep raw data in the column source_data
-        if self.config.normalization.type == NormalizationType.NONE:
+        # Case we unnest the data
+        if self.config.unnest:
             return [
-                bigquery.SchemaField("_source_record_id", "STRING", mode="REQUIRED"),
-                bigquery.SchemaField("_source_timestamp", "TIMESTAMP", mode="REQUIRED"),
-                bigquery.SchemaField("_source_data", "STRING", mode="NULLABLE"),
-                bigquery.SchemaField("_bizon_extracted_at", "TIMESTAMP", mode="REQUIRED"),
                 bigquery.SchemaField(
-                    "_bizon_loaded_at", "TIMESTAMP", mode="REQUIRED", default_value_expression="CURRENT_TIMESTAMP()"
-                ),
-                bigquery.SchemaField("_bizon_id", "STRING", mode="REQUIRED"),
+                    col.name,
+                    col.type,
+                    mode=col.mode,
+                    description=col.description,
+                )
+                for col in self.record_schemas[self.destination_id]
             ]
-        # If normalization is tabular, we parse key / value pairs to columns
-        elif self.config.normalization.type == NormalizationType.TABULAR:
-            # We use the first record to infer the schema of tabular data (key / value pairs)
-            source_data_keys = list(json.loads(df_destination_records["source_data"][0]).keys())
-            return [bigquery.SchemaField(key, "STRING", mode="NULLABLE") for key in source_data_keys] + [
+        # Case we don't unnest the data
+        else:
+            return [
                 bigquery.SchemaField("_source_record_id", "STRING", mode="REQUIRED"),
                 bigquery.SchemaField("_source_timestamp", "TIMESTAMP", mode="REQUIRED"),
+                bigquery.SchemaField("_source_data", "STRING", mode="NULLABLE"),
                 bigquery.SchemaField("_bizon_extracted_at", "TIMESTAMP", mode="REQUIRED"),
                 bigquery.SchemaField(
                     "_bizon_loaded_at", "TIMESTAMP", mode="REQUIRED", default_value_expression="CURRENT_TIMESTAMP()"
@@ -90,8 +91,6 @@ class BigQueryDestination(AbstractDestination):
                 bigquery.SchemaField("_bizon_id", "STRING", mode="REQUIRED"),
             ]
-        raise NotImplementedError(f"Normalization type {self.config.normalization.type} is not supported")
     def check_connection(self) -> bool:
         dataset_ref = DatasetReference(self.project_id, self.dataset_id)
@@ -129,6 +128,28 @@ class BigQueryDestination(AbstractDestination):
         raise NotImplementedError(f"Buffer format {self.buffer_format} is not supported")
+    @staticmethod
+    def unnest_data(df_destination_records: pl.DataFrame, record_schema: list[BigQueryColumn]) -> pl.DataFrame:
+        """Unnest the source_data field into separate columns"""
+        # Check if the schema matches the expected schema
+        source_data_fields = (
+            pl.DataFrame(df_destination_records["source_data"].str.json_decode(infer_schema_length=None))
+            .schema["source_data"]
+            .fields
+        )
+        record_schema_fields = [col.name for col in record_schema]
+        for field in source_data_fields:
+            assert field.name in record_schema_fields, f"Column {field.name} not found in BigQuery schema"
+        # Parse the JSON and unnest the fields to polar type
+        return df_destination_records.select(
+            pl.col("source_data").str.json_path_match(f"$.{col.name}").cast(col.polars_type).alias(col.name)
+            for col in record_schema
+        )
     def load_to_bigquery(self, gcs_file: str, df_destination_records: pl.DataFrame):
         # We always partition by the loaded_at field

bizon-0.1.1/bizon/connectors/destinations/bigquery_streaming/src/config.py ADDED Viewed

@@ -0,0 +1,56 @@
+from enum import Enum
+from typing import Literal, Optional
+from pydantic import BaseModel, Field
+from bizon.connectors.destinations.bigquery.src.config import BigQueryRecordSchemaConfig
+from bizon.destination.config import (
+    AbstractDestinationConfig,
+    AbstractDestinationDetailsConfig,
+    DestinationTypes,
+)
+class TimePartitioningWindow(str, Enum):
+    DAY = "DAY"
+    HOUR = "HOUR"
+    MONTH = "MONTH"
+    YEAR = "YEAR"
+class TimePartitioning(BaseModel):
+    type: TimePartitioningWindow = Field(default=TimePartitioningWindow.DAY, description="Time partitioning type")
+    field: Optional[str] = Field(
+        "_bizon_loaded_at", description="Field to partition by. You can use a transformation to create this field."
+    )
+class BigQueryAuthentication(BaseModel):
+    service_account_key: str = Field(
+        description="Service Account Key JSON string. If empty it will be infered",
+        default="",
+    )
+class BigQueryStreamingConfigDetails(AbstractDestinationDetailsConfig):
+    project_id: str
+    dataset_id: str
+    dataset_location: Optional[str] = "US"
+    time_partitioning: Optional[TimePartitioning] = Field(
+        default=TimePartitioning(type=TimePartitioningWindow.DAY, field="_bizon_loaded_at"),
+        description="BigQuery Time partitioning type",
+    )
+    authentication: Optional[BigQueryAuthentication] = None
+    bq_max_rows_per_request: Optional[int] = Field(30000, description="Max rows per buffer streaming request.")
+    record_schemas: Optional[list[BigQueryRecordSchemaConfig]] = Field(
+        default=None, description="Schema for the records. Required if unnest is set to true."
+    )
+    use_legacy_streaming_api: bool = Field(
+        default=False,
+        description="[DEPRECATED] Use the legacy streaming API. This is required for some older BigQuery versions.",
+    )
+class BigQueryStreamingConfig(AbstractDestinationConfig):
+    name: Literal[DestinationTypes.BIGQUERY_STREAMING]
+    config: BigQueryStreamingConfigDetails

bizon 0.1.0__tar.gz → 0.1.1__tar.gz

bizon 0.1.0tar.gz → 0.1.1tar.gz