PyPI - airbyte-cdk - Versions diffs - 0.50.0__py3-none-any.whl → 0.50.2__py3-none-any.whl - Mend

airbyte-cdk 0.50.0py3-none-any.whl → 0.50.2py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of airbyte-cdk might be problematic. Click here for more details.

Files changed (25) hide show

airbyte_cdk/entrypoint.py CHANGED Viewed

@@ -181,6 +181,13 @@ class AirbyteEntrypoint(object):
             return parsed_args.catalog
         return None
+    @classmethod
+    def extract_config(cls, args: List[str]) -> Optional[Any]:
+        parsed_args = cls.parse_args(args)
+        if hasattr(parsed_args, "config"):
+            return parsed_args.config
+        return None
     def _emit_queued_messages(self, source: Source) -> Iterable[AirbyteMessage]:
         if hasattr(source, "message_repository") and source.message_repository:
             yield from source.message_repository.consume_queue()

airbyte_cdk/sources/declarative/declarative_component_schema.yaml CHANGED Viewed

@@ -578,9 +578,9 @@ definitions:
           - "created_at"
           - "{{ config['record_cursor'] }}"
       datetime_format:
-        title: Cursor Field Datetime Format
+        title: Outgoing Datetime Format
         description: |
-          The datetime format of the Cursor Field. Use placeholders starting with "%" to describe the format the API is using. The following placeholders are available:
+          The datetime format used to format the datetime values that are sent in outgoing requests to the API. Use placeholders starting with "%" to describe the format the API is using. The following placeholders are available:
             * **%s**: Epoch unix timestamp - `1686218963`
             * **%a**: Weekday (abbreviated) - `Sun`
             * **%A**: Weekday (full) - `Sunday`
@@ -626,7 +626,7 @@ definitions:
           - "{{ config['start_time'] }}"
       cursor_datetime_formats:
         title: Cursor Datetime Formats
-        description: The possible formats for the cursor field
+        description: The possible formats for the cursor field, in order of preference. The first format that matches the cursor field value will be used to parse it. If not provided, the `datetime_format` will be used.
         type: array
         items:
           type: string

airbyte_cdk/sources/declarative/models/declarative_component_schema.py CHANGED Viewed

@@ -810,9 +810,9 @@ class DatetimeBasedCursor(BaseModel):
     )
     datetime_format: str = Field(
         ...,
-        description="The datetime format of the Cursor Field. Use placeholders starting with \"%\" to describe the format the API is using. The following placeholders are available:\n  * **%s**: Epoch unix timestamp - `1686218963`\n  * **%a**: Weekday (abbreviated) - `Sun`\n  * **%A**: Weekday (full) - `Sunday`\n  * **%w**: Weekday (decimal) - `0` (Sunday), `6` (Saturday)\n  * **%d**: Day of the month (zero-padded) - `01`, `02`, ..., `31`\n  * **%b**: Month (abbreviated) - `Jan`\n  * **%B**: Month (full) - `January`\n  * **%m**: Month (zero-padded) - `01`, `02`, ..., `12`\n  * **%y**: Year (without century, zero-padded) - `00`, `01`, ..., `99`\n  * **%Y**: Year (with century) - `0001`, `0002`, ..., `9999`\n  * **%H**: Hour (24-hour, zero-padded) - `00`, `01`, ..., `23`\n  * **%I**: Hour (12-hour, zero-padded) - `01`, `02`, ..., `12`\n  * **%p**: AM/PM indicator\n  * **%M**: Minute (zero-padded) - `00`, `01`, ..., `59`\n  * **%S**: Second (zero-padded) - `00`, `01`, ..., `59`\n  * **%f**: Microsecond (zero-padded to 6 digits) - `000000`\n  * **%z**: UTC offset - `(empty)`, `+0000`, `-04:00`\n  * **%Z**: Time zone name - `(empty)`, `UTC`, `GMT`\n  * **%j**: Day of the year (zero-padded) - `001`, `002`, ..., `366`\n  * **%U**: Week number of the year (starting Sunday) - `00`, ..., `53`\n  * **%W**: Week number of the year (starting Monday) - `00`, ..., `53`\n  * **%c**: Date and time - `Tue Aug 16 21:30:00 1988`\n  * **%x**: Date standard format - `08/16/1988`\n  * **%X**: Time standard format - `21:30:00`\n  * **%%**: Literal '%' character\n\n  Some placeholders depend on the locale of the underlying system - in most cases this locale is configured as en/US. For more information see the [Python documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).\n",
+        description="The datetime format used to format the datetime values that are sent in outgoing requests to the API. Use placeholders starting with \"%\" to describe the format the API is using. The following placeholders are available:\n  * **%s**: Epoch unix timestamp - `1686218963`\n  * **%a**: Weekday (abbreviated) - `Sun`\n  * **%A**: Weekday (full) - `Sunday`\n  * **%w**: Weekday (decimal) - `0` (Sunday), `6` (Saturday)\n  * **%d**: Day of the month (zero-padded) - `01`, `02`, ..., `31`\n  * **%b**: Month (abbreviated) - `Jan`\n  * **%B**: Month (full) - `January`\n  * **%m**: Month (zero-padded) - `01`, `02`, ..., `12`\n  * **%y**: Year (without century, zero-padded) - `00`, `01`, ..., `99`\n  * **%Y**: Year (with century) - `0001`, `0002`, ..., `9999`\n  * **%H**: Hour (24-hour, zero-padded) - `00`, `01`, ..., `23`\n  * **%I**: Hour (12-hour, zero-padded) - `01`, `02`, ..., `12`\n  * **%p**: AM/PM indicator\n  * **%M**: Minute (zero-padded) - `00`, `01`, ..., `59`\n  * **%S**: Second (zero-padded) - `00`, `01`, ..., `59`\n  * **%f**: Microsecond (zero-padded to 6 digits) - `000000`\n  * **%z**: UTC offset - `(empty)`, `+0000`, `-04:00`\n  * **%Z**: Time zone name - `(empty)`, `UTC`, `GMT`\n  * **%j**: Day of the year (zero-padded) - `001`, `002`, ..., `366`\n  * **%U**: Week number of the year (starting Sunday) - `00`, ..., `53`\n  * **%W**: Week number of the year (starting Monday) - `00`, ..., `53`\n  * **%c**: Date and time - `Tue Aug 16 21:30:00 1988`\n  * **%x**: Date standard format - `08/16/1988`\n  * **%X**: Time standard format - `21:30:00`\n  * **%%**: Literal '%' character\n\n  Some placeholders depend on the locale of the underlying system - in most cases this locale is configured as en/US. For more information see the [Python documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).\n",
         examples=["%Y-%m-%dT%H:%M:%S.%f%z", "%Y-%m-%d", "%s"],
-        title="Cursor Field Datetime Format",
+        title="Outgoing Datetime Format",
     )
     start_datetime: Union[str, MinMaxDatetime] = Field(
         ...,
@@ -822,7 +822,7 @@ class DatetimeBasedCursor(BaseModel):
     )
     cursor_datetime_formats: Optional[List[str]] = Field(
         None,
-        description="The possible formats for the cursor field",
+        description="The possible formats for the cursor field, in order of preference. The first format that matches the cursor field value will be used to parse it. If not provided, the `datetime_format` will be used.",
         title="Cursor Datetime Formats",
     )
     cursor_granularity: Optional[str] = Field(

airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py CHANGED Viewed

@@ -4,7 +4,7 @@
 import logging
 import traceback
-from typing import List, Optional, Tuple
+from typing import TYPE_CHECKING, List, Optional, Tuple
 from airbyte_cdk.sources import Source
 from airbyte_cdk.sources.file_based.availability_strategy import AbstractFileBasedAvailabilityStrategy
@@ -12,14 +12,16 @@ from airbyte_cdk.sources.file_based.exceptions import CheckAvailabilityError, Fi
 from airbyte_cdk.sources.file_based.file_based_stream_reader import AbstractFileBasedStreamReader
 from airbyte_cdk.sources.file_based.remote_file import RemoteFile
 from airbyte_cdk.sources.file_based.schema_helpers import conforms_to_schema
-from airbyte_cdk.sources.file_based.stream import AbstractFileBasedStream
+if TYPE_CHECKING:
+    from airbyte_cdk.sources.file_based.stream import AbstractFileBasedStream
 class DefaultFileBasedAvailabilityStrategy(AbstractFileBasedAvailabilityStrategy):
     def __init__(self, stream_reader: AbstractFileBasedStreamReader):
         self.stream_reader = stream_reader
-    def check_availability(self, stream: AbstractFileBasedStream, logger: logging.Logger, _: Optional[Source]) -> Tuple[bool, Optional[str]]:  # type: ignore[override]
+    def check_availability(self, stream: "AbstractFileBasedStream", logger: logging.Logger, _: Optional[Source]) -> Tuple[bool, Optional[str]]:  # type: ignore[override]
         """
         Perform a connection check for the stream (verify that we can list files from the stream).
@@ -33,7 +35,7 @@ class DefaultFileBasedAvailabilityStrategy(AbstractFileBasedAvailabilityStrategy
         return True, None
     def check_availability_and_parsability(
-        self, stream: AbstractFileBasedStream, logger: logging.Logger, _: Optional[Source]
+        self, stream: "AbstractFileBasedStream", logger: logging.Logger, _: Optional[Source]
     ) -> Tuple[bool, Optional[str]]:
         """
         Perform a connection check for the stream.
@@ -51,8 +53,6 @@ class DefaultFileBasedAvailabilityStrategy(AbstractFileBasedAvailabilityStrategy
         - If the user provided a schema in the config, check that a subset of records in
           one file conform to the schema via a call to stream.conforms_to_schema(schema).
         """
-        if not isinstance(stream, AbstractFileBasedStream):
-            raise ValueError(f"Stream {stream.name} is not a file-based stream.")
         try:
             files = self._check_list_files(stream)
             self._check_extensions(stream, files)
@@ -62,7 +62,7 @@ class DefaultFileBasedAvailabilityStrategy(AbstractFileBasedAvailabilityStrategy
         return True, None
-    def _check_list_files(self, stream: AbstractFileBasedStream) -> List[RemoteFile]:
+    def _check_list_files(self, stream: "AbstractFileBasedStream") -> List[RemoteFile]:
         try:
             files = stream.list_files()
         except Exception as exc:
@@ -73,12 +73,12 @@ class DefaultFileBasedAvailabilityStrategy(AbstractFileBasedAvailabilityStrategy
         return files
-    def _check_extensions(self, stream: AbstractFileBasedStream, files: List[RemoteFile]) -> None:
+    def _check_extensions(self, stream: "AbstractFileBasedStream", files: List[RemoteFile]) -> None:
         if not all(f.extension_agrees_with_file_type(stream.config.file_type) for f in files):
             raise CheckAvailabilityError(FileBasedSourceError.EXTENSION_MISMATCH, stream=stream.name)
         return None
-    def _check_parse_record(self, stream: AbstractFileBasedStream, file: RemoteFile, logger: logging.Logger) -> None:
+    def _check_parse_record(self, stream: "AbstractFileBasedStream", file: RemoteFile, logger: logging.Logger) -> None:
         parser = stream.get_parser(stream.config.file_type)
         try:

airbyte_cdk/sources/file_based/config/csv_format.py CHANGED Viewed

@@ -4,9 +4,9 @@
 import codecs
 from enum import Enum
-from typing import Optional
+from typing import Any, Mapping, Optional, Set
-from pydantic import BaseModel, Field, validator
+from pydantic import BaseModel, Field, root_validator, validator
 from typing_extensions import Literal
@@ -17,6 +17,10 @@ class QuotingBehavior(Enum):
     QUOTE_NONE = "Quote None"
+DEFAULT_TRUE_VALUES = ["y", "yes", "t", "true", "on", "1"]
+DEFAULT_FALSE_VALUES = ["n", "no", "f", "false", "off", "0"]
 class CsvFormat(BaseModel):
     filetype: Literal["csv"] = "csv"
     delimiter: str = Field(
@@ -46,10 +50,34 @@ class CsvFormat(BaseModel):
         default=QuotingBehavior.QUOTE_SPECIAL_CHARACTERS,
         description="The quoting behavior determines when a value in a row should have quote marks added around it. For example, if Quote Non-numeric is specified, while reading, quotes are expected for row values that do not contain numbers. Or for Quote All, every row value will be expecting quotes.",
     )
-    # Noting that the existing S3 connector had a config option newlines_in_values. This was only supported by pyarrow and not
-    # the Python csv package. It has a little adoption, but long term we should ideally phase this out because of the drawbacks
-    # of using pyarrow
+    null_values: Set[str] = Field(
+        title="Null Values",
+        default=[],
+        description="A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.",
+    )
+    skip_rows_before_header: int = Field(
+        title="Skip Rows Before Header",
+        default=0,
+        description="The number of rows to skip before the header row. For example, if the header row is on the 3rd row, enter 2 in this field.",
+    )
+    skip_rows_after_header: int = Field(
+        title="Skip Rows After Header", default=0, description="The number of rows to skip after the header row."
+    )
+    autogenerate_column_names: bool = Field(
+        title="Autogenerate Column Names",
+        default=False,
+        description="Whether to autogenerate column names if column_names is empty. If true, column names will be of the form “f0”, “f1”… If false, column names will be read from the first CSV row after skip_rows_before_header.",
+    )
+    true_values: Set[str] = Field(
+        title="True Values",
+        default=DEFAULT_TRUE_VALUES,
+        description="A set of case-sensitive strings that should be interpreted as true values.",
+    )
+    false_values: Set[str] = Field(
+        title="False Values",
+        default=DEFAULT_FALSE_VALUES,
+        description="A set of case-sensitive strings that should be interpreted as false values.",
+    )
     @validator("delimiter")
     def validate_delimiter(cls, v: str) -> str:
@@ -78,3 +106,11 @@ class CsvFormat(BaseModel):
         except LookupError:
             raise ValueError(f"invalid encoding format: {v}")
         return v
+    @root_validator
+    def validate_option_combinations(cls, values: Mapping[str, Any]) -> Mapping[str, Any]:
+        skip_rows_before_header = values.get("skip_rows_before_header", 0)
+        auto_generate_column_names = values.get("autogenerate_column_names", False)
+        if skip_rows_before_header > 0 and auto_generate_column_names:
+            raise ValueError("Cannot skip rows before header and autogenerate column names at the same time.")
+        return values

airbyte_cdk/sources/file_based/file_based_source.py CHANGED Viewed

@@ -19,12 +19,11 @@ from airbyte_cdk.sources.file_based.file_types import default_parsers
 from airbyte_cdk.sources.file_based.file_types.file_type_parser import FileTypeParser
 from airbyte_cdk.sources.file_based.schema_validation_policies import DEFAULT_SCHEMA_VALIDATION_POLICIES, AbstractSchemaValidationPolicy
 from airbyte_cdk.sources.file_based.stream import AbstractFileBasedStream, DefaultFileBasedStream
+from airbyte_cdk.sources.file_based.stream.cursor import AbstractFileBasedCursor
 from airbyte_cdk.sources.file_based.stream.cursor.default_file_based_cursor import DefaultFileBasedCursor
 from airbyte_cdk.sources.streams import Stream
 from pydantic.error_wrappers import ValidationError
-DEFAULT_MAX_HISTORY_SIZE = 10_000
 class FileBasedSource(AbstractSource, ABC):
     def __init__(
@@ -36,7 +35,7 @@ class FileBasedSource(AbstractSource, ABC):
         discovery_policy: AbstractDiscoveryPolicy = DefaultDiscoveryPolicy(),
         parsers: Mapping[str, FileTypeParser] = default_parsers,
         validation_policies: Mapping[str, AbstractSchemaValidationPolicy] = DEFAULT_SCHEMA_VALIDATION_POLICIES,
-        max_history_size: int = DEFAULT_MAX_HISTORY_SIZE,
+        cursor_cls: Type[AbstractFileBasedCursor] = DefaultFileBasedCursor,
     ):
         self.stream_reader = stream_reader
         self.spec_class = spec_class
@@ -46,7 +45,7 @@ class FileBasedSource(AbstractSource, ABC):
         self.validation_policies = validation_policies
         catalog = self.read_catalog(catalog_path) if catalog_path else None
         self.stream_schemas = {s.stream.name: s.stream.json_schema for s in catalog.streams} if catalog else {}
-        self.max_history_size = max_history_size
+        self.cursor_cls = cursor_cls
         self.logger = logging.getLogger(f"airbyte.{self.name}")
     def check_connection(self, logger: logging.Logger, config: Mapping[str, Any]) -> Tuple[bool, Optional[Any]]:
@@ -104,7 +103,7 @@ class FileBasedSource(AbstractSource, ABC):
                         discovery_policy=self.discovery_policy,
                         parsers=self.parsers,
                         validation_policy=self._validate_and_get_validation_policy(stream_config),
-                        cursor=DefaultFileBasedCursor(self.max_history_size, stream_config.days_to_sync_if_history_is_full),
+                        cursor=self.cursor_cls(stream_config),
                     )
                 )
             return streams

airbyte_cdk/sources/file_based/file_types/csv_parser.py CHANGED Viewed

@@ -5,12 +5,13 @@
 import csv
 import json
 import logging
-from distutils.util import strtobool
-from typing import Any, Dict, Iterable, Mapping, Optional
+from functools import partial
+from io import IOBase
+from typing import Any, Callable, Dict, Iterable, List, Mapping, Optional, Set
 from airbyte_cdk.sources.file_based.config.csv_format import CsvFormat, QuotingBehavior
 from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig
-from airbyte_cdk.sources.file_based.exceptions import FileBasedSourceError
+from airbyte_cdk.sources.file_based.exceptions import FileBasedSourceError, RecordParseError
 from airbyte_cdk.sources.file_based.file_based_stream_reader import AbstractFileBasedStreamReader, FileReadMode
 from airbyte_cdk.sources.file_based.file_types.file_type_parser import FileTypeParser
 from airbyte_cdk.sources.file_based.remote_file import RemoteFile
@@ -34,30 +35,25 @@ class CsvParser(FileTypeParser):
         stream_reader: AbstractFileBasedStreamReader,
         logger: logging.Logger,
     ) -> Dict[str, Any]:
-        config_format = config.format.get(config.file_type) if config.format else None
-        if config_format:
-            if not isinstance(config_format, CsvFormat):
-                raise ValueError(f"Invalid format config: {config_format}")
-            dialect_name = config.name + DIALECT_NAME
-            csv.register_dialect(
-                dialect_name,
-                delimiter=config_format.delimiter,
-                quotechar=config_format.quote_char,
-                escapechar=config_format.escape_char,
-                doublequote=config_format.double_quote,
-                quoting=config_to_quoting.get(config_format.quoting_behavior, csv.QUOTE_MINIMAL),
-            )
-            with stream_reader.open_file(file, self.file_read_mode, logger) as fp:
-                # todo: the existing InMemoryFilesSource.open_file() test source doesn't currently require an encoding, but actual
-                #  sources will likely require one. Rather than modify the interface now we can wait until the real use case
-                reader = csv.DictReader(fp, dialect=dialect_name)  # type: ignore
-                schema = {field.strip(): {"type": "string"} for field in next(reader)}
-                csv.unregister_dialect(dialect_name)
-                return schema
-        else:
-            with stream_reader.open_file(file, self.file_read_mode, logger) as fp:
-                reader = csv.DictReader(fp)  # type: ignore
-                return {field.strip(): {"type": "string"} for field in next(reader)}
+        config_format = config.format.get(config.file_type) if config.format else CsvFormat()
+        if not isinstance(config_format, CsvFormat):
+            raise ValueError(f"Invalid format config: {config_format}")
+        dialect_name = config.name + DIALECT_NAME
+        csv.register_dialect(
+            dialect_name,
+            delimiter=config_format.delimiter,
+            quotechar=config_format.quote_char,
+            escapechar=config_format.escape_char,
+            doublequote=config_format.double_quote,
+            quoting=config_to_quoting.get(config_format.quoting_behavior, csv.QUOTE_MINIMAL),
+        )
+        with stream_reader.open_file(file, self.file_read_mode, logger) as fp:
+            # todo: the existing InMemoryFilesSource.open_file() test source doesn't currently require an encoding, but actual
+            #  sources will likely require one. Rather than modify the interface now we can wait until the real use case
+            headers = self._get_headers(fp, config_format, dialect_name)
+            schema = {field.strip(): {"type": "string"} for field in headers}
+            csv.unregister_dialect(dialect_name)
+            return schema
     def parse_records(
         self,
@@ -67,30 +63,28 @@ class CsvParser(FileTypeParser):
         logger: logging.Logger,
     ) -> Iterable[Dict[str, Any]]:
         schema: Mapping[str, Any] = config.input_schema  # type: ignore
-        config_format = config.format.get(config.file_type) if config.format else None
-        if config_format:
-            if not isinstance(config_format, CsvFormat):
-                raise ValueError(f"Invalid format config: {config_format}")
-            # Formats are configured individually per-stream so a unique dialect should be registered for each stream.
-            # Wwe don't unregister the dialect because we are lazily parsing each csv file to generate records
-            dialect_name = config.name + DIALECT_NAME
-            csv.register_dialect(
-                dialect_name,
-                delimiter=config_format.delimiter,
-                quotechar=config_format.quote_char,
-                escapechar=config_format.escape_char,
-                doublequote=config_format.double_quote,
-                quoting=config_to_quoting.get(config_format.quoting_behavior, csv.QUOTE_MINIMAL),
-            )
-            with stream_reader.open_file(file, self.file_read_mode, logger) as fp:
-                # todo: the existing InMemoryFilesSource.open_file() test source doesn't currently require an encoding, but actual
-                #  sources will likely require one. Rather than modify the interface now we can wait until the real use case
-                reader = csv.DictReader(fp, dialect=dialect_name)  # type: ignore
-                yield from self._read_and_cast_types(reader, schema, logger)
-        else:
-            with stream_reader.open_file(file, self.file_read_mode, logger) as fp:
-                reader = csv.DictReader(fp)  # type: ignore
-                yield from self._read_and_cast_types(reader, schema, logger)
+        config_format = config.format.get(config.file_type) if config.format else CsvFormat()
+        if not isinstance(config_format, CsvFormat):
+            raise ValueError(f"Invalid format config: {config_format}")
+        # Formats are configured individually per-stream so a unique dialect should be registered for each stream.
+        # We don't unregister the dialect because we are lazily parsing each csv file to generate records
+        # This will potentially be a problem if we ever process multiple streams concurrently
+        dialect_name = config.name + DIALECT_NAME
+        csv.register_dialect(
+            dialect_name,
+            delimiter=config_format.delimiter,
+            quotechar=config_format.quote_char,
+            escapechar=config_format.escape_char,
+            doublequote=config_format.double_quote,
+            quoting=config_to_quoting.get(config_format.quoting_behavior, csv.QUOTE_MINIMAL),
+        )
+        with stream_reader.open_file(file, self.file_read_mode, logger) as fp:
+            # todo: the existing InMemoryFilesSource.open_file() test source doesn't currently require an encoding, but actual
+            #  sources will likely require one. Rather than modify the interface now we can wait until the real use case
+            self._skip_rows_before_header(fp, config_format.skip_rows_before_header)
+            field_names = self._auto_generate_headers(fp, config_format) if config_format.autogenerate_column_names else None
+            reader = csv.DictReader(fp, dialect=dialect_name, fieldnames=field_names)  # type: ignore
+            yield from self._read_and_cast_types(reader, schema, config_format, logger)
     @property
     def file_read_mode(self) -> FileReadMode:
@@ -98,7 +92,7 @@ class CsvParser(FileTypeParser):
     @staticmethod
     def _read_and_cast_types(
-        reader: csv.DictReader, schema: Optional[Mapping[str, Any]], logger: logging.Logger  # type: ignore
+        reader: csv.DictReader, schema: Optional[Mapping[str, Any]], config_format: CsvFormat, logger: logging.Logger  # type: ignore
     ) -> Iterable[Dict[str, Any]]:
         """
         If the user provided a schema, attempt to cast the record values to the associated type.
@@ -107,16 +101,65 @@ class CsvParser(FileTypeParser):
         cast it to a string. Downstream, the user's validation policy will determine whether the
         record should be emitted.
         """
-        if not schema:
-            yield from reader
+        cast_fn = CsvParser._get_cast_function(schema, config_format, logger)
+        for i, row in enumerate(reader):
+            if i < config_format.skip_rows_after_header:
+                continue
+            # The row was not properly parsed if any of  the values are None
+            if any(val is None for val in row.values()):
+                raise RecordParseError(FileBasedSourceError.ERROR_PARSING_RECORD)
+            else:
+                yield CsvParser._to_nullable(cast_fn(row), config_format.null_values)
-        else:
+    @staticmethod
+    def _get_cast_function(
+        schema: Optional[Mapping[str, Any]], config_format: CsvFormat, logger: logging.Logger
+    ) -> Callable[[Mapping[str, str]], Mapping[str, str]]:
+        # Only cast values if the schema is provided
+        if schema:
             property_types = {col: prop["type"] for col, prop in schema["properties"].items()}
-            for row in reader:
-                yield cast_types(row, property_types, logger)
+            return partial(_cast_types, property_types=property_types, config_format=config_format, logger=logger)
+        else:
+            # If no schema is provided, yield the rows as they are
+            return _no_cast
+    @staticmethod
+    def _to_nullable(row: Mapping[str, str], null_values: Set[str]) -> Dict[str, Optional[str]]:
+        nullable = row | {k: None if v in null_values else v for k, v in row.items()}
+        return nullable
+    @staticmethod
+    def _skip_rows_before_header(fp: IOBase, rows_to_skip: int) -> None:
+        """
+        Skip rows before the header. This has to be done on the file object itself, not the reader
+        """
+        for _ in range(rows_to_skip):
+            fp.readline()
+    def _get_headers(self, fp: IOBase, config_format: CsvFormat, dialect_name: str) -> List[str]:
+        # Note that this method assumes the dialect has already been registered if we're parsing the headers
+        if config_format.autogenerate_column_names:
+            return self._auto_generate_headers(fp, config_format)
+        else:
+            # If we're not autogenerating column names, we need to skip the rows before the header
+            self._skip_rows_before_header(fp, config_format.skip_rows_before_header)
+            # Then read the header
+            reader = csv.DictReader(fp, dialect=dialect_name)  # type: ignore
+            return next(reader)  # type: ignore
+    def _auto_generate_headers(self, fp: IOBase, config_format: CsvFormat) -> List[str]:
+        """
+        Generates field names as [f0, f1, ...] in the same way as pyarrow's csv reader with autogenerate_column_names=True.
+        See https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html
+        """
+        next_line = next(fp).strip()
+        number_of_columns = len(next_line.split(config_format.delimiter))  # type: ignore
+        # Reset the file pointer to the beginning of the file so that the first row is not skipped
+        fp.seek(0)
+        return [f"f{i}" for i in range(number_of_columns)]
-def cast_types(row: Dict[str, str], property_types: Dict[str, Any], logger: logging.Logger) -> Dict[str, Any]:
+def _cast_types(row: Dict[str, str], property_types: Dict[str, Any], config_format: CsvFormat, logger: logging.Logger) -> Dict[str, Any]:
     """
     Casts the values in the input 'row' dictionary according to the types defined in the JSON schema.
@@ -142,7 +185,7 @@ def cast_types(row: Dict[str, str], property_types: Dict[str, Any], logger: logg
             elif python_type == bool:
                 try:
-                    cast_value = strtobool(value)
+                    cast_value = _value_to_bool(value, config_format.true_values, config_format.false_values)
                 except ValueError:
                     warnings.append(_format_warning(key, value, prop_type))
@@ -178,5 +221,17 @@ def cast_types(row: Dict[str, str], property_types: Dict[str, Any], logger: logg
     return result
+def _value_to_bool(value: str, true_values: Set[str], false_values: Set[str]) -> bool:
+    if value in true_values:
+        return True
+    if value in false_values:
+        return False
+    raise ValueError(f"Value {value} is not a valid boolean value")
 def _format_warning(key: str, value: str, expected_type: Optional[Any]) -> str:
     return f"{key}: value={value},expected_type={expected_type}"
+def _no_cast(row: Mapping[str, str]) -> Mapping[str, str]:
+    return row

airbyte_cdk/sources/file_based/stream/cursor/__init__.py CHANGED Viewed

@@ -1,4 +1,4 @@
+from .abstract_file_based_cursor import AbstractFileBasedCursor
 from .default_file_based_cursor import DefaultFileBasedCursor
-from .file_based_cursor import FileBasedCursor
-__all__ = ["FileBasedCursor", "DefaultFileBasedCursor"]
+__all__ = ["AbstractFileBasedCursor", "DefaultFileBasedCursor"]

airbyte_cdk/sources/file_based/stream/cursor/{file_based_cursor.py → abstract_file_based_cursor.py} RENAMED Viewed

@@ -7,15 +7,23 @@ from abc import ABC, abstractmethod
 from datetime import datetime
 from typing import Any, Iterable, MutableMapping
+from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig
 from airbyte_cdk.sources.file_based.remote_file import RemoteFile
 from airbyte_cdk.sources.file_based.types import StreamState
-class FileBasedCursor(ABC):
+class AbstractFileBasedCursor(ABC):
     """
     Abstract base class for cursors used by file-based streams.
     """
+    @abstractmethod
+    def __init__(self, stream_config: FileBasedStreamConfig, **kwargs: Any):
+        """
+        Common interface for all cursors.
+        """
+        ...
     @abstractmethod
     def add_file(self, file: RemoteFile) -> None:
         """

airbyte_cdk/sources/file_based/stream/cursor/default_file_based_cursor.py CHANGED Viewed

@@ -4,26 +4,26 @@
 import logging
 from datetime import datetime, timedelta
-from typing import Iterable, MutableMapping, Optional
+from typing import Any, Iterable, MutableMapping, Optional
+from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig
 from airbyte_cdk.sources.file_based.remote_file import RemoteFile
-from airbyte_cdk.sources.file_based.stream.cursor.file_based_cursor import FileBasedCursor
+from airbyte_cdk.sources.file_based.stream.cursor.abstract_file_based_cursor import AbstractFileBasedCursor
 from airbyte_cdk.sources.file_based.types import StreamState
-class DefaultFileBasedCursor(FileBasedCursor):
+class DefaultFileBasedCursor(AbstractFileBasedCursor):
     DEFAULT_DAYS_TO_SYNC_IF_HISTORY_IS_FULL = 3
+    DEFAULT_MAX_HISTORY_SIZE = 10_000
     DATE_TIME_FORMAT = "%Y-%m-%dT%H:%M:%S.%fZ"
-    def __init__(self, max_history_size: int, days_to_sync_if_history_is_full: Optional[int]):
+    def __init__(self, stream_config: FileBasedStreamConfig, **_: Any):
+        super().__init__(stream_config)
         self._file_to_datetime_history: MutableMapping[str, str] = {}
-        self._max_history_size = max_history_size
         self._time_window_if_history_is_full = timedelta(
-            days=days_to_sync_if_history_is_full or self.DEFAULT_DAYS_TO_SYNC_IF_HISTORY_IS_FULL
+            days=stream_config.days_to_sync_if_history_is_full or self.DEFAULT_DAYS_TO_SYNC_IF_HISTORY_IS_FULL
         )
-        if self._max_history_size <= 0:
-            raise ValueError(f"max_history_size must be a positive integer, got {self._max_history_size}")
         if self._time_window_if_history_is_full <= timedelta():
             raise ValueError(f"days_to_sync_if_history_is_full must be a positive timedelta, got {self._time_window_if_history_is_full}")
@@ -37,7 +37,7 @@ class DefaultFileBasedCursor(FileBasedCursor):
     def add_file(self, file: RemoteFile) -> None:
         self._file_to_datetime_history[file.uri] = file.last_modified.strftime(self.DATE_TIME_FORMAT)
-        if len(self._file_to_datetime_history) > self._max_history_size:
+        if len(self._file_to_datetime_history) > self.DEFAULT_MAX_HISTORY_SIZE:
             # Get the earliest file based on its last modified date and its uri
             oldest_file = self._compute_earliest_file_in_history()
             if oldest_file:
@@ -67,7 +67,7 @@ class DefaultFileBasedCursor(FileBasedCursor):
         """
         Returns true if the state's history is full, meaning new entries will start to replace old entries.
         """
-        return len(self._file_to_datetime_history) >= self._max_history_size
+        return len(self._file_to_datetime_history) >= self.DEFAULT_MAX_HISTORY_SIZE
     def _should_sync_file(self, file: RemoteFile, logger: logging.Logger) -> bool:
         if file.uri in self._file_to_datetime_history:

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py CHANGED Viewed

@@ -15,13 +15,14 @@ from airbyte_cdk.sources.file_based.exceptions import (
     FileBasedSourceError,
     InvalidSchemaError,
     MissingSchemaError,
+    RecordParseError,
     SchemaInferenceError,
     StopSyncPerValidationPolicy,
 )
 from airbyte_cdk.sources.file_based.remote_file import RemoteFile
 from airbyte_cdk.sources.file_based.schema_helpers import merge_schemas, schemaless_schema
 from airbyte_cdk.sources.file_based.stream import AbstractFileBasedStream
-from airbyte_cdk.sources.file_based.stream.cursor import FileBasedCursor
+from airbyte_cdk.sources.file_based.stream.cursor import AbstractFileBasedCursor
 from airbyte_cdk.sources.file_based.types import StreamSlice
 from airbyte_cdk.sources.streams import IncrementalMixin
 from airbyte_cdk.sources.streams.core import JsonSchema
@@ -39,7 +40,7 @@ class DefaultFileBasedStream(AbstractFileBasedStream, IncrementalMixin):
     ab_file_name_col = "_ab_source_file_url"
     airbyte_columns = [ab_last_mod_col, ab_file_name_col]
-    def __init__(self, cursor: FileBasedCursor, **kwargs: Any):
+    def __init__(self, cursor: AbstractFileBasedCursor, **kwargs: Any):
         super().__init__(**kwargs)
         self._cursor = cursor
@@ -105,6 +106,18 @@ class DefaultFileBasedStream(AbstractFileBasedStream, IncrementalMixin):
                 )
                 break
+            except RecordParseError:
+                # Increment line_no because the exception was raised before we could increment it
+                line_no += 1
+                yield AirbyteMessage(
+                    type=MessageType.LOG,
+                    log=AirbyteLogMessage(
+                        level=Level.ERROR,
+                        message=f"{FileBasedSourceError.ERROR_PARSING_RECORD.value} stream={self.name} file={file.uri} line_no={line_no} n_skipped={n_skipped}",
+                        stack_trace=traceback.format_exc(),
+                    ),
+                )
             except Exception:
                 yield AirbyteMessage(
                     type=MessageType.LOG,

{airbyte_cdk-0.50.0.dist-info → airbyte_cdk-0.50.2.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: airbyte-cdk
-Version: 0.50.0
+Version: 0.50.2
 Summary: A framework for writing Airbyte Connectors.
 Home-page: https://github.com/airbytehq/airbyte
 Author: Airbyte

airbyte-cdk 0.50.0__py3-none-any.whl → 0.50.2__py3-none-any.whl

Potentially problematic release.

airbyte-cdk 0.50.0py3-none-any.whl → 0.50.2py3-none-any.whl