PyPI - ingestr - Versions diffs - 0.8.4__tar.gz → 0.9.1__tar.gz - Mend

ingestr 0.8.4tar.gz → 0.9.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of ingestr might be problematic. Click here for more details.

Files changed (133) hide show

{ingestr-0.8.4 → ingestr-0.9.1}/PKG-INFO RENAMED Viewed

@@ -1,11 +1,10 @@
 Metadata-Version: 2.3
 Name: ingestr
-Version: 0.8.4
+Version: 0.9.1
 Summary: ingestr is a command-line application that ingests data from various sources and stores them in any database.
 Project-URL: Homepage, https://github.com/bruin-data/ingestr
 Project-URL: Issues, https://github.com/bruin-data/ingestr/issues
 Author-email: Burak Karakan <burak.karakan@getbruin.com>
-License-File: LICENSE.md
 Classifier: Development Status :: 4 - Beta
 Classifier: Environment :: Console
 Classifier: Intended Audience :: Developers
@@ -15,7 +14,6 @@ Classifier: Programming Language :: Python :: 3
 Classifier: Topic :: Database
 Requires-Python: >=3.9
 Requires-Dist: confluent-kafka>=2.3.0
-Requires-Dist: cx-oracle==8.3.0
 Requires-Dist: databricks-sql-connector==2.9.3
 Requires-Dist: dlt==0.5.1
 Requires-Dist: duckdb-engine==0.11.5
@@ -35,6 +33,7 @@ Requires-Dist: pyrate-limiter==3.6.1
 Requires-Dist: redshift-connector==2.1.0
 Requires-Dist: rich==13.7.1
 Requires-Dist: rudder-sdk-python==2.1.0
+Requires-Dist: s3fs==2024.9.0
 Requires-Dist: snowflake-sqlalchemy==1.5.3
 Requires-Dist: sqlalchemy-bigquery==1.11.0
 Requires-Dist: sqlalchemy-hana==2.0.0
@@ -45,6 +44,8 @@ Requires-Dist: stripe==10.7.0
 Requires-Dist: tqdm==4.66.2
 Requires-Dist: typer==0.12.3
 Requires-Dist: types-requests==2.32.0.20240907
+Provides-Extra: oracle
+Requires-Dist: cx-oracle==8.3.0; extra == 'oracle'
 Description-Content-Type: text/markdown
 <div align="center">
@@ -226,6 +227,11 @@ Join our Slack community [here](https://join.slack.com/t/bruindatacommunity/shar
         <td>Notion</td>
         <td>✅</td>
         <td>-</td>
+    </tr>
+     <tr>
+        <td>S3</td>
+        <td>✅</td>
+        <td>-</td>
     </tr>
     <tr>
         <td>Shopify</td>
@@ -242,6 +248,11 @@ Join our Slack community [here](https://join.slack.com/t/bruindatacommunity/shar
         <td>✅</td>
         <td>-</td>
     </tr>
+    <tr>
+        <td>Zendesk</td>
+        <td>✅</td>
+        <td>-</td>
+    </tr>
 </table>
 More to come soon!

{ingestr-0.8.4 → ingestr-0.9.1}/README.md RENAMED Viewed

@@ -177,6 +177,11 @@ Join our Slack community [here](https://join.slack.com/t/bruindatacommunity/shar
         <td>Notion</td>
         <td>✅</td>
         <td>-</td>
+    </tr>
+     <tr>
+        <td>S3</td>
+        <td>✅</td>
+        <td>-</td>
     </tr>
     <tr>
         <td>Shopify</td>
@@ -193,6 +198,11 @@ Join our Slack community [here](https://join.slack.com/t/bruindatacommunity/shar
         <td>✅</td>
         <td>-</td>
     </tr>
+    <tr>
+        <td>Zendesk</td>
+        <td>✅</td>
+        <td>-</td>
+    </tr>
 </table>
 More to come soon!

{ingestr-0.8.4 → ingestr-0.9.1}/docs/.vitepress/config.mjs RENAMED Viewed

@@ -97,9 +97,11 @@ export default defineConfig({
               { text: "HubSpot", link: "/supported-sources/hubspot.md" },
               { text: "Klaviyo", link: "/supported-sources/klaviyo.md" },
               { text: "Notion", link: "/supported-sources/notion.md" },
+              { text: "S3", link: "/supported-sources/s3.md" },
               { text: "Shopify", link: "/supported-sources/shopify.md" },
               { text: "Slack", link: "/supported-sources/slack.md" },
               { text: "Stripe", link: "/supported-sources/stripe.md" },
+              { text: "Zendesk", link: "/supported-sources/zendesk.md" },
             ],
           },
         ],

{ingestr-0.8.4 → ingestr-0.9.1}/docs/supported-sources/gsheets.md RENAMED Viewed

@@ -1,9 +1,11 @@
 # Google Sheets
 [Google Sheets](https://www.google.com/sheets/about/) is a web-based spreadsheet program that is part of Google's free, web-based Google Docs Editors suite.
 ingestr supports Google Sheets as a source.
 ## URI Format
 The URI format for Google Sheets is as follows:
 ```
@@ -11,11 +13,13 @@ gsheets://?credentials_path=/path/to/service/account.json
 ```
 Alternatively, you can use base64 encoded credentials:
 ```
 gsheets://?credentials_base64=<base64_encoded_credentials>
 ```
 URI parameters:
 - `credentials_path`: the path to the service account JSON file
 The URI is used to connect to the Google Sheets API for extracting data.
@@ -24,15 +28,16 @@ The URI is used to connect to the Google Sheets API for extracting data.
 Google Sheets requires a few steps to set up an integration, please follow the guide dltHub [has built here](https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_sheets#setup-guide).
-Once you complete the guide, you should have a service account JSON file, and the spreadsheet ID to connect to. Let's say:
-- you store your JSON file in path `/path/to/file.json`
-- the spreadsheet you'd like to connect to is `abcdxyz`,
-- the sheet inside the spreadsheet is `Sheet1`
+Once you complete the guide, you should have a service account JSON file and the spreadsheet ID to connect to. Let's say:
+- you store your JSON file at the path `/path/to/file.json`.
+- the spreadsheet you'd like to connect to has the ID `fkdUQ2bjdNfUq2CA`. For example, if your spreadsheet URL is `https://docs.google.com/spreadsheets/d/fkdUQ2bjdNfUq2CA/edit?pli=1&gid=0#gid=0`, then the spreadsheet ID is `fkdUQ2bjdNfUq2CA`.
+- the sheet inside the spreadsheet is `Sheet1`.
 Based on this assumption, here's a sample command that will copy the data from the Google Sheets spreadsheet into a duckdb database:
 ```sh
-ingestr ingest --source-uri 'gsheets://?credentials_path=/path/to/file.json' --source-table 'abcdxyz.Sheet1' --dest-uri duckdb:///gsheets.duckdb --dest-table 'gsheets.output'
+ingestr ingest --source-uri 'gsheets://?credentials_path=/path/to/file.json' --source-table 'fkdUQ2bjdNfUq2CA.Sheet1' --dest-uri duckdb:///gsheets.duckdb --dest-table 'gsheets.output'
 ```
 The result of this command will be a table in the \`gsheets.duckdb\` database.

ingestr-0.9.1/docs/supported-sources/s3.md ADDED Viewed

@@ -0,0 +1,39 @@
+# S3
+[S3](https://aws.amazon.com/s3/) is a bucket for storing data in Amazon's Simple Storage Service, a cloud-based storage solution provided by AWS. S3 buckets allow users to store and retrieve data at any time from anywhere on the web.
+ingestr supports S3 as a source.
+## URI Format
+The URI format for S3 is as follows:
+```plaintext
+s3://<bucket_name>/<path_to_file>?access_key_id=<access_key_id>&secret_access_key=<secret_access_key>
+```
+URI parameters:
+- `bucket_name`: The name of the bucket
+- `path_to_files`: The relative path from the root of the bucket. You can find this from the S3 URI. For example, if your S3 URI is `s3://mybucket/students/students_details.csv`, then your bucket name is `mybucket` and path_to_files is `students/students_details.csv`.
+- `access_key_id` and `secret_access_key` : Used for accessing S3 bucket.
+## Setting up a S3 Integration
+S3 requires access_key_id and secret_access_key. Please follow the guide on dltHub to [obtain credentials](https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem/basic#get-credentials). Once you've completed the guide, you should have an `access_key_id` and `secret_access_key`. From the S3 URI, you can extract the `bucket_name` and `path_to_files`
+For example, if your access_key_id is `AKC3YOW7E`, secret_access_key is `XCtkpL5B`, bucket name is `my_bucket`, and path_to_files is `students/students_details.csv`, here's a sample command that will copy the data from the S3 bucket into a DuckDB database:
+```sh
+ingestr ingest --source-uri 's3://my_bucket/students/students_details.csv?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' --source-table 'students_details' --dest-uri duckdb:///s3.duckdb --dest-table 'dest.students_details'
+```
+The result of this command will be a table in the s3.duckdb database.
+Below are some examples of file path patterns, each path pattern is a reference from the root of the bucket:
+- `**/*.csv`: Retrieves all .csv files, regardless of how deep they are within the folder structure.
+- `*.csv`: Retrieves all .csv files from the first level of a folder.
+- `myFolder/**/*.jsonl`: Retrieves all .jsonl files from anywhere under myFolder.
+- `myFolder/mySubFolder/users.parquet`: Retrieves the users.parquet file from mySubFolder.
+- `employees.jsonl`: Retrieves the employees.jsonl file from the root level of bucket.

{ingestr-0.8.4 → ingestr-0.9.1}/docs/supported-sources/stripe.md RENAMED Viewed

@@ -1,9 +1,11 @@
 # Stripe
 [Stripe](https://www.stripe.com/) is a technology company that builds economic infrastructure for the internet, providing payment processing software and APIs for e-commerce websites and mobile applications.
 ingestr supports Stripe as a source.
 ## URI Format
 The URI format for Stripe is as follows:
 ```plaintext
@@ -11,6 +13,7 @@ stripe://?api_key=<api-key-here>
 ```
 URI parameters:
 - `api_key`: the API key used for authentication with the Stripe API
 The URI is used to connect to the Stripe API for extracting data. More details on setting up Stripe integrations can be found [here](https://stripe.com/docs/api).
@@ -28,7 +31,9 @@ ingestr ingest --source-uri 'stripe://?api_key=sk_test_12345' --source-table 'ch
 The result of this command will be a table in the `stripe.duckdb` database with JSON columns.
 ## Available Tables
 Stripe source allows ingesting the following sources into separate tables:
 - `subscription`: Represents a customer's subscription to a recurring service, detailing billing cycles, plans, and status.
 - `account`: Contains information about a Stripe account, including balances, payouts, and account settings.
 - `coupon`: Stores data about discount codes or coupons that can be applied to invoices, subscriptions, or other charges.

ingestr-0.9.1/docs/supported-sources/zendesk.md ADDED Viewed

@@ -0,0 +1,84 @@
+# Zendesk
+[Zendesk](https://www.zendesk.com/) is a cloud-based customer service and support platform. It offers a range of features including ticket management, self-service options, knowledgebase management, live chat, customer analytics, and conversations.
+ingestr supports Zendesk as a source.
+The Zendesk supports two authentication methods when connecting through ingestr:
+- OAuth Token
+- API Token
+For all resources except chat resources, you can use either the [API Token](https://dlthub.com/docs/dlt-ecosystem/verified-sources/zendesk#grab-zendesk-support-api-token) or the Zendesk Support [OAuth Token](https://dlthub.com/docs/dlt-ecosystem/verified-sources/zendesk#zendesk-support-oauth-token) to fetch data. However, for chat resources, you must use the [OAuth Token](https://dlthub.com/docs/dlt-ecosystem/verified-sources/zendesk#zendesk-chat) specific to Zendesk Chat.
+## URI Format
+The URI format for Zendesk based on the authentication method:
+### For OAuth Token Authentication:
+```plaintext
+zendesk://:<oauth_token>@<sub-domain>
+```
+### For API Token Authentication:
+```plaintext
+zendesk://<email>:<api_token>@<sub-domain>
+```
+URI parameters:
+- `subdomain`: Unique zendesk subdomain that can be found in account url. For example, if your account url is https://My_Company.zendesk.com/, then `My_Company` is your subdomain
+- `email`: Email address of the user
+- `api_token`: API token used for authentication with zendesk
+- `oauth_token`: OAuth token used for authentication with zendesk
+## Setting up a Zendesk Integration
+Zendesk requires a few steps to set up an integration, please follow the guide dltHub [has built here](https://dlthub.com/docs/dlt-ecosystem/verified-sources/zendesk#setup-guide).
+Once you complete the guide, if you decide to use an OAuth Token, you should have a subdomain and an OAuth token. Let’s say your subdomain is `mycompany` and your OAuth token is `qVsbdiasVt`.
+```sh
+ingestr ingest --source-uri "zendesk://:qVsbdiasVt@mycompany" \
+--source-table 'tickets' \
+--dest-uri 'duckdb:///zendesk.duckdb' \
+--dest-table 'zendesk.tickets' \
+--interval-start '2024-01-01'
+```
+If you decide to use an API Token, you should have a subdomain, email, and API token. Let’s say your subdomain is `mycompany`, your email is `john@get.com`, and your API token is `nbs123`.
+```sh
+ingestr ingest --source-uri "zendesk://john@get.com:nbs123@mycompany" \
+--source-table 'tickets' \
+--dest-uri 'duckdb:///zendesk.duckdb' \
+--dest-table 'zendesk.tickets' \
+--interval-start '2024-01-01'
+```
+The result of this command will be a table in the `zendesk.duckdb` database.
+## Available Tables
+Zendesk source allows ingesting the following sources into separate tables:
+- [activities](https://developer.zendesk.com/api-reference/ticketing/tickets/activity_stream/): Retrieves ticket activities affecting the agent
+- [addresses](https://developer.zendesk.com/api-reference/voice/talk-api/addresses/): Retrieves addresses information
+- [agents_activity](https://developer.zendesk.com/api-reference/voice/talk-api/stats/#list-agents-activity): Retrieves activity information for agents
+- [automations](https://developer.zendesk.com/api-reference/ticketing/business-rules/automations/): Retrives automations for the current account
+- [brands](https://developer.zendesk.com/api-reference/ticketing/account-configuration/brands/): Retrieves all brands for your account
+- [calls](https://developer.zendesk.com/api-reference/voice/talk-api/incremental_exports/#incremental-calls-export): Retrieves all calls specific to channels
+- [chats](https://developer.zendesk.com/api-reference/live-chat/chat-api/incremental_export/): Retrieves available chats
+- [greetings](https://developer.zendesk.com/api-reference/voice/talk-api/greetings/): Retrieves all default or customs greetings
+- [groups](https://developer.zendesk.com/api-reference/ticketing/groups/groups/): Retrieves groups of support agents
+- [legs_incremental](https://developer.zendesk.com/api-reference/voice/talk-api/incremental_exports/#incremental-call-legs-export): Retrieves detailed information about each agent involved in a call
+- [lines](https://developer.zendesk.com/api-reference/voice/talk-api/lines/): Retrieves all available lines (phone numbers and digital lines) in your Zendesk voice account
+- [organizations](https://developer.zendesk.com/api-reference/ticketing/organizations/organizations/) : Retrieves organizations (your customers can be grouped into organizations by their email domain)
+- [phone_numbers](https://developer.zendesk.com/api-reference/voice/talk-api/phone_numbers/): Retrieves all available phone numbers
+- [settings](https://developer.zendesk.com/api-reference/voice/talk-api/voice_settings/): Retrieves account settings related to Zendesk voice accounts
+- [sla_policies](https://developer.zendesk.com/api-reference/ticketing/business-rules/sla_policies/): Retrives different sla policies.
+- [targets](https://developer.zendesk.com/api-reference/ticketing/targets/targets/): Retrieves targets where as targets are data from Zendesk to external applications like Slack when a ticket is updated or created.
+- [tickets](https://developer.zendesk.com/api-reference/ticketing/tickets/tickets/): Retrieves all tickets, which are the means through which customers communicate with agents
+- [ticket_forms](https://developer.zendesk.com/api-reference/ticketing/tickets/ticket_forms/): Retrieves all ticket forms
+- [ticket_metrics](https://developer.zendesk.com/api-reference/ticketing/tickets/ticket_metrics/): Retrieves various metrics about one or more tickets.
+- [ticket_metric_events](https://developer.zendesk.com/api-reference/ticketing/tickets/ticket_metric_events/): Retrieves ticket metric events that occurred on or after the start time
+- [users](https://developer.zendesk.com/api-reference/ticketing/users/users/): Retrieves all users
+Use these as `--source-table` parameter in the `ingestr ingest` command.

{ingestr-0.8.4 → ingestr-0.9.1}/ingestr/src/factory.py RENAMED Viewed

@@ -28,10 +28,12 @@ from ingestr.src.sources import (
     LocalCsvSource,
     MongoDbSource,
     NotionSource,
+    S3Source,
     ShopifySource,
     SlackSource,
     SqlSource,
     StripeAnalyticsSource,
+    ZendeskSource,
 )
 SQL_SOURCE_SCHEMES = [
@@ -132,6 +134,10 @@ class SourceDestinationFactory:
             return KafkaSource()
         elif self.source_scheme == "adjust":
             return AdjustSource()
+        elif self.source_scheme == "zendesk":
+            return ZendeskSource()
+        elif self.source_scheme == "s3":
+            return S3Source()
         else:
             raise ValueError(f"Unsupported source scheme: {self.source_scheme}")

ingestr-0.9.1/ingestr/src/filesystem/__init__.py ADDED Viewed

@@ -0,0 +1,98 @@
+"""Reads files in s3, gs or azure buckets using fsspec and provides convenience resources for chunked reading of various file formats"""
+from typing import Iterator, List, Optional, Tuple, Union
+import dlt
+from dlt.sources import DltResource
+from dlt.sources.credentials import FileSystemCredentials
+from dlt.sources.filesystem import FileItem, FileItemDict, fsspec_filesystem, glob_files
+from .helpers import (
+    AbstractFileSystem,
+    FilesystemConfigurationResource,
+)
+from .readers import (
+    ReadersSource,
+    _read_csv,
+    _read_csv_duckdb,
+    _read_jsonl,
+    _read_parquet,
+)
+@dlt.source(_impl_cls=ReadersSource, spec=FilesystemConfigurationResource)
+def readers(
+    bucket_url: str,
+    credentials: Union[FileSystemCredentials, AbstractFileSystem],
+    file_glob: Optional[str] = "*",
+) -> Tuple[DltResource, ...]:
+    """This source provides a few resources that are chunked file readers. Readers can be further parametrized before use
+       read_csv(chunksize, **pandas_kwargs)
+       read_jsonl(chunksize)
+       read_parquet(chunksize)
+    Args:
+        bucket_url (str): The url to the bucket.
+        credentials (FileSystemCredentials | AbstractFilesystem): The credentials to the filesystem of fsspec `AbstractFilesystem` instance.
+        file_glob (str, optional): The filter to apply to the files in glob format. by default lists all files in bucket_url non-recursively
+    """
+    filesystem_resource = filesystem(bucket_url, credentials, file_glob=file_glob)
+    filesystem_resource.apply_hints(
+        incremental=dlt.sources.incremental("modification_date")
+    )
+    return (
+        filesystem_resource | dlt.transformer(name="read_csv")(_read_csv),
+        filesystem_resource | dlt.transformer(name="read_jsonl")(_read_jsonl),
+        filesystem_resource | dlt.transformer(name="read_parquet")(_read_parquet),
+        filesystem_resource | dlt.transformer(name="read_csv_duckdb")(_read_csv_duckdb),
+    )
+@dlt.resource(
+    primary_key="file_url", spec=FilesystemConfigurationResource, standalone=True
+)
+def filesystem(
+    bucket_url: str = dlt.secrets.value,
+    credentials: Union[FileSystemCredentials, AbstractFileSystem] = dlt.secrets.value,
+    file_glob: Optional[str] = "*",
+    files_per_page: int = 100,
+    extract_content: bool = True,
+) -> Iterator[List[FileItem]]:
+    """This resource lists files in `bucket_url` using `file_glob` pattern. The files are yielded as FileItem which also
+    provide methods to open and read file data. It should be combined with transformers that further process (ie. load files)
+    Args:
+        bucket_url (str): The url to the bucket.
+        credentials (FileSystemCredentials | AbstractFilesystem): The credentials to the filesystem of fsspec `AbstractFilesystem` instance.
+        file_glob (str, optional): The filter to apply to the files in glob format. by default lists all files in bucket_url non-recursively
+        files_per_page (int, optional): The number of files to process at once, defaults to 100.
+        extract_content (bool, optional): If true, the content of the file will be extracted if
+            false it will return a fsspec file, defaults to False.
+    Returns:
+        Iterator[List[FileItem]]: The list of files.
+    """
+    if isinstance(credentials, AbstractFileSystem):
+        fs_client = credentials
+    else:
+        fs_client = fsspec_filesystem(bucket_url, credentials)[0]
+    files_chunk: List[FileItem] = []
+    for file_model in glob_files(fs_client, bucket_url, file_glob):
+        file_dict = FileItemDict(file_model, credentials)
+        if extract_content:
+            file_dict["file_content"] = file_dict.read_bytes()
+        files_chunk.append(file_dict)  # type: ignore
+        # wait for the chunk to be full
+        if len(files_chunk) >= files_per_page:
+            yield files_chunk
+            files_chunk = []
+    if files_chunk:
+        yield files_chunk
+read_csv = dlt.transformer(standalone=True)(_read_csv)
+read_jsonl = dlt.transformer(standalone=True)(_read_jsonl)
+read_parquet = dlt.transformer(standalone=True)(_read_parquet)
+read_csv_duckdb = dlt.transformer(standalone=True)(_read_csv_duckdb)

ingestr-0.9.1/ingestr/src/filesystem/helpers.py ADDED Viewed

@@ -0,0 +1,100 @@
+"""Helpers for the filesystem resource."""
+from typing import Any, Dict, Iterable, List, Optional, Type, Union
+import dlt
+from dlt.common.configuration import resolve_type
+from dlt.common.typing import TDataItem
+from dlt.sources import DltResource
+from dlt.sources.config import configspec, with_config
+from dlt.sources.credentials import (
+    CredentialsConfiguration,
+    FilesystemConfiguration,
+    FileSystemCredentials,
+)
+from dlt.sources.filesystem import fsspec_filesystem
+from fsspec import AbstractFileSystem  # type: ignore
+@configspec
+class FilesystemConfigurationResource(FilesystemConfiguration):
+    credentials: Union[FileSystemCredentials, AbstractFileSystem] = None
+    file_glob: Optional[str] = "*"
+    files_per_page: int = 100
+    extract_content: bool = False
+    @resolve_type("credentials")
+    def resolve_credentials_type(self) -> Type[CredentialsConfiguration]:
+        # use known credentials or empty credentials for unknown protocol
+        return Union[
+            self.PROTOCOL_CREDENTIALS.get(self.protocol)
+            or Optional[CredentialsConfiguration],
+            AbstractFileSystem,
+        ]  # type: ignore[return-value]
+def fsspec_from_resource(filesystem_instance: DltResource) -> AbstractFileSystem:
+    """Extract authorized fsspec client from a filesystem resource"""
+    @with_config(
+        spec=FilesystemConfiguration,
+        sections=("sources", filesystem_instance.section, filesystem_instance.name),
+    )
+    def _get_fsspec(
+        bucket_url: str, credentials: Optional[FileSystemCredentials]
+    ) -> AbstractFileSystem:
+        return fsspec_filesystem(bucket_url, credentials)[0]
+    return _get_fsspec(
+        filesystem_instance.explicit_args.get("bucket_url", dlt.config.value),
+        filesystem_instance.explicit_args.get("credentials", dlt.secrets.value),
+    )
+def add_columns(columns: List[str], rows: List[List[Any]]) -> List[Dict[str, Any]]:
+    """Adds column names to the given rows.
+    Args:
+        columns (List[str]): The column names.
+        rows (List[List[Any]]): The rows.
+    Returns:
+        List[Dict[str, Any]]: The rows with column names.
+    """
+    result = []
+    for row in rows:
+        result.append(dict(zip(columns, row)))
+    return result
+def fetch_arrow(file_data, chunk_size: int) -> Iterable[TDataItem]:  # type: ignore
+    """Fetches data from the given CSV file.
+    Args:
+        file_data (DuckDBPyRelation): The CSV file data.
+        chunk_size (int): The number of rows to read at once.
+    Yields:
+        Iterable[TDataItem]: Data items, read from the given CSV file.
+    """
+    batcher = file_data.fetch_arrow_reader(batch_size=chunk_size)
+    yield from batcher
+def fetch_json(file_data, chunk_size: int) -> List[Dict[str, Any]]:  # type: ignore
+    """Fetches data from the given CSV file.
+    Args:
+        file_data (DuckDBPyRelation): The CSV file data.
+        chunk_size (int): The number of rows to read at once.
+    Yields:
+        Iterable[TDataItem]: Data items, read from the given CSV file.
+    """
+    while True:
+        batch = file_data.fetchmany(chunk_size)
+        if not batch:
+            break
+        yield add_columns(file_data.columns, batch)

ingestr-0.9.1/ingestr/src/filesystem/readers.py ADDED Viewed

@@ -0,0 +1,131 @@
+from typing import TYPE_CHECKING, Any, Iterator, Optional
+from dlt.common import json
+from dlt.common.typing import copy_sig
+from dlt.sources import DltResource, DltSource, TDataItems
+from dlt.sources.filesystem import FileItemDict
+from .helpers import fetch_arrow, fetch_json
+def _read_csv(
+    items: Iterator[FileItemDict], chunksize: int = 10000, **pandas_kwargs: Any
+) -> Iterator[TDataItems]:
+    """Reads csv file with Pandas chunk by chunk.
+    Args:
+        chunksize (int): Number of records to read in one chunk
+        **pandas_kwargs: Additional keyword arguments passed to Pandas.read_csv
+    Returns:
+        TDataItem: The file content
+    """
+    import pandas as pd
+    # apply defaults to pandas kwargs
+    kwargs = {**{"header": "infer", "chunksize": chunksize}, **pandas_kwargs}
+    for file_obj in items:
+        # Here we use pandas chunksize to read the file in chunks and avoid loading the whole file
+        # in memory.
+        with file_obj.open() as file:
+            for df in pd.read_csv(file, **kwargs):
+                yield df.to_dict(orient="records")
+def _read_jsonl(
+    items: Iterator[FileItemDict], chunksize: int = 1000
+) -> Iterator[TDataItems]:
+    """Reads jsonl file content and extract the data.
+    Args:
+        chunksize (int, optional): The number of JSON lines to load and yield at once, defaults to 1000
+    Returns:
+        TDataItem: The file content
+    """
+    for file_obj in items:
+        with file_obj.open() as f:
+            lines_chunk = []
+            for line in f:
+                lines_chunk.append(json.loadb(line))
+                if len(lines_chunk) >= chunksize:
+                    yield lines_chunk
+                    lines_chunk = []
+        if lines_chunk:
+            yield lines_chunk
+def _read_parquet(
+    items: Iterator[FileItemDict],
+    chunksize: int = 10,
+) -> Iterator[TDataItems]:
+    """Reads parquet file content and extract the data.
+    Args:
+        chunksize (int, optional): The number of files to process at once, defaults to 10.
+    Returns:
+        TDataItem: The file content
+    """
+    from pyarrow import parquet as pq
+    for file_obj in items:
+        with file_obj.open() as f:
+            parquet_file = pq.ParquetFile(f)
+            for rows in parquet_file.iter_batches(batch_size=chunksize):
+                yield rows.to_pylist()
+def _read_csv_duckdb(
+    items: Iterator[FileItemDict],
+    chunk_size: Optional[int] = 5000,
+    use_pyarrow: bool = False,
+    **duckdb_kwargs: Any,
+) -> Iterator[TDataItems]:
+    """A resource to extract data from the given CSV files.
+    Uses DuckDB engine to import and cast CSV data.
+    Args:
+        items (Iterator[FileItemDict]): CSV files to read.
+        chunk_size (Optional[int]):
+            The number of rows to read at once. Defaults to 5000.
+        use_pyarrow (bool):
+            Whether to use `pyarrow` to read the data and designate
+            data schema. If set to False (by default), JSON is used.
+        duckdb_kwargs (Dict):
+            Additional keyword arguments to pass to the `read_csv()`.
+    Returns:
+        Iterable[TDataItem]: Data items, read from the given CSV files.
+    """
+    import duckdb
+    helper = fetch_arrow if use_pyarrow else fetch_json
+    for item in items:
+        with item.open() as f:
+            file_data = duckdb.from_csv_auto(f, **duckdb_kwargs)  # type: ignore
+            yield from helper(file_data, chunk_size)
+if TYPE_CHECKING:
+    class ReadersSource(DltSource):
+        """This is a typing stub that provides docstrings and signatures to the resources in `readers" source"""
+        @copy_sig(_read_csv)
+        def read_csv(self) -> DltResource: ...
+        @copy_sig(_read_jsonl)
+        def read_jsonl(self) -> DltResource: ...
+        @copy_sig(_read_parquet)
+        def read_parquet(self) -> DltResource: ...
+        @copy_sig(_read_csv_duckdb)
+        def read_csv_duckdb(self) -> DltResource: ...
+else:
+    ReadersSource = DltSource

{ingestr-0.8.4 → ingestr-0.9.1}/ingestr/src/shopify/__init__.py RENAMED Viewed

@@ -1,8 +1,9 @@
 """Fetches Shopify Orders and Products."""
-from typing import Iterable, Optional
+from typing import Any, Dict, Iterable, Optional  # noqa: F401
 import dlt
+from dlt.common import jsonpath as jp  # noqa: F401
 from dlt.common import pendulum
 from dlt.common.time import ensure_pendulum_datetime
 from dlt.common.typing import TAnyDateTime, TDataItem
@@ -12,6 +13,7 @@ from .helpers import ShopifyApi, ShopifyGraphQLApi, TOrderStatus
 from .settings import (
     DEFAULT_API_VERSION,
     DEFAULT_ITEMS_PER_PAGE,
+    DEFAULT_PARTNER_API_VERSION,  # noqa: F401
     FIRST_DAY_OF_MILLENNIUM,
 )

ingestr 0.8.4__tar.gz → 0.9.1__tar.gz

Potentially problematic release.

ingestr 0.8.4tar.gz → 0.9.1tar.gz