PyPI - ingestr - Versions diffs - 0.4.0__tar.gz → 0.5.1__tar.gz - Mend

ingestr 0.4.0tar.gz → 0.5.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of ingestr might be problematic. Click here for more details.

Files changed (77) hide show

{ingestr-0.4.0 → ingestr-0.5.1}/Dockerfile RENAMED Viewed

@@ -10,7 +10,7 @@ ENV VIRTUAL_ENV=/usr/local
 ADD --chmod=755 https://astral.sh/uv/install.sh /install.sh
 RUN /install.sh && rm /install.sh
-RUN /root/.cargo/bin/uv pip install --no-cache -r requirements.txt
+RUN /root/.cargo/bin/uv pip install --system --no-cache -r requirements.txt
 COPY . /app

{ingestr-0.4.0 → ingestr-0.5.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: ingestr
-Version: 0.4.0
+Version: 0.5.1
 Summary: ingestr is a command-line application that ingests data from various sources and stores them in any database.
 Project-URL: Homepage, https://github.com/bruin-data/ingestr
 Project-URL: Issues, https://github.com/bruin-data/ingestr/issues
@@ -19,6 +19,7 @@ Requires-Dist: databricks-sql-connector==2.9.3
 Requires-Dist: dlt==0.4.8
 Requires-Dist: duckdb-engine==0.11.5
 Requires-Dist: duckdb==0.10.2
+Requires-Dist: google-api-python-client==2.130.0
 Requires-Dist: google-cloud-bigquery-storage==2.24.0
 Requires-Dist: pendulum==3.0.0
 Requires-Dist: psycopg2-binary==2.9.9
@@ -169,6 +170,11 @@ Join our Slack community [here](https://join.slack.com/t/bruindatacommunity/shar
     <tr>
         <td colspan="3" style='text-align:center;'><strong>Platforms</strong></td>
     </tr>
+    <tr>
+        <td>Google Sheets</td>
+        <td>✅</td>
+        <td>❌</td>
+    </tr>
     <tr>
         <td>Notion</td>
         <td>✅</td>

{ingestr-0.4.0 → ingestr-0.5.1}/README.md RENAMED Viewed

@@ -128,6 +128,11 @@ Join our Slack community [here](https://join.slack.com/t/bruindatacommunity/shar
     <tr>
         <td colspan="3" style='text-align:center;'><strong>Platforms</strong></td>
     </tr>
+    <tr>
+        <td>Google Sheets</td>
+        <td>✅</td>
+        <td>❌</td>
+    </tr>
     <tr>
         <td>Notion</td>
         <td>✅</td>

{ingestr-0.4.0 → ingestr-0.5.1}/docs/.vitepress/config.mjs RENAMED Viewed

@@ -67,7 +67,10 @@ export default defineConfig({
           {
             text: "Platforms",
             collapsed: false,
-            items: [{ text: "Notion", link: "/supported-sources/notion.md" }],
+            items: [
+              { text: "Google Sheets", link: "/supported-sources/gsheets.md" },
+              { text: "Notion", link: "/supported-sources/notion.md" },
+            ],
           },
         ],
       },

ingestr-0.5.1/docs/supported-sources/gsheets.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Google Sheets
+[Google Sheets](https://www.google.com/sheets/about/) is a web-based spreadsheet program that is part of Google's free, web-based Google Docs Editors suite.
+ingestr supports Google Sheets as a source.
+## URI Format
+The URI format for Google Sheets is as follows:
+```
+gsheets://?credentials_path=/path/to/service/account.json
+```
+Alternatively, you can use base64 encoded credentials:
+```
+gsheets://?credentials_base64=<base64_encoded_credentials>
+```
+URI parameters:
+- `credentials_path`: the path to the service account JSON file
+The URI is used to connect to the Google Sheets API for extracting data.
+## Setting up a Google Sheets Integration
+Google Sheets requires a few steps to set up an integration, please follow the guide dltHub [has built here](https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_sheets#setup-guide).
+Once you complete the guide, you should have a service account JSON file, and the spreadsheet ID to connect to. Let's say:
+- you store your JSON file in path `/path/to/file.json`
+- the spreadsheet you'd like to connect to is `abcdxyz`,
+- the sheet inside the spreadsheet is `Sheet1`
+Based on this assumption, here's a sample command that will copy the data from the Google Sheets spreadsheet into a duckdb database:
+```sh
+ingestr ingest --source-uri 'gsheets://?credentials_path=/path/to/file.json' --source-table 'abcdxyz.Sheet1' --dest-uri duckdb:///gsheets.duckdb --dest-table 'gsheets.output'
+```
+The result of this command will be a table in the \`gsheets.duckdb\` database.
+> [!CAUTION]
+> Google Sheets does not support incremental loading, which means every time you run the command, it will copy the entire spreadsheet from Google Sheets to the destination. This can be slow for large spreadsheets.

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/overview.md RENAMED Viewed

@@ -79,6 +79,11 @@ ingestr supports the following sources and destinations:
     <tr>
         <td colspan="3" style='text-align:center;'><strong>Platforms</strong></td>
     </tr>
+    <tr>
+        <td>Google Sheets</td>
+        <td>✅</td>
+        <td>❌</td>
+    </tr>
     <tr>
         <td>Notion</td>
         <td>✅</td>

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/main.py RENAMED Viewed

@@ -6,6 +6,7 @@ from typing import Optional
 import dlt
 import humanize
 import typer
+from dlt.common.pipeline import LoadInfo
 from dlt.common.runtime.collector import Collector, LogCollector
 from rich.console import Console
 from rich.status import Status
@@ -310,7 +311,7 @@ def ingest(
             ):
                 loader_file_format = None
-        run_info = pipeline.run(
+        run_info: LoadInfo = pipeline.run(
             dlt_source,
             **destination.dlt_run_params(
                 uri=dest_uri,
@@ -323,6 +324,18 @@ def ingest(
             else None,  # type: ignore
         )
+        for load_package in run_info.load_packages:
+            failed_jobs = load_package.jobs["failed_jobs"]
+            if len(failed_jobs) > 0:
+                print()
+                print("[bold red]Failed jobs:[/bold red]")
+                print()
+                for job in failed_jobs:
+                    print(f"[bold red]  {job.job_file_info.job_id()}[/bold red]")
+                    print(f"    [bold yellow]Error:[/bold yellow] {job.failed_message}")
+                raise typer.Exit()
         destination.post_load()
         elapsedHuman = ""

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/factory.py RENAMED Viewed

@@ -14,7 +14,13 @@ from ingestr.src.destinations import (
     SnowflakeDestination,
     SynapseDestination,
 )
-from ingestr.src.sources import LocalCsvSource, MongoDbSource, NotionSource, SqlSource
+from ingestr.src.sources import (
+    GoogleSheetsSource,
+    LocalCsvSource,
+    MongoDbSource,
+    NotionSource,
+    SqlSource,
+)
 SQL_SOURCE_SCHEMES = [
     "bigquery",
@@ -83,6 +89,8 @@ class SourceDestinationFactory:
             return MongoDbSource()
         elif self.source_scheme == "notion":
             return NotionSource()
+        elif self.source_scheme == "gsheets":
+            return GoogleSheetsSource()
         else:
             raise ValueError(f"Unsupported source scheme: {self.source_scheme}")

ingestr-0.5.1/ingestr/src/google_sheets/README.md ADDED Viewed

@@ -0,0 +1,95 @@
+# Google Sheets
+## Prepare your data
+We recommend to to use [Named Ranges](link to gsheets) to indicate which data should be extracted from a particular spreadsheet and this is how this source
+will work by default - when called with without setting any other options. All the named ranges will be converted into tables named after them and stored in the
+destination.
+* You can let the spreadsheet users to add and remove tables by just adding/removing the ranges, you do not need to configure the pipeline again.
+* You can indicate exactly the fragments of interest and only this data will be retrieved so it is the fastest.
+* You can name database tables by changing the range names.
+If you are not happy with the workflow above, you can:
+* Disable it by setting `get_named_ranges` option to False
+* Enable retrieving all sheets/tabs with `get_sheets` option set to True
+* Pass a list of ranges as supported by Google Sheets in `range_names`
+Note that hidden columns will be extracted.
+> 💡 You can load data from many spreadsheets and also rename the tables to which data is loaded. This is standard part of `dlt`, see `load_with_table_rename_and_multiple_spreadsheets` demo in `google_sheets_pipeline.py`
+### Make sure your data has headers and is a proper table
+**First row of any extracted range should contain headers**. Please make sure:
+1. The header names are strings and are unique.
+2. That all the columns that you intend to extract have a header.
+3. That data starts exactly at the origin of the range - otherwise source will remove padding but it is a waste of resources!
+When source detects any problems with headers or table layout **it will issue a WARNING in the log** so it makes sense to run your pipeline script manually/locally and fix all the problems.
+1. Columns without headers will be removed and not extracted!
+2. Columns with headers that does not contain any data will be removed.
+2. If there's any problems with reading headers (ie. header is not string or is empty or not unique): **the headers row will be extracted as data** and automatic header names will be used.
+3. Empty rows are ignored
+4. `dlt` will normalize range names and headers into table and column names - so they may be different in the database than in google sheets. Prefer small cap names without special characters!
+### Data Types
+`dlt` normalizer will use first row of data to infer types and will try to coerce following rows - creating variant columns if that is not possible. This is a standard behavior.
+**date time** and **date** types are also recognized and this happens via additional metadata that is retrieved for the first row.
+## Passing the spreadsheet id/url and explicit range names
+You can use both url of your spreadsheet that you can copy from the browser ie.
+```
+https://docs.google.com/spreadsheets/d/1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4/edit?usp=sharing
+```
+or spreadsheet id (which is a part of the url)
+```
+1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4
+```
+typically you pass it directly to the `google_spreadsheet` function
+**passing ranges**
+You can pass explicit ranges to the `google_spreadsheet`:
+1. sheet names
+2. named ranges
+3. any range in Google Sheet format ie. **sheet 1!A1:B7**
+## The `spreadsheet_info` table
+This table is repopulated after every load and keeps the information on loaded ranges:
+* id and title of the spreadsheet
+* name of the range as passed to the source
+* string representation of the loaded range
+* range above in parsed representation
+## Running on Airflow (and some under the hood information)
+Internally, the source loads all the data immediately in the `google_spreadsheet` before execution of the pipeline in `run`. No matter how many ranges you request, we make just two calls to the API to retrieve data. This works very well with typical scripts that create a dlt source with `google_spreadsheet` and then run it with `pipeline.run`.
+In case of Airflow, the source is created and executed separately. In typical configuration where runner is a separate machine, **this will load data twice**.
+**Moreover, you should not use `scc` decomposition in our Airflow helper**. It will create an instance of the source for each requested range in order to run a task that corresponds to it! Following our [Airflow deployment guide](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-modify-dag-file), this is how you should use `tasks.add_run` on `PipelineTasksGroup`:
+```python
+@dag(
+    schedule_interval='@daily',
+    start_date=pendulum.datetime(2023, 2, 1),
+    catchup=False,
+    max_active_runs=1,
+    default_args=default_task_args
+)
+def get_named_ranges():
+    tasks = PipelineTasksGroup("get_named_ranges", use_data_folder=False, wipe_local_data=True)
+    # import your source from pipeline script
+    from google_sheets import google_spreadsheet
+    pipeline = dlt.pipeline(
+        pipeline_name="get_named_ranges",
+        dataset_name="named_ranges_data",
+        destination='bigquery',
+    )
+    # do not use decompose to run `google_spreadsheet` in single task
+    tasks.add_run(pipeline, google_spreadsheet("1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580"), decompose="none", trigger_rule="all_done", retries=0, provide_context=True)
+```
+## Setup credentials
+[We recommend to use service account for any production deployments](https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_sheets#google-sheets-api-authentication)

ingestr-0.5.1/ingestr/src/google_sheets/__init__.py ADDED Viewed

@@ -0,0 +1,152 @@
+"""Loads Google Sheets data from tabs, named and explicit ranges. Contains the main source functions."""
+from typing import Iterable, Sequence, Union
+import dlt
+from dlt.common import logger
+from dlt.sources import DltResource
+from dlt.sources.credentials import GcpOAuthCredentials, GcpServiceAccountCredentials
+from .helpers import api_calls
+from .helpers.api_calls import api_auth
+from .helpers.data_processing import (
+    get_data_types,
+    get_range_headers,
+    get_spreadsheet_id,
+    process_range,
+)
+@dlt.source
+def google_spreadsheet(
+    spreadsheet_url_or_id: str = dlt.config.value,
+    range_names: Sequence[str] = dlt.config.value,
+    credentials: Union[
+        GcpOAuthCredentials, GcpServiceAccountCredentials
+    ] = dlt.secrets.value,
+    get_sheets: bool = False,
+    get_named_ranges: bool = True,
+    max_api_retries: int = 5,
+) -> Iterable[DltResource]:
+    """
+    The source for the dlt pipeline. It returns the following resources:
+    - 1 dlt resource for every range in range_names.
+    - Optionally, dlt resources for all sheets inside the spreadsheet and all named ranges inside the spreadsheet.
+    Args:
+        spreadsheet_url_or_id (str): The ID or URL of the spreadsheet.
+        range_names (Sequence[str]): A list of ranges in the spreadsheet in the format used by Google Sheets. Accepts Named Ranges and Sheets (tabs) names.
+            These are the ranges to be converted into tables.
+        credentials (Union[GcpServiceAccountCredentials, GcpOAuthCredentials]): GCP credentials to the account
+            with Google Sheets API access, defined in dlt.secrets.
+        get_sheets (bool, optional): If True, load all the sheets inside the spreadsheet into the database.
+            Defaults to False.
+        get_named_ranges (bool, optional): If True, load all the named ranges inside the spreadsheet into the database.
+            Defaults to True.
+        max_api_retries (int, optional): Max number of retires to google sheets API. Actual behavior is internal to google client.
+    Yields:
+        Iterable[DltResource]: List of dlt resources.
+    """
+    # authenticate to the service using the helper function
+    service = api_auth(credentials, max_api_retries=max_api_retries)
+    # get spreadsheet id from url or id
+    spreadsheet_id = get_spreadsheet_id(spreadsheet_url_or_id)
+    all_range_names = set(range_names or [])
+    # if no explicit ranges, get sheets and named ranges from metadata
+    # get metadata with list of sheets and named ranges in the spreadsheet
+    sheet_names, named_ranges, spreadsheet_title = api_calls.get_known_range_names(
+        spreadsheet_id=spreadsheet_id, service=service
+    )
+    if not range_names:
+        if get_sheets:
+            all_range_names.update(sheet_names)
+        if get_named_ranges:
+            all_range_names.update(named_ranges)
+    # first we get all data for all the ranges (explicit or named)
+    all_range_data = api_calls.get_data_for_ranges(
+        service=service,
+        spreadsheet_id=spreadsheet_id,
+        range_names=list(all_range_names),
+    )
+    assert len(all_range_names) == len(
+        all_range_data
+    ), "Google Sheets API must return values for all requested ranges"
+    # get metadata for two first rows of each range
+    # first should contain headers
+    # second row contains data which we'll use to sample data types.
+    # google sheets return datetime and date types as lotus notes serial number. which is just a float so we cannot infer the correct types just from the data
+    # warn and remove empty ranges
+    range_data = []
+    metadata_table = []
+    for name, parsed_range, meta_range, values in all_range_data:
+        # # pass all ranges to spreadsheet info - including empty
+        # metadata_table.append(
+        #     {
+        #         "spreadsheet_id": spreadsheet_id,
+        #         "title": spreadsheet_title,
+        #         "range_name": name,
+        #         "range": str(parsed_range),
+        #         "range_parsed": parsed_range._asdict(),
+        #         "skipped": True,
+        #     }
+        # )
+        if values is None or len(values) == 0:
+            logger.warning(f"Range {name} does not contain any data. Skipping.")
+            continue
+        if len(values) == 1:
+            logger.warning(f"Range {name} contain only 1 row of data. Skipping.")
+            continue
+        if len(values[0]) == 0:
+            logger.warning(
+                f"First row of range {name} does not contain data. Skipping."
+            )
+            continue
+        # metadata_table[-1]["skipped"] = False
+        range_data.append((name, parsed_range, meta_range, values))
+    meta_values = api_calls.get_meta_for_ranges(
+        service, spreadsheet_id, [str(data[2]) for data in range_data]
+    )
+    for name, parsed_range, _, values in range_data:
+        logger.info(f"Processing range {parsed_range} with name {name}")
+        # here is a tricky part due to how Google Sheets API returns the metadata. We are not able to directly pair the input range names with returned metadata objects
+        # instead metadata objects are grouped by sheet names, still each group order preserves the order of input ranges
+        # so for each range we get a sheet name, we look for the metadata group for that sheet and then we consume first object on that list with pop
+        metadata = next(
+            sheet
+            for sheet in meta_values["sheets"]
+            if sheet["properties"]["title"] == parsed_range.sheet_name
+        )["data"].pop(0)
+        headers_metadata = metadata["rowData"][0]["values"]
+        headers = get_range_headers(headers_metadata, name)
+        if headers is None:
+            # generate automatic headers and treat the first row as data
+            headers = [f"col_{idx+1}" for idx in range(len(headers_metadata))]
+            data_row_metadata = headers_metadata
+            rows_data = values[0:]
+            logger.warning(
+                f"Using automatic headers. WARNING: first row of the range {name} will be used as data!"
+            )
+        else:
+            # first row contains headers and is skipped
+            data_row_metadata = metadata["rowData"][1]["values"]
+            rows_data = values[1:]
+        data_types = get_data_types(data_row_metadata)
+        yield dlt.resource(
+            process_range(rows_data, headers=headers, data_types=data_types),
+            name=name,
+            write_disposition="replace",
+        )
+    yield dlt.resource(
+        metadata_table,
+        write_disposition="merge",
+        name="spreadsheet_info",
+        merge_key="spreadsheet_id",
+    )

ingestr-0.5.1/ingestr/src/google_sheets/helpers/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ """Google Sheets source helpers"""

ingestr-0.5.1/ingestr/src/google_sheets/helpers/api_calls.py ADDED Viewed

@@ -0,0 +1,146 @@
+"""Contains helper functions to extract data from spreadsheet API"""
+from typing import Any, List, Tuple
+from dlt.common.exceptions import MissingDependencyException
+from dlt.common.typing import DictStrAny
+from dlt.sources.credentials import GcpCredentials, GcpOAuthCredentials
+from dlt.sources.helpers.requests.retry import DEFAULT_RETRY_STATUS
+from tenacity import retry, retry_if_exception, stop_after_attempt, wait_exponential
+from .data_processing import ParsedRange, trim_range_top_left
+try:
+    from apiclient.discovery import Resource, build
+except ImportError:
+    raise MissingDependencyException("Google API Client", ["google-api-python-client"])
+def is_retry_status_code(exception: BaseException) -> bool:
+    """Retry condition on HttpError"""
+    from googleapiclient.errors import HttpError  # type: ignore
+    # print(f"RETRY ON {str(HttpError)} = {isinstance(exception, HttpError) and exception.resp.status in DEFAULT_RETRY_STATUS}")
+    # if isinstance(exception, HttpError):
+    #     print(exception.resp.status)
+    #     print(DEFAULT_RETRY_STATUS)
+    return (
+        isinstance(exception, HttpError)
+        and exception.resp.status in DEFAULT_RETRY_STATUS
+    )
+retry_deco = retry(
+    # Retry if it's a rate limit error (HTTP 429)
+    retry=retry_if_exception(is_retry_status_code),
+    # Use exponential backoff for the waiting time between retries, starting with 5 seconds
+    wait=wait_exponential(multiplier=1.5, min=5, max=120),
+    # Stop retrying after 10 attempts
+    stop=stop_after_attempt(10),
+    # Print out the retrying details
+    reraise=True,
+)
+def api_auth(credentials: GcpCredentials, max_api_retries: int) -> Resource:
+    """
+    Uses GCP credentials to authenticate with Google Sheets API.
+    Args:
+        credentials (GcpCredentials): Credentials needed to log in to GCP.
+        max_api_retries (int): Max number of retires to google sheets API. Actual behavior is internal to google client.
+    Returns:
+        Resource: Object needed to make API calls to Google Sheets API.
+    """
+    if isinstance(credentials, GcpOAuthCredentials):
+        credentials.auth("https://www.googleapis.com/auth/spreadsheets.readonly")
+    # Build the service object for Google sheets api.
+    service = build(
+        "sheets",
+        "v4",
+        credentials=credentials.to_native_credentials(),
+        num_retries=max_api_retries,
+    )
+    return service
+@retry_deco
+def get_meta_for_ranges(
+    service: Resource, spreadsheet_id: str, range_names: List[str]
+) -> Any:
+    """Retrieves `spreadsheet_id` cell metadata for `range_names`"""
+    return (
+        service.spreadsheets()
+        .get(
+            spreadsheetId=spreadsheet_id,
+            ranges=range_names,
+            includeGridData=True,
+        )
+        .execute()
+    )
+@retry_deco
+def get_known_range_names(
+    spreadsheet_id: str, service: Resource
+) -> Tuple[List[str], List[str], str]:
+    """
+    Retrieves spreadsheet metadata and extracts a list of sheet names and named ranges
+    Args:
+        spreadsheet_id (str): The ID of the spreadsheet.
+        service (Resource): Resource object used to make API calls to Google Sheets API.
+    Returns:
+        Tuple[List[str], List[str], str] sheet names, named ranges, spreadheet title
+    """
+    metadata = service.spreadsheets().get(spreadsheetId=spreadsheet_id).execute()
+    sheet_names: List[str] = [s["properties"]["title"] for s in metadata["sheets"]]
+    named_ranges: List[str] = [r["name"] for r in metadata.get("namedRanges", {})]
+    title: str = metadata["properties"]["title"]
+    return sheet_names, named_ranges, title
+@retry_deco
+def get_data_for_ranges(
+    service: Resource, spreadsheet_id: str, range_names: List[str]
+) -> List[Tuple[str, ParsedRange, ParsedRange, List[List[Any]]]]:
+    """
+    Calls Google Sheets API to get data in a batch. This is the most efficient way to get data for multiple ranges inside a spreadsheet.
+    Args:
+        service (Resource): Object to make API calls to Google Sheets.
+        spreadsheet_id (str): The ID of the spreadsheet.
+        range_names (List[str]): List of range names.
+    Returns:
+        List[DictStrAny]: A list of ranges with data in the same order as `range_names`
+    """
+    range_batch_resp = (
+        service.spreadsheets()
+        .values()
+        .batchGet(
+            spreadsheetId=spreadsheet_id,
+            ranges=range_names,
+            # un formatted returns typed values
+            valueRenderOption="UNFORMATTED_VALUE",
+            # will return formatted dates as a serial number
+            dateTimeRenderOption="SERIAL_NUMBER",
+        )
+        .execute()
+    )
+    # if there are not ranges to be loaded, there's no "valueRanges"
+    range_batch: List[DictStrAny] = range_batch_resp.get("valueRanges", [])
+    # trim the empty top rows and columns from the left
+    rv = []
+    for name, range_ in zip(range_names, range_batch):
+        parsed_range = ParsedRange.parse_range(range_["range"])
+        values: List[List[Any]] = range_.get("values", None)
+        if values:
+            parsed_range, values = trim_range_top_left(parsed_range, values)
+        # create a new range to get first two rows
+        meta_range = parsed_range._replace(end_row=parsed_range.start_row + 1)
+        # print(f"{name}:{parsed_range}:{meta_range}")
+        rv.append((name, parsed_range, meta_range, values))
+    return rv

ingestr-0.5.1/ingestr/src/google_sheets/helpers/data_processing.py ADDED Viewed

@@ -0,0 +1,302 @@
+"""This is a helper module that contains function which validate and process data"""
+import re
+from typing import Any, Iterator, List, NamedTuple, Tuple, Union
+import dlt
+from dlt.common import logger, pendulum
+from dlt.common.data_types import TDataType
+from dlt.common.typing import DictStrAny
+# this string comes before the id
+URL_ID_IDENTIFIER = "d"
+# time info
+SECONDS_PER_DAY = 86400
+# TIMEZONE info
+DLT_TIMEZONE = "UTC"
+# number of seconds from UNIX timestamp origin (1st Jan 1970) to serial number origin (30th Dec 1899)
+TIMESTAMP_CONST = -2209161600.0
+# compiled regex to extract ranges
+RE_PARSE_RANGE = re.compile(
+    r"^(?:(?P<sheet>[\'\w\s]+)!)?(?P<start_col>[A-Z]+)(?P<start_row>\d+):(?P<end_col>[A-Z]+)(?P<end_row>\d+)$"
+)
+class ParsedRange(NamedTuple):
+    sheet_name: str
+    start_col: str
+    start_row: int
+    end_col: str
+    end_row: int
+    @classmethod
+    def parse_range(cls, s: str) -> "ParsedRange":
+        match = RE_PARSE_RANGE.match(s)
+        if match:
+            parsed_dict = match.groupdict()
+            return ParsedRange(
+                parsed_dict["sheet"].strip("'"),
+                parsed_dict["start_col"],
+                int(parsed_dict["start_row"]),
+                parsed_dict["end_col"],
+                int(parsed_dict["end_row"]),
+            )
+        else:
+            raise ValueError(s)
+    def __str__(self) -> str:
+        return f"{self.sheet_name}!{self.start_col}{self.start_row}:{self.end_col}{self.end_row}"
+    @staticmethod
+    def shift_column(col: str, shift: int) -> str:
+        """
+        Shift a Google Sheets column string by a given number of positions.
+        Parameters:
+        col (str): The original column string.
+        shift (int): The number of positions to shift the column.
+        Returns:
+        str: The new column string after shifting.
+        """
+        # Convert column string to column index (1-indexed)
+        col_num = 0
+        for i, char in enumerate(reversed(col)):
+            col_num += (ord(char.upper()) - 65 + 1) * (26**i)
+        # Shift the column index
+        col_num += shift
+        # Convert back to column string
+        col_str = ""
+        while col_num > 0:
+            col_num, remainder = divmod(col_num - 1, 26)
+            col_str = chr(65 + remainder) + col_str
+        return col_str
+def get_spreadsheet_id(url_or_id: str) -> str:
+    """
+    Receives an ID or URL to a Google Spreadsheet and returns the spreadsheet ID as a string.
+    Args:
+        url_or_id (str): The ID or URL of the spreadsheet.
+    Returns:
+        str: The spreadsheet ID as a string.
+    """
+    # check if this is an url: http or https in it
+    if re.match(r"http://|https://", url_or_id):
+        # process url
+        spreadsheet_id = extract_spreadsheet_id_from_url(url_or_id)
+        return spreadsheet_id
+    else:
+        # just return id
+        return url_or_id
+def extract_spreadsheet_id_from_url(url: str) -> str:
+    """
+    Takes a URL to a Google spreadsheet and computes the spreadsheet ID from it according to the spreadsheet URL formula: https://docs.google.com/spreadsheets/d/<spreadsheet_id>/edit.
+    If the URL is not formatted correctly, a ValueError will be raised.
+    Args:
+        url (str): The URL to the spreadsheet.
+    Returns:
+        str: The spreadsheet ID as a string.
+    Raises:
+        ValueError: If the URL is not properly formatted.
+    """
+    # split on the '/'
+    parts = url.split("/")
+    # loop through parts
+    for i in range(len(parts)):
+        if parts[i] == URL_ID_IDENTIFIER and i + 1 < len(parts):
+            # if the id part is left empty then the url is not formatted correctly
+            if parts[i + 1] == "":
+                raise ValueError(f"Spreadsheet ID is an empty string in url: {url}")
+            else:
+                return parts[i + 1]
+    raise ValueError(f"Invalid URL. Cannot find spreadsheet ID in url: {url}")
+def get_range_headers(headers_metadata: List[DictStrAny], range_name: str) -> List[str]:
+    """
+    Retrieves the headers for columns from the metadata of a range.
+    Args:
+        headers_metadata (List[DictStrAny]): Metadata for the first 2 rows of a range.
+        range_name (str): The name of the range as appears in the metadata.
+    Returns:
+        List[str]: A list of headers.
+    """
+    headers = []
+    for idx, header in enumerate(headers_metadata):
+        header_val: str = None
+        if header:
+            if "stringValue" in header.get("effectiveValue", {}):
+                header_val = header["formattedValue"]
+            else:
+                header_val = header.get("formattedValue", None)
+                # if there's no formatted value then the cell is empty (no empty string as well!) in that case add auto name and move on
+                if header_val is None:
+                    header_val = str(f"col_{idx + 1}")
+                else:
+                    logger.warning(
+                        f"In range {range_name}, header value: {header_val} at position {idx+1} is not a string!"
+                    )
+                    return None
+        else:
+            logger.warning(
+                f"In range {range_name}, header at position {idx+1} is not missing!"
+            )
+            return None
+        headers.append(header_val)
+    # make sure that headers are unique, first normalize the headers
+    header_mappings = {
+        h: dlt.current.source_schema().naming.normalize_identifier(h) for h in headers
+    }
+    if len(set(header_mappings.values())) != len(headers):
+        logger.warning(
+            "Header names must be unique otherwise you risk that data in columns with duplicate header names to be lost. Note that several destinations require "
+            + "that column names are normalized ie. must be lower or upper case and without special characters. dlt normalizes those names for you but it may "
+            + f"result in duplicate column names. Headers in range {range_name} are mapped as follows: "
+            + ", ".join([f"{k}->{v}" for k, v in header_mappings.items()])
+            + ". Please use make your header names unique."
+        )
+        return None
+    return headers
+def get_data_types(data_row_metadata: List[DictStrAny]) -> List[TDataType]:
+    """
+    Determines if each column in the first line of a range contains datetime objects.
+    Args:
+        data_row_metadata (List[DictStrAny]): Metadata of the first row of data
+    Returns:
+        List[TDataType]: "timestamp" or "data" indicating the date/time type for a column, otherwise None
+    """
+    # get data for 1st column and process them, if empty just return an empty list
+    try:
+        data_types: List[TDataType] = [None] * len(data_row_metadata)
+        for idx, val_dict in enumerate(data_row_metadata):
+            try:
+                data_type = val_dict["effectiveFormat"]["numberFormat"]["type"]
+                if data_type in ["DATE_TIME", "TIME"]:
+                    data_types[idx] = "timestamp"
+                elif data_type == "DATE":
+                    data_types[idx] = "date"
+            except KeyError:
+                pass
+        return data_types
+    except IndexError:
+        return []
+def serial_date_to_datetime(
+    serial_number: Union[int, float], data_type: TDataType
+) -> Union[pendulum.DateTime, pendulum.Date]:
+    """
+    Converts a serial number to a datetime (if input is float) or date (if input is int).
+    Args:
+        serial_number (Union[int, float, str, bool]): The Lotus Notes serial number
+    Returns:
+        Union[pendulum.DateTime, str, bool]: The converted datetime object, or the original value if conversion fails.
+    """
+    # To get the seconds passed since the start date of serial numbers we round the product of the number of seconds in a day and the serial number
+    conv_datetime: pendulum.DateTime = pendulum.from_timestamp(
+        0, DLT_TIMEZONE
+    ) + pendulum.duration(
+        seconds=TIMESTAMP_CONST + round(SECONDS_PER_DAY * serial_number)
+    )
+    # int values are dates, float values are datetimes
+    if data_type == "date":
+        return conv_datetime.date()  # type: ignore[no-any-return]
+    return conv_datetime
+def process_range(
+    sheet_values: List[List[Any]], headers: List[str], data_types: List[TDataType]
+) -> Iterator[DictStrAny]:
+    """
+    Yields lists of values as dictionaries, converts data times and handles empty rows and cells. Please note:
+    1. empty rows get ignored
+    2. empty cells are converted to None (and then to NULL by dlt)
+    3. data in columns without headers will be dropped
+    Args:
+        sheet_val (List[List[Any]]): range values without the header row
+        headers (List[str]): names of the headers
+        data_types: List[TDataType]: "timestamp" and "date" or None for each column
+    Yields:
+        DictStrAny: A dictionary version of the table. It generates a dictionary of the type {header: value} for every row.
+    """
+    for row in sheet_values:
+        # empty row; skip
+        if not row:
+            continue
+        table_dict = {}
+        # process both rows and check for differences to spot dates
+        for val, header, data_type in zip(row, headers, data_types):
+            # 3 main cases: null cell value, datetime value, every other value
+            # handle null values properly. Null cell values are returned as empty strings, this will cause dlt to create new columns and fill them with empty strings
+            if val == "":
+                fill_val = None
+            elif data_type in ["timestamp", "date"]:
+                # the datetimes are inferred from first row of data. if next rows have inconsistent data types - pass the values to dlt to deal with them
+                if not isinstance(val, (int, float)) or isinstance(val, bool):
+                    fill_val = val
+                else:
+                    fill_val = serial_date_to_datetime(val, data_type)
+            else:
+                fill_val = val
+            table_dict[header] = fill_val
+        yield table_dict
+def trim_range_top_left(
+    parsed_range: ParsedRange, range_values: List[List[Any]]
+) -> Tuple[ParsedRange, List[List[Any]]]:
+    # skip empty rows and then empty columns
+    # skip empty rows
+    shift_x = 0
+    for row in range_values:
+        if row:
+            break
+        else:
+            shift_x += 1
+    if shift_x > 0:
+        range_values = range_values[shift_x:]
+    # skip empty columns
+    shift_y = 0
+    if len(range_values) > 0:
+        for col in range_values[0]:
+            if col == "":
+                shift_y += 1
+            else:
+                break
+        if shift_y > 0:
+            # skip all columns
+            for idx, row in enumerate(range_values):
+                range_values[idx] = row[shift_y:]
+    parsed_range = parsed_range._replace(
+        start_row=parsed_range.start_row + shift_x,
+        start_col=ParsedRange.shift_column(parsed_range.start_col, shift_y),
+    )
+    return parsed_range, range_values

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/sources.py RENAMED Viewed

@@ -1,9 +1,12 @@
+import base64
 import csv
+import json
 from typing import Callable
 from urllib.parse import parse_qs, urlparse
 import dlt
+from ingestr.src.google_sheets import google_spreadsheet
 from ingestr.src.mongodb import mongodb_collection
 from ingestr.src.notion import notion_databases
 from ingestr.src.sql_database import sql_table
@@ -129,3 +132,46 @@ class NotionSource:
             database_ids=[{"id": table}],
             api_key=api_key[0],
         )
+class GoogleSheetsSource:
+    table_builder: Callable
+    def __init__(self, table_builder=google_spreadsheet) -> None:
+        self.table_builder = table_builder
+    def dlt_source(self, uri: str, table: str, **kwargs):
+        if kwargs.get("incremental_key"):
+            raise ValueError("Incremental loads are not supported for Google Sheets")
+        source_fields = urlparse(uri)
+        source_params = parse_qs(source_fields.query)
+        cred_path = source_params.get("credentials_path")
+        credentials_base64 = source_params.get("credentials_base64")
+        if not cred_path and not credentials_base64:
+            raise ValueError(
+                "credentials_path or credentials_base64 is required in the URI to get data from Google Sheets"
+            )
+        credentials = {}
+        if cred_path:
+            with open(cred_path[0], "r") as f:
+                credentials = json.load(f)
+        elif credentials_base64:
+            credentials = json.loads(
+                base64.b64decode(credentials_base64[0]).decode("utf-8")
+            )
+        table_fields = table.split(".")
+        if len(table_fields) != 2:
+            raise ValueError(
+                "Table name must be in the format <spreadsheet_id>.<sheet_name>"
+            )
+        return self.table_builder(
+            credentials=credentials,
+            spreadsheet_url_or_id=table_fields[0],
+            range_names=[table_fields[1]],
+            get_named_ranges=False,
+        )

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/sql_database/override.py RENAMED Viewed

@@ -7,3 +7,4 @@ from dlt.sources.credentials import ConnectionStringCredentials
 @configspec(init=False)
 class IngestrConnectionStringCredentials(ConnectionStringCredentials):
     username: Optional[str] = None
+    database: Optional[str] = None

ingestr-0.5.1/ingestr/src/version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.5.1"

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/testdata/test_append.db RENAMED Viewed

Binary file

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/testdata/test_create_replace.db RENAMED Viewed

Binary file

ingestr-0.4.0/ingestr/testdata/test_merge_with_primary_key.db → ingestr-0.5.1/ingestr/testdata/test_delete_insert_with_timerange.db RENAMED Viewed

Binary file

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/testdata/test_delete_insert_without_primary_key.db RENAMED Viewed

Binary file

ingestr-0.4.0/ingestr/testdata/test_delete_insert_with_timerange.db → ingestr-0.5.1/ingestr/testdata/test_merge_with_primary_key.db RENAMED Viewed

Binary file

{ingestr-0.4.0 → ingestr-0.5.1}/pyproject.toml RENAMED Viewed

@@ -65,12 +65,14 @@ exclude = [
     'venv',
     'src/sql_database/.*',
     'src/mongodb/.*',
+    'src/google_sheets/.*',
 ]
 [[tool.mypy.overrides]]
 module = [
   "ingestr.src.sql_database.*",
   "ingestr.src.mongodb.*",
+  "ingestr.src.google_sheets.*",
 ]
 follow_imports = "skip"

{ingestr-0.4.0 → ingestr-0.5.1}/requirements.txt RENAMED Viewed

@@ -4,6 +4,7 @@ dlt==0.4.8
 duckdb_engine==0.11.5
 duckdb==0.10.2
 google-cloud-bigquery-storage==2.24.0
+google-api-python-client==2.130.0
 pendulum==3.0.0
 psycopg2-binary==2.9.9
 py-machineid==0.5.1

ingestr-0.4.0/ingestr/src/version.py DELETED Viewed

	@@ -1 +0,0 @@
1	- __version__ = "0.4.0"

{ingestr-0.4.0 → ingestr-0.5.1}/.dockerignore RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/.github/workflows/deploy-docs.yml RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/.github/workflows/docker.yml RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/.gitignore RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/LICENSE.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/Makefile RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/.vitepress/theme/custom.css RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/.vitepress/theme/index.js RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/commands/example-uris.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/commands/ingest.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/getting-started/core-concepts.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/getting-started/incremental-loading.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/getting-started/quickstart.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/getting-started/telemetry.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/index.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/bigquery.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/csv.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/databricks.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/duckdb.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/mongodb.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/mssql.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/mysql.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/notion.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/oracle.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/postgres.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/redshift.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/sap-hana.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/snowflake.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/docs/supported-sources/sqlite.md RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/main_test.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/destinations.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/destinations_test.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/factory_test.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/mongodb/__init__.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/mongodb/helpers.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/notion/__init__.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/notion/helpers/__init__.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/notion/helpers/client.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/notion/helpers/database.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/notion/settings.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/sources_test.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/sql_database/__init__.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/sql_database/helpers.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/sql_database/schema_types.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/telemetry/event.py RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/src/testdata/fakebqcredentials.json RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/ingestr/testdata/.gitignore RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/package-lock.json RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/package.json RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/requirements-dev.txt RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/resources/demo.gif RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/resources/demo.tape RENAMED Viewed

File without changes

{ingestr-0.4.0 → ingestr-0.5.1}/resources/ingestr.svg RENAMED Viewed

File without changes

ingestr 0.4.0__tar.gz → 0.5.1__tar.gz

Potentially problematic release.

ingestr 0.4.0tar.gz → 0.5.1tar.gz