PyPI - dlt-iceberg - Versions diffs - 0.1.3__tar.gz → 0.1.4__tar.gz - Mend

dlt-iceberg 0.1.3tar.gz → 0.1.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: dlt-iceberg
-Version: 0.1.3
+Version: 0.1.4
 Summary: dlt destination for Apache Iceberg with atomic multi-file commits via REST catalogs
 Project-URL: Homepage, https://github.com/sidequery/dlt-iceberg
 Project-URL: Repository, https://github.com/sidequery/dlt-iceberg
@@ -47,9 +47,13 @@ A [dlt](https://dlthub.com/) destination for [Apache Iceberg](https://iceberg.ap
 ## Installation
 ```bash
-git clone https://github.com/sidequery/dlt-iceberg.git
-cd dlt-iceberg
-uv sync
+pip install dlt-iceberg
+```
+Or with uv:
+```bash
+uv add dlt-iceberg
 ```
 ## Quick Start
@@ -95,7 +99,38 @@ def generate_users():
 pipeline.run(generate_users())
 ```
-## Configuration
+## Configuration Options
+All configuration options can be passed to `iceberg_rest()`:
+```python
+iceberg_rest(
+    catalog_uri="...",           # Required: REST catalog URI
+    namespace="...",             # Required: Iceberg namespace (database)
+    warehouse="...",             # Optional: Warehouse location
+    # Authentication
+    credential="...",            # OAuth2 client credentials
+    oauth2_server_uri="...",     # OAuth2 token endpoint
+    token="...",                 # Bearer token
+    # AWS SigV4
+    sigv4_enabled=True,
+    signing_region="us-east-1",
+    # S3 configuration
+    s3_endpoint="...",
+    s3_access_key_id="...",
+    s3_secret_access_key="...",
+    s3_region="...",
+    # Performance tuning
+    max_retries=5,               # Retry attempts for transient failures
+    retry_backoff_base=2.0,      # Exponential backoff multiplier
+    merge_batch_size=100000,     # Rows per batch for merge operations
+    strict_casting=False,        # Fail on potential data loss
+)
+```
 ### Nessie (Docker)

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/README.md RENAMED Viewed

@@ -15,9 +15,13 @@ A [dlt](https://dlthub.com/) destination for [Apache Iceberg](https://iceberg.ap
 ## Installation
 ```bash
-git clone https://github.com/sidequery/dlt-iceberg.git
-cd dlt-iceberg
-uv sync
+pip install dlt-iceberg
+```
+Or with uv:
+```bash
+uv add dlt-iceberg
 ```
 ## Quick Start
@@ -63,7 +67,38 @@ def generate_users():
 pipeline.run(generate_users())
 ```
-## Configuration
+## Configuration Options
+All configuration options can be passed to `iceberg_rest()`:
+```python
+iceberg_rest(
+    catalog_uri="...",           # Required: REST catalog URI
+    namespace="...",             # Required: Iceberg namespace (database)
+    warehouse="...",             # Optional: Warehouse location
+    # Authentication
+    credential="...",            # OAuth2 client credentials
+    oauth2_server_uri="...",     # OAuth2 token endpoint
+    token="...",                 # Bearer token
+    # AWS SigV4
+    sigv4_enabled=True,
+    signing_region="us-east-1",
+    # S3 configuration
+    s3_endpoint="...",
+    s3_access_key_id="...",
+    s3_secret_access_key="...",
+    s3_region="...",
+    # Performance tuning
+    max_retries=5,               # Retry attempts for transient failures
+    retry_backoff_base=2.0,      # Exponential backoff multiplier
+    merge_batch_size=100000,     # Rows per batch for merge operations
+    strict_casting=False,        # Fail on potential data loss
+)
+```
 ### Nessie (Docker)

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/examples/README.md RENAMED Viewed

@@ -2,6 +2,8 @@
 This directory contains example scripts demonstrating how to use dlt-iceberg with REST catalogs.
+All examples use [PEP 723](https://peps.python.org/pep-0723/) inline script metadata, allowing them to be run directly with `uv run` without installing dependencies separately.
 ## Prerequisites
 Start the Docker services (Nessie REST catalog and MinIO):
@@ -48,6 +50,32 @@ Demonstrates merging two CSV files with overlapping customer IDs using merge dis
 uv run examples/merge_load.py
 ```
+### USGS Earthquake Data (`usgs_earthquakes.py`)
+Demonstrates loading real-world GeoJSON data from the USGS Earthquake API from 2010 through current date.
+- Fetches earthquake data from USGS API (2010-present, ~190 months)
+- Loads data in monthly batches with automatic weekly splitting for high-volume months
+- Handles API's 20,000 result limit by automatically splitting large months into weekly chunks
+- Uses partitioning by month on timestamp column
+- Transforms GeoJSON features into flat records
+- Loads ~2.5 million earthquakes with complete metadata
+**Features demonstrated:**
+- Dynamic date range generation using `date.today()`
+- Automatic handling of API result limits with recursive splitting
+- Retry logic with exponential backoff
+- Rate limiting between API requests
+- Large-scale data ingestion (190+ API calls)
+**Run:**
+```bash
+uv run examples/usgs_earthquakes.py
+```
+**Note:** This script takes approximately 15-20 minutes to complete as it makes 190+ API calls with rate limiting.
 ## Data Files
 Sample CSV files are in the `data/` directory:
@@ -63,4 +91,6 @@ Sample CSV files are in the `data/` directory:
 - **Incremental loads**: Append new data to existing tables
 - **Merge/Upsert**: Update existing records and insert new ones based on primary key
 - **REST catalog**: All examples use Nessie REST catalog with MinIO storage
+- **Partitioning**: Partition tables by timestamp (month transform)
+- **API integration**: Fetch and transform data from external APIs
 - **Querying**: Direct PyIceberg queries to verify loaded data

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/examples/incremental_load.py RENAMED Viewed

@@ -1,3 +1,11 @@
+#!/usr/bin/env -S uv run
+# /// script
+# dependencies = [
+#     "dlt",
+#     "dlt-iceberg",
+#     "pyiceberg",
+# ]
+# ///
 """
 Incremental Load Example

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/examples/merge_load.py RENAMED Viewed

@@ -1,3 +1,12 @@
+#!/usr/bin/env -S uv run
+# /// script
+# dependencies = [
+#     "dlt",
+#     "dlt-iceberg",
+#     "pyiceberg",
+#     "pandas",
+# ]
+# ///
 """
 Merge Load Example

dlt_iceberg-0.1.4/examples/usgs_earthquakes.py ADDED Viewed

@@ -0,0 +1,234 @@
+#!/usr/bin/env -S uv run
+# /// script
+# dependencies = [
+#     "dlt",
+#     "dlt-iceberg",
+#     "pyiceberg",
+#     "requests",
+#     "python-dateutil",
+# ]
+# ///
+"""
+USGS Earthquake Data Example
+Loads earthquake data from USGS GeoJSON API from 2010 through current date
+into an Iceberg table.
+"""
+import dlt
+import requests
+import time
+from datetime import datetime, date
+from dateutil.relativedelta import relativedelta
+from dlt_iceberg import iceberg_rest
+def fetch_earthquakes(start_date: str, end_date: str, split_on_error: bool = True):
+    """
+    Fetch earthquake data from USGS API for a date range.
+    If a 400 error occurs (likely due to >20k result limit), automatically
+    splits the range into smaller chunks and retries.
+    Args:
+        start_date: Start date in YYYY-MM-DD format
+        end_date: End date in YYYY-MM-DD format
+        split_on_error: If True, split the date range on 400 errors
+    """
+    url = "https://earthquake.usgs.gov/fdsnws/event/1/query"
+    params = {
+        "format": "geojson",
+        "starttime": start_date,
+        "endtime": end_date,
+    }
+    print(f"Fetching earthquakes from {start_date} to {end_date}...")
+    # Retry logic for rate limiting
+    max_retries = 3
+    for attempt in range(max_retries):
+        try:
+            response = requests.get(url, params=params, timeout=30)
+            response.raise_for_status()
+            break
+        except requests.exceptions.HTTPError as e:
+            if e.response.status_code == 400 and split_on_error:
+                # Split the date range and retry with smaller chunks
+                print(f"  Too many results, splitting range into weekly chunks...")
+                from datetime import timedelta
+                start_dt = datetime.fromisoformat(start_date)
+                end_dt = datetime.fromisoformat(end_date)
+                current = start_dt
+                while current < end_dt:
+                    next_week = min(current + timedelta(days=7), end_dt)
+                    # Recursively fetch with split_on_error=False to avoid infinite recursion
+                    yield from fetch_earthquakes(
+                        current.date().isoformat(),
+                        next_week.date().isoformat(),
+                        split_on_error=False
+                    )
+                    current = next_week
+                return
+            elif e.response.status_code == 400:
+                print(f"  Warning: API returned 400 error for {start_date} to {end_date}, skipping...")
+                return
+            if attempt < max_retries - 1:
+                wait_time = 2 ** attempt
+                print(f"  Request failed, retrying in {wait_time}s...")
+                time.sleep(wait_time)
+            else:
+                raise
+        except requests.exceptions.RequestException as e:
+            if attempt < max_retries - 1:
+                wait_time = 2 ** attempt
+                print(f"  Request failed, retrying in {wait_time}s...")
+                time.sleep(wait_time)
+            else:
+                raise
+    data = response.json()
+    features = data.get("features", [])
+    print(f"Retrieved {len(features)} earthquakes")
+    # Rate limiting: sleep briefly between requests
+    time.sleep(0.5)
+    # Transform GeoJSON features into flat records
+    for feature in features:
+        props = feature["properties"]
+        geom = feature["geometry"]
+        yield {
+            "earthquake_id": feature["id"],
+            "magnitude": props.get("mag"),
+            "place": props.get("place"),
+            "time": datetime.fromtimestamp(props["time"] / 1000) if props.get("time") else None,
+            "updated": datetime.fromtimestamp(props["updated"] / 1000) if props.get("updated") else None,
+            "url": props.get("url"),
+            "detail": props.get("detail"),
+            "felt": props.get("felt"),
+            "cdi": props.get("cdi"),
+            "mmi": props.get("mmi"),
+            "alert": props.get("alert"),
+            "status": props.get("status"),
+            "tsunami": props.get("tsunami"),
+            "sig": props.get("sig"),
+            "net": props.get("net"),
+            "code": props.get("code"),
+            "ids": props.get("ids"),
+            "sources": props.get("sources"),
+            "types": props.get("types"),
+            "nst": props.get("nst"),
+            "dmin": props.get("dmin"),
+            "rms": props.get("rms"),
+            "gap": props.get("gap"),
+            "magType": props.get("magType"),
+            "type": props.get("type"),
+            "title": props.get("title"),
+            "longitude": geom["coordinates"][0] if geom and geom.get("coordinates") else None,
+            "latitude": geom["coordinates"][1] if geom and geom.get("coordinates") else None,
+            "depth": geom["coordinates"][2] if geom and geom.get("coordinates") and len(geom["coordinates"]) > 2 else None,
+        }
+def main():
+    # Create dlt pipeline with Nessie REST catalog
+    pipeline = dlt.pipeline(
+        pipeline_name="usgs_earthquakes",
+        destination=iceberg_rest(
+            catalog_uri="http://localhost:19120/iceberg/main",
+            namespace="examples",
+            s3_endpoint="http://localhost:9000",
+            s3_access_key_id="minioadmin",
+            s3_secret_access_key="minioadmin",
+            s3_region="us-east-1",
+        ),
+        dataset_name="usgs_data",
+    )
+    # Load earthquakes from 2010 through current date
+    # Breaking into monthly batches to avoid overwhelming the API
+    # Note: USGS endtime is exclusive, so we use the first day of next month
+    start_date = date(2010, 1, 1)
+    end_date = date.today()
+    date_ranges = []
+    current = start_date
+    while current <= end_date:
+        next_month = current + relativedelta(months=1)
+        date_ranges.append((current.isoformat(), next_month.isoformat()))
+        current = next_month
+    print(f"Loading {len(date_ranges)} months of earthquake data from {start_date} to {end_date}...")
+    print()
+    for i, (start, end) in enumerate(date_ranges, 1):
+        @dlt.resource(
+            name="earthquakes",
+            write_disposition="append",
+            columns={
+                "time": {
+                    "data_type": "timestamp",
+                    "x-partition": True,
+                    "x-partition-transform": "month",
+                }
+            }
+        )
+        def earthquakes_batch():
+            return fetch_earthquakes(start, end)
+        load_info = pipeline.run(earthquakes_batch())
+        print(f"[{i}/{len(date_ranges)}] Loaded {start} to {end}")
+        print()
+    # Query the table to verify
+    from pyiceberg.catalog import load_catalog
+    catalog = load_catalog(
+        "query",
+        type="rest",
+        uri="http://localhost:19120/iceberg/main",
+        **{
+            "s3.endpoint": "http://localhost:9000",
+            "s3.access-key-id": "minioadmin",
+            "s3.secret-access-key": "minioadmin",
+            "s3.region": "us-east-1",
+        },
+    )
+    table = catalog.load_table("examples.earthquakes")
+    result = table.scan().to_arrow()
+    print(f"\n{'='*60}")
+    print(f"Total earthquakes loaded: {len(result)}")
+    import pyarrow.compute as pc
+    print(f"Date range: {pc.min(result['time']).as_py()} to {pc.max(result['time']).as_py()}")
+    print(f"Magnitude range: {pc.min(result['magnitude']).as_py()} to {pc.max(result['magnitude']).as_py()}")
+    # Show some sample records
+    import pandas as pd
+    df = result.to_pandas()
+    print(f"\nSample earthquakes:")
+    print(df[["time", "magnitude", "place", "depth"]].head(10).to_string(index=False))
+    # Show distribution by month
+    df["month"] = pd.to_datetime(df["time"]).dt.to_period("M")
+    monthly_counts = df.groupby("month").size().sort_index()
+    print(f"\nEarthquakes by month:")
+    for month, count in monthly_counts.items():
+        print(f"  {month}: {count:,}")
+    # Show magnitude distribution
+    print(f"\nMagnitude distribution:")
+    print(df["magnitude"].describe())
+    print(f"\n{'='*60}")
+    print("USGS earthquake data load complete!")
+if __name__ == "__main__":
+    main()

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "dlt-iceberg"
-version = "0.1.3"
+version = "0.1.4"
 description = "dlt destination for Apache Iceberg with atomic multi-file commits via REST catalogs"
 readme = "README.md"
 requires-python = ">=3.11"

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/uv.lock RENAMED Viewed

@@ -182,7 +182,7 @@ wheels = [
 [[package]]
 name = "dlt-iceberg"
-version = "0.1.2"
+version = "0.1.3"
 source = { editable = "." }
 dependencies = [
     { name = "boto3" },

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/.github/workflows/publish.yml RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/.github/workflows/test.yml RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/.gitignore RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/.python-version RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/LICENSE RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/TESTING.md RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/docker-compose.yml RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/examples/data/customers_initial.csv RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/examples/data/customers_updates.csv RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/examples/data/events_batch1.csv RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/examples/data/events_batch2.csv RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/src/dlt_iceberg/__init__.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/src/dlt_iceberg/destination.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/src/dlt_iceberg/destination_client.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/src/dlt_iceberg/error_handling.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/src/dlt_iceberg/partition_builder.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/src/dlt_iceberg/schema_casting.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/src/dlt_iceberg/schema_converter.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/src/dlt_iceberg/schema_evolution.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_class_based_atomic.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_destination_e2e.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_destination_rest_catalog.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_e2e_sqlite_catalog.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_error_handling.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_merge_disposition.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_partition_builder.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_partitioning_e2e.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_pyiceberg_append.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_schema_casting.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_schema_converter.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_schema_evolution.py RENAMED Viewed

File without changes

{dlt_iceberg-0.1.3 → dlt_iceberg-0.1.4}/tests/test_smoke.py RENAMED Viewed

File without changes

dlt-iceberg 0.1.3__tar.gz → 0.1.4__tar.gz

dlt-iceberg 0.1.3tar.gz → 0.1.4tar.gz