dlt-iceberg 0.1.4__py3-none-any.whl → 0.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,314 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: dlt-iceberg
3
- Version: 0.1.4
4
- Summary: dlt destination for Apache Iceberg with atomic multi-file commits via REST catalogs
5
- Project-URL: Homepage, https://github.com/sidequery/dlt-iceberg
6
- Project-URL: Repository, https://github.com/sidequery/dlt-iceberg
7
- Project-URL: Issues, https://github.com/sidequery/dlt-iceberg/issues
8
- Author-email: Sidequery <hello@sidequery.com>
9
- License: MIT
10
- License-File: LICENSE
11
- Keywords: data-engineering,data-pipeline,dlt,elt,etl,iceberg
12
- Classifier: Development Status :: 4 - Beta
13
- Classifier: Intended Audience :: Developers
14
- Classifier: License :: OSI Approved :: MIT License
15
- Classifier: Programming Language :: Python :: 3
16
- Classifier: Programming Language :: Python :: 3.11
17
- Classifier: Programming Language :: Python :: 3.12
18
- Classifier: Programming Language :: Python :: 3.13
19
- Classifier: Topic :: Database
20
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
- Requires-Python: >=3.11
22
- Requires-Dist: boto3>=1.40.50
23
- Requires-Dist: dlt>=1.17.1
24
- Requires-Dist: pandas>=2.3.3
25
- Requires-Dist: pyarrow>=21.0.0
26
- Requires-Dist: pydantic<2.11
27
- Requires-Dist: pyiceberg[pyiceberg-core]>=0.10.0
28
- Requires-Dist: requests>=2.32.5
29
- Requires-Dist: s3fs>=0.4.2
30
- Requires-Dist: sqlalchemy>=2.0.44
31
- Description-Content-Type: text/markdown
32
-
33
- # dlt-iceberg
34
-
35
- A [dlt](https://dlthub.com/) destination for [Apache Iceberg](https://iceberg.apache.org/) tables using REST catalogs.
36
-
37
- ## Features
38
-
39
- - **Atomic Multi-File Commits**: Multiple parquet files committed as single Iceberg snapshot per table
40
- - **REST Catalog Support**: Works with Nessie, Polaris, AWS Glue, Unity Catalog
41
- - **Partitioning**: Full support for Iceberg partition transforms (temporal, bucket, truncate, identity)
42
- - **Authentication**: OAuth2, Bearer token, AWS SigV4
43
- - **Write Dispositions**: Append, replace, merge (upsert)
44
- - **Schema Evolution**: Automatic schema updates when adding columns
45
- - **Retry Logic**: Exponential backoff for transient failures
46
-
47
- ## Installation
48
-
49
- ```bash
50
- pip install dlt-iceberg
51
- ```
52
-
53
- Or with uv:
54
-
55
- ```bash
56
- uv add dlt-iceberg
57
- ```
58
-
59
- ## Quick Start
60
-
61
- See [examples/](examples/) directory for working examples.
62
-
63
- ### Incremental Load
64
-
65
- ```python
66
- import dlt
67
- from dlt_iceberg import iceberg_rest
68
-
69
- @dlt.resource(name="events", write_disposition="append")
70
- def generate_events():
71
- yield {"event_id": 1, "value": 100}
72
-
73
- pipeline = dlt.pipeline(
74
- pipeline_name="my_pipeline",
75
- destination=iceberg_rest(
76
- catalog_uri="http://localhost:19120/iceberg/main",
77
- namespace="analytics",
78
- s3_endpoint="http://localhost:9000",
79
- s3_access_key_id="minioadmin",
80
- s3_secret_access_key="minioadmin",
81
- s3_region="us-east-1",
82
- ),
83
- )
84
-
85
- pipeline.run(generate_events())
86
- ```
87
-
88
- ### Merge/Upsert
89
-
90
- ```python
91
- @dlt.resource(
92
- name="users",
93
- write_disposition="merge",
94
- primary_key="user_id"
95
- )
96
- def generate_users():
97
- yield {"user_id": 1, "name": "Alice", "status": "active"}
98
-
99
- pipeline.run(generate_users())
100
- ```
101
-
102
- ## Configuration Options
103
-
104
- All configuration options can be passed to `iceberg_rest()`:
105
-
106
- ```python
107
- iceberg_rest(
108
- catalog_uri="...", # Required: REST catalog URI
109
- namespace="...", # Required: Iceberg namespace (database)
110
- warehouse="...", # Optional: Warehouse location
111
-
112
- # Authentication
113
- credential="...", # OAuth2 client credentials
114
- oauth2_server_uri="...", # OAuth2 token endpoint
115
- token="...", # Bearer token
116
-
117
- # AWS SigV4
118
- sigv4_enabled=True,
119
- signing_region="us-east-1",
120
-
121
- # S3 configuration
122
- s3_endpoint="...",
123
- s3_access_key_id="...",
124
- s3_secret_access_key="...",
125
- s3_region="...",
126
-
127
- # Performance tuning
128
- max_retries=5, # Retry attempts for transient failures
129
- retry_backoff_base=2.0, # Exponential backoff multiplier
130
- merge_batch_size=100000, # Rows per batch for merge operations
131
- strict_casting=False, # Fail on potential data loss
132
- )
133
- ```
134
-
135
- ### Nessie (Docker)
136
-
137
- ```python
138
- iceberg_rest(
139
- catalog_uri="http://localhost:19120/iceberg/main",
140
- namespace="my_namespace",
141
- s3_endpoint="http://localhost:9000",
142
- s3_access_key_id="minioadmin",
143
- s3_secret_access_key="minioadmin",
144
- s3_region="us-east-1",
145
- )
146
- ```
147
-
148
- Start services: `docker compose up -d`
149
-
150
- ### AWS Glue
151
-
152
- ```python
153
- iceberg_rest(
154
- catalog_uri="https://glue.us-east-1.amazonaws.com/iceberg",
155
- warehouse="<account-id>:s3tablescatalog/<bucket>",
156
- namespace="my_database",
157
- sigv4_enabled=True,
158
- signing_region="us-east-1",
159
- )
160
- ```
161
-
162
- AWS credentials via environment variables.
163
-
164
- ### Polaris
165
-
166
- ```python
167
- iceberg_rest(
168
- catalog_uri="https://polaris.example.com/api/catalog",
169
- warehouse="s3://bucket/warehouse",
170
- namespace="production",
171
- credential="client-id:client-secret",
172
- oauth2_server_uri="https://polaris.example.com/api/catalog/v1/oauth/tokens",
173
- )
174
- ```
175
-
176
- ### Unity Catalog
177
-
178
- ```python
179
- iceberg_rest(
180
- catalog_uri="https://<workspace>.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest",
181
- warehouse="<catalog-name>",
182
- namespace="<schema-name>",
183
- token="<databricks-token>",
184
- )
185
- ```
186
-
187
- ## Partitioning
188
-
189
- Mark columns for partitioning using dlt column hints:
190
-
191
- ```python
192
- @dlt.resource(
193
- name="events",
194
- columns={
195
- "event_date": {
196
- "data_type": "date",
197
- "partition": True,
198
- "partition_transform": "day", # Optional: year, month, day, hour
199
- },
200
- "region": {
201
- "data_type": "text",
202
- "partition": True, # Uses identity transform for strings
203
- },
204
- "user_id": {
205
- "data_type": "bigint",
206
- "partition": True,
207
- "partition_transform": "bucket[10]", # Hash into 10 buckets
208
- }
209
- }
210
- )
211
- def events():
212
- ...
213
- ```
214
-
215
- ### Available Transforms
216
-
217
- - **Temporal**: `year`, `month`, `day`, `hour` (for timestamp/date columns)
218
- - **Identity**: No transformation (default for string/integer)
219
- - **Bucket**: `bucket[N]` - Hash-based partitioning into N buckets
220
- - **Truncate**: `truncate[N]` - Truncate strings/integers to N width
221
-
222
- ### Default Behavior
223
-
224
- If `partition_transform` is not specified:
225
- - Timestamp/date columns default to `month`
226
- - String/integer columns default to `identity`
227
-
228
- ## Write Dispositions
229
-
230
- ### Append
231
- ```python
232
- write_disposition="append"
233
- ```
234
- Adds new data without modifying existing rows.
235
-
236
- ### Replace
237
- ```python
238
- write_disposition="replace"
239
- ```
240
- Truncates table and inserts new data.
241
-
242
- ### Merge
243
- ```python
244
- write_disposition="merge"
245
- primary_key="user_id"
246
- ```
247
- Updates existing rows by primary key, inserts new rows.
248
-
249
- ## Development
250
-
251
- ### Run Tests
252
-
253
- ```bash
254
- # Start Docker services
255
- docker compose up -d
256
-
257
- # Run all tests
258
- uv run pytest tests/ -v
259
-
260
- # Run only unit tests
261
- uv run pytest tests/ -v -m "not integration"
262
-
263
- # Run only integration tests
264
- uv run pytest tests/ -v -m integration
265
- ```
266
-
267
- ### Project Structure
268
-
269
- ```
270
- dlt-iceberg/
271
- ├── src/dlt_iceberg/
272
- │ ├── __init__.py # Public API
273
- │ ├── destination_client.py # Class-based destination (atomic commits)
274
- │ ├── destination.py # Function-based destination (legacy)
275
- │ ├── schema_converter.py # dlt → Iceberg schema conversion
276
- │ ├── schema_casting.py # Arrow table casting
277
- │ ├── schema_evolution.py # Schema updates
278
- │ ├── partition_builder.py # Partition specs
279
- │ └── error_handling.py # Retry logic
280
- ├── tests/
281
- │ ├── test_destination_rest_catalog.py # Integration tests (Docker)
282
- │ ├── test_class_based_atomic.py # Atomic commit tests
283
- │ ├── test_merge_disposition.py
284
- │ ├── test_schema_evolution.py
285
- │ └── ...
286
- ├── examples/
287
- │ ├── incremental_load.py # CSV incremental loading
288
- │ ├── merge_load.py # CSV merge/upsert
289
- │ └── data/ # Sample CSV files
290
- └── docker-compose.yml # Nessie + MinIO for testing
291
- ```
292
-
293
- ## How It Works
294
-
295
- The class-based destination uses dlt's `JobClientBase` interface to accumulate parquet files during a load and commit them atomically in `complete_load()`:
296
-
297
- 1. dlt extracts data and writes parquet files
298
- 2. Each file is registered in module-level global state
299
- 3. After all files complete, `complete_load()` is called
300
- 4. All files for a table are combined and committed as single Iceberg snapshot
301
- 5. Each table gets one snapshot per load
302
-
303
- This ensures atomic commits even though dlt creates multiple client instances.
304
-
305
- ## License
306
-
307
- MIT License - see LICENSE file
308
-
309
- ## Resources
310
-
311
- - [dlt Documentation](https://dlthub.com/docs)
312
- - [Apache Iceberg](https://iceberg.apache.org/)
313
- - [PyIceberg](https://py.iceberg.apache.org/)
314
- - [Iceberg REST Spec](https://iceberg.apache.org/rest-catalog-spec/)
@@ -1,12 +0,0 @@
1
- dlt_iceberg/__init__.py,sha256=ONy6E-sGcCvvqia8_fGaYp8da4n4wdjox9W42tmQPK0,780
2
- dlt_iceberg/destination.py,sha256=F8QJXsQeosOA32Xm1140DL485WQmxbuhiA2QZ6zpVSU,15737
3
- dlt_iceberg/destination_client.py,sha256=l1q8GYvIJ_tBgoQ979IS3VtUQNmg2-hYv80XZkAVFKs,23786
4
- dlt_iceberg/error_handling.py,sha256=k6Kkldi9BDRsXQ63VEBMMSw1xx2-b1BMjsgRFKI2iB0,7852
5
- dlt_iceberg/partition_builder.py,sha256=l9YNAh2t6gk2xqsPSOs8ymTDLk9BOEZWVOtVni7ONNU,10081
6
- dlt_iceberg/schema_casting.py,sha256=oSQrnOcCMFcinMS65N8YQ1uzrqnQmN50mCCuQyE3794,15247
7
- dlt_iceberg/schema_converter.py,sha256=e_eqXQz2cpABOGEAxVwcGbiOdVmv9kaZanRnU83lzXk,5619
8
- dlt_iceberg/schema_evolution.py,sha256=ieOkCA9ngQdJ5lbZLYQ09deTLZEW8whxDn2arpoH-aM,8326
9
- dlt_iceberg-0.1.4.dist-info/METADATA,sha256=2FnlwL3bwN-aItvx0Y1dUQ3iTfjLr9AJUQ0hLrJZql8,8777
10
- dlt_iceberg-0.1.4.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
11
- dlt_iceberg-0.1.4.dist-info/licenses/LICENSE,sha256=0amGlcH0msYju3WUhlsuUxO4aj3ZODkkIZ0MKOq9fQ4,1066
12
- dlt_iceberg-0.1.4.dist-info/RECORD,,