dlt-iceberg 0.1.1__py3-none-any.whl → 0.1.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of dlt-iceberg might be problematic. Click here for more details.

@@ -0,0 +1,279 @@
1
+ Metadata-Version: 2.4
2
+ Name: dlt-iceberg
3
+ Version: 0.1.2
4
+ Summary: dlt destination for Apache Iceberg with atomic multi-file commits via REST catalogs
5
+ Project-URL: Homepage, https://github.com/sidequery/dlt-iceberg
6
+ Project-URL: Repository, https://github.com/sidequery/dlt-iceberg
7
+ Project-URL: Issues, https://github.com/sidequery/dlt-iceberg/issues
8
+ Author-email: Sidequery <hello@sidequery.com>
9
+ License: MIT
10
+ License-File: LICENSE
11
+ Keywords: data-engineering,data-pipeline,dlt,elt,etl,iceberg
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: Database
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Requires-Python: >=3.11
22
+ Requires-Dist: boto3>=1.40.50
23
+ Requires-Dist: dlt>=1.17.1
24
+ Requires-Dist: pandas>=2.3.3
25
+ Requires-Dist: pyarrow>=21.0.0
26
+ Requires-Dist: pydantic<2.11
27
+ Requires-Dist: pyiceberg[pyiceberg-core]>=0.10.0
28
+ Requires-Dist: requests>=2.32.5
29
+ Requires-Dist: s3fs>=0.4.2
30
+ Requires-Dist: sqlalchemy>=2.0.44
31
+ Description-Content-Type: text/markdown
32
+
33
+ # dlt-iceberg
34
+
35
+ A [dlt](https://dlthub.com/) destination for [Apache Iceberg](https://iceberg.apache.org/) tables using REST catalogs.
36
+
37
+ ## Features
38
+
39
+ - **Atomic Multi-File Commits**: Multiple parquet files committed as single Iceberg snapshot per table
40
+ - **REST Catalog Support**: Works with Nessie, Polaris, AWS Glue, Unity Catalog
41
+ - **Partitioning**: Full support for Iceberg partition transforms (temporal, bucket, truncate, identity)
42
+ - **Authentication**: OAuth2, Bearer token, AWS SigV4
43
+ - **Write Dispositions**: Append, replace, merge (upsert)
44
+ - **Schema Evolution**: Automatic schema updates when adding columns
45
+ - **Retry Logic**: Exponential backoff for transient failures
46
+
47
+ ## Installation
48
+
49
+ ```bash
50
+ git clone https://github.com/sidequery/dlt-iceberg.git
51
+ cd dlt-iceberg
52
+ uv sync
53
+ ```
54
+
55
+ ## Quick Start
56
+
57
+ See [examples/](examples/) directory for working examples.
58
+
59
+ ### Incremental Load
60
+
61
+ ```python
62
+ import dlt
63
+ from dlt_iceberg import iceberg_rest
64
+
65
+ @dlt.resource(name="events", write_disposition="append")
66
+ def generate_events():
67
+ yield {"event_id": 1, "value": 100}
68
+
69
+ pipeline = dlt.pipeline(
70
+ pipeline_name="my_pipeline",
71
+ destination=iceberg_rest(
72
+ catalog_uri="http://localhost:19120/iceberg/main",
73
+ namespace="analytics",
74
+ s3_endpoint="http://localhost:9000",
75
+ s3_access_key_id="minioadmin",
76
+ s3_secret_access_key="minioadmin",
77
+ s3_region="us-east-1",
78
+ ),
79
+ )
80
+
81
+ pipeline.run(generate_events())
82
+ ```
83
+
84
+ ### Merge/Upsert
85
+
86
+ ```python
87
+ @dlt.resource(
88
+ name="users",
89
+ write_disposition="merge",
90
+ primary_key="user_id"
91
+ )
92
+ def generate_users():
93
+ yield {"user_id": 1, "name": "Alice", "status": "active"}
94
+
95
+ pipeline.run(generate_users())
96
+ ```
97
+
98
+ ## Configuration
99
+
100
+ ### Nessie (Docker)
101
+
102
+ ```python
103
+ iceberg_rest(
104
+ catalog_uri="http://localhost:19120/iceberg/main",
105
+ namespace="my_namespace",
106
+ s3_endpoint="http://localhost:9000",
107
+ s3_access_key_id="minioadmin",
108
+ s3_secret_access_key="minioadmin",
109
+ s3_region="us-east-1",
110
+ )
111
+ ```
112
+
113
+ Start services: `docker compose up -d`
114
+
115
+ ### AWS Glue
116
+
117
+ ```python
118
+ iceberg_rest(
119
+ catalog_uri="https://glue.us-east-1.amazonaws.com/iceberg",
120
+ warehouse="<account-id>:s3tablescatalog/<bucket>",
121
+ namespace="my_database",
122
+ sigv4_enabled=True,
123
+ signing_region="us-east-1",
124
+ )
125
+ ```
126
+
127
+ AWS credentials via environment variables.
128
+
129
+ ### Polaris
130
+
131
+ ```python
132
+ iceberg_rest(
133
+ catalog_uri="https://polaris.example.com/api/catalog",
134
+ warehouse="s3://bucket/warehouse",
135
+ namespace="production",
136
+ credential="client-id:client-secret",
137
+ oauth2_server_uri="https://polaris.example.com/api/catalog/v1/oauth/tokens",
138
+ )
139
+ ```
140
+
141
+ ### Unity Catalog
142
+
143
+ ```python
144
+ iceberg_rest(
145
+ catalog_uri="https://<workspace>.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest",
146
+ warehouse="<catalog-name>",
147
+ namespace="<schema-name>",
148
+ token="<databricks-token>",
149
+ )
150
+ ```
151
+
152
+ ## Partitioning
153
+
154
+ Mark columns for partitioning using dlt column hints:
155
+
156
+ ```python
157
+ @dlt.resource(
158
+ name="events",
159
+ columns={
160
+ "event_date": {
161
+ "data_type": "date",
162
+ "partition": True,
163
+ "partition_transform": "day", # Optional: year, month, day, hour
164
+ },
165
+ "region": {
166
+ "data_type": "text",
167
+ "partition": True, # Uses identity transform for strings
168
+ },
169
+ "user_id": {
170
+ "data_type": "bigint",
171
+ "partition": True,
172
+ "partition_transform": "bucket[10]", # Hash into 10 buckets
173
+ }
174
+ }
175
+ )
176
+ def events():
177
+ ...
178
+ ```
179
+
180
+ ### Available Transforms
181
+
182
+ - **Temporal**: `year`, `month`, `day`, `hour` (for timestamp/date columns)
183
+ - **Identity**: No transformation (default for string/integer)
184
+ - **Bucket**: `bucket[N]` - Hash-based partitioning into N buckets
185
+ - **Truncate**: `truncate[N]` - Truncate strings/integers to N width
186
+
187
+ ### Default Behavior
188
+
189
+ If `partition_transform` is not specified:
190
+ - Timestamp/date columns default to `month`
191
+ - String/integer columns default to `identity`
192
+
193
+ ## Write Dispositions
194
+
195
+ ### Append
196
+ ```python
197
+ write_disposition="append"
198
+ ```
199
+ Adds new data without modifying existing rows.
200
+
201
+ ### Replace
202
+ ```python
203
+ write_disposition="replace"
204
+ ```
205
+ Truncates table and inserts new data.
206
+
207
+ ### Merge
208
+ ```python
209
+ write_disposition="merge"
210
+ primary_key="user_id"
211
+ ```
212
+ Updates existing rows by primary key, inserts new rows.
213
+
214
+ ## Development
215
+
216
+ ### Run Tests
217
+
218
+ ```bash
219
+ # Start Docker services
220
+ docker compose up -d
221
+
222
+ # Run all tests
223
+ uv run pytest tests/ -v
224
+
225
+ # Run only unit tests
226
+ uv run pytest tests/ -v -m "not integration"
227
+
228
+ # Run only integration tests
229
+ uv run pytest tests/ -v -m integration
230
+ ```
231
+
232
+ ### Project Structure
233
+
234
+ ```
235
+ dlt-iceberg/
236
+ ├── src/dlt_iceberg/
237
+ │ ├── __init__.py # Public API
238
+ │ ├── destination_client.py # Class-based destination (atomic commits)
239
+ │ ├── destination.py # Function-based destination (legacy)
240
+ │ ├── schema_converter.py # dlt → Iceberg schema conversion
241
+ │ ├── schema_casting.py # Arrow table casting
242
+ │ ├── schema_evolution.py # Schema updates
243
+ │ ├── partition_builder.py # Partition specs
244
+ │ └── error_handling.py # Retry logic
245
+ ├── tests/
246
+ │ ├── test_destination_rest_catalog.py # Integration tests (Docker)
247
+ │ ├── test_class_based_atomic.py # Atomic commit tests
248
+ │ ├── test_merge_disposition.py
249
+ │ ├── test_schema_evolution.py
250
+ │ └── ...
251
+ ├── examples/
252
+ │ ├── incremental_load.py # CSV incremental loading
253
+ │ ├── merge_load.py # CSV merge/upsert
254
+ │ └── data/ # Sample CSV files
255
+ └── docker-compose.yml # Nessie + MinIO for testing
256
+ ```
257
+
258
+ ## How It Works
259
+
260
+ The class-based destination uses dlt's `JobClientBase` interface to accumulate parquet files during a load and commit them atomically in `complete_load()`:
261
+
262
+ 1. dlt extracts data and writes parquet files
263
+ 2. Each file is registered in module-level global state
264
+ 3. After all files complete, `complete_load()` is called
265
+ 4. All files for a table are combined and committed as single Iceberg snapshot
266
+ 5. Each table gets one snapshot per load
267
+
268
+ This ensures atomic commits even though dlt creates multiple client instances.
269
+
270
+ ## License
271
+
272
+ MIT License - see LICENSE file
273
+
274
+ ## Resources
275
+
276
+ - [dlt Documentation](https://dlthub.com/docs)
277
+ - [Apache Iceberg](https://iceberg.apache.org/)
278
+ - [PyIceberg](https://py.iceberg.apache.org/)
279
+ - [Iceberg REST Spec](https://iceberg.apache.org/rest-catalog-spec/)
@@ -6,7 +6,7 @@ dlt_iceberg/partition_builder.py,sha256=l9YNAh2t6gk2xqsPSOs8ymTDLk9BOEZWVOtVni7O
6
6
  dlt_iceberg/schema_casting.py,sha256=Qn4sarRnyJM04lKvKonEjvlvVdizUOGI65J_AmzbEAs,12997
7
7
  dlt_iceberg/schema_converter.py,sha256=e_eqXQz2cpABOGEAxVwcGbiOdVmv9kaZanRnU83lzXk,5619
8
8
  dlt_iceberg/schema_evolution.py,sha256=ieOkCA9ngQdJ5lbZLYQ09deTLZEW8whxDn2arpoH-aM,8326
9
- dlt_iceberg-0.1.1.dist-info/METADATA,sha256=hhtEkMwpG_rQBUULTeyoMsSevGGEIhCqOjTJJgCw8qY,466
10
- dlt_iceberg-0.1.1.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
11
- dlt_iceberg-0.1.1.dist-info/licenses/LICENSE,sha256=0amGlcH0msYju3WUhlsuUxO4aj3ZODkkIZ0MKOq9fQ4,1066
12
- dlt_iceberg-0.1.1.dist-info/RECORD,,
9
+ dlt_iceberg-0.1.2.dist-info/METADATA,sha256=fR4s357aPlxbCqkVF8l6-tc2qn4fX5DQeDrs3wFszNg,7797
10
+ dlt_iceberg-0.1.2.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
11
+ dlt_iceberg-0.1.2.dist-info/licenses/LICENSE,sha256=0amGlcH0msYju3WUhlsuUxO4aj3ZODkkIZ0MKOq9fQ4,1066
12
+ dlt_iceberg-0.1.2.dist-info/RECORD,,
@@ -1,15 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: dlt-iceberg
3
- Version: 0.1.1
4
- Summary: dlt custom destination for Apache Iceberg with REST catalog support
5
- License-File: LICENSE
6
- Requires-Python: >=3.11
7
- Requires-Dist: boto3>=1.40.50
8
- Requires-Dist: dlt>=1.17.1
9
- Requires-Dist: pandas>=2.3.3
10
- Requires-Dist: pyarrow>=21.0.0
11
- Requires-Dist: pydantic<2.11
12
- Requires-Dist: pyiceberg[pyiceberg-core]>=0.10.0
13
- Requires-Dist: requests>=2.32.5
14
- Requires-Dist: s3fs>=0.4.2
15
- Requires-Dist: sqlalchemy>=2.0.44