lakekeeper 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 ab2dridi
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,451 @@
1
+ Metadata-Version: 2.4
2
+ Name: lakekeeper
3
+ Version: 0.0.1
4
+ Summary: Safe compaction of Hive external tables on on-premises Kerberized Hadoop clusters
5
+ Author-email: ab2dridi <a-d13@hotmail.fr>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/ab2dridi/Lakekeeper
8
+ Project-URL: Repository, https://github.com/ab2dridi/Lakekeeper
9
+ Project-URL: Changelog, https://github.com/ab2dridi/Lakekeeper/blob/main/CHANGELOG.md
10
+ Project-URL: Bug Tracker, https://github.com/ab2dridi/Lakekeeper/issues
11
+ Keywords: hive,hdfs,spark,compaction,hadoop,kerberos,pyspark,data-engineering,small-files,iceberg,data-lake
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: System Administrators
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Database
21
+ Classifier: Topic :: System :: Filesystems
22
+ Classifier: Topic :: Utilities
23
+ Requires-Python: >=3.9
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE
26
+ Requires-Dist: click>=8.0
27
+ Requires-Dist: pyyaml>=6.0
28
+ Provides-Extra: dev
29
+ Requires-Dist: pytest>=7.0; extra == "dev"
30
+ Requires-Dist: pytest-cov>=4.0; extra == "dev"
31
+ Requires-Dist: ruff>=0.4; extra == "dev"
32
+ Dynamic: license-file
33
+
34
+ # Lakekeeper
35
+
36
+ > Safe compaction of Hive external tables on on-premises Kerberized Hadoop clusters.
37
+
38
+ [![PyPI version](https://img.shields.io/pypi/v/lakekeeper.svg)](https://pypi.org/project/lakekeeper/)
39
+ [![Python](https://img.shields.io/pypi/pyversions/lakekeeper.svg)](https://pypi.org/project/lakekeeper/)
40
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
41
+
42
+ ---
43
+
44
+ ## The problem
45
+
46
+ On Hadoop clusters using Hive external tables with PySpark, data pipelines
47
+ accumulate thousands of small files over time (e.g. 65,000 files for 3 GB of
48
+ data). This pattern degrades read performance, overloads the HDFS NameNode,
49
+ and slows down all downstream queries.
50
+
51
+ The root cause is that Spark writes one file per partition task by default,
52
+ and incremental pipelines append rather than rewrite. Common tools like
53
+ `INSERT OVERWRITE` or `saveAsTable` solve the small-file problem but destroy
54
+ metadata cataloging properties (Apache Atlas lineage, table location in the
55
+ Hive Metastore), making them unsuitable for production use on managed clusters.
56
+
57
+ Lakekeeper solves this **without touching the table's Metastore location**.
58
+
59
+ ---
60
+
61
+ ## Solution
62
+
63
+ Lakekeeper compacts Hive external tables safely:
64
+
65
+ - **No `saveAsTable`** — the table's Metastore location never changes, preserving lineage and catalog properties (Apache Atlas and compatible systems)
66
+ - **Zero-copy backups** — `CREATE EXTERNAL TABLE LIKE` pointing to the original location, no data duplication
67
+ - **Per-partition compaction** — only compacts partitions that exceed the small-file threshold, untouched partitions are skipped
68
+ - **Dynamic target file count** — computed from actual data size and configured HDFS block size
69
+ - **Row count verification** — aborts and rolls back automatically if counts do not match after compaction
70
+
71
+ ---
72
+
73
+ ## Requirements
74
+
75
+ - Python >= 3.9
76
+ - Apache Spark (PySpark) accessible on the cluster
77
+ - Hive Metastore with external table support
78
+ - HDFS as the underlying storage
79
+
80
+ Tested on Cloudera CDP 7.1.9. Compatible with any on-premises Hadoop distribution
81
+ (Hortonworks HDP, Apache Ambari, vanilla Hadoop) that exposes a standard Hive
82
+ Metastore and HDFS filesystem.
83
+
84
+ ---
85
+
86
+ ## Installation
87
+
88
+ ```bash
89
+ pip install lakekeeper
90
+ ```
91
+
92
+ For development:
93
+
94
+ ```bash
95
+ git clone https://github.com/ab2dridi/Lakekeeper.git
96
+ cd BeeKeeper
97
+ pip install ".[dev]"
98
+ ```
99
+
100
+ ---
101
+
102
+ ## End-to-end usage
103
+
104
+ ### Scenario 1 — Local cluster (no Kerberos)
105
+
106
+ Suitable for development environments or clusters without Kerberos authentication.
107
+
108
+ ```bash
109
+ # 1. Install
110
+ pip install lakekeeper
111
+
112
+ # 2. Analyze — see which tables need compaction (no writes, safe to run anytime)
113
+ lakekeeper analyze --database mydb
114
+
115
+ # 3. Compact a specific table
116
+ lakekeeper compact --table mydb.events
117
+
118
+ # 4. If something went wrong, rollback to the original state
119
+ lakekeeper rollback --table mydb.events
120
+
121
+ # 5. Once you're confident, remove the backup to free up disk space
122
+ lakekeeper cleanup --table mydb.events
123
+ ```
124
+
125
+ ### Scenario 2 — On-premises Kerberized cluster (YAML config)
126
+
127
+ On a Kerberized cluster, configure `spark_submit` in a YAML file. The
128
+ `lakekeeper` CLI automatically builds and executes the `spark-submit` command —
129
+ no need to write it manually.
130
+
131
+ **Step 1 — Create the Python environment to ship to the cluster**
132
+
133
+ ```bash
134
+ conda create -n lakekeeper_env python=3.9 -y
135
+ conda activate lakekeeper_env
136
+ pip install lakekeeper
137
+ conda-pack -o lakekeeper_env.tar.gz
138
+ ```
139
+
140
+ **Step 2 — Write a config file**
141
+
142
+ ```yaml
143
+ # lakekeeper.yaml
144
+ block_size_mb: 128
145
+ compaction_ratio_threshold: 10.0
146
+ log_level: INFO
147
+
148
+ spark_submit:
149
+ enabled: true
150
+ master: yarn
151
+ deploy_mode: client
152
+ principal: myuser@MY.REALM.COM
153
+ keytab: /etc/security/keytabs/myuser.keytab
154
+ queue: data-engineering
155
+ archives: /opt/lakekeeper_env.tar.gz#lakekeeper_env
156
+ python_env: ./lakekeeper_env/bin/python
157
+ executor_memory: 4g
158
+ num_executors: 10
159
+ executor_cores: 2
160
+ driver_memory: 2g
161
+ script_path: /opt/lakekeeper/run_lakekeeper.py
162
+ extra_conf:
163
+ spark.yarn.kerberos.relogin.period: 1h
164
+ ```
165
+
166
+ **Step 3 — Run**
167
+
168
+ ```bash
169
+ # Analyze (dry-run, no writes)
170
+ lakekeeper --config-file lakekeeper.yaml analyze --database mydb
171
+
172
+ # Compact a single table
173
+ lakekeeper --config-file lakekeeper.yaml compact --table mydb.events
174
+
175
+ # Compact multiple tables
176
+ lakekeeper --config-file lakekeeper.yaml compact --tables mydb.events,mydb.users
177
+
178
+ # Compact an entire database
179
+ lakekeeper --config-file lakekeeper.yaml compact --database mydb
180
+
181
+ # Rollback if needed
182
+ lakekeeper --config-file lakekeeper.yaml rollback --table mydb.events
183
+
184
+ # Cleanup backups older than 7 days
185
+ lakekeeper --config-file lakekeeper.yaml cleanup --database mydb --older-than 7d
186
+ ```
187
+
188
+ Under the hood, Lakekeeper builds and executes:
189
+
190
+ ```
191
+ spark-submit --master yarn --deploy-mode client \
192
+ --principal myuser@MY.REALM.COM \
193
+ --keytab /etc/security/keytabs/myuser.keytab \
194
+ --conf spark.yarn.queue=data-engineering \
195
+ --archives /opt/lakekeeper_env.tar.gz#lakekeeper_env \
196
+ --conf spark.pyspark.python=./lakekeeper_env/bin/python \
197
+ --executor-memory 4g --num-executors 10 \
198
+ /opt/lakekeeper/run_lakekeeper.py compact --table mydb.events
199
+ ```
200
+
201
+ ### Scenario 3 — spark-submit manually
202
+
203
+ For one-off runs or when Lakekeeper is not installed on the edge node.
204
+
205
+ ```bash
206
+ spark-submit \
207
+ --master yarn \
208
+ --deploy-mode client \
209
+ --principal myuser@MY.REALM.COM \
210
+ --keytab /etc/security/keytabs/myuser.keytab \
211
+ --conf spark.yarn.queue=my-queue \
212
+ --archives lakekeeper_env.tar.gz#lakekeeper_env \
213
+ --conf spark.pyspark.python=./lakekeeper_env/bin/python \
214
+ run_lakekeeper.py compact --database mydb --block-size 128
215
+ ```
216
+
217
+ ---
218
+
219
+ ## CLI reference
220
+
221
+ ```
222
+ lakekeeper [OPTIONS] COMMAND [ARGS]...
223
+
224
+ Options:
225
+ --version Show version and exit.
226
+ --help Show help and exit.
227
+
228
+ Commands:
229
+ analyze Analyze tables and report compaction needs (dry-run, no writes).
230
+ compact Compact Hive external tables.
231
+ rollback Rollback a table to its pre-compaction state.
232
+ cleanup Remove backup tables and reclaim HDFS space.
233
+ ```
234
+
235
+ ### analyze
236
+
237
+ ```bash
238
+ lakekeeper analyze --database mydb
239
+ lakekeeper analyze --table mydb.events
240
+ lakekeeper analyze --tables mydb.events,mydb.users
241
+ lakekeeper analyze --table mydb.events --block-size 256 --ratio-threshold 5
242
+ ```
243
+
244
+ ### compact
245
+
246
+ ```bash
247
+ lakekeeper compact --database mydb
248
+ lakekeeper compact --table mydb.events
249
+ lakekeeper compact --tables mydb.events,mydb.users
250
+ lakekeeper compact --database mydb --block-size 256 --ratio-threshold 5
251
+ lakekeeper compact --database mydb --dry-run # analyze only, no writes
252
+ ```
253
+
254
+ ### rollback
255
+
256
+ ```bash
257
+ lakekeeper rollback --table mydb.events
258
+ ```
259
+
260
+ ### cleanup
261
+
262
+ ```bash
263
+ lakekeeper cleanup --table mydb.events # remove all backups for a table
264
+ lakekeeper cleanup --database mydb --older-than 7d # remove backups older than 7 days
265
+ ```
266
+
267
+ ---
268
+
269
+ ## Configuration reference
270
+
271
+ ### Lakekeeper parameters
272
+
273
+ | Parameter | Default | CLI flag | Description |
274
+ |---|---|---|---|
275
+ | `block_size_mb` | `128` | `--block-size` | Target HDFS block size in MB |
276
+ | `compaction_ratio_threshold` | `10.0` | `--ratio-threshold` | Compact if avg file size < block_size / ratio |
277
+ | `backup_prefix` | `__bkp` | — | Prefix for backup table names |
278
+ | `dry_run` | `false` | `--dry-run` | Analyze only, no writes |
279
+ | `log_level` | `INFO` | `--log-level` | `DEBUG`, `INFO`, `WARNING`, `ERROR` |
280
+
281
+ ### spark_submit parameters
282
+
283
+ | Parameter | Default | Description |
284
+ |---|---|---|
285
+ | `enabled` | `false` | Enable automatic spark-submit launch |
286
+ | `master` | `yarn` | Spark master URL |
287
+ | `deploy_mode` | `client` | `client` or `cluster` |
288
+ | `principal` | — | Kerberos principal (e.g. `user@REALM.COM`) |
289
+ | `keytab` | — | Path to the Kerberos keytab file |
290
+ | `queue` | — | YARN queue name (`spark.yarn.queue`) |
291
+ | `archives` | — | `--archives` for the conda-packed Python env |
292
+ | `python_env` | — | Python path inside the archive (`spark.pyspark.python`) |
293
+ | `executor_memory` | — | `--executor-memory` (e.g. `4g`) |
294
+ | `num_executors` | — | `--num-executors` |
295
+ | `executor_cores` | — | `--executor-cores` |
296
+ | `driver_memory` | — | `--driver-memory` |
297
+ | `script_path` | `run_lakekeeper.py` | Path to the entry-point script passed to spark-submit |
298
+ | `extra_conf` | `{}` | Additional `--conf key=value` pairs |
299
+
300
+ ---
301
+
302
+ ## How it works
303
+
304
+ ### Compaction strategy — HDFS rename swap
305
+
306
+ Lakekeeper uses HDFS directory renames rather than `ALTER TABLE SET LOCATION`
307
+ to swap data. The table's Metastore location never changes — only the contents
308
+ of the HDFS directory are replaced in place. Lineage and cataloging properties
309
+ (Apache Atlas and compatible systems) are fully preserved.
310
+
311
+ #### Non-partitioned table
312
+
313
+ Given a table `mydb.events` at `hdfs:///warehouse/mydb/events/`:
314
+
315
+ ```
316
+ Step 1 — Backup
317
+ Metastore: mydb.__bkp_events_20240301_020000 → hdfs:///warehouse/mydb/events/
318
+ (external.table.purge=false)
319
+ HDFS: events/ (original files, untouched)
320
+
321
+ Step 2 — Write compacted data to a temp sibling directory
322
+ HDFS: events/ ← original, still live
323
+ events__compact_tmp_1709257200/ ← Spark writes here
324
+
325
+ Step 3 — Verify row count
326
+ Counts differ → delete events__compact_tmp_1709257200/ and abort.
327
+ Original data at events/ is never touched.
328
+
329
+ Step 4 — Atomic HDFS rename swap
330
+ rename events/ → events__old_1709257200/
331
+ rename events__compact_tmp_1709257200/ → events/
332
+
333
+ Final state:
334
+ events/ ← compacted files (table still points here)
335
+ events__old_1709257200/ ← original files (kept for rollback)
336
+ __bkp_events_20240301_020000 ← backup table in Metastore
337
+ ```
338
+
339
+ #### Partitioned table
340
+
341
+ The same rename swap is applied **partition by partition**, only for partitions
342
+ that exceed the compaction threshold:
343
+
344
+ ```
345
+ Before:
346
+ events/year=2024/month=01/ 10 000 files, 1 GB ← needs compaction
347
+ events/year=2024/month=02/ 3 files, 300 MB ← skipped
348
+
349
+ After:
350
+ events/year=2024/month=01/ ← 8 compacted files
351
+ events/year=2024/month=01__old_TS/ ← original (kept for rollback)
352
+ events/year=2024/month=02/ ← untouched
353
+ ```
354
+
355
+ Readers of already-compacted partitions see the new files immediately while
356
+ readers of not-yet-processed partitions still see the original data. All reads
357
+ remain consistent throughout the operation.
358
+
359
+ ### Rollback
360
+
361
+ ```bash
362
+ lakekeeper rollback --table mydb.events
363
+ ```
364
+
365
+ 1. Finds the most recent backup table (`__bkp_events_*`)
366
+ 2. Reads its Metastore location → `events__old_TS/` (the original data)
367
+ 3. Deletes `events/` (the compacted data)
368
+ 4. Renames `events__old_TS/` back to `events/`
369
+ 5. Drops the backup table
370
+
371
+ The table is restored to exactly its pre-compaction state.
372
+
373
+ ### Cleanup
374
+
375
+ ```bash
376
+ lakekeeper cleanup --table mydb.events
377
+ ```
378
+
379
+ 1. Finds all `__bkp_events_*` backup tables
380
+ 2. For each: deletes the `__old_*` HDFS directory it points to, then drops the backup table
381
+
382
+ **Cleanup is irreversible.** Once run, rollback is no longer possible for the cleaned backups.
383
+
384
+ ---
385
+
386
+ ## Important considerations
387
+
388
+ ### ⚠ Run during a maintenance window
389
+
390
+ Lakekeeper reads the table twice (once to count rows, once to write). Any rows
391
+ written by an active pipeline **between those two reads** will not appear in
392
+ the compacted output and will be lost after the rename swap.
393
+
394
+ **Always run Lakekeeper while source pipelines are stopped**, or schedule it
395
+ in a maintenance window.
396
+
397
+ ### ⚠ 2× disk space required
398
+
399
+ During compaction, both the original and compacted data exist on HDFS simultaneously:
400
+ - `events/` — original files (until the rename swap)
401
+ - `events__compact_tmp_TS/` — compacted files being written
402
+
403
+ Ensure the HDFS parent directory quota allows **at least 2× the table size** before starting.
404
+
405
+ ### ⚠ Do not delete `__old_*` directories manually
406
+
407
+ After a successful compaction, `events__old_TS/` is the rollback safety net.
408
+ Deleting it manually makes rollback impossible. Use `lakekeeper cleanup` instead.
409
+
410
+ ### ⚠ Do not drop backup tables manually
411
+
412
+ Backup tables are created with `TBLPROPERTIES ('external.table.purge'='false')`
413
+ to prevent the Hive Metastore setting `external.table.purge=true` from deleting
414
+ the underlying HDFS data on `DROP TABLE`. Dropping a backup table manually
415
+ removes the Metastore pointer to `events__old_TS/` and prevents rollback.
416
+
417
+ > **Cloudera CDP note:** CDP clusters commonly set `external.table.purge=true`
418
+ > globally. The `purge=false` property on backup tables overrides this default.
419
+
420
+ ### ⚠ Leftover staging directories block the next run
421
+
422
+ If a previous compaction crashed, it may have left a `events__compact_tmp_TS/`
423
+ or `events__old_TS/` directory behind. Lakekeeper **refuses to start** if either
424
+ path already exists. Resolve manually before retrying:
425
+
426
+ 1. Inspect the leftover directory contents.
427
+ 2. If it contains valid compacted data, check whether the rename swap completed and restore accordingly.
428
+ 3. If it is stale or incomplete, delete it: `hdfs dfs -rm -r <path>`.
429
+
430
+ ---
431
+
432
+ ## Development
433
+
434
+ ```bash
435
+ git clone https://github.com/ab2dridi/Lakekeeper.git
436
+ cd BeeKeeper
437
+ pip install ".[dev]"
438
+
439
+ # Lint
440
+ ruff check src/ tests/
441
+ ruff format --check src/ tests/
442
+
443
+ # Tests with coverage
444
+ pytest tests/ -v --cov=lakekeeper --cov-report=term-missing
445
+ ```
446
+
447
+ ---
448
+
449
+ ## License
450
+
451
+ MIT — see [LICENSE](LICENSE) for details.