posixlake 0.1.6__cp311-cp311-win_amd64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1010 @@
1
+ Metadata-Version: 2.1
2
+ Name: posixlake
3
+ Version: 0.1.6
4
+ Summary: High-performance Delta Lake database with POSIX interface and Python bindings
5
+ Home-page: https://github.com/npiesco/posixlake
6
+ Author: posixlake Contributors
7
+ Author-email:
8
+ License: MIT
9
+ Project-URL: Bug Tracker, https://github.com/npiesco/posixlake/issues
10
+ Project-URL: Documentation, https://github.com/npiesco/posixlake#readme
11
+ Project-URL: Source Code, https://github.com/npiesco/posixlake
12
+ Keywords: database,delta-lake,sql,parquet,rust,datafusion,time-travel,acid,analytics
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.8
18
+ Classifier: Programming Language :: Python :: 3.9
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Programming Language :: Rust
23
+ Classifier: Topic :: Database
24
+ Classifier: Topic :: Software Development :: Libraries
25
+ Classifier: Operating System :: OS Independent
26
+ Requires-Python: >=3.8
27
+ Description-Content-Type: text/markdown
28
+
29
+ <div align="center">
30
+ <h1>posixlake Python Bindings</h1>
31
+ <p><strong>High-performance Delta Lake database with Python API and POSIX interface</strong></p>
32
+
33
+ <p><em>Python API for posixlake (File Store Database) - Access Delta Lake operations, SQL queries, time travel, and use Unix commands (`cat`, `grep`, `awk`, `wc`, `head`, `tail`, `sort`, `cut`, `echo >>`, `sed -i`, `vim`, `mkdir`, `mv`, `cp`, `rmdir`, `rm`) to query and trigger Delta Lake transactions. Mount databases as POSIX filesystems where standard Unix tools execute ACID operations. Works with local filesystem directories and object storage/S3. Built on Rust for maximum performance.</em></p>
34
+
35
+ [![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)](https://www.python.org)
36
+ [![PyPI](https://img.shields.io/badge/PyPI-posixlake-3776AB?logo=pypi&logoColor=white)](https://pypi.org/project/posixlake/)
37
+ [![Delta Lake](https://img.shields.io/badge/Delta%20Lake-Native%20Format-00ADD8?logo=delta&logoColor=white)](https://delta.io)
38
+ [![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](../../LICENSE.md)
39
+ [![Rust](https://img.shields.io/badge/Powered%20by-Rust-orange.svg)](https://www.rust-lang.org)
40
+
41
+ [![Arrow](https://img.shields.io/badge/Arrow-56.2-red?logo=apache)](https://arrow.apache.org)
42
+ [![DataFusion](https://img.shields.io/badge/DataFusion-50.3-purple?logo=apache)](https://datafusion.apache.org)
43
+ [![S3 Compatible](https://img.shields.io/badge/S3-Compatible-569A31?logo=amazons3&logoColor=white)](.)
44
+ [![NFS Server](https://img.shields.io/badge/NFS-Pure%20Rust-orange)](.)
45
+ </div>
46
+
47
+ ---
48
+
49
+ **Key Features:**
50
+ - **Delta Lake Native**: Full ACID transactions with native `_delta_log/` format
51
+ - **SQL Queries**: DataFusion-powered SQL engine embedded in Python
52
+ - **Time Travel**: Query historical versions and timestamps
53
+ - **CSV/Parquet Import**: Create databases from CSV (auto schema inference) or Parquet files
54
+ - **Buffered Inserts**: 10x performance improvement for small batch writes
55
+ - **NFS Server**: Mount Delta Lake as POSIX filesystem - standard Unix tools work directly
56
+ - **Storage Backends**: Works with local filesystem and S3/MinIO - same unified API
57
+ - **Performance**: Rust-powered engine with buffered inserts (~10x faster for small batches)
58
+ - **No Special Drivers**: Uses OS built-in NFS client - zero installation
59
+ - **Delta Lake Compatible**: Tables readable by Spark, Databricks, and Athena immediately
60
+
61
+ ---
62
+
63
+ ## Installation
64
+
65
+ ### From PyPI (Recommended)
66
+
67
+ ```bash
68
+ pip install posixlake
69
+ ```
70
+
71
+ **Requirements:**
72
+ - **Python 3.11+** (required for prebuilt wheels with native library)
73
+ - For other Python versions, install from source (see below)
74
+
75
+ **PyPI Package:** https://pypi.org/project/posixlake/
76
+
77
+ ### From Source
78
+
79
+ ```bash
80
+ # 1. Clone the repository
81
+ git clone https://github.com/npiesco/posixlake.git
82
+ cd posixlake
83
+
84
+ # 2. Build Rust library
85
+ cargo build --release
86
+
87
+ # 3. Generate Python API
88
+ cargo run --bin uniffi-bindgen -- generate \
89
+ --library target/release/libposixlake.dylib \
90
+ --language python \
91
+ --out-dir bindings/python
92
+
93
+ # 4. Copy library
94
+ cp target/release/libposixlake.dylib bindings/python/
95
+
96
+ # 5. Install Python package
97
+ cd bindings/python
98
+ pip install -e .
99
+ ```
100
+
101
+ **Prerequisites:**
102
+ - Python 3.8+ (3.11+ recommended for prebuilt wheels)
103
+ - Rust 1.70+ (for building from source)
104
+ - NFS client (built-in on macOS/Linux/Windows Pro)
105
+
106
+ ---
107
+
108
+ ## Quick Start
109
+
110
+ ### Example 1: Basic Database Operations
111
+
112
+ ```python
113
+ from posixlake import DatabaseOps, Schema, Field, PosixLakeError
114
+
115
+ # Create a schema
116
+ schema = Schema(fields=[
117
+ Field(name="id", data_type="Int32", nullable=False),
118
+ Field(name="name", data_type="String", nullable=False),
119
+ Field(name="age", data_type="Int32", nullable=True),
120
+ Field(name="salary", data_type="Float64", nullable=True),
121
+ ])
122
+
123
+ # Create database on local filesystem
124
+ try:
125
+ db = DatabaseOps.create("/path/to/db", schema)
126
+ print("✓ Database created")
127
+ except PosixLakeError as e:
128
+ print(f"✗ Error: {e}")
129
+
130
+ # Insert data (JSON format)
131
+ data = '[{"id": 1, "name": "Alice", "age": 30, "salary": 75000.0}]'
132
+ db.insert_json(data)
133
+
134
+ # Query with SQL
135
+ results = db.query_json("SELECT * FROM data WHERE age > 25")
136
+ print(results)
137
+ # [{"id": 1, "name": "Alice", "age": 30, "salary": 75000.0}]
138
+
139
+ # Delete rows
140
+ db.delete_rows_where("id = 1")
141
+ print("✓ Row deleted")
142
+ ```
143
+
144
+ ### Example 2: Buffered Insert (High Performance)
145
+
146
+ ```python
147
+ from posixlake import DatabaseOps, Schema, Field
148
+ import json
149
+
150
+ schema = Schema(fields=[
151
+ Field(name="id", data_type="Int32", nullable=False),
152
+ Field(name="name", data_type="String", nullable=False),
153
+ Field(name="email", data_type="String", nullable=False),
154
+ ])
155
+
156
+ db = DatabaseOps.create("/path/to/db", schema)
157
+
158
+ # Insert many small batches efficiently (buffers up to 1000 rows)
159
+ print("Inserting 100 small batches using buffered insert...")
160
+ for i in range(100):
161
+ db.insert_buffered_json(json.dumps([{
162
+ "id": i,
163
+ "name": f"User_{i}",
164
+ "email": f"user{i}@example.com"
165
+ }]))
166
+ if (i + 1) % 20 == 0:
167
+ print(f" Buffered {i + 1}/100 batches...")
168
+
169
+ # Flush buffer to commit all data
170
+ print("\nFlushing write buffer...")
171
+ db.flush_write_buffer()
172
+ print("✓ All buffered data committed to Delta Lake")
173
+
174
+ # Result: ~1-2 Delta Lake transactions instead of 100!
175
+ # Performance improvement: ~10x faster for small batches
176
+ ```
177
+
178
+ ### Example 3: S3 / Object Storage Backend
179
+
180
+ ```python
181
+ from posixlake import DatabaseOps, Schema, Field, S3Config
182
+
183
+ schema = Schema(fields=[
184
+ Field(name="id", data_type="Int32", nullable=False),
185
+ Field(name="name", data_type="String", nullable=False),
186
+ Field(name="value", data_type="Float64", nullable=True),
187
+ ])
188
+
189
+ # Create database on S3/MinIO
190
+ s3_config = S3Config(
191
+ endpoint="http://localhost:9000", # MinIO or AWS S3 endpoint
192
+ access_key_id="minioadmin",
193
+ secret_access_key="minioadmin",
194
+ region="us-east-1"
195
+ )
196
+
197
+ db = DatabaseOps.create_with_s3("s3://bucket-name/db-path", schema, s3_config)
198
+
199
+ # Same API works with S3!
200
+ db.insert_json('[{"id": 1, "name": "Alice", "value": 123.45}]')
201
+ results = db.query_json("SELECT * FROM data WHERE value > 100")
202
+ print(results)
203
+
204
+ # All data stored in S3 with Delta Lake ACID transactions
205
+ ```
206
+
207
+ ### Example 4: POSIX Access via NFS Server
208
+
209
+ ```python
210
+ from posixlake import DatabaseOps, Schema, Field, NfsServer
211
+ import time
212
+ import subprocess
213
+
214
+ # Create database
215
+ schema = Schema(fields=[
216
+ Field(name="id", data_type="Int32", nullable=False),
217
+ Field(name="name", data_type="String", nullable=False),
218
+ Field(name="age", data_type="Int32", nullable=True),
219
+ ])
220
+ db = DatabaseOps.create("/path/to/db", schema)
221
+
222
+ # Insert data
223
+ db.insert_json('[{"id": 1, "name": "Alice", "age": 30}, {"id": 2, "name": "Bob", "age": 25}]')
224
+
225
+ # Start NFS server on port 12049
226
+ nfs_port = 12049
227
+ nfs_server = NfsServer(db, nfs_port)
228
+ print(f"✓ NFS server started on port {nfs_port}")
229
+
230
+ # Wait for server to be ready
231
+ time.sleep(0.5)
232
+ if nfs_server.is_ready():
233
+ print("✓ NFS server is ready!")
234
+ else:
235
+ print("⚠ NFS server not ready, POSIX operations may fail")
236
+
237
+ # Mount filesystem (requires sudo - run this in terminal)
238
+ # sudo mount_nfs -o nolocks,vers=3,tcp,port=12049,mountport=12049 localhost:/ /mnt/posixlake
239
+
240
+ # Now use standard Unix tools to query and trigger Delta Lake operations:
241
+ # $ cat /mnt/posixlake/data/data.csv # Queries Parquet data, converts to CSV
242
+ # id,name,age
243
+ # 1,Alice,30
244
+ # 2,Bob,25
245
+ #
246
+ # $ grep "Alice" /mnt/posixlake/data/data.csv | awk -F',' '{print $2}' # Search and process
247
+ # Alice
248
+ #
249
+ # $ wc -l /mnt/posixlake/data/data.csv # Count records
250
+ # 3 /mnt/posixlake/data/data.csv
251
+ #
252
+ # $ echo "3,Charlie,28" >> /mnt/posixlake/data/data.csv # Triggers Delta Lake INSERT transaction!
253
+ #
254
+ # $ sed -i 's/Alice,30/Alice,31/' /mnt/posixlake/data/data.csv # Triggers Delta Lake MERGE (UPDATE) transaction!
255
+ #
256
+ # $ grep -v "Bob" /mnt/posixlake/data/data.csv > /tmp/temp && cat /tmp/temp > /mnt/posixlake/data/data.csv # Triggers MERGE (DELETE) transaction!
257
+
258
+ # Shutdown NFS server when done
259
+ # nfs_server.shutdown()
260
+ ```
261
+
262
+ ### Example 5: Time Travel Queries
263
+
264
+ ```python
265
+ from posixlake import DatabaseOps, Schema, Field
266
+
267
+ schema = Schema(fields=[
268
+ Field(name="id", data_type="Int32", nullable=False),
269
+ Field(name="name", data_type="String", nullable=False),
270
+ ])
271
+
272
+ db = DatabaseOps.create("/path/to/db", schema)
273
+
274
+ # Insert initial data
275
+ db.insert_json('[{"id": 1, "name": "Alice"}]')
276
+ version_1 = db.get_current_version()
277
+ print(f"Version 1: {version_1}")
278
+
279
+ # Insert more data
280
+ db.insert_json('[{"id": 2, "name": "Bob"}]')
281
+ version_2 = db.get_current_version()
282
+ print(f"Version 2: {version_2}")
283
+
284
+ # Query by version (historical data)
285
+ results_v1 = db.query_json_at_version("SELECT * FROM data", version_1)
286
+ print(f"Data at version {version_1}: {results_v1}")
287
+ # [{"id": 1, "name": "Alice"}]
288
+
289
+ results_v2 = db.query_json_at_version("SELECT * FROM data", version_2)
290
+ print(f"Data at version {version_2}: {results_v2}")
291
+ # [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
292
+
293
+ # Query by timestamp
294
+ import time
295
+ timestamp = int(time.time())
296
+ results = db.query_json_at_timestamp("SELECT * FROM data", timestamp)
297
+ print(f"Data at timestamp {timestamp}: {results}")
298
+ ```
299
+
300
+ ### Example 6: Import from CSV (Auto Schema Inference)
301
+
302
+ ```python
303
+ from posixlake import DatabaseOps
304
+ import json
305
+
306
+ # Create database by importing CSV - schema is automatically inferred!
307
+ # Column types detected: Int64, Float64, Boolean, String
308
+ db = DatabaseOps.create_from_csv("/path/to/new_db", "/path/to/data.csv")
309
+
310
+ # Query the imported data
311
+ results = db.query_json("SELECT * FROM data LIMIT 5")
312
+ print(json.loads(results))
313
+
314
+ # Check inferred schema
315
+ schema = db.get_schema()
316
+ for field in schema.fields:
317
+ print(f" {field.name}: {field.data_type} (nullable={field.nullable})")
318
+ ```
319
+
320
+ ### Example 7: Import from Parquet
321
+
322
+ ```python
323
+ from posixlake import DatabaseOps
324
+ import json
325
+
326
+ # Create database from existing Parquet file(s)
327
+ # Schema is read directly from Parquet metadata
328
+ db = DatabaseOps.create_from_parquet("/path/to/new_db", "/path/to/data.parquet")
329
+
330
+ # Supports glob patterns for multiple files
331
+ db = DatabaseOps.create_from_parquet("/path/to/db", "/data/*.parquet")
332
+
333
+ # Query the imported data
334
+ results = db.query_json("SELECT COUNT(*) as total FROM data")
335
+ print(json.loads(results))
336
+ ```
337
+
338
+ ### Example 8: Delta Lake Operations
339
+
340
+ ```python
341
+ from posixlake import DatabaseOps, Schema, Field
342
+
343
+ db = DatabaseOps.open("/path/to/db")
344
+
345
+ # OPTIMIZE: Compact small Parquet files into larger ones
346
+ optimize_result = db.optimize()
347
+ print(f"✓ OPTIMIZE completed: {optimize_result}")
348
+
349
+ # VACUUM: Remove old files (retention period in hours)
350
+ vacuum_result = db.vacuum(retention_hours=168) # 7 days
351
+ print(f"✓ VACUUM completed: {vacuum_result}")
352
+
353
+ # Z-ORDER: Multi-dimensional clustering for better query performance
354
+ zorder_result = db.zorder(columns=["id", "name"])
355
+ print(f"✓ Z-ORDER completed: {zorder_result}")
356
+
357
+ # Get data skipping statistics
358
+ stats = db.get_data_skipping_stats()
359
+ print(f"Data skipping stats: {stats}")
360
+ ```
361
+
362
+ ---
363
+
364
+ ## Core Features
365
+
366
+ ### Database Operations
367
+
368
+ #### Creating and Opening Databases
369
+
370
+ ```python
371
+ from posixlake import DatabaseOps, Schema, Field, S3Config
372
+
373
+ # Local filesystem with explicit schema
374
+ schema = Schema(fields=[
375
+ Field(name="id", data_type="Int32", nullable=False),
376
+ Field(name="name", data_type="String", nullable=False),
377
+ ])
378
+ db = DatabaseOps.create("/path/to/db", schema)
379
+ db = DatabaseOps.open("/path/to/db")
380
+
381
+ # Import from CSV (auto schema inference)
382
+ db = DatabaseOps.create_from_csv("/path/to/db", "/path/to/data.csv")
383
+
384
+ # Import from Parquet (schema from metadata)
385
+ db = DatabaseOps.create_from_parquet("/path/to/db", "/path/to/data.parquet")
386
+ db = DatabaseOps.create_from_parquet("/path/to/db", "/data/*.parquet") # glob pattern
387
+
388
+ # With authentication
389
+ db = DatabaseOps.create_with_auth("/path/to/db", schema, auth_enabled=True)
390
+ db = DatabaseOps.open_with_credentials("/path/to/db", credentials)
391
+
392
+ # S3 backend
393
+ s3_config = S3Config(
394
+ endpoint="http://localhost:9000",
395
+ access_key_id="minioadmin",
396
+ secret_access_key="minioadmin",
397
+ region="us-east-1"
398
+ )
399
+ db = DatabaseOps.create_with_s3("s3://bucket/db-path", schema, s3_config)
400
+ db = DatabaseOps.open_with_s3("s3://bucket/db-path", s3_config)
401
+ ```
402
+
403
+ #### Data Insertion
404
+
405
+ ```python
406
+ # Regular insert (one transaction per call)
407
+ db.insert_json('[{"id": 1, "name": "Alice"}]')
408
+
409
+ # Buffered insert (batches multiple writes)
410
+ db.insert_buffered_json('[{"id": 2, "name": "Bob"}]')
411
+ db.insert_buffered_json('[{"id": 3, "name": "Charlie"}]')
412
+ db.flush_write_buffer() # Commit all buffered data
413
+
414
+ # MERGE (UPSERT) operation
415
+ merge_data = [
416
+ {"id": 1, "name": "Alice Updated", "_op": "UPDATE"},
417
+ {"id": 4, "name": "David", "_op": "INSERT"},
418
+ {"id": 2, "_op": "DELETE"}
419
+ ]
420
+ result = db.merge_json(json.dumps(merge_data), "id")
421
+ # Returns: {"rows_inserted": 1, "rows_updated": 1, "rows_deleted": 1}
422
+ ```
423
+
424
+ #### SQL Queries
425
+
426
+ ```python
427
+ # Basic query
428
+ results = db.query_json("SELECT * FROM data WHERE id > 0")
429
+
430
+ # Aggregations
431
+ results = db.query_json("SELECT COUNT(*) as count, AVG(age) as avg_age FROM data")
432
+
433
+ # Joins (if multiple tables)
434
+ results = db.query_json("""
435
+ SELECT a.id, a.name, b.value
436
+ FROM data a
437
+ JOIN other_table b ON a.id = b.id
438
+ """)
439
+
440
+ # Time travel queries
441
+ results = db.query_json_at_version("SELECT * FROM data", version=5)
442
+ results = db.query_json_at_timestamp("SELECT * FROM data", timestamp=1234567890)
443
+ ```
444
+
445
+ #### Row Deletion
446
+
447
+ ```python
448
+ # Delete by condition
449
+ db.delete_rows_where("id = 5")
450
+ db.delete_rows_where("age < 18")
451
+ db.delete_rows_where("name LIKE '%test%'")
452
+
453
+ # Delete all rows (truncate)
454
+ db.delete_rows_where("1=1")
455
+ ```
456
+
457
+ ### Time Travel
458
+
459
+ posixlake supports Delta Lake's time travel feature, allowing you to query historical versions of your data:
460
+
461
+ ```python
462
+ # Get current version
463
+ current_version = db.get_current_version()
464
+ print(f"Current version: {current_version}")
465
+
466
+ # Query by version
467
+ results = db.query_json_at_version("SELECT * FROM data", version=10)
468
+
469
+ # Query by timestamp
470
+ import time
471
+ timestamp = int(time.time()) - 3600 # 1 hour ago
472
+ results = db.query_json_at_timestamp("SELECT * FROM data", timestamp)
473
+
474
+ # Get version history
475
+ history = db.get_version_history()
476
+ for entry in history:
477
+ print(f"Version {entry['version']}: {entry['timestamp']} - {entry['operation']}")
478
+ ```
479
+
480
+ ### Delta Lake Operations
481
+
482
+ #### OPTIMIZE (File Compaction)
483
+
484
+ ```python
485
+ # Compact small Parquet files into larger ones for better query performance
486
+ result = db.optimize()
487
+ print(f"Files compacted: {result}")
488
+ ```
489
+
490
+ #### VACUUM (Cleanup Old Files)
491
+
492
+ ```python
493
+ # Remove old files (retention period in hours)
494
+ # Default: 168 hours (7 days)
495
+ result = db.vacuum(retention_hours=168)
496
+ print(f"Files removed: {result}")
497
+ ```
498
+
499
+ #### Z-ORDER (Multi-dimensional Clustering)
500
+
501
+ ```python
502
+ # Cluster data by multiple columns for better query performance
503
+ result = db.zorder(columns=["id", "name", "age"])
504
+ print(f"Z-ORDER completed: {result}")
505
+ ```
506
+
507
+ #### Data Skipping Statistics
508
+
509
+ ```python
510
+ # Get statistics for query optimization
511
+ stats = db.get_data_skipping_stats()
512
+ print(f"Data skipping stats: {stats}")
513
+ ```
514
+
515
+ ### NFS Server (POSIX Filesystem Access)
516
+
517
+ The NFS server allows you to mount your Delta Lake database as a standard POSIX filesystem. **Unix commands don't just read data - they trigger Delta Lake operations**: `cat` queries Parquet data, `grep` searches, `echo >>` triggers INSERT transactions, `sed -i` triggers MERGE (UPDATE/DELETE) transactions. All operations are ACID-compliant Delta Lake transactions.
518
+
519
+ #### Starting the NFS Server
520
+
521
+ ```python
522
+ from posixlake import DatabaseOps, Schema, Field, NfsServer
523
+ import time
524
+
525
+ # Create/open database
526
+ db = DatabaseOps.open("/path/to/db")
527
+
528
+ # Start NFS server on port 12049
529
+ nfs = NfsServer(db, 12049)
530
+
531
+ # Wait for server to be ready
532
+ time.sleep(0.5)
533
+ if nfs.is_ready():
534
+ print("✓ NFS server ready")
535
+ else:
536
+ print("⚠ NFS server not ready")
537
+ ```
538
+
539
+ #### Mounting the Filesystem
540
+
541
+ ```bash
542
+ # Mount command (requires sudo)
543
+ sudo mount_nfs -o nolocks,vers=3,tcp,port=12049,mountport=12049 localhost:/ /mnt/posixlake
544
+
545
+ # Verify mount
546
+ ls -la /mnt/posixlake/
547
+ # data/
548
+ # schema.sql
549
+ # .query
550
+ ```
551
+
552
+ #### Using POSIX Commands
553
+
554
+ Once mounted, your Delta Lake table is accessible like any other directory:
555
+
556
+ ```bash
557
+ # 1. List directory contents
558
+ ls -la /mnt/posixlake/data/
559
+
560
+ # 2. Read all data as CSV
561
+ cat /mnt/posixlake/data/data.csv
562
+ # id,name,age
563
+ # 1,Alice,30
564
+ # 2,Bob,25
565
+
566
+ # 3. Search for specific records with grep
567
+ grep "Alice" /mnt/posixlake/data/data.csv
568
+ # 1,Alice,30
569
+
570
+ # 4. Process columns with awk
571
+ awk -F',' '{print $2, $3}' /mnt/posixlake/data/data.csv
572
+ # name age
573
+ # Alice 30
574
+ # Bob 25
575
+
576
+ # 5. Count lines/records with wc
577
+ wc -l /mnt/posixlake/data/data.csv
578
+ # 3 /mnt/posixlake/data/data.csv (includes header)
579
+
580
+ # 6. Sort data by a column
581
+ sort -t',' -k2 /mnt/posixlake/data/data.csv # Sort by name
582
+
583
+ # 7. Append new data (triggers Delta Lake INSERT transaction!)
584
+ echo "3,Charlie,28" >> /mnt/posixlake/data/data.csv
585
+ # → Executes: Delta Lake INSERT transaction with ACID guarantees
586
+ cat /mnt/posixlake/data/data.csv
587
+ # id,name,age
588
+ # 1,Alice,30
589
+ # 2,Bob,25
590
+ # 3,Charlie,28
591
+
592
+ # 8. Edit data (triggers Delta Lake MERGE transaction - atomic INSERT/UPDATE/DELETE!)
593
+ # Example: Update Alice's age to 31
594
+ sed -i 's/Alice,30/Alice,31/' /mnt/posixlake/data/data.csv
595
+ # → Executes: Delta Lake MERGE transaction (UPDATE operation)
596
+ cat /mnt/posixlake/data/data.csv
597
+ # id,name,age
598
+ # 1,Alice,31
599
+ # 2,Bob,25
600
+ # 3,Charlie,28
601
+
602
+ # Example: Delete Bob (id=2)
603
+ grep -v "2,Bob" /mnt/posixlake/data/data.csv > /tmp/temp_data.csv
604
+ cat /tmp/temp_data.csv > /mnt/posixlake/data/data.csv
605
+ # → Executes: Delta Lake MERGE transaction (DELETE operation)
606
+ cat /mnt/posixlake/data/data.csv
607
+ # id,name,age
608
+ # 1,Alice,31
609
+ # 3,Charlie,28
610
+
611
+ # 9. Truncate table (triggers Delta Lake DELETE ALL transaction!)
612
+ rm /mnt/posixlake/data/data.csv
613
+ # → Executes: Delta Lake DELETE ALL transaction
614
+ cat /mnt/posixlake/data/data.csv
615
+ # id,name,age
616
+ ```
617
+
618
+ #### Unmounting and Shutdown
619
+
620
+ ```bash
621
+ # Unmount filesystem
622
+ sudo umount /mnt/posixlake
623
+ ```
624
+
625
+ ```python
626
+ # Shutdown NFS server
627
+ nfs.shutdown()
628
+ ```
629
+
630
+ **How It Works:**
631
+ - **Read Operations** (`cat`, `grep`, `awk`, `wc`): NFS server queries Parquet files → converts to CSV on-demand → caches result
632
+ - **Append Operations** (`echo >>`): NFS server parses CSV → converts to RecordBatch → Delta Lake INSERT transaction
633
+ - **Overwrite Operations** (`sed -i`, `cat > file`): Detects INSERT/UPDATE/DELETE by comparing old vs new CSV → executes MERGE transaction (atomic INSERT/UPDATE/DELETE)
634
+ - **Delete Operations** (`rm file`): Triggers Delta Lake DELETE ALL transaction
635
+ - **No Special Drivers**: Uses OS built-in NFS client - works everywhere
636
+
637
+ ### Authentication & Security
638
+
639
+ ```python
640
+ from posixlake import DatabaseOps, Schema, Field, Credentials
641
+
642
+ # Create database with authentication enabled
643
+ schema = Schema(fields=[...])
644
+ db = DatabaseOps.create_with_auth("/path/to/db", schema, auth_enabled=True)
645
+
646
+ # Open with credentials
647
+ credentials = Credentials(username="admin", password="secret")
648
+ db = DatabaseOps.open_with_credentials("/path/to/db", credentials)
649
+
650
+ # User management
651
+ db.create_user("alice", "password123", role="admin")
652
+ db.delete_user("alice")
653
+
654
+ # Role-based access control
655
+ # Permissions checked automatically on all operations
656
+ ```
657
+
658
+ ### Backup & Restore
659
+
660
+ ```python
661
+ # Full backup
662
+ backup_path = db.backup("/path/to/backup")
663
+ print(f"Backup created: {backup_path}")
664
+
665
+ # Incremental backup
666
+ backup_path = db.backup_incremental("/path/to/backup")
667
+ print(f"Incremental backup created: {backup_path}")
668
+
669
+ # Restore
670
+ db.restore("/path/to/backup")
671
+ print("✓ Database restored")
672
+ ```
673
+
674
+ ### Monitoring
675
+
676
+ ```python
677
+ # Get real-time metrics
678
+ metrics = db.get_metrics()
679
+ print(f"Metrics: {metrics}")
680
+
681
+ # Health check
682
+ is_healthy = db.health_check()
683
+ print(f"Database healthy: {is_healthy}")
684
+
685
+ # Data skipping statistics
686
+ stats = db.get_data_skipping_stats()
687
+ print(f"Data skipping stats: {stats}")
688
+ ```
689
+
690
+ ---
691
+
692
+ ## API Reference
693
+
694
+ ### DatabaseOps
695
+
696
+ Main class for database operations.
697
+
698
+ #### Methods
699
+
700
+ | Method | Description | Returns |
701
+ |--------|-------------|---------|
702
+ | `create(path, schema)` | Create new database | `DatabaseOps` |
703
+ | `create_from_csv(db_path, csv_path)` | Create from CSV (auto schema) | `DatabaseOps` |
704
+ | `create_from_parquet(db_path, parquet_path)` | Create from Parquet | `DatabaseOps` |
705
+ | `open(path)` | Open existing database | `DatabaseOps` |
706
+ | `create_with_auth(path, schema, auth_enabled)` | Create with authentication | `DatabaseOps` |
707
+ | `open_with_credentials(path, credentials)` | Open with credentials | `DatabaseOps` |
708
+ | `create_with_s3(s3_path, schema, s3_config)` | Create on S3 | `DatabaseOps` |
709
+ | `open_with_s3(s3_path, s3_config)` | Open from S3 | `DatabaseOps` |
710
+ | `insert_json(json_data)` | Insert data from JSON | `u64` (rows inserted) |
711
+ | `insert_buffered_json(json_data)` | Buffered insert | `u64` (rows inserted) |
712
+ | `flush_write_buffer()` | Flush buffered writes | `None` |
713
+ | `merge_json(json_data, key_column)` | MERGE (UPSERT) operation | `str` (JSON metrics) |
714
+ | `query_json(sql)` | Execute SQL query | `str` (JSON results) |
715
+ | `query_json_at_version(sql, version)` | Time travel query by version | `str` (JSON results) |
716
+ | `query_json_at_timestamp(sql, timestamp)` | Time travel query by timestamp | `str` (JSON results) |
717
+ | `delete_rows_where(condition)` | Delete rows by condition | `u64` (rows deleted) |
718
+ | `optimize()` | Compact Parquet files | `str` (result) |
719
+ | `vacuum(retention_hours)` | Remove old files | `str` (result) |
720
+ | `zorder(columns)` | Multi-dimensional clustering | `str` (result) |
721
+ | `get_current_version()` | Get current version | `i64` |
722
+ | `get_version_history()` | Get version history | `list` |
723
+ | `get_data_skipping_stats()` | Get skipping statistics | `str` (JSON) |
724
+ | `get_metrics()` | Get real-time metrics | `str` (JSON) |
725
+ | `health_check()` | Health check | `bool` |
726
+ | `backup(path)` | Full backup | `str` (backup path) |
727
+ | `backup_incremental(path)` | Incremental backup | `str` (backup path) |
728
+ | `restore(path)` | Restore from backup | `None` |
729
+
730
+ ### Schema
731
+
732
+ Database schema definition.
733
+
734
+ ```python
735
+ from posixlake import Schema, Field
736
+
737
+ schema = Schema(fields=[
738
+ Field(name="id", data_type="Int32", nullable=False),
739
+ Field(name="name", data_type="String", nullable=False),
740
+ Field(name="age", data_type="Int32", nullable=True),
741
+ Field(name="salary", data_type="Float64", nullable=True),
742
+ ])
743
+ ```
744
+
745
+ #### Supported Data Types
746
+
747
+ **Primitive Types:**
748
+ - `Int8`, `Int16`, `Int32`, `Int64`
749
+ - `UInt8`, `UInt16`, `UInt32`, `UInt64`
750
+ - `Float32`, `Float64`
751
+ - `String`, `LargeUtf8`, `Binary`, `LargeBinary`
752
+ - `Boolean`
753
+ - `Date32`, `Date64`
754
+ - `Timestamp`
755
+
756
+ **Complex Types:**
757
+ - `Decimal128(precision,scale)` - e.g., `Decimal128(10,2)` for currency
758
+ - `List<ElementType>` - e.g., `List<Int32>`, `List<String>`
759
+ - `Map<KeyType,ValueType>` - e.g., `Map<String,Int64>`
760
+ - `Struct<field1:Type1,field2:Type2>` - e.g., `Struct<x:Int32,y:Int32>`
761
+
762
+ ### Field
763
+
764
+ Schema field definition.
765
+
766
+ ```python
767
+ # Simple types
768
+ Field(name="id", data_type="Int32", nullable=False)
769
+ Field(name="price", data_type="Decimal128(10,2)", nullable=False)
770
+
771
+ # Complex types
772
+ Field(name="tags", data_type="List<String>", nullable=True)
773
+ Field(name="metadata", data_type="Map<String,String>", nullable=True)
774
+ Field(name="address", data_type="Struct<city:String,zip:Int32>", nullable=True)
775
+ ```
776
+
777
+ ### NfsServer
778
+
779
+ NFS server for POSIX filesystem access.
780
+
781
+ ```python
782
+ nfs = NfsServer(db, port=12049)
783
+ nfs.is_ready() # Check if server is ready
784
+ nfs.shutdown() # Shutdown server
785
+ ```
786
+
787
+ ### S3Config
788
+
789
+ S3 configuration for object storage backend.
790
+
791
+ ```python
792
+ s3_config = S3Config(
793
+ endpoint="http://localhost:9000",
794
+ access_key_id="minioadmin",
795
+ secret_access_key="minioadmin",
796
+ region="us-east-1"
797
+ )
798
+ ```
799
+
800
+ ### PosixLakeError
801
+
802
+ Exception class for all posixlake errors.
803
+
804
+ ```python
805
+ from posixlake import PosixLakeError
806
+
807
+ try:
808
+ db.insert_json(data)
809
+ except PosixLakeError as e:
810
+ print(f"Error: {e}")
811
+ ```
812
+
813
+ #### Error Types
814
+
815
+ - `PosixLakeError.IoError` - I/O operations
816
+ - `PosixLakeError.SerializationError` - JSON/Arrow serialization
817
+ - `PosixLakeError.DeltaLakeError` - Delta Lake operations
818
+ - `PosixLakeError.InvalidOperation` - Invalid operations
819
+ - `PosixLakeError.QueryError` - SQL query errors
820
+ - `PosixLakeError.AuthenticationError` - Authentication failures
821
+ - `PosixLakeError.PermissionDenied` - Permission errors
822
+ - `PosixLakeError.SchemaError` - Schema-related errors
823
+ - `PosixLakeError.VersionError` - Version conflicts
824
+ - `PosixLakeError.StorageError` - Storage backend errors
825
+ - `PosixLakeError.NetworkError` - Network operations
826
+ - `PosixLakeError.TimeoutError` - Operation timeouts
827
+ - `PosixLakeError.NotFound` - Resource not found
828
+ - `PosixLakeError.AlreadyExists` - Resource already exists
829
+
830
+ ---
831
+
832
+ ## Performance
833
+
834
+ ### Buffered Inserts
835
+
836
+ **10x performance improvement** for small batch writes:
837
+
838
+ ```python
839
+ # Regular insert: 100 separate Delta Lake transactions
840
+ for i in range(100):
841
+ db.insert_json(f'[{{"id": {i}, "name": "User_{i}"}}]')
842
+ # Time: ~5-10 seconds (50-100ms per transaction)
843
+
844
+ # Buffered insert: ~1-2 batched transactions
845
+ for i in range(100):
846
+ db.insert_buffered_json(f'[{{"id": {i}, "name": "User_{i}"}}]')
847
+ db.flush_write_buffer()
848
+ # Time: ~0.5-1 second (10x faster!)
849
+ ```
850
+
851
+ **How It Works:**
852
+ - Buffers multiple small writes in memory
853
+ - Auto-flushes at 1000 rows (configurable in Rust)
854
+ - Batches all buffered data into fewer Delta Lake transactions
855
+ - Reduces transaction overhead significantly
856
+
857
+ ### Efficient Operations
858
+
859
+ - Optimized data transfer between Rust and Python
860
+ - Arrow RecordBatches shared efficiently
861
+ - Minimal memory copying for large datasets
862
+
863
+ ### Async Operations
864
+
865
+ - Operations run on async runtime
866
+ - Synchronous Python API for ease of use
867
+ - Optimal concurrency for I/O-bound workloads
868
+
869
+ ---
870
+
871
+ ## Error Handling
872
+
873
+ All Rust errors are properly mapped to Python exceptions:
874
+
875
+ ```python
876
+ from posixlake import PosixLakeError
877
+
878
+ try:
879
+ db = DatabaseOps.create("/path/to/db", schema)
880
+ db.insert_json(data)
881
+ results = db.query_json("SELECT * FROM data")
882
+ except PosixLakeError.IoError as e:
883
+ print(f"I/O error: {e}")
884
+ except PosixLakeError.SerializationError as e:
885
+ print(f"Serialization error: {e}")
886
+ except PosixLakeError.DeltaLakeError as e:
887
+ print(f"Delta Lake error: {e}")
888
+ except PosixLakeError.InvalidOperation as e:
889
+ print(f"Invalid operation: {e}")
890
+ except PosixLakeError as e:
891
+ print(f"posixlake error: {e}")
892
+ ```
893
+
894
+ **Error Types:**
895
+ - All errors inherit from `PosixLakeError`
896
+ - Specific error types for different failure modes
897
+ - Comprehensive error messages with context
898
+ - Stack traces preserved from Rust
899
+
900
+ ---
901
+
902
+ ## Architecture
903
+
904
+ ### System Overview
905
+
906
+ ```
907
+ ┌─────────────────────────────────────────┐
908
+ │ Python Application │
909
+ │ from posixlake import DatabaseOps │
910
+ └──────────────┬──────────────────────────┘
911
+
912
+ ┌──────────────▼──────────────────────────┐
913
+ │ Python API Layer │
914
+ │ • Type conversion │
915
+ │ • Error handling │
916
+ │ • Async runtime bridge │
917
+ └──────────────┬──────────────────────────┘
918
+
919
+ ┌──────────────▼──────────────────────────┐
920
+ │ Rust Library (libposixlake.dylib) │
921
+ │ • DatabaseOps │
922
+ │ • Delta Lake operations │
923
+ │ • DataFusion SQL engine │
924
+ │ • NFS server │
925
+ └──────────────┬──────────────────────────┘
926
+
927
+ ┌──────────────▼──────────────────────────┐
928
+ │ Delta Lake Protocol │
929
+ │ • ACID transactions │
930
+ │ • Time travel │
931
+ │ • Parquet storage │
932
+ └─────────────────────────────────────────┘
933
+ ```
934
+
935
+ **Key Features:**
936
+ - **Type Safety**: Automatic type conversion between Rust and Python
937
+ - **Error Handling**: Comprehensive error mapping to Python exceptions
938
+ - **Efficient Data Transfer**: Optimized data sharing via Arrow
939
+ - **Async Support**: Async runtime for optimal performance
940
+ - **Memory Safety**: Rust's memory safety guarantees
941
+
942
+ ### Storage Backends
943
+
944
+ posixlake Python bindings support multiple storage backends:
945
+
946
+ - **Local Filesystem**: Standard directory paths
947
+ - **S3/MinIO**: Object storage with S3-compatible API
948
+ - **Unified API**: Same Python code works with both
949
+
950
+ ---
951
+
952
+ ## What Makes This Awesome
953
+
954
+ 1. **Performance**: Rust-powered engine with buffered inserts (~10x faster for small batches)
955
+ 2. **No Special Drivers**: NFS server uses OS built-in NFS client - zero installation
956
+ 3. **Unix Commands Trigger Delta Operations**: `cat` queries data, `grep` searches, `echo >>` triggers INSERT, `sed -i` triggers MERGE (UPDATE/DELETE) - all as ACID transactions
957
+ 4. **Standard Tools**: `grep`, `awk`, `sed`, `wc`, `sort` work on your data lake and trigger Delta Lake operations - no special libraries needed
958
+ 5. **Smart Batching**: Auto-flushes at 1000 rows, reducing transaction overhead
959
+ 6. **Delta Lake Compatible**: Tables readable by Spark, Databricks, and Athena immediately
960
+ 7. **Robust**: Comprehensive error handling, async support, and testing
961
+ 8. **Type Safety**: Complete type hints and comprehensive error handling
962
+ 9. **Efficient**: Optimized data transfer with minimal overhead
963
+ 10. **Unified Storage**: Same API works with local filesystem and S3
964
+
965
+ **Use Unix commands to query and trigger Delta Lake operations** - `cat` queries Parquet data, `grep` searches, `echo >>` triggers INSERT transactions, `sed -i` triggers MERGE (UPDATE/DELETE) transactions. No special libraries, no drivers, just mount and use standard Unix tools. Plus buffered inserts for 10x performance when loading many small batches.
966
+
967
+ ---
968
+
969
+ ## License
970
+
971
+ **Apache License 2.0**
972
+
973
+ Copyright 2025 posixlake Contributors
974
+
975
+ Licensed under the Apache License, Version 2.0 (the "License");
976
+ you may not use this file except in compliance with the License.
977
+
978
+ See [LICENSE.md](../../LICENSE.md) for the full license text.
979
+
980
+ ---
981
+
982
+ ## Contributing
983
+
984
+ Contributions welcome! Please follow these guidelines:
985
+
986
+ 1. **Write tests first** - TDD approach for all features
987
+ 2. **Run full suite** - Ensure all tests pass
988
+ 3. **Update documentation** - Keep README and docs up to date
989
+ 4. **Commit messages** - Use conventional commits
990
+
991
+ ---
992
+
993
+ ## Acknowledgments
994
+
995
+ Built with:
996
+
997
+ - [Rust](https://www.rust-lang.org/) - Systems programming language
998
+ - [Apache Arrow](https://arrow.apache.org/) - Columnar in-memory format
999
+ - [Apache Parquet](https://parquet.apache.org/) - Columnar file format
1000
+ - [DataFusion](https://datafusion.apache.org/) - Query engine
1001
+ - [Delta Lake](https://delta.io/) - Transaction log
1002
+ - [ObjectStore](https://docs.rs/object_store/) - Storage abstraction
1003
+
1004
+ ---
1005
+
1006
+ **Questions?** Open an [issue](https://github.com/npiesco/posixlake/issues)
1007
+
1008
+ **Like this project?** Star the repo and share with your data engineering team!
1009
+
1010
+ **PyPI Package:** https://pypi.org/project/posixlake/