posixlake 0.1.6__cp311-cp311-win_amd64.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- posixlake/__init__.py +17 -0
- posixlake/posixlake.dll +0 -0
- posixlake/posixlake.py +3222 -0
- posixlake-0.1.6.dist-info/METADATA +1010 -0
- posixlake-0.1.6.dist-info/RECORD +7 -0
- posixlake-0.1.6.dist-info/WHEEL +5 -0
- posixlake-0.1.6.dist-info/top_level.txt +1 -0
|
@@ -0,0 +1,1010 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: posixlake
|
|
3
|
+
Version: 0.1.6
|
|
4
|
+
Summary: High-performance Delta Lake database with POSIX interface and Python bindings
|
|
5
|
+
Home-page: https://github.com/npiesco/posixlake
|
|
6
|
+
Author: posixlake Contributors
|
|
7
|
+
Author-email:
|
|
8
|
+
License: MIT
|
|
9
|
+
Project-URL: Bug Tracker, https://github.com/npiesco/posixlake/issues
|
|
10
|
+
Project-URL: Documentation, https://github.com/npiesco/posixlake#readme
|
|
11
|
+
Project-URL: Source Code, https://github.com/npiesco/posixlake
|
|
12
|
+
Keywords: database,delta-lake,sql,parquet,rust,datafusion,time-travel,acid,analytics
|
|
13
|
+
Classifier: Development Status :: 4 - Beta
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
22
|
+
Classifier: Programming Language :: Rust
|
|
23
|
+
Classifier: Topic :: Database
|
|
24
|
+
Classifier: Topic :: Software Development :: Libraries
|
|
25
|
+
Classifier: Operating System :: OS Independent
|
|
26
|
+
Requires-Python: >=3.8
|
|
27
|
+
Description-Content-Type: text/markdown
|
|
28
|
+
|
|
29
|
+
<div align="center">
|
|
30
|
+
<h1>posixlake Python Bindings</h1>
|
|
31
|
+
<p><strong>High-performance Delta Lake database with Python API and POSIX interface</strong></p>
|
|
32
|
+
|
|
33
|
+
<p><em>Python API for posixlake (File Store Database) - Access Delta Lake operations, SQL queries, time travel, and use Unix commands (`cat`, `grep`, `awk`, `wc`, `head`, `tail`, `sort`, `cut`, `echo >>`, `sed -i`, `vim`, `mkdir`, `mv`, `cp`, `rmdir`, `rm`) to query and trigger Delta Lake transactions. Mount databases as POSIX filesystems where standard Unix tools execute ACID operations. Works with local filesystem directories and object storage/S3. Built on Rust for maximum performance.</em></p>
|
|
34
|
+
|
|
35
|
+
[](https://www.python.org)
|
|
36
|
+
[](https://pypi.org/project/posixlake/)
|
|
37
|
+
[](https://delta.io)
|
|
38
|
+
[](../../LICENSE.md)
|
|
39
|
+
[](https://www.rust-lang.org)
|
|
40
|
+
|
|
41
|
+
[](https://arrow.apache.org)
|
|
42
|
+
[](https://datafusion.apache.org)
|
|
43
|
+
[](.)
|
|
44
|
+
[](.)
|
|
45
|
+
</div>
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
**Key Features:**
|
|
50
|
+
- **Delta Lake Native**: Full ACID transactions with native `_delta_log/` format
|
|
51
|
+
- **SQL Queries**: DataFusion-powered SQL engine embedded in Python
|
|
52
|
+
- **Time Travel**: Query historical versions and timestamps
|
|
53
|
+
- **CSV/Parquet Import**: Create databases from CSV (auto schema inference) or Parquet files
|
|
54
|
+
- **Buffered Inserts**: 10x performance improvement for small batch writes
|
|
55
|
+
- **NFS Server**: Mount Delta Lake as POSIX filesystem - standard Unix tools work directly
|
|
56
|
+
- **Storage Backends**: Works with local filesystem and S3/MinIO - same unified API
|
|
57
|
+
- **Performance**: Rust-powered engine with buffered inserts (~10x faster for small batches)
|
|
58
|
+
- **No Special Drivers**: Uses OS built-in NFS client - zero installation
|
|
59
|
+
- **Delta Lake Compatible**: Tables readable by Spark, Databricks, and Athena immediately
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Installation
|
|
64
|
+
|
|
65
|
+
### From PyPI (Recommended)
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
pip install posixlake
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
**Requirements:**
|
|
72
|
+
- **Python 3.11+** (required for prebuilt wheels with native library)
|
|
73
|
+
- For other Python versions, install from source (see below)
|
|
74
|
+
|
|
75
|
+
**PyPI Package:** https://pypi.org/project/posixlake/
|
|
76
|
+
|
|
77
|
+
### From Source
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
# 1. Clone the repository
|
|
81
|
+
git clone https://github.com/npiesco/posixlake.git
|
|
82
|
+
cd posixlake
|
|
83
|
+
|
|
84
|
+
# 2. Build Rust library
|
|
85
|
+
cargo build --release
|
|
86
|
+
|
|
87
|
+
# 3. Generate Python API
|
|
88
|
+
cargo run --bin uniffi-bindgen -- generate \
|
|
89
|
+
--library target/release/libposixlake.dylib \
|
|
90
|
+
--language python \
|
|
91
|
+
--out-dir bindings/python
|
|
92
|
+
|
|
93
|
+
# 4. Copy library
|
|
94
|
+
cp target/release/libposixlake.dylib bindings/python/
|
|
95
|
+
|
|
96
|
+
# 5. Install Python package
|
|
97
|
+
cd bindings/python
|
|
98
|
+
pip install -e .
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
**Prerequisites:**
|
|
102
|
+
- Python 3.8+ (3.11+ recommended for prebuilt wheels)
|
|
103
|
+
- Rust 1.70+ (for building from source)
|
|
104
|
+
- NFS client (built-in on macOS/Linux/Windows Pro)
|
|
105
|
+
|
|
106
|
+
---
|
|
107
|
+
|
|
108
|
+
## Quick Start
|
|
109
|
+
|
|
110
|
+
### Example 1: Basic Database Operations
|
|
111
|
+
|
|
112
|
+
```python
|
|
113
|
+
from posixlake import DatabaseOps, Schema, Field, PosixLakeError
|
|
114
|
+
|
|
115
|
+
# Create a schema
|
|
116
|
+
schema = Schema(fields=[
|
|
117
|
+
Field(name="id", data_type="Int32", nullable=False),
|
|
118
|
+
Field(name="name", data_type="String", nullable=False),
|
|
119
|
+
Field(name="age", data_type="Int32", nullable=True),
|
|
120
|
+
Field(name="salary", data_type="Float64", nullable=True),
|
|
121
|
+
])
|
|
122
|
+
|
|
123
|
+
# Create database on local filesystem
|
|
124
|
+
try:
|
|
125
|
+
db = DatabaseOps.create("/path/to/db", schema)
|
|
126
|
+
print("✓ Database created")
|
|
127
|
+
except PosixLakeError as e:
|
|
128
|
+
print(f"✗ Error: {e}")
|
|
129
|
+
|
|
130
|
+
# Insert data (JSON format)
|
|
131
|
+
data = '[{"id": 1, "name": "Alice", "age": 30, "salary": 75000.0}]'
|
|
132
|
+
db.insert_json(data)
|
|
133
|
+
|
|
134
|
+
# Query with SQL
|
|
135
|
+
results = db.query_json("SELECT * FROM data WHERE age > 25")
|
|
136
|
+
print(results)
|
|
137
|
+
# [{"id": 1, "name": "Alice", "age": 30, "salary": 75000.0}]
|
|
138
|
+
|
|
139
|
+
# Delete rows
|
|
140
|
+
db.delete_rows_where("id = 1")
|
|
141
|
+
print("✓ Row deleted")
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Example 2: Buffered Insert (High Performance)
|
|
145
|
+
|
|
146
|
+
```python
|
|
147
|
+
from posixlake import DatabaseOps, Schema, Field
|
|
148
|
+
import json
|
|
149
|
+
|
|
150
|
+
schema = Schema(fields=[
|
|
151
|
+
Field(name="id", data_type="Int32", nullable=False),
|
|
152
|
+
Field(name="name", data_type="String", nullable=False),
|
|
153
|
+
Field(name="email", data_type="String", nullable=False),
|
|
154
|
+
])
|
|
155
|
+
|
|
156
|
+
db = DatabaseOps.create("/path/to/db", schema)
|
|
157
|
+
|
|
158
|
+
# Insert many small batches efficiently (buffers up to 1000 rows)
|
|
159
|
+
print("Inserting 100 small batches using buffered insert...")
|
|
160
|
+
for i in range(100):
|
|
161
|
+
db.insert_buffered_json(json.dumps([{
|
|
162
|
+
"id": i,
|
|
163
|
+
"name": f"User_{i}",
|
|
164
|
+
"email": f"user{i}@example.com"
|
|
165
|
+
}]))
|
|
166
|
+
if (i + 1) % 20 == 0:
|
|
167
|
+
print(f" Buffered {i + 1}/100 batches...")
|
|
168
|
+
|
|
169
|
+
# Flush buffer to commit all data
|
|
170
|
+
print("\nFlushing write buffer...")
|
|
171
|
+
db.flush_write_buffer()
|
|
172
|
+
print("✓ All buffered data committed to Delta Lake")
|
|
173
|
+
|
|
174
|
+
# Result: ~1-2 Delta Lake transactions instead of 100!
|
|
175
|
+
# Performance improvement: ~10x faster for small batches
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
### Example 3: S3 / Object Storage Backend
|
|
179
|
+
|
|
180
|
+
```python
|
|
181
|
+
from posixlake import DatabaseOps, Schema, Field, S3Config
|
|
182
|
+
|
|
183
|
+
schema = Schema(fields=[
|
|
184
|
+
Field(name="id", data_type="Int32", nullable=False),
|
|
185
|
+
Field(name="name", data_type="String", nullable=False),
|
|
186
|
+
Field(name="value", data_type="Float64", nullable=True),
|
|
187
|
+
])
|
|
188
|
+
|
|
189
|
+
# Create database on S3/MinIO
|
|
190
|
+
s3_config = S3Config(
|
|
191
|
+
endpoint="http://localhost:9000", # MinIO or AWS S3 endpoint
|
|
192
|
+
access_key_id="minioadmin",
|
|
193
|
+
secret_access_key="minioadmin",
|
|
194
|
+
region="us-east-1"
|
|
195
|
+
)
|
|
196
|
+
|
|
197
|
+
db = DatabaseOps.create_with_s3("s3://bucket-name/db-path", schema, s3_config)
|
|
198
|
+
|
|
199
|
+
# Same API works with S3!
|
|
200
|
+
db.insert_json('[{"id": 1, "name": "Alice", "value": 123.45}]')
|
|
201
|
+
results = db.query_json("SELECT * FROM data WHERE value > 100")
|
|
202
|
+
print(results)
|
|
203
|
+
|
|
204
|
+
# All data stored in S3 with Delta Lake ACID transactions
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
### Example 4: POSIX Access via NFS Server
|
|
208
|
+
|
|
209
|
+
```python
|
|
210
|
+
from posixlake import DatabaseOps, Schema, Field, NfsServer
|
|
211
|
+
import time
|
|
212
|
+
import subprocess
|
|
213
|
+
|
|
214
|
+
# Create database
|
|
215
|
+
schema = Schema(fields=[
|
|
216
|
+
Field(name="id", data_type="Int32", nullable=False),
|
|
217
|
+
Field(name="name", data_type="String", nullable=False),
|
|
218
|
+
Field(name="age", data_type="Int32", nullable=True),
|
|
219
|
+
])
|
|
220
|
+
db = DatabaseOps.create("/path/to/db", schema)
|
|
221
|
+
|
|
222
|
+
# Insert data
|
|
223
|
+
db.insert_json('[{"id": 1, "name": "Alice", "age": 30}, {"id": 2, "name": "Bob", "age": 25}]')
|
|
224
|
+
|
|
225
|
+
# Start NFS server on port 12049
|
|
226
|
+
nfs_port = 12049
|
|
227
|
+
nfs_server = NfsServer(db, nfs_port)
|
|
228
|
+
print(f"✓ NFS server started on port {nfs_port}")
|
|
229
|
+
|
|
230
|
+
# Wait for server to be ready
|
|
231
|
+
time.sleep(0.5)
|
|
232
|
+
if nfs_server.is_ready():
|
|
233
|
+
print("✓ NFS server is ready!")
|
|
234
|
+
else:
|
|
235
|
+
print("⚠ NFS server not ready, POSIX operations may fail")
|
|
236
|
+
|
|
237
|
+
# Mount filesystem (requires sudo - run this in terminal)
|
|
238
|
+
# sudo mount_nfs -o nolocks,vers=3,tcp,port=12049,mountport=12049 localhost:/ /mnt/posixlake
|
|
239
|
+
|
|
240
|
+
# Now use standard Unix tools to query and trigger Delta Lake operations:
|
|
241
|
+
# $ cat /mnt/posixlake/data/data.csv # Queries Parquet data, converts to CSV
|
|
242
|
+
# id,name,age
|
|
243
|
+
# 1,Alice,30
|
|
244
|
+
# 2,Bob,25
|
|
245
|
+
#
|
|
246
|
+
# $ grep "Alice" /mnt/posixlake/data/data.csv | awk -F',' '{print $2}' # Search and process
|
|
247
|
+
# Alice
|
|
248
|
+
#
|
|
249
|
+
# $ wc -l /mnt/posixlake/data/data.csv # Count records
|
|
250
|
+
# 3 /mnt/posixlake/data/data.csv
|
|
251
|
+
#
|
|
252
|
+
# $ echo "3,Charlie,28" >> /mnt/posixlake/data/data.csv # Triggers Delta Lake INSERT transaction!
|
|
253
|
+
#
|
|
254
|
+
# $ sed -i 's/Alice,30/Alice,31/' /mnt/posixlake/data/data.csv # Triggers Delta Lake MERGE (UPDATE) transaction!
|
|
255
|
+
#
|
|
256
|
+
# $ grep -v "Bob" /mnt/posixlake/data/data.csv > /tmp/temp && cat /tmp/temp > /mnt/posixlake/data/data.csv # Triggers MERGE (DELETE) transaction!
|
|
257
|
+
|
|
258
|
+
# Shutdown NFS server when done
|
|
259
|
+
# nfs_server.shutdown()
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
### Example 5: Time Travel Queries
|
|
263
|
+
|
|
264
|
+
```python
|
|
265
|
+
from posixlake import DatabaseOps, Schema, Field
|
|
266
|
+
|
|
267
|
+
schema = Schema(fields=[
|
|
268
|
+
Field(name="id", data_type="Int32", nullable=False),
|
|
269
|
+
Field(name="name", data_type="String", nullable=False),
|
|
270
|
+
])
|
|
271
|
+
|
|
272
|
+
db = DatabaseOps.create("/path/to/db", schema)
|
|
273
|
+
|
|
274
|
+
# Insert initial data
|
|
275
|
+
db.insert_json('[{"id": 1, "name": "Alice"}]')
|
|
276
|
+
version_1 = db.get_current_version()
|
|
277
|
+
print(f"Version 1: {version_1}")
|
|
278
|
+
|
|
279
|
+
# Insert more data
|
|
280
|
+
db.insert_json('[{"id": 2, "name": "Bob"}]')
|
|
281
|
+
version_2 = db.get_current_version()
|
|
282
|
+
print(f"Version 2: {version_2}")
|
|
283
|
+
|
|
284
|
+
# Query by version (historical data)
|
|
285
|
+
results_v1 = db.query_json_at_version("SELECT * FROM data", version_1)
|
|
286
|
+
print(f"Data at version {version_1}: {results_v1}")
|
|
287
|
+
# [{"id": 1, "name": "Alice"}]
|
|
288
|
+
|
|
289
|
+
results_v2 = db.query_json_at_version("SELECT * FROM data", version_2)
|
|
290
|
+
print(f"Data at version {version_2}: {results_v2}")
|
|
291
|
+
# [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
|
|
292
|
+
|
|
293
|
+
# Query by timestamp
|
|
294
|
+
import time
|
|
295
|
+
timestamp = int(time.time())
|
|
296
|
+
results = db.query_json_at_timestamp("SELECT * FROM data", timestamp)
|
|
297
|
+
print(f"Data at timestamp {timestamp}: {results}")
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
### Example 6: Import from CSV (Auto Schema Inference)
|
|
301
|
+
|
|
302
|
+
```python
|
|
303
|
+
from posixlake import DatabaseOps
|
|
304
|
+
import json
|
|
305
|
+
|
|
306
|
+
# Create database by importing CSV - schema is automatically inferred!
|
|
307
|
+
# Column types detected: Int64, Float64, Boolean, String
|
|
308
|
+
db = DatabaseOps.create_from_csv("/path/to/new_db", "/path/to/data.csv")
|
|
309
|
+
|
|
310
|
+
# Query the imported data
|
|
311
|
+
results = db.query_json("SELECT * FROM data LIMIT 5")
|
|
312
|
+
print(json.loads(results))
|
|
313
|
+
|
|
314
|
+
# Check inferred schema
|
|
315
|
+
schema = db.get_schema()
|
|
316
|
+
for field in schema.fields:
|
|
317
|
+
print(f" {field.name}: {field.data_type} (nullable={field.nullable})")
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
### Example 7: Import from Parquet
|
|
321
|
+
|
|
322
|
+
```python
|
|
323
|
+
from posixlake import DatabaseOps
|
|
324
|
+
import json
|
|
325
|
+
|
|
326
|
+
# Create database from existing Parquet file(s)
|
|
327
|
+
# Schema is read directly from Parquet metadata
|
|
328
|
+
db = DatabaseOps.create_from_parquet("/path/to/new_db", "/path/to/data.parquet")
|
|
329
|
+
|
|
330
|
+
# Supports glob patterns for multiple files
|
|
331
|
+
db = DatabaseOps.create_from_parquet("/path/to/db", "/data/*.parquet")
|
|
332
|
+
|
|
333
|
+
# Query the imported data
|
|
334
|
+
results = db.query_json("SELECT COUNT(*) as total FROM data")
|
|
335
|
+
print(json.loads(results))
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
### Example 8: Delta Lake Operations
|
|
339
|
+
|
|
340
|
+
```python
|
|
341
|
+
from posixlake import DatabaseOps, Schema, Field
|
|
342
|
+
|
|
343
|
+
db = DatabaseOps.open("/path/to/db")
|
|
344
|
+
|
|
345
|
+
# OPTIMIZE: Compact small Parquet files into larger ones
|
|
346
|
+
optimize_result = db.optimize()
|
|
347
|
+
print(f"✓ OPTIMIZE completed: {optimize_result}")
|
|
348
|
+
|
|
349
|
+
# VACUUM: Remove old files (retention period in hours)
|
|
350
|
+
vacuum_result = db.vacuum(retention_hours=168) # 7 days
|
|
351
|
+
print(f"✓ VACUUM completed: {vacuum_result}")
|
|
352
|
+
|
|
353
|
+
# Z-ORDER: Multi-dimensional clustering for better query performance
|
|
354
|
+
zorder_result = db.zorder(columns=["id", "name"])
|
|
355
|
+
print(f"✓ Z-ORDER completed: {zorder_result}")
|
|
356
|
+
|
|
357
|
+
# Get data skipping statistics
|
|
358
|
+
stats = db.get_data_skipping_stats()
|
|
359
|
+
print(f"Data skipping stats: {stats}")
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
---
|
|
363
|
+
|
|
364
|
+
## Core Features
|
|
365
|
+
|
|
366
|
+
### Database Operations
|
|
367
|
+
|
|
368
|
+
#### Creating and Opening Databases
|
|
369
|
+
|
|
370
|
+
```python
|
|
371
|
+
from posixlake import DatabaseOps, Schema, Field, S3Config
|
|
372
|
+
|
|
373
|
+
# Local filesystem with explicit schema
|
|
374
|
+
schema = Schema(fields=[
|
|
375
|
+
Field(name="id", data_type="Int32", nullable=False),
|
|
376
|
+
Field(name="name", data_type="String", nullable=False),
|
|
377
|
+
])
|
|
378
|
+
db = DatabaseOps.create("/path/to/db", schema)
|
|
379
|
+
db = DatabaseOps.open("/path/to/db")
|
|
380
|
+
|
|
381
|
+
# Import from CSV (auto schema inference)
|
|
382
|
+
db = DatabaseOps.create_from_csv("/path/to/db", "/path/to/data.csv")
|
|
383
|
+
|
|
384
|
+
# Import from Parquet (schema from metadata)
|
|
385
|
+
db = DatabaseOps.create_from_parquet("/path/to/db", "/path/to/data.parquet")
|
|
386
|
+
db = DatabaseOps.create_from_parquet("/path/to/db", "/data/*.parquet") # glob pattern
|
|
387
|
+
|
|
388
|
+
# With authentication
|
|
389
|
+
db = DatabaseOps.create_with_auth("/path/to/db", schema, auth_enabled=True)
|
|
390
|
+
db = DatabaseOps.open_with_credentials("/path/to/db", credentials)
|
|
391
|
+
|
|
392
|
+
# S3 backend
|
|
393
|
+
s3_config = S3Config(
|
|
394
|
+
endpoint="http://localhost:9000",
|
|
395
|
+
access_key_id="minioadmin",
|
|
396
|
+
secret_access_key="minioadmin",
|
|
397
|
+
region="us-east-1"
|
|
398
|
+
)
|
|
399
|
+
db = DatabaseOps.create_with_s3("s3://bucket/db-path", schema, s3_config)
|
|
400
|
+
db = DatabaseOps.open_with_s3("s3://bucket/db-path", s3_config)
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
#### Data Insertion
|
|
404
|
+
|
|
405
|
+
```python
|
|
406
|
+
# Regular insert (one transaction per call)
|
|
407
|
+
db.insert_json('[{"id": 1, "name": "Alice"}]')
|
|
408
|
+
|
|
409
|
+
# Buffered insert (batches multiple writes)
|
|
410
|
+
db.insert_buffered_json('[{"id": 2, "name": "Bob"}]')
|
|
411
|
+
db.insert_buffered_json('[{"id": 3, "name": "Charlie"}]')
|
|
412
|
+
db.flush_write_buffer() # Commit all buffered data
|
|
413
|
+
|
|
414
|
+
# MERGE (UPSERT) operation
|
|
415
|
+
merge_data = [
|
|
416
|
+
{"id": 1, "name": "Alice Updated", "_op": "UPDATE"},
|
|
417
|
+
{"id": 4, "name": "David", "_op": "INSERT"},
|
|
418
|
+
{"id": 2, "_op": "DELETE"}
|
|
419
|
+
]
|
|
420
|
+
result = db.merge_json(json.dumps(merge_data), "id")
|
|
421
|
+
# Returns: {"rows_inserted": 1, "rows_updated": 1, "rows_deleted": 1}
|
|
422
|
+
```
|
|
423
|
+
|
|
424
|
+
#### SQL Queries
|
|
425
|
+
|
|
426
|
+
```python
|
|
427
|
+
# Basic query
|
|
428
|
+
results = db.query_json("SELECT * FROM data WHERE id > 0")
|
|
429
|
+
|
|
430
|
+
# Aggregations
|
|
431
|
+
results = db.query_json("SELECT COUNT(*) as count, AVG(age) as avg_age FROM data")
|
|
432
|
+
|
|
433
|
+
# Joins (if multiple tables)
|
|
434
|
+
results = db.query_json("""
|
|
435
|
+
SELECT a.id, a.name, b.value
|
|
436
|
+
FROM data a
|
|
437
|
+
JOIN other_table b ON a.id = b.id
|
|
438
|
+
""")
|
|
439
|
+
|
|
440
|
+
# Time travel queries
|
|
441
|
+
results = db.query_json_at_version("SELECT * FROM data", version=5)
|
|
442
|
+
results = db.query_json_at_timestamp("SELECT * FROM data", timestamp=1234567890)
|
|
443
|
+
```
|
|
444
|
+
|
|
445
|
+
#### Row Deletion
|
|
446
|
+
|
|
447
|
+
```python
|
|
448
|
+
# Delete by condition
|
|
449
|
+
db.delete_rows_where("id = 5")
|
|
450
|
+
db.delete_rows_where("age < 18")
|
|
451
|
+
db.delete_rows_where("name LIKE '%test%'")
|
|
452
|
+
|
|
453
|
+
# Delete all rows (truncate)
|
|
454
|
+
db.delete_rows_where("1=1")
|
|
455
|
+
```
|
|
456
|
+
|
|
457
|
+
### Time Travel
|
|
458
|
+
|
|
459
|
+
posixlake supports Delta Lake's time travel feature, allowing you to query historical versions of your data:
|
|
460
|
+
|
|
461
|
+
```python
|
|
462
|
+
# Get current version
|
|
463
|
+
current_version = db.get_current_version()
|
|
464
|
+
print(f"Current version: {current_version}")
|
|
465
|
+
|
|
466
|
+
# Query by version
|
|
467
|
+
results = db.query_json_at_version("SELECT * FROM data", version=10)
|
|
468
|
+
|
|
469
|
+
# Query by timestamp
|
|
470
|
+
import time
|
|
471
|
+
timestamp = int(time.time()) - 3600 # 1 hour ago
|
|
472
|
+
results = db.query_json_at_timestamp("SELECT * FROM data", timestamp)
|
|
473
|
+
|
|
474
|
+
# Get version history
|
|
475
|
+
history = db.get_version_history()
|
|
476
|
+
for entry in history:
|
|
477
|
+
print(f"Version {entry['version']}: {entry['timestamp']} - {entry['operation']}")
|
|
478
|
+
```
|
|
479
|
+
|
|
480
|
+
### Delta Lake Operations
|
|
481
|
+
|
|
482
|
+
#### OPTIMIZE (File Compaction)
|
|
483
|
+
|
|
484
|
+
```python
|
|
485
|
+
# Compact small Parquet files into larger ones for better query performance
|
|
486
|
+
result = db.optimize()
|
|
487
|
+
print(f"Files compacted: {result}")
|
|
488
|
+
```
|
|
489
|
+
|
|
490
|
+
#### VACUUM (Cleanup Old Files)
|
|
491
|
+
|
|
492
|
+
```python
|
|
493
|
+
# Remove old files (retention period in hours)
|
|
494
|
+
# Default: 168 hours (7 days)
|
|
495
|
+
result = db.vacuum(retention_hours=168)
|
|
496
|
+
print(f"Files removed: {result}")
|
|
497
|
+
```
|
|
498
|
+
|
|
499
|
+
#### Z-ORDER (Multi-dimensional Clustering)
|
|
500
|
+
|
|
501
|
+
```python
|
|
502
|
+
# Cluster data by multiple columns for better query performance
|
|
503
|
+
result = db.zorder(columns=["id", "name", "age"])
|
|
504
|
+
print(f"Z-ORDER completed: {result}")
|
|
505
|
+
```
|
|
506
|
+
|
|
507
|
+
#### Data Skipping Statistics
|
|
508
|
+
|
|
509
|
+
```python
|
|
510
|
+
# Get statistics for query optimization
|
|
511
|
+
stats = db.get_data_skipping_stats()
|
|
512
|
+
print(f"Data skipping stats: {stats}")
|
|
513
|
+
```
|
|
514
|
+
|
|
515
|
+
### NFS Server (POSIX Filesystem Access)
|
|
516
|
+
|
|
517
|
+
The NFS server allows you to mount your Delta Lake database as a standard POSIX filesystem. **Unix commands don't just read data - they trigger Delta Lake operations**: `cat` queries Parquet data, `grep` searches, `echo >>` triggers INSERT transactions, `sed -i` triggers MERGE (UPDATE/DELETE) transactions. All operations are ACID-compliant Delta Lake transactions.
|
|
518
|
+
|
|
519
|
+
#### Starting the NFS Server
|
|
520
|
+
|
|
521
|
+
```python
|
|
522
|
+
from posixlake import DatabaseOps, Schema, Field, NfsServer
|
|
523
|
+
import time
|
|
524
|
+
|
|
525
|
+
# Create/open database
|
|
526
|
+
db = DatabaseOps.open("/path/to/db")
|
|
527
|
+
|
|
528
|
+
# Start NFS server on port 12049
|
|
529
|
+
nfs = NfsServer(db, 12049)
|
|
530
|
+
|
|
531
|
+
# Wait for server to be ready
|
|
532
|
+
time.sleep(0.5)
|
|
533
|
+
if nfs.is_ready():
|
|
534
|
+
print("✓ NFS server ready")
|
|
535
|
+
else:
|
|
536
|
+
print("⚠ NFS server not ready")
|
|
537
|
+
```
|
|
538
|
+
|
|
539
|
+
#### Mounting the Filesystem
|
|
540
|
+
|
|
541
|
+
```bash
|
|
542
|
+
# Mount command (requires sudo)
|
|
543
|
+
sudo mount_nfs -o nolocks,vers=3,tcp,port=12049,mountport=12049 localhost:/ /mnt/posixlake
|
|
544
|
+
|
|
545
|
+
# Verify mount
|
|
546
|
+
ls -la /mnt/posixlake/
|
|
547
|
+
# data/
|
|
548
|
+
# schema.sql
|
|
549
|
+
# .query
|
|
550
|
+
```
|
|
551
|
+
|
|
552
|
+
#### Using POSIX Commands
|
|
553
|
+
|
|
554
|
+
Once mounted, your Delta Lake table is accessible like any other directory:
|
|
555
|
+
|
|
556
|
+
```bash
|
|
557
|
+
# 1. List directory contents
|
|
558
|
+
ls -la /mnt/posixlake/data/
|
|
559
|
+
|
|
560
|
+
# 2. Read all data as CSV
|
|
561
|
+
cat /mnt/posixlake/data/data.csv
|
|
562
|
+
# id,name,age
|
|
563
|
+
# 1,Alice,30
|
|
564
|
+
# 2,Bob,25
|
|
565
|
+
|
|
566
|
+
# 3. Search for specific records with grep
|
|
567
|
+
grep "Alice" /mnt/posixlake/data/data.csv
|
|
568
|
+
# 1,Alice,30
|
|
569
|
+
|
|
570
|
+
# 4. Process columns with awk
|
|
571
|
+
awk -F',' '{print $2, $3}' /mnt/posixlake/data/data.csv
|
|
572
|
+
# name age
|
|
573
|
+
# Alice 30
|
|
574
|
+
# Bob 25
|
|
575
|
+
|
|
576
|
+
# 5. Count lines/records with wc
|
|
577
|
+
wc -l /mnt/posixlake/data/data.csv
|
|
578
|
+
# 3 /mnt/posixlake/data/data.csv (includes header)
|
|
579
|
+
|
|
580
|
+
# 6. Sort data by a column
|
|
581
|
+
sort -t',' -k2 /mnt/posixlake/data/data.csv # Sort by name
|
|
582
|
+
|
|
583
|
+
# 7. Append new data (triggers Delta Lake INSERT transaction!)
|
|
584
|
+
echo "3,Charlie,28" >> /mnt/posixlake/data/data.csv
|
|
585
|
+
# → Executes: Delta Lake INSERT transaction with ACID guarantees
|
|
586
|
+
cat /mnt/posixlake/data/data.csv
|
|
587
|
+
# id,name,age
|
|
588
|
+
# 1,Alice,30
|
|
589
|
+
# 2,Bob,25
|
|
590
|
+
# 3,Charlie,28
|
|
591
|
+
|
|
592
|
+
# 8. Edit data (triggers Delta Lake MERGE transaction - atomic INSERT/UPDATE/DELETE!)
|
|
593
|
+
# Example: Update Alice's age to 31
|
|
594
|
+
sed -i 's/Alice,30/Alice,31/' /mnt/posixlake/data/data.csv
|
|
595
|
+
# → Executes: Delta Lake MERGE transaction (UPDATE operation)
|
|
596
|
+
cat /mnt/posixlake/data/data.csv
|
|
597
|
+
# id,name,age
|
|
598
|
+
# 1,Alice,31
|
|
599
|
+
# 2,Bob,25
|
|
600
|
+
# 3,Charlie,28
|
|
601
|
+
|
|
602
|
+
# Example: Delete Bob (id=2)
|
|
603
|
+
grep -v "2,Bob" /mnt/posixlake/data/data.csv > /tmp/temp_data.csv
|
|
604
|
+
cat /tmp/temp_data.csv > /mnt/posixlake/data/data.csv
|
|
605
|
+
# → Executes: Delta Lake MERGE transaction (DELETE operation)
|
|
606
|
+
cat /mnt/posixlake/data/data.csv
|
|
607
|
+
# id,name,age
|
|
608
|
+
# 1,Alice,31
|
|
609
|
+
# 3,Charlie,28
|
|
610
|
+
|
|
611
|
+
# 9. Truncate table (triggers Delta Lake DELETE ALL transaction!)
|
|
612
|
+
rm /mnt/posixlake/data/data.csv
|
|
613
|
+
# → Executes: Delta Lake DELETE ALL transaction
|
|
614
|
+
cat /mnt/posixlake/data/data.csv
|
|
615
|
+
# id,name,age
|
|
616
|
+
```
|
|
617
|
+
|
|
618
|
+
#### Unmounting and Shutdown
|
|
619
|
+
|
|
620
|
+
```bash
|
|
621
|
+
# Unmount filesystem
|
|
622
|
+
sudo umount /mnt/posixlake
|
|
623
|
+
```
|
|
624
|
+
|
|
625
|
+
```python
|
|
626
|
+
# Shutdown NFS server
|
|
627
|
+
nfs.shutdown()
|
|
628
|
+
```
|
|
629
|
+
|
|
630
|
+
**How It Works:**
|
|
631
|
+
- **Read Operations** (`cat`, `grep`, `awk`, `wc`): NFS server queries Parquet files → converts to CSV on-demand → caches result
|
|
632
|
+
- **Append Operations** (`echo >>`): NFS server parses CSV → converts to RecordBatch → Delta Lake INSERT transaction
|
|
633
|
+
- **Overwrite Operations** (`sed -i`, `cat > file`): Detects INSERT/UPDATE/DELETE by comparing old vs new CSV → executes MERGE transaction (atomic INSERT/UPDATE/DELETE)
|
|
634
|
+
- **Delete Operations** (`rm file`): Triggers Delta Lake DELETE ALL transaction
|
|
635
|
+
- **No Special Drivers**: Uses OS built-in NFS client - works everywhere
|
|
636
|
+
|
|
637
|
+
### Authentication & Security
|
|
638
|
+
|
|
639
|
+
```python
|
|
640
|
+
from posixlake import DatabaseOps, Schema, Field, Credentials
|
|
641
|
+
|
|
642
|
+
# Create database with authentication enabled
|
|
643
|
+
schema = Schema(fields=[...])
|
|
644
|
+
db = DatabaseOps.create_with_auth("/path/to/db", schema, auth_enabled=True)
|
|
645
|
+
|
|
646
|
+
# Open with credentials
|
|
647
|
+
credentials = Credentials(username="admin", password="secret")
|
|
648
|
+
db = DatabaseOps.open_with_credentials("/path/to/db", credentials)
|
|
649
|
+
|
|
650
|
+
# User management
|
|
651
|
+
db.create_user("alice", "password123", role="admin")
|
|
652
|
+
db.delete_user("alice")
|
|
653
|
+
|
|
654
|
+
# Role-based access control
|
|
655
|
+
# Permissions checked automatically on all operations
|
|
656
|
+
```
|
|
657
|
+
|
|
658
|
+
### Backup & Restore
|
|
659
|
+
|
|
660
|
+
```python
|
|
661
|
+
# Full backup
|
|
662
|
+
backup_path = db.backup("/path/to/backup")
|
|
663
|
+
print(f"Backup created: {backup_path}")
|
|
664
|
+
|
|
665
|
+
# Incremental backup
|
|
666
|
+
backup_path = db.backup_incremental("/path/to/backup")
|
|
667
|
+
print(f"Incremental backup created: {backup_path}")
|
|
668
|
+
|
|
669
|
+
# Restore
|
|
670
|
+
db.restore("/path/to/backup")
|
|
671
|
+
print("✓ Database restored")
|
|
672
|
+
```
|
|
673
|
+
|
|
674
|
+
### Monitoring
|
|
675
|
+
|
|
676
|
+
```python
|
|
677
|
+
# Get real-time metrics
|
|
678
|
+
metrics = db.get_metrics()
|
|
679
|
+
print(f"Metrics: {metrics}")
|
|
680
|
+
|
|
681
|
+
# Health check
|
|
682
|
+
is_healthy = db.health_check()
|
|
683
|
+
print(f"Database healthy: {is_healthy}")
|
|
684
|
+
|
|
685
|
+
# Data skipping statistics
|
|
686
|
+
stats = db.get_data_skipping_stats()
|
|
687
|
+
print(f"Data skipping stats: {stats}")
|
|
688
|
+
```
|
|
689
|
+
|
|
690
|
+
---
|
|
691
|
+
|
|
692
|
+
## API Reference
|
|
693
|
+
|
|
694
|
+
### DatabaseOps
|
|
695
|
+
|
|
696
|
+
Main class for database operations.
|
|
697
|
+
|
|
698
|
+
#### Methods
|
|
699
|
+
|
|
700
|
+
| Method | Description | Returns |
|
|
701
|
+
|--------|-------------|---------|
|
|
702
|
+
| `create(path, schema)` | Create new database | `DatabaseOps` |
|
|
703
|
+
| `create_from_csv(db_path, csv_path)` | Create from CSV (auto schema) | `DatabaseOps` |
|
|
704
|
+
| `create_from_parquet(db_path, parquet_path)` | Create from Parquet | `DatabaseOps` |
|
|
705
|
+
| `open(path)` | Open existing database | `DatabaseOps` |
|
|
706
|
+
| `create_with_auth(path, schema, auth_enabled)` | Create with authentication | `DatabaseOps` |
|
|
707
|
+
| `open_with_credentials(path, credentials)` | Open with credentials | `DatabaseOps` |
|
|
708
|
+
| `create_with_s3(s3_path, schema, s3_config)` | Create on S3 | `DatabaseOps` |
|
|
709
|
+
| `open_with_s3(s3_path, s3_config)` | Open from S3 | `DatabaseOps` |
|
|
710
|
+
| `insert_json(json_data)` | Insert data from JSON | `u64` (rows inserted) |
|
|
711
|
+
| `insert_buffered_json(json_data)` | Buffered insert | `u64` (rows inserted) |
|
|
712
|
+
| `flush_write_buffer()` | Flush buffered writes | `None` |
|
|
713
|
+
| `merge_json(json_data, key_column)` | MERGE (UPSERT) operation | `str` (JSON metrics) |
|
|
714
|
+
| `query_json(sql)` | Execute SQL query | `str` (JSON results) |
|
|
715
|
+
| `query_json_at_version(sql, version)` | Time travel query by version | `str` (JSON results) |
|
|
716
|
+
| `query_json_at_timestamp(sql, timestamp)` | Time travel query by timestamp | `str` (JSON results) |
|
|
717
|
+
| `delete_rows_where(condition)` | Delete rows by condition | `u64` (rows deleted) |
|
|
718
|
+
| `optimize()` | Compact Parquet files | `str` (result) |
|
|
719
|
+
| `vacuum(retention_hours)` | Remove old files | `str` (result) |
|
|
720
|
+
| `zorder(columns)` | Multi-dimensional clustering | `str` (result) |
|
|
721
|
+
| `get_current_version()` | Get current version | `i64` |
|
|
722
|
+
| `get_version_history()` | Get version history | `list` |
|
|
723
|
+
| `get_data_skipping_stats()` | Get skipping statistics | `str` (JSON) |
|
|
724
|
+
| `get_metrics()` | Get real-time metrics | `str` (JSON) |
|
|
725
|
+
| `health_check()` | Health check | `bool` |
|
|
726
|
+
| `backup(path)` | Full backup | `str` (backup path) |
|
|
727
|
+
| `backup_incremental(path)` | Incremental backup | `str` (backup path) |
|
|
728
|
+
| `restore(path)` | Restore from backup | `None` |
|
|
729
|
+
|
|
730
|
+
### Schema
|
|
731
|
+
|
|
732
|
+
Database schema definition.
|
|
733
|
+
|
|
734
|
+
```python
|
|
735
|
+
from posixlake import Schema, Field
|
|
736
|
+
|
|
737
|
+
schema = Schema(fields=[
|
|
738
|
+
Field(name="id", data_type="Int32", nullable=False),
|
|
739
|
+
Field(name="name", data_type="String", nullable=False),
|
|
740
|
+
Field(name="age", data_type="Int32", nullable=True),
|
|
741
|
+
Field(name="salary", data_type="Float64", nullable=True),
|
|
742
|
+
])
|
|
743
|
+
```
|
|
744
|
+
|
|
745
|
+
#### Supported Data Types
|
|
746
|
+
|
|
747
|
+
**Primitive Types:**
|
|
748
|
+
- `Int8`, `Int16`, `Int32`, `Int64`
|
|
749
|
+
- `UInt8`, `UInt16`, `UInt32`, `UInt64`
|
|
750
|
+
- `Float32`, `Float64`
|
|
751
|
+
- `String`, `LargeUtf8`, `Binary`, `LargeBinary`
|
|
752
|
+
- `Boolean`
|
|
753
|
+
- `Date32`, `Date64`
|
|
754
|
+
- `Timestamp`
|
|
755
|
+
|
|
756
|
+
**Complex Types:**
|
|
757
|
+
- `Decimal128(precision,scale)` - e.g., `Decimal128(10,2)` for currency
|
|
758
|
+
- `List<ElementType>` - e.g., `List<Int32>`, `List<String>`
|
|
759
|
+
- `Map<KeyType,ValueType>` - e.g., `Map<String,Int64>`
|
|
760
|
+
- `Struct<field1:Type1,field2:Type2>` - e.g., `Struct<x:Int32,y:Int32>`
|
|
761
|
+
|
|
762
|
+
### Field
|
|
763
|
+
|
|
764
|
+
Schema field definition.
|
|
765
|
+
|
|
766
|
+
```python
|
|
767
|
+
# Simple types
|
|
768
|
+
Field(name="id", data_type="Int32", nullable=False)
|
|
769
|
+
Field(name="price", data_type="Decimal128(10,2)", nullable=False)
|
|
770
|
+
|
|
771
|
+
# Complex types
|
|
772
|
+
Field(name="tags", data_type="List<String>", nullable=True)
|
|
773
|
+
Field(name="metadata", data_type="Map<String,String>", nullable=True)
|
|
774
|
+
Field(name="address", data_type="Struct<city:String,zip:Int32>", nullable=True)
|
|
775
|
+
```
|
|
776
|
+
|
|
777
|
+
### NfsServer
|
|
778
|
+
|
|
779
|
+
NFS server for POSIX filesystem access.
|
|
780
|
+
|
|
781
|
+
```python
|
|
782
|
+
nfs = NfsServer(db, port=12049)
|
|
783
|
+
nfs.is_ready() # Check if server is ready
|
|
784
|
+
nfs.shutdown() # Shutdown server
|
|
785
|
+
```
|
|
786
|
+
|
|
787
|
+
### S3Config
|
|
788
|
+
|
|
789
|
+
S3 configuration for object storage backend.
|
|
790
|
+
|
|
791
|
+
```python
|
|
792
|
+
s3_config = S3Config(
|
|
793
|
+
endpoint="http://localhost:9000",
|
|
794
|
+
access_key_id="minioadmin",
|
|
795
|
+
secret_access_key="minioadmin",
|
|
796
|
+
region="us-east-1"
|
|
797
|
+
)
|
|
798
|
+
```
|
|
799
|
+
|
|
800
|
+
### PosixLakeError
|
|
801
|
+
|
|
802
|
+
Exception class for all posixlake errors.
|
|
803
|
+
|
|
804
|
+
```python
|
|
805
|
+
from posixlake import PosixLakeError
|
|
806
|
+
|
|
807
|
+
try:
|
|
808
|
+
db.insert_json(data)
|
|
809
|
+
except PosixLakeError as e:
|
|
810
|
+
print(f"Error: {e}")
|
|
811
|
+
```
|
|
812
|
+
|
|
813
|
+
#### Error Types
|
|
814
|
+
|
|
815
|
+
- `PosixLakeError.IoError` - I/O operations
|
|
816
|
+
- `PosixLakeError.SerializationError` - JSON/Arrow serialization
|
|
817
|
+
- `PosixLakeError.DeltaLakeError` - Delta Lake operations
|
|
818
|
+
- `PosixLakeError.InvalidOperation` - Invalid operations
|
|
819
|
+
- `PosixLakeError.QueryError` - SQL query errors
|
|
820
|
+
- `PosixLakeError.AuthenticationError` - Authentication failures
|
|
821
|
+
- `PosixLakeError.PermissionDenied` - Permission errors
|
|
822
|
+
- `PosixLakeError.SchemaError` - Schema-related errors
|
|
823
|
+
- `PosixLakeError.VersionError` - Version conflicts
|
|
824
|
+
- `PosixLakeError.StorageError` - Storage backend errors
|
|
825
|
+
- `PosixLakeError.NetworkError` - Network operations
|
|
826
|
+
- `PosixLakeError.TimeoutError` - Operation timeouts
|
|
827
|
+
- `PosixLakeError.NotFound` - Resource not found
|
|
828
|
+
- `PosixLakeError.AlreadyExists` - Resource already exists
|
|
829
|
+
|
|
830
|
+
---
|
|
831
|
+
|
|
832
|
+
## Performance
|
|
833
|
+
|
|
834
|
+
### Buffered Inserts
|
|
835
|
+
|
|
836
|
+
**10x performance improvement** for small batch writes:
|
|
837
|
+
|
|
838
|
+
```python
|
|
839
|
+
# Regular insert: 100 separate Delta Lake transactions
|
|
840
|
+
for i in range(100):
|
|
841
|
+
db.insert_json(f'[{{"id": {i}, "name": "User_{i}"}}]')
|
|
842
|
+
# Time: ~5-10 seconds (50-100ms per transaction)
|
|
843
|
+
|
|
844
|
+
# Buffered insert: ~1-2 batched transactions
|
|
845
|
+
for i in range(100):
|
|
846
|
+
db.insert_buffered_json(f'[{{"id": {i}, "name": "User_{i}"}}]')
|
|
847
|
+
db.flush_write_buffer()
|
|
848
|
+
# Time: ~0.5-1 second (10x faster!)
|
|
849
|
+
```
|
|
850
|
+
|
|
851
|
+
**How It Works:**
|
|
852
|
+
- Buffers multiple small writes in memory
|
|
853
|
+
- Auto-flushes at 1000 rows (configurable in Rust)
|
|
854
|
+
- Batches all buffered data into fewer Delta Lake transactions
|
|
855
|
+
- Reduces transaction overhead significantly
|
|
856
|
+
|
|
857
|
+
### Efficient Operations
|
|
858
|
+
|
|
859
|
+
- Optimized data transfer between Rust and Python
|
|
860
|
+
- Arrow RecordBatches shared efficiently
|
|
861
|
+
- Minimal memory copying for large datasets
|
|
862
|
+
|
|
863
|
+
### Async Operations
|
|
864
|
+
|
|
865
|
+
- Operations run on async runtime
|
|
866
|
+
- Synchronous Python API for ease of use
|
|
867
|
+
- Optimal concurrency for I/O-bound workloads
|
|
868
|
+
|
|
869
|
+
---
|
|
870
|
+
|
|
871
|
+
## Error Handling
|
|
872
|
+
|
|
873
|
+
All Rust errors are properly mapped to Python exceptions:
|
|
874
|
+
|
|
875
|
+
```python
|
|
876
|
+
from posixlake import PosixLakeError
|
|
877
|
+
|
|
878
|
+
try:
|
|
879
|
+
db = DatabaseOps.create("/path/to/db", schema)
|
|
880
|
+
db.insert_json(data)
|
|
881
|
+
results = db.query_json("SELECT * FROM data")
|
|
882
|
+
except PosixLakeError.IoError as e:
|
|
883
|
+
print(f"I/O error: {e}")
|
|
884
|
+
except PosixLakeError.SerializationError as e:
|
|
885
|
+
print(f"Serialization error: {e}")
|
|
886
|
+
except PosixLakeError.DeltaLakeError as e:
|
|
887
|
+
print(f"Delta Lake error: {e}")
|
|
888
|
+
except PosixLakeError.InvalidOperation as e:
|
|
889
|
+
print(f"Invalid operation: {e}")
|
|
890
|
+
except PosixLakeError as e:
|
|
891
|
+
print(f"posixlake error: {e}")
|
|
892
|
+
```
|
|
893
|
+
|
|
894
|
+
**Error Types:**
|
|
895
|
+
- All errors inherit from `PosixLakeError`
|
|
896
|
+
- Specific error types for different failure modes
|
|
897
|
+
- Comprehensive error messages with context
|
|
898
|
+
- Stack traces preserved from Rust
|
|
899
|
+
|
|
900
|
+
---
|
|
901
|
+
|
|
902
|
+
## Architecture
|
|
903
|
+
|
|
904
|
+
### System Overview
|
|
905
|
+
|
|
906
|
+
```
|
|
907
|
+
┌─────────────────────────────────────────┐
|
|
908
|
+
│ Python Application │
|
|
909
|
+
│ from posixlake import DatabaseOps │
|
|
910
|
+
└──────────────┬──────────────────────────┘
|
|
911
|
+
│
|
|
912
|
+
┌──────────────▼──────────────────────────┐
|
|
913
|
+
│ Python API Layer │
|
|
914
|
+
│ • Type conversion │
|
|
915
|
+
│ • Error handling │
|
|
916
|
+
│ • Async runtime bridge │
|
|
917
|
+
└──────────────┬──────────────────────────┘
|
|
918
|
+
│
|
|
919
|
+
┌──────────────▼──────────────────────────┐
|
|
920
|
+
│ Rust Library (libposixlake.dylib) │
|
|
921
|
+
│ • DatabaseOps │
|
|
922
|
+
│ • Delta Lake operations │
|
|
923
|
+
│ • DataFusion SQL engine │
|
|
924
|
+
│ • NFS server │
|
|
925
|
+
└──────────────┬──────────────────────────┘
|
|
926
|
+
│
|
|
927
|
+
┌──────────────▼──────────────────────────┐
|
|
928
|
+
│ Delta Lake Protocol │
|
|
929
|
+
│ • ACID transactions │
|
|
930
|
+
│ • Time travel │
|
|
931
|
+
│ • Parquet storage │
|
|
932
|
+
└─────────────────────────────────────────┘
|
|
933
|
+
```
|
|
934
|
+
|
|
935
|
+
**Key Features:**
|
|
936
|
+
- **Type Safety**: Automatic type conversion between Rust and Python
|
|
937
|
+
- **Error Handling**: Comprehensive error mapping to Python exceptions
|
|
938
|
+
- **Efficient Data Transfer**: Optimized data sharing via Arrow
|
|
939
|
+
- **Async Support**: Async runtime for optimal performance
|
|
940
|
+
- **Memory Safety**: Rust's memory safety guarantees
|
|
941
|
+
|
|
942
|
+
### Storage Backends
|
|
943
|
+
|
|
944
|
+
posixlake Python bindings support multiple storage backends:
|
|
945
|
+
|
|
946
|
+
- **Local Filesystem**: Standard directory paths
|
|
947
|
+
- **S3/MinIO**: Object storage with S3-compatible API
|
|
948
|
+
- **Unified API**: Same Python code works with both
|
|
949
|
+
|
|
950
|
+
---
|
|
951
|
+
|
|
952
|
+
## What Makes This Awesome
|
|
953
|
+
|
|
954
|
+
1. **Performance**: Rust-powered engine with buffered inserts (~10x faster for small batches)
|
|
955
|
+
2. **No Special Drivers**: NFS server uses OS built-in NFS client - zero installation
|
|
956
|
+
3. **Unix Commands Trigger Delta Operations**: `cat` queries data, `grep` searches, `echo >>` triggers INSERT, `sed -i` triggers MERGE (UPDATE/DELETE) - all as ACID transactions
|
|
957
|
+
4. **Standard Tools**: `grep`, `awk`, `sed`, `wc`, `sort` work on your data lake and trigger Delta Lake operations - no special libraries needed
|
|
958
|
+
5. **Smart Batching**: Auto-flushes at 1000 rows, reducing transaction overhead
|
|
959
|
+
6. **Delta Lake Compatible**: Tables readable by Spark, Databricks, and Athena immediately
|
|
960
|
+
7. **Robust**: Comprehensive error handling, async support, and testing
|
|
961
|
+
8. **Type Safety**: Complete type hints and comprehensive error handling
|
|
962
|
+
9. **Efficient**: Optimized data transfer with minimal overhead
|
|
963
|
+
10. **Unified Storage**: Same API works with local filesystem and S3
|
|
964
|
+
|
|
965
|
+
**Use Unix commands to query and trigger Delta Lake operations** - `cat` queries Parquet data, `grep` searches, `echo >>` triggers INSERT transactions, `sed -i` triggers MERGE (UPDATE/DELETE) transactions. No special libraries, no drivers, just mount and use standard Unix tools. Plus buffered inserts for 10x performance when loading many small batches.
|
|
966
|
+
|
|
967
|
+
---
|
|
968
|
+
|
|
969
|
+
## License
|
|
970
|
+
|
|
971
|
+
**Apache License 2.0**
|
|
972
|
+
|
|
973
|
+
Copyright 2025 posixlake Contributors
|
|
974
|
+
|
|
975
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
|
976
|
+
you may not use this file except in compliance with the License.
|
|
977
|
+
|
|
978
|
+
See [LICENSE.md](../../LICENSE.md) for the full license text.
|
|
979
|
+
|
|
980
|
+
---
|
|
981
|
+
|
|
982
|
+
## Contributing
|
|
983
|
+
|
|
984
|
+
Contributions welcome! Please follow these guidelines:
|
|
985
|
+
|
|
986
|
+
1. **Write tests first** - TDD approach for all features
|
|
987
|
+
2. **Run full suite** - Ensure all tests pass
|
|
988
|
+
3. **Update documentation** - Keep README and docs up to date
|
|
989
|
+
4. **Commit messages** - Use conventional commits
|
|
990
|
+
|
|
991
|
+
---
|
|
992
|
+
|
|
993
|
+
## Acknowledgments
|
|
994
|
+
|
|
995
|
+
Built with:
|
|
996
|
+
|
|
997
|
+
- [Rust](https://www.rust-lang.org/) - Systems programming language
|
|
998
|
+
- [Apache Arrow](https://arrow.apache.org/) - Columnar in-memory format
|
|
999
|
+
- [Apache Parquet](https://parquet.apache.org/) - Columnar file format
|
|
1000
|
+
- [DataFusion](https://datafusion.apache.org/) - Query engine
|
|
1001
|
+
- [Delta Lake](https://delta.io/) - Transaction log
|
|
1002
|
+
- [ObjectStore](https://docs.rs/object_store/) - Storage abstraction
|
|
1003
|
+
|
|
1004
|
+
---
|
|
1005
|
+
|
|
1006
|
+
**Questions?** Open an [issue](https://github.com/npiesco/posixlake/issues)
|
|
1007
|
+
|
|
1008
|
+
**Like this project?** Star the repo and share with your data engineering team!
|
|
1009
|
+
|
|
1010
|
+
**PyPI Package:** https://pypi.org/project/posixlake/
|