cached-duckdb 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cached_duckdb-0.2.0/CHANGELOG.md +52 -0
- cached_duckdb-0.2.0/ENVIRONMENT_VARIABLES.md +179 -0
- cached_duckdb-0.2.0/LICENSE +21 -0
- cached_duckdb-0.2.0/MANIFEST.in +10 -0
- cached_duckdb-0.2.0/PKG-INFO +375 -0
- cached_duckdb-0.2.0/README.md +349 -0
- cached_duckdb-0.2.0/VERSION +1 -0
- cached_duckdb-0.2.0/__init__.py +48 -0
- cached_duckdb-0.2.0/cache_manager.py +621 -0
- cached_duckdb-0.2.0/cache_persistence.py +391 -0
- cached_duckdb-0.2.0/cache_query.py +173 -0
- cached_duckdb-0.2.0/cache_store.py +146 -0
- cached_duckdb-0.2.0/cache_ttl.py +245 -0
- cached_duckdb-0.2.0/cached_duckdb.egg-info/PKG-INFO +375 -0
- cached_duckdb-0.2.0/cached_duckdb.egg-info/SOURCES.txt +44 -0
- cached_duckdb-0.2.0/cached_duckdb.egg-info/dependency_links.txt +1 -0
- cached_duckdb-0.2.0/cached_duckdb.egg-info/requires.txt +12 -0
- cached_duckdb-0.2.0/cached_duckdb.egg-info/top_level.txt +1 -0
- cached_duckdb-0.2.0/cached_duckdb_USER_MANUAL.md +1155 -0
- cached_duckdb-0.2.0/config.py +242 -0
- cached_duckdb-0.2.0/errors.py +33 -0
- cached_duckdb-0.2.0/example.py +177 -0
- cached_duckdb-0.2.0/pyproject.toml +63 -0
- cached_duckdb-0.2.0/requirements.txt +2 -0
- cached_duckdb-0.2.0/setup.cfg +4 -0
- cached_duckdb-0.2.0/setup.py +63 -0
- cached_duckdb-0.2.0/table_utils.py +248 -0
- cached_duckdb-0.2.0/test_duckdb_libraries.py +1109 -0
- cached_duckdb-0.2.0/test_libs_clickhouse_config.py +511 -0
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [0.2.0] - 2026-06-01
|
|
9
|
+
|
|
10
|
+
### Changed
|
|
11
|
+
- Standardized packaging to match deployment rules
|
|
12
|
+
- Added `license-files`, `classifiers`, and `cibuildwheel` config to pyproject.toml
|
|
13
|
+
- Replaced legacy setup.py with Cython-gated build pattern
|
|
14
|
+
- Added version lookup via `importlib.metadata` in `__init__.py`
|
|
15
|
+
- Added GitHub Actions publish workflow (`.github/workflows/publish.yml`)
|
|
16
|
+
- Updated MANIFEST.in to include all documentation files
|
|
17
|
+
- Bumped `requires-python` to `>=3.10`
|
|
18
|
+
|
|
19
|
+
### Added
|
|
20
|
+
- PersistenceCoordinator for external DB persistence
|
|
21
|
+
- `DuckDbCachePersistenceError` exception
|
|
22
|
+
- `cache_persistence.py` module
|
|
23
|
+
|
|
24
|
+
## [0.1.0] - 2026-05-13
|
|
25
|
+
|
|
26
|
+
### Added
|
|
27
|
+
- Initial release of cached_duckdb library
|
|
28
|
+
- Generic database/table API for in-memory DataFrame caching
|
|
29
|
+
- Two storage modes: single_db and per_table_db
|
|
30
|
+
- Atomic swap for safe concurrent writes
|
|
31
|
+
- TTL-based expiry with background cleanup thread
|
|
32
|
+
- Lazy stale flagging for active readers
|
|
33
|
+
- Scheduler-managed table support (bypass TTL)
|
|
34
|
+
- SQL query interface with WHERE clause filtering
|
|
35
|
+
- Cross-table JOIN support (single_db mode)
|
|
36
|
+
- Per-database configuration via JSON file
|
|
37
|
+
- Environment variable configuration
|
|
38
|
+
- Thread-safe operations with per-database/table locking
|
|
39
|
+
- Comprehensive error handling with custom exceptions
|
|
40
|
+
- Metadata queries (row count, columns, types, last updated)
|
|
41
|
+
- Manual cache invalidation (per table or entire database)
|
|
42
|
+
- Raw DuckDB connection access for advanced queries
|
|
43
|
+
- Graceful shutdown with connection cleanup
|
|
44
|
+
|
|
45
|
+
### Features
|
|
46
|
+
- Zero disk usage - pure in-memory storage
|
|
47
|
+
- Columnar format - 20-30% less RAM than pandas
|
|
48
|
+
- Single-pass SQL queries - filter + aggregate in one operation
|
|
49
|
+
- Safe concurrent reads during writes
|
|
50
|
+
- Configurable TTL per table or database
|
|
51
|
+
- Priority-based configuration resolution
|
|
52
|
+
- Background cleanup thread with configurable interval
|
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
# Environment Variables
|
|
2
|
+
|
|
3
|
+
This document lists all environment variables supported by cached_duckdb.
|
|
4
|
+
|
|
5
|
+
All variables use the `CACHED_DUCKDB_` prefix by default.
|
|
6
|
+
|
|
7
|
+
## Configuration Variables
|
|
8
|
+
|
|
9
|
+
### CACHED_DUCKDB_DEFAULT_MODE
|
|
10
|
+
- **Type:** String
|
|
11
|
+
- **Default:** `single_db`
|
|
12
|
+
- **Options:** `single_db` | `per_table_db`
|
|
13
|
+
- **Description:** Default storage mode for all databases
|
|
14
|
+
- `single_db`: One connection per database (enables JOINs)
|
|
15
|
+
- `per_table_db`: One connection per (database, table) pair (parallel writes)
|
|
16
|
+
|
|
17
|
+
### CACHED_DUCKDB_DEFAULT_TTL_MINUTES
|
|
18
|
+
- **Type:** Integer
|
|
19
|
+
- **Default:** `30`
|
|
20
|
+
- **Description:** Default time-to-live in minutes for cached data
|
|
21
|
+
- **Note:** Can be overridden per database/table in config file
|
|
22
|
+
|
|
23
|
+
### CACHED_DUCKDB_CLEANUP_INTERVAL_MINUTES
|
|
24
|
+
- **Type:** Integer
|
|
25
|
+
- **Default:** `5`
|
|
26
|
+
- **Description:** How often the background cleanup thread runs (in minutes)
|
|
27
|
+
|
|
28
|
+
### CACHED_DUCKDB_LOCK_TIMEOUT_SECONDS
|
|
29
|
+
- **Type:** Float
|
|
30
|
+
- **Default:** `30.0`
|
|
31
|
+
- **Description:** Timeout for acquiring write locks (in seconds)
|
|
32
|
+
- **Note:** Raises `DuckDbCacheLockError` if timeout exceeded
|
|
33
|
+
|
|
34
|
+
### CACHED_DUCKDB_CONFIG_FILE_PATH
|
|
35
|
+
- **Type:** String (file path)
|
|
36
|
+
- **Default:** `None`
|
|
37
|
+
- **Description:** Path to connector_config.json for per-database settings
|
|
38
|
+
- **Example:** `/path/to/connector_config.json`
|
|
39
|
+
|
|
40
|
+
### CACHED_DUCKDB_LOG_NAME
|
|
41
|
+
- **Type:** String
|
|
42
|
+
- **Default:** `cached_duckdb`
|
|
43
|
+
- **Description:** Logger name for this library
|
|
44
|
+
- **Note:** Use this to configure logging for cached_duckdb specifically
|
|
45
|
+
|
|
46
|
+
## Persistence Variables (v0.2.0+)
|
|
47
|
+
|
|
48
|
+
### CACHED_DUCKDB_PERSIST_BASE_PATH
|
|
49
|
+
- **Type:** String (directory path)
|
|
50
|
+
- **Default:** `None` (in-memory only, no persistence)
|
|
51
|
+
- **Description:** Base directory for file-based persistence. When set:
|
|
52
|
+
- **Scenario 1 (DB-level):** Databases with a `persist_path` in connector_config.json (or all databases if no per-DB override) are stored as `{path}/{db_name}.duckdb` files instead of in-memory.
|
|
53
|
+
- **Scenario 2 (Table-level):** Tables marked with `"persist": true` in connector_config.json are saved as `{path}/{db_name}/{table_name}.parquet` files after each `store()`.
|
|
54
|
+
- **Example:** `/data/cache` or `C:\data\cache`
|
|
55
|
+
|
|
56
|
+
### CACHED_DUCKDB_SERVICE_NAME
|
|
57
|
+
- **Type:** String
|
|
58
|
+
- **Default:** `default`
|
|
59
|
+
- **Description:** Service identifier used to namespace rows in the external DB snapshot table (`cached_duckdb_snapshots`). Allows multiple services to share the same external snapshot table.
|
|
60
|
+
- **Example:** `order_service`, `analytics_pipeline`
|
|
61
|
+
|
|
62
|
+
## Example Configuration
|
|
63
|
+
|
|
64
|
+
### Linux/macOS (.env file)
|
|
65
|
+
```bash
|
|
66
|
+
CACHED_DUCKDB_DEFAULT_MODE=single_db
|
|
67
|
+
CACHED_DUCKDB_DEFAULT_TTL_MINUTES=30
|
|
68
|
+
CACHED_DUCKDB_CLEANUP_INTERVAL_MINUTES=5
|
|
69
|
+
CACHED_DUCKDB_LOCK_TIMEOUT_SECONDS=30
|
|
70
|
+
CACHED_DUCKDB_CONFIG_FILE_PATH=/opt/config/connector_config.json
|
|
71
|
+
CACHED_DUCKDB_PERSIST_BASE_PATH=/data/cache
|
|
72
|
+
CACHED_DUCKDB_SERVICE_NAME=my_service
|
|
73
|
+
CACHED_DUCKDB_LOG_NAME=cached_duckdb
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
### Windows (PowerShell)
|
|
77
|
+
```powershell
|
|
78
|
+
$env:CACHED_DUCKDB_DEFAULT_MODE="single_db"
|
|
79
|
+
$env:CACHED_DUCKDB_DEFAULT_TTL_MINUTES="30"
|
|
80
|
+
$env:CACHED_DUCKDB_CLEANUP_INTERVAL_MINUTES="5"
|
|
81
|
+
$env:CACHED_DUCKDB_LOCK_TIMEOUT_SECONDS="30"
|
|
82
|
+
$env:CACHED_DUCKDB_CONFIG_FILE_PATH="C:\config\connector_config.json"
|
|
83
|
+
$env:CACHED_DUCKDB_PERSIST_BASE_PATH="C:\data\cache"
|
|
84
|
+
$env:CACHED_DUCKDB_SERVICE_NAME="my_service"
|
|
85
|
+
$env:CACHED_DUCKDB_LOG_NAME="cached_duckdb"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Python Code
|
|
89
|
+
```python
|
|
90
|
+
import os
|
|
91
|
+
|
|
92
|
+
os.environ['CACHED_DUCKDB_DEFAULT_MODE'] = 'per_table_db'
|
|
93
|
+
os.environ['CACHED_DUCKDB_DEFAULT_TTL_MINUTES'] = '60'
|
|
94
|
+
os.environ['CACHED_DUCKDB_CLEANUP_INTERVAL_MINUTES'] = '10'
|
|
95
|
+
os.environ['CACHED_DUCKDB_PERSIST_BASE_PATH'] = '/data/cache'
|
|
96
|
+
os.environ['CACHED_DUCKDB_SERVICE_NAME'] = 'order_service'
|
|
97
|
+
|
|
98
|
+
from cached_duckdb import DuckDbCacheConfig, DuckDbCacheManager
|
|
99
|
+
|
|
100
|
+
config = DuckDbCacheConfig.from_env()
|
|
101
|
+
cache = DuckDbCacheManager(config)
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
## Configuration Priority
|
|
105
|
+
|
|
106
|
+
Configuration is resolved in this order (highest to lowest):
|
|
107
|
+
|
|
108
|
+
1. **Per-table config** in connector_config.json (database → table → setting)
|
|
109
|
+
2. **Per-database config** in connector_config.json (database → setting)
|
|
110
|
+
3. **Environment variables** (CACHED_DUCKDB_*)
|
|
111
|
+
4. **Hardcoded defaults** in DuckDbCacheConfig
|
|
112
|
+
|
|
113
|
+
### Example Priority Resolution
|
|
114
|
+
|
|
115
|
+
For TTL of `database="client_abc"`, `table="sales_data"`:
|
|
116
|
+
|
|
117
|
+
```json
|
|
118
|
+
// connector_config.json
|
|
119
|
+
{
|
|
120
|
+
"client_abc": {
|
|
121
|
+
"default_cache_ttl_minutes": 45, // Database-level
|
|
122
|
+
"sales_data": {
|
|
123
|
+
"cache_ttl_minutes": 60 // Table-level (highest priority)
|
|
124
|
+
}
|
|
125
|
+
}
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### Persistence Configuration in connector_config.json (v0.2.0+)
|
|
130
|
+
|
|
131
|
+
```json
|
|
132
|
+
{
|
|
133
|
+
"client_abc": {
|
|
134
|
+
"persist_path": "/data/cache", // Scenario 1: DB saved as /data/cache/client_abc.duckdb
|
|
135
|
+
"sales_data": {
|
|
136
|
+
"persist": true // Scenario 2: table saved as {persist_base_path}/client_abc/sales_data.parquet
|
|
137
|
+
}
|
|
138
|
+
}
|
|
139
|
+
}
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
**Note:** If `persist_path` is set on the database AND `persist: true` is set on a table within that database, the file-based DB takes precedence — the table is already on disk inside the `.duckdb` file, so no separate Parquet file is created.
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
# .env
|
|
147
|
+
CACHED_DUCKDB_DEFAULT_TTL_MINUTES=30 # Environment (lower priority)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
**Result:** `cache_ttl_minutes = 60` (from table-level config)
|
|
151
|
+
|
|
152
|
+
## Logging Configuration
|
|
153
|
+
|
|
154
|
+
To configure logging for cached_duckdb:
|
|
155
|
+
|
|
156
|
+
```python
|
|
157
|
+
import logging
|
|
158
|
+
|
|
159
|
+
# Set log level
|
|
160
|
+
logging.getLogger('cached_duckdb').setLevel(logging.DEBUG)
|
|
161
|
+
|
|
162
|
+
# Add handler
|
|
163
|
+
handler = logging.StreamHandler()
|
|
164
|
+
handler.setFormatter(logging.Formatter(
|
|
165
|
+
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
|
166
|
+
))
|
|
167
|
+
logging.getLogger('cached_duckdb').addHandler(handler)
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
Or use environment-based log name:
|
|
171
|
+
|
|
172
|
+
```bash
|
|
173
|
+
export CACHED_DUCKDB_LOG_NAME=my_cache_logger
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
```python
|
|
177
|
+
import logging
|
|
178
|
+
logging.getLogger('my_cache_logger').setLevel(logging.INFO)
|
|
179
|
+
```
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 sreeyenan
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,375 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: cached-duckdb
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Fast in-memory DataFrame cache using DuckDB with SQL query interface
|
|
5
|
+
Author-email: sreeyenan <sreeyenanek@gmail.com>
|
|
6
|
+
Keywords: duckdb,cache,dataframe,sql,in-memory
|
|
7
|
+
Classifier: Development Status :: 4 - Beta
|
|
8
|
+
Classifier: Intended Audience :: Developers
|
|
9
|
+
Classifier: Programming Language :: Python :: 3
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
+
Requires-Python: >=3.10
|
|
14
|
+
Description-Content-Type: text/markdown
|
|
15
|
+
License-File: LICENSE
|
|
16
|
+
Requires-Dist: duckdb>=0.10.0
|
|
17
|
+
Requires-Dist: pandas>=1.5.0
|
|
18
|
+
Provides-Extra: dev
|
|
19
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
20
|
+
Requires-Dist: pytest-cov; extra == "dev"
|
|
21
|
+
Provides-Extra: protected
|
|
22
|
+
Requires-Dist: Cython>=3.0; extra == "protected"
|
|
23
|
+
Provides-Extra: all
|
|
24
|
+
Requires-Dist: Cython>=3.0; extra == "all"
|
|
25
|
+
Dynamic: license-file
|
|
26
|
+
|
|
27
|
+
# cached_duckdb
|
|
28
|
+
|
|
29
|
+
Fast in-memory DataFrame cache using DuckDB with SQL query interface.
|
|
30
|
+
|
|
31
|
+
## Overview
|
|
32
|
+
|
|
33
|
+
`cached_duckdb` replaces pandas dict-based caching with DuckDB in-memory connections for:
|
|
34
|
+
- **Columnar storage** - 20-30% less RAM than pandas
|
|
35
|
+
- **SQL queries** - Single-pass filter+aggregate operations
|
|
36
|
+
- **Concurrency** - Safe parallel reads during writes
|
|
37
|
+
- **Zero disk usage** - Pure in-memory like pandas
|
|
38
|
+
|
|
39
|
+
## Key Features
|
|
40
|
+
|
|
41
|
+
- **Generic database/table API** - Works with any cache-based system
|
|
42
|
+
- **Two storage modes:**
|
|
43
|
+
- `single_db`: One connection per database (enables cross-table JOINs)
|
|
44
|
+
- `per_table_db`: One connection per (database, table) pair (fully parallel writes)
|
|
45
|
+
- **Atomic swap** - Readers see 100% old or 100% new data, never partial
|
|
46
|
+
- **TTL-based expiry** - Background cleanup with lazy stale flagging
|
|
47
|
+
- **Scheduler-managed tables** - Bypass TTL for scheduled updates
|
|
48
|
+
- **Thread-safe operations** - Per-database or per-table locking
|
|
49
|
+
|
|
50
|
+
## Installation
|
|
51
|
+
|
|
52
|
+
### From PyPI (Recommended)
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
pip install cached-duckdb
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### Install Specific Version
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
pip install cached-duckdb==0.2.0
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### With Optional Protected Build Extras
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
pip install "cached-duckdb[all]"
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### From Local Source
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
pip install -r requirements.txt
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
Or install in development mode:
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
pip install -e .
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Verify installed version:
|
|
83
|
+
|
|
84
|
+
```python
|
|
85
|
+
import cached_duckdb
|
|
86
|
+
print(cached_duckdb.__version__)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Quick Start
|
|
90
|
+
|
|
91
|
+
### Basic Usage
|
|
92
|
+
|
|
93
|
+
```python
|
|
94
|
+
from cached_duckdb import DuckDbCacheManager
|
|
95
|
+
import pandas as pd
|
|
96
|
+
|
|
97
|
+
# Initialize cache (singleton)
|
|
98
|
+
cache = DuckDbCacheManager()
|
|
99
|
+
|
|
100
|
+
# Store DataFrame
|
|
101
|
+
df = pd.DataFrame({
|
|
102
|
+
'date': ['2026-01-01', '2026-01-02'],
|
|
103
|
+
'amount': [1000, 2000],
|
|
104
|
+
'country': ['USA', 'UK']
|
|
105
|
+
})
|
|
106
|
+
cache.store(database="client_abc", table="sales_data", df=df)
|
|
107
|
+
|
|
108
|
+
# Query with SQL filtering
|
|
109
|
+
result = cache.query(
|
|
110
|
+
database="client_abc",
|
|
111
|
+
table="sales_data",
|
|
112
|
+
sql_where="amount > 1000 AND country = 'USA'",
|
|
113
|
+
columns=["date", "amount"]
|
|
114
|
+
)
|
|
115
|
+
print(result)
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
### Advanced Queries
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
# Get all data
|
|
122
|
+
df = cache.query(database="app_x", table="dataset_1")
|
|
123
|
+
|
|
124
|
+
# Filter with WHERE
|
|
125
|
+
df = cache.query(
|
|
126
|
+
database="app_x",
|
|
127
|
+
table="dataset_1",
|
|
128
|
+
sql_where="age > 25 AND country = 'USA'"
|
|
129
|
+
)
|
|
130
|
+
|
|
131
|
+
# Select specific columns
|
|
132
|
+
df = cache.query(
|
|
133
|
+
database="app_x",
|
|
134
|
+
table="dataset_1",
|
|
135
|
+
columns=["name", "age", "salary"]
|
|
136
|
+
)
|
|
137
|
+
|
|
138
|
+
# Limit results
|
|
139
|
+
df = cache.query(
|
|
140
|
+
database="app_x",
|
|
141
|
+
table="dataset_1",
|
|
142
|
+
sql_where="date >= '2026-01-01'",
|
|
143
|
+
limit=100
|
|
144
|
+
)
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### Cross-Table JOINs (single_db mode)
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
# Execute raw SQL for complex queries
|
|
151
|
+
sql = """
|
|
152
|
+
SELECT s.date, s.amount, o.customer_name
|
|
153
|
+
FROM sales_data s
|
|
154
|
+
JOIN orders o ON s.order_id = o.id
|
|
155
|
+
WHERE s.amount > 1000
|
|
156
|
+
"""
|
|
157
|
+
result = cache.execute_sql(database="client_abc", sql=sql)
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
### Check if Data Exists
|
|
161
|
+
|
|
162
|
+
```python
|
|
163
|
+
if cache.exists(database="client_abc", table="sales_data"):
|
|
164
|
+
# Data is ready and fresh
|
|
165
|
+
df = cache.query(database="client_abc", table="sales_data")
|
|
166
|
+
else:
|
|
167
|
+
# Data missing or stale - reload needed
|
|
168
|
+
df = load_from_source()
|
|
169
|
+
cache.store(database="client_abc", table="sales_data", df=df)
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### TTL and Scheduler-Managed Tables
|
|
173
|
+
|
|
174
|
+
```python
|
|
175
|
+
# Store with custom TTL
|
|
176
|
+
cache.store(
|
|
177
|
+
database="client_abc",
|
|
178
|
+
table="sales_data",
|
|
179
|
+
df=df,
|
|
180
|
+
ttl_minutes=60 # Expires after 60 minutes
|
|
181
|
+
)
|
|
182
|
+
|
|
183
|
+
# Scheduler-managed table (no auto-expiry on reads)
|
|
184
|
+
cache.store(
|
|
185
|
+
database="client_abc",
|
|
186
|
+
table="sales_data",
|
|
187
|
+
df=df,
|
|
188
|
+
scheduler_managed=True # Only scheduler updates it
|
|
189
|
+
)
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
### Invalidate Cache
|
|
193
|
+
|
|
194
|
+
```python
|
|
195
|
+
# Invalidate one table
|
|
196
|
+
cache.invalidate(database="client_abc", table="sales_data")
|
|
197
|
+
|
|
198
|
+
# Invalidate all tables for a database
|
|
199
|
+
cache.invalidate(database="client_abc")
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
### Get Metadata
|
|
203
|
+
|
|
204
|
+
```python
|
|
205
|
+
# Last updated timestamp
|
|
206
|
+
last_updated = cache.get_last_updated(database="client_abc", table="sales_data")
|
|
207
|
+
print(f"Last updated: {last_updated}")
|
|
208
|
+
|
|
209
|
+
# Table info
|
|
210
|
+
info = cache.get_table_info(database="client_abc", table="sales_data")
|
|
211
|
+
print(f"Rows: {info['row_count']}")
|
|
212
|
+
print(f"Columns: {info['columns']}")
|
|
213
|
+
print(f"Types: {info['column_types']}")
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
## Documentation
|
|
217
|
+
|
|
218
|
+
- [User Manual](cached_duckdb_USER_MANUAL.md)
|
|
219
|
+
- [Environment Variables](ENVIRONMENT_VARIABLES.md)
|
|
220
|
+
- [Changelog](CHANGELOG.md)
|
|
221
|
+
|
|
222
|
+
## Author
|
|
223
|
+
|
|
224
|
+
**sreeyenan** (sreeyenanek@gmail.com)
|
|
225
|
+
|
|
226
|
+
## Version
|
|
227
|
+
|
|
228
|
+
Current version: **0.2.0**
|
|
229
|
+
|
|
230
|
+
## Configuration
|
|
231
|
+
|
|
232
|
+
### Environment Variables
|
|
233
|
+
|
|
234
|
+
```bash
|
|
235
|
+
# Storage mode: single_db or per_table_db
|
|
236
|
+
CACHED_DUCKDB_DEFAULT_MODE=single_db
|
|
237
|
+
|
|
238
|
+
# Default TTL in minutes
|
|
239
|
+
CACHED_DUCKDB_DEFAULT_TTL_MINUTES=30
|
|
240
|
+
|
|
241
|
+
# Cleanup thread interval
|
|
242
|
+
CACHED_DUCKDB_CLEANUP_INTERVAL_MINUTES=5
|
|
243
|
+
|
|
244
|
+
# Lock timeout in seconds
|
|
245
|
+
CACHED_DUCKDB_LOCK_TIMEOUT_SECONDS=30
|
|
246
|
+
|
|
247
|
+
# Path to connector config file (optional)
|
|
248
|
+
CACHED_DUCKDB_CONFIG_FILE_PATH=/path/to/connector_config.json
|
|
249
|
+
|
|
250
|
+
# Logger name
|
|
251
|
+
CACHED_DUCKDB_LOG_NAME=cached_duckdb
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
### Per-Database Configuration File
|
|
255
|
+
|
|
256
|
+
Create `connector_config.json` for per-database settings:
|
|
257
|
+
|
|
258
|
+
```json
|
|
259
|
+
{
|
|
260
|
+
"client_abc": {
|
|
261
|
+
"duck_cache_mode": "per_table_db",
|
|
262
|
+
"default_cache_ttl_minutes": 45,
|
|
263
|
+
"sales_data": {
|
|
264
|
+
"cache_ttl_minutes": 60,
|
|
265
|
+
"scheduler_managed": false
|
|
266
|
+
},
|
|
267
|
+
"live_feed": {
|
|
268
|
+
"cache_ttl_minutes": 0,
|
|
269
|
+
"scheduler_managed": true
|
|
270
|
+
}
|
|
271
|
+
},
|
|
272
|
+
"client_xyz": {
|
|
273
|
+
"duck_cache_mode": "single_db"
|
|
274
|
+
}
|
|
275
|
+
}
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
**Priority order:**
|
|
279
|
+
1. Per-table config in JSON file (highest)
|
|
280
|
+
2. Per-database config in JSON file
|
|
281
|
+
3. Environment variables
|
|
282
|
+
4. Hardcoded defaults (lowest)
|
|
283
|
+
|
|
284
|
+
### Load Configuration
|
|
285
|
+
|
|
286
|
+
```python
|
|
287
|
+
from cached_duckdb import DuckDbCacheConfig, DuckDbCacheManager
|
|
288
|
+
|
|
289
|
+
# From environment
|
|
290
|
+
config = DuckDbCacheConfig.from_env()
|
|
291
|
+
cache = DuckDbCacheManager(config)
|
|
292
|
+
|
|
293
|
+
# From dict
|
|
294
|
+
config = DuckDbCacheConfig.from_dict({
|
|
295
|
+
"default_mode": "single_db",
|
|
296
|
+
"default_cache_ttl_minutes": 30,
|
|
297
|
+
"config_file_path": "/path/to/connector_config.json"
|
|
298
|
+
})
|
|
299
|
+
cache = DuckDbCacheManager(config)
|
|
300
|
+
```
|
|
301
|
+
|
|
302
|
+
## Storage Modes
|
|
303
|
+
|
|
304
|
+
### Mode A: single_db (Default)
|
|
305
|
+
|
|
306
|
+
- One DuckDB connection per `database`
|
|
307
|
+
- Multiple tables share the same connection
|
|
308
|
+
- Enables cross-table SQL JOINs
|
|
309
|
+
- Write contention: One lock per database
|
|
310
|
+
|
|
311
|
+
**Use when:** Database has few tables (< 20) or need cross-table queries
|
|
312
|
+
|
|
313
|
+
### Mode B: per_table_db
|
|
314
|
+
|
|
315
|
+
- One DuckDB connection per `(database, table)` pair
|
|
316
|
+
- Each table is fully isolated
|
|
317
|
+
- Zero write contention between tables
|
|
318
|
+
- Fully parallel writes
|
|
319
|
+
|
|
320
|
+
**Use when:** Database has many tables (20+) or high write concurrency
|
|
321
|
+
|
|
322
|
+
## Architecture
|
|
323
|
+
|
|
324
|
+
```
|
|
325
|
+
DuckDbCacheManager (singleton)
|
|
326
|
+
├── CacheStore - Atomic writes, table management
|
|
327
|
+
├── CacheQuery - SQL queries, metadata
|
|
328
|
+
├── TTLRegistry - TTL tracking, cleanup thread
|
|
329
|
+
└── CacheConfigResolver - Per-database config resolution
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
## Thread Safety
|
|
333
|
+
|
|
334
|
+
- **store()**: Write-locked per database or per table
|
|
335
|
+
- **query()**: Lock-free parallel reads
|
|
336
|
+
- **invalidate()**: Write-locked, waits for active readers
|
|
337
|
+
- **Background cleanup**: Minimal locking, uses stale flags
|
|
338
|
+
|
|
339
|
+
## Use Cases
|
|
340
|
+
|
|
341
|
+
- **Multi-tenant web APIs** - Cache per tenant with `database=tenant_id`
|
|
342
|
+
- **Analytics dashboards** - Fast in-memory OLAP queries
|
|
343
|
+
- **ETL pipelines** - Store intermediate DataFrames
|
|
344
|
+
- **Session managers** - Replace pandas dict caching
|
|
345
|
+
- **Microservices** - Shared cache library across services
|
|
346
|
+
|
|
347
|
+
## API Reference
|
|
348
|
+
|
|
349
|
+
### DuckDbCacheManager
|
|
350
|
+
|
|
351
|
+
- `store(database, table, df, ttl_minutes=None, scheduler_managed=False)` - Store DataFrame
|
|
352
|
+
- `query(database, table, sql_where=None, columns=None, limit=None)` - Query with filtering
|
|
353
|
+
- `execute_sql(database, sql)` - Execute raw SQL
|
|
354
|
+
- `exists(database, table)` - Check if exists and fresh
|
|
355
|
+
- `invalidate(database, table=None)` - Remove from cache
|
|
356
|
+
- `get_last_updated(database, table)` - Get timestamp
|
|
357
|
+
- `get_table_info(database, table)` - Get metadata
|
|
358
|
+
- `get_raw_connection(database, table=None)` - Get DuckDB connection
|
|
359
|
+
- `shutdown()` - Close all connections
|
|
360
|
+
|
|
361
|
+
### Exceptions
|
|
362
|
+
|
|
363
|
+
- `DuckDbCacheError` - Base exception
|
|
364
|
+
- `DuckDbCacheConfigError` - Configuration error
|
|
365
|
+
- `DuckDbCacheLockError` - Lock acquisition failed
|
|
366
|
+
- `DuckDbCacheNotFoundError` - Table not found
|
|
367
|
+
- `DuckDbCacheStaleError` - Data is stale
|
|
368
|
+
|
|
369
|
+
## License
|
|
370
|
+
|
|
371
|
+
MIT License - see LICENSE file
|
|
372
|
+
|
|
373
|
+
## Author
|
|
374
|
+
|
|
375
|
+
sreeyenan
|