kelp-core 0.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- kelp_core-0.0.1/PKG-INFO +314 -0
- kelp_core-0.0.1/README.md +298 -0
- kelp_core-0.0.1/pyproject.toml +111 -0
- kelp_core-0.0.1/src/kelp/__init__.py +15 -0
- kelp_core-0.0.1/src/kelp/__main__.py +4 -0
- kelp_core-0.0.1/src/kelp/catalog/__init__.py +21 -0
- kelp_core-0.0.1/src/kelp/catalog/abac_ddl.py +66 -0
- kelp_core-0.0.1/src/kelp/catalog/api.py +137 -0
- kelp_core-0.0.1/src/kelp/catalog/function_ddl.py +115 -0
- kelp_core-0.0.1/src/kelp/catalog/metric_view_ddl.py +310 -0
- kelp_core-0.0.1/src/kelp/catalog/uc_adapter.py +357 -0
- kelp_core-0.0.1/src/kelp/catalog/uc_diff.py +235 -0
- kelp_core-0.0.1/src/kelp/catalog/uc_models.py +132 -0
- kelp_core-0.0.1/src/kelp/catalog/uc_query_builder.py +383 -0
- kelp_core-0.0.1/src/kelp/cli/__init__.py +0 -0
- kelp_core-0.0.1/src/kelp/cli/catalog.py +493 -0
- kelp_core-0.0.1/src/kelp/cli/cli.py +309 -0
- kelp_core-0.0.1/src/kelp/cli/init.py +130 -0
- kelp_core-0.0.1/src/kelp/config/__init__.py +3 -0
- kelp_core-0.0.1/src/kelp/config/catalog.py +65 -0
- kelp_core-0.0.1/src/kelp/config/catalog_spec.py +83 -0
- kelp_core-0.0.1/src/kelp/config/lifecycle.py +219 -0
- kelp_core-0.0.1/src/kelp/config/project.py +197 -0
- kelp_core-0.0.1/src/kelp/config/runtime.py +101 -0
- kelp_core-0.0.1/src/kelp/config/settings.py +243 -0
- kelp_core-0.0.1/src/kelp/config/vars.py +111 -0
- kelp_core-0.0.1/src/kelp/constants.py +3 -0
- kelp_core-0.0.1/src/kelp/models/__init__.py +15 -0
- kelp_core-0.0.1/src/kelp/models/abac.py +58 -0
- kelp_core-0.0.1/src/kelp/models/catalog.py +238 -0
- kelp_core-0.0.1/src/kelp/models/function.py +125 -0
- kelp_core-0.0.1/src/kelp/models/jsonschema.py +56 -0
- kelp_core-0.0.1/src/kelp/models/metric_view.py +64 -0
- kelp_core-0.0.1/src/kelp/models/project_config.py +161 -0
- kelp_core-0.0.1/src/kelp/models/runtime_context.py +38 -0
- kelp_core-0.0.1/src/kelp/models/table.py +363 -0
- kelp_core-0.0.1/src/kelp/pipelines/__init__.py +20 -0
- kelp_core-0.0.1/src/kelp/pipelines/api.py +146 -0
- kelp_core-0.0.1/src/kelp/pipelines/streaming_tables.py +446 -0
- kelp_core-0.0.1/src/kelp/pipelines/utils.py +16 -0
- kelp_core-0.0.1/src/kelp/service/__init__.py +3 -0
- kelp_core-0.0.1/src/kelp/service/pipeline_manager.py +370 -0
- kelp_core-0.0.1/src/kelp/service/table_manager.py +525 -0
- kelp_core-0.0.1/src/kelp/service/yaml_manager.py +775 -0
- kelp_core-0.0.1/src/kelp/tables/__init__.py +29 -0
- kelp_core-0.0.1/src/kelp/tables/api.py +133 -0
- kelp_core-0.0.1/src/kelp/transformations/__init__.py +13 -0
- kelp_core-0.0.1/src/kelp/transformations/functions.py +138 -0
- kelp_core-0.0.1/src/kelp/transformations/schema.py +527 -0
- kelp_core-0.0.1/src/kelp/utils/__init__.py +1 -0
- kelp_core-0.0.1/src/kelp/utils/common.py +80 -0
- kelp_core-0.0.1/src/kelp/utils/databricks.py +182 -0
- kelp_core-0.0.1/src/kelp/utils/dict_parser.py +100 -0
- kelp_core-0.0.1/src/kelp/utils/jinja_parser.py +152 -0
- kelp_core-0.0.1/src/kelp/utils/logging.py +28 -0
- kelp_core-0.0.1/src/kelp/utils/yaml_parser.py +11 -0
kelp_core-0.0.1/PKG-INFO
ADDED
|
@@ -0,0 +1,314 @@
|
|
|
1
|
+
Metadata-Version: 2.3
|
|
2
|
+
Name: kelp-core
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Summary: Metadata Toolkit for Databricks Spark and Declarative Pipelines
|
|
5
|
+
Author: BenSchr
|
|
6
|
+
Requires-Dist: databricks-sdk>=0.80.0
|
|
7
|
+
Requires-Dist: jinja2>=3.1.6
|
|
8
|
+
Requires-Dist: pydantic<3
|
|
9
|
+
Requires-Dist: pyyaml>=6.0
|
|
10
|
+
Requires-Dist: typer>=0.23.0
|
|
11
|
+
Requires-Python: >=3.12
|
|
12
|
+
Project-URL: Homepage, https://github.com/benschr/kelp-core
|
|
13
|
+
Project-URL: Documentation, https://benschr.github.io/kelp-core/
|
|
14
|
+
Project-URL: Repository, https://github.com/benschr/kelp-core
|
|
15
|
+
Description-Content-Type: text/markdown
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
```
|
|
19
|
+
██╗ ██╗███████╗██╗ ██████╗
|
|
20
|
+
██║ ██╔╝██╔════╝██║ ██╔══██╗
|
|
21
|
+
█████╔╝ █████╗ ██║ ██████╔╝
|
|
22
|
+
██╔═██╗ ██╔══╝ ██║ ██╔═══╝
|
|
23
|
+
██║ ██╗███████╗███████╗██║
|
|
24
|
+
╚═╝ ╚═╝╚══════╝╚══════╝╚═╝
|
|
25
|
+
Metadata Toolkit for Databricks Spark and Declarative Pipelines
|
|
26
|
+
```
|
|
27
|
+
Kelp is a powerful framework designed to simplify the management of data pipelines, quality checks, and table configurations. Follow the instructions below to set up Kelp in your environment and start building robust data solutions.
|
|
28
|
+
|
|
29
|
+
Documentation: [https://benschr.github.io/kelp-core/](https://benschr.github.io/kelp-core/)
|
|
30
|
+
|
|
31
|
+
## Why Kelp?
|
|
32
|
+
Kelp provides a metadata and transformation layer for Databricks Spark and Spark Declarative Pipelines (SDP). It lets you define data models, quality checks, and transformations in structured YAML while offering Python utilities for advanced logic. With Kelp you can:
|
|
33
|
+
|
|
34
|
+
### Metadata management
|
|
35
|
+
- Define models, metric views, functions, and ABAC policies in readable, maintainable YAML
|
|
36
|
+
- Keep local metadata synchronized with Unity Catalog for improved governance and discoverability
|
|
37
|
+
- Use variables and targets for environment-specific configuration
|
|
38
|
+
- Inherit directory-level settings and tags across models
|
|
39
|
+
|
|
40
|
+
### Spark Declarative Pipelines (SDP)
|
|
41
|
+
- Inject metadata into SDP decorators with minimal boilerplate
|
|
42
|
+
- Optionally use DQX quality checks instead of SDP expectations
|
|
43
|
+
- Apply a quarantine pattern for validation failures
|
|
44
|
+
- Sync metadata to Unity Catalog after pipeline runs
|
|
45
|
+
- Easily inject catalog and schema names for tables and functions
|
|
46
|
+
- Sync descriptions and tags from metadata to tables and columns without requiring the Spark schema to match exactly
|
|
47
|
+
- Use a low-level API (no decorators) to stay robust against SDP syntax or feature changes
|
|
48
|
+
|
|
49
|
+
### Extra utilities
|
|
50
|
+
- Composable DataFrame transformations for schema enforcement and function application
|
|
51
|
+
- CLI tools for project management and metadata synchronization
|
|
52
|
+
- Metric views for defining business metrics and dimensions in metadata
|
|
53
|
+
- ABAC policies for row- and column-level access control defined in metadata and applied in code and the catalog
|
|
54
|
+
- Reusable function definitions in metadata that can be referenced from code and ABAC policies for consistent logic and easier maintenance
|
|
55
|
+
|
|
56
|
+
## Installation
|
|
57
|
+
|
|
58
|
+
To install Kelp, you can use `uv`, `pip`, or the package manager of your choice. Below are the commands for both methods:
|
|
59
|
+
|
|
60
|
+
```
|
|
61
|
+
uv add kelp-core==0.0.1
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
```
|
|
65
|
+
pip install kelp-core==0.0.1
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
|
|
69
|
+
## Initialization
|
|
70
|
+
|
|
71
|
+
After installing `kelp`, initialize a new Kelp project in your desired directory by running the following command:
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
kelp init .
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
This will create a `kelp_project.yml` file in the current directory, which is the main configuration file for your Kelp project. You can customize this file to specify your project's settings, variables and file paths.
|
|
78
|
+
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
kelp_project.yml # (1)!
|
|
82
|
+
kelp_metadata/# (2)!
|
|
83
|
+
models/**/*.yml
|
|
84
|
+
metrics/**/*.yml
|
|
85
|
+
functions/**/*.yml
|
|
86
|
+
abacs/**/*.yml
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
1. This is where your main project configuration file lives. Here you can set global settings, variables, and other configurations for your Kelp project.
|
|
90
|
+
2. This directory stores your model and metric definitions in YAML format. You can organize them in subdirectories as needed (e.g., by environment, team, or domain).
|
|
91
|
+
|
|
92
|
+
Example structure
|
|
93
|
+
```markdown
|
|
94
|
+
kelp_project.yml
|
|
95
|
+
kelp_metadata/
|
|
96
|
+
models/
|
|
97
|
+
bronze/
|
|
98
|
+
bronze_customers.yml
|
|
99
|
+
silver/
|
|
100
|
+
silver_customers.yml
|
|
101
|
+
gold/
|
|
102
|
+
gold_customers.yml
|
|
103
|
+
metrics/
|
|
104
|
+
customer_metrics.yml
|
|
105
|
+
functions/
|
|
106
|
+
functions.yml
|
|
107
|
+
sql/
|
|
108
|
+
mask_ssn.sql
|
|
109
|
+
abacs/
|
|
110
|
+
policies.yml
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
## Set Up Targets and Base Configurations
|
|
114
|
+
|
|
115
|
+
Targets in Kelp represent different environments or configurations for your pipelines (e.g., development, staging, production). Define targets in your `kelp_project.yml` file under the `targets` section. Each target can have its own settings, such as catalog and schema variables, as well as other environment-specific configurations.
|
|
116
|
+
|
|
117
|
+
```yaml
|
|
118
|
+
kelp_project:
|
|
119
|
+
|
|
120
|
+
models_path: "./kelp_metadata/models"
|
|
121
|
+
models:
|
|
122
|
+
+catalog: ${ catalog } # (1)!
|
|
123
|
+
bronze:
|
|
124
|
+
+schema: kelp_bronze
|
|
125
|
+
silver:
|
|
126
|
+
+schema: kelp_silver
|
|
127
|
+
gold:
|
|
128
|
+
+schema: kelp_gold
|
|
129
|
+
+tags:
|
|
130
|
+
kelp_managed: "" # (2)!
|
|
131
|
+
|
|
132
|
+
metrics_path: "./kelp_metadata/metrics"
|
|
133
|
+
metric_views:
|
|
134
|
+
+catalog: ${ catalog }
|
|
135
|
+
+schema: kelp_gold
|
|
136
|
+
+tags:
|
|
137
|
+
kelp_managed: ""
|
|
138
|
+
|
|
139
|
+
functions_path: "./kelp_metadata/functions"
|
|
140
|
+
functions:
|
|
141
|
+
+catalog: ${ security_catalog } # (4)!
|
|
142
|
+
+schema: ${ security_schema }
|
|
143
|
+
|
|
144
|
+
abacs_path: "./kelp_metadata/abacs"
|
|
145
|
+
abacs: {}
|
|
146
|
+
|
|
147
|
+
vars:
|
|
148
|
+
default_catalog: my_catalog
|
|
149
|
+
default_schema: my_schema
|
|
150
|
+
default_security_catalog: security_catalog
|
|
151
|
+
default_security_schema: security_schema
|
|
152
|
+
|
|
153
|
+
targets:
|
|
154
|
+
dev:
|
|
155
|
+
vars:
|
|
156
|
+
catalog: ${default_catalog}_dev # (3)!
|
|
157
|
+
schema: ${default_schema}_dev
|
|
158
|
+
security_catalog: ${default_security_catalog}_dev
|
|
159
|
+
security_schema: ${default_security_schema}_dev
|
|
160
|
+
prod:
|
|
161
|
+
vars:
|
|
162
|
+
catalog: ${default_catalog}_prod
|
|
163
|
+
schema: ${default_schema}_prod
|
|
164
|
+
security_catalog: ${default_security_catalog}_prod
|
|
165
|
+
security_schema: ${default_security_schema}_prod
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
1. Set up directory-level configurations with `+` that can be inherited by all models and metric views in that directory.
|
|
169
|
+
2. This sets a tag on all models in this project.
|
|
170
|
+
3. You can override variables for each target.
|
|
171
|
+
4. Functions often live in a separate security schema/catalog and can be configured independently.
|
|
172
|
+
|
|
173
|
+
## Next Steps
|
|
174
|
+
|
|
175
|
+
Explore Kelp's comprehensive guides to get the most out of the framework:
|
|
176
|
+
|
|
177
|
+
| Guide | Overview |
|
|
178
|
+
|-------|----------|
|
|
179
|
+
| [Spark Declarative Pipelines (SDP)](guides/sdp.md) | Integrate Kelp with Databricks SDP using decorators and the low-level API |
|
|
180
|
+
| [Normal Spark (Non-SDP)](guides/normal_spark.md) | Use Kelp in standard Spark jobs with `kelp.tables`, DDL, and DQX |
|
|
181
|
+
| [Sync Metadata with Your Catalog](guides/catalog.md) | Keep local metadata in sync with Unity Catalog |
|
|
182
|
+
| [DataFrame Transformations](guides/transformations.md) | Use composable transformations like `apply_schema()` and `apply_func()` |
|
|
183
|
+
| [Project Configuration](guides/project_config.md) | Master `kelp_project.yml` configuration, hierarchies, and targets |
|
|
184
|
+
| [CLI Reference](guides/cli.md) | Command-line tools for project management and metadata sync |
|
|
185
|
+
| [Functions](guides/functions.md) | Define reusable SQL and Python functions in Unity Catalog |
|
|
186
|
+
| [ABAC Policies](guides/abacs.md) | Implement row and column access control |
|
|
187
|
+
| [Metric Views](guides/metric_views.md) | Define business metrics and dimensions |
|
|
188
|
+
|
|
189
|
+
## Build Transformations
|
|
190
|
+
|
|
191
|
+
Kelp provides utilities to transform data using DataFrame transformations that can be chained together:
|
|
192
|
+
|
|
193
|
+
- **Schema enforcement** - Apply and enforce schemas from metadata via `apply_schema()`
|
|
194
|
+
- **Function application** - Apply Unity Catalog functions via `apply_func()`
|
|
195
|
+
|
|
196
|
+
Use Kelp's composable transformations in your pipelines:
|
|
197
|
+
|
|
198
|
+
```python
|
|
199
|
+
from kelp.transformations import apply_schema, apply_func
|
|
200
|
+
import kelp.pipelines as kp
|
|
201
|
+
|
|
202
|
+
@kp.table()
|
|
203
|
+
def silver_customers():
|
|
204
|
+
df = spark.readStream.table(kp.ref("bronze_customers"))
|
|
205
|
+
|
|
206
|
+
return (
|
|
207
|
+
df
|
|
208
|
+
.transform(apply_schema("silver_customers"))
|
|
209
|
+
.transform(apply_func(
|
|
210
|
+
func_name="normalize_email",
|
|
211
|
+
new_column="email_clean",
|
|
212
|
+
parameters="email"
|
|
213
|
+
))
|
|
214
|
+
)
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
Learn more in the [DataFrame Transformations](guides/transformations.md) guide.
|
|
218
|
+
|
|
219
|
+
## Define Functions, Metrics, and Policies
|
|
220
|
+
|
|
221
|
+
Kelp supports multiple metadata objects beyond tables:
|
|
222
|
+
|
|
223
|
+
- **`kelp_functions`** - SQL/Python Unity Catalog functions (define once, use in code and ABAC)
|
|
224
|
+
- **`kelp_metric_views`** - Business metrics for analytics and dashboards
|
|
225
|
+
- **`kelp_abacs`** - Row filters and column masking (attribute-based access control)
|
|
226
|
+
|
|
227
|
+
Example function:
|
|
228
|
+
|
|
229
|
+
```yaml
|
|
230
|
+
kelp_functions:
|
|
231
|
+
- name: normalize_email
|
|
232
|
+
language: SQL
|
|
233
|
+
parameters:
|
|
234
|
+
- name: email
|
|
235
|
+
data_type: STRING
|
|
236
|
+
returns_data_type: STRING
|
|
237
|
+
body: lower(trim(email))
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
Example metric view:
|
|
241
|
+
|
|
242
|
+
```yaml
|
|
243
|
+
kelp_metric_views:
|
|
244
|
+
- name: customer_monthly_revenue
|
|
245
|
+
catalog: ${ catalog }
|
|
246
|
+
schema: ${ metric_schema }
|
|
247
|
+
definition:
|
|
248
|
+
measures:
|
|
249
|
+
- name: total_revenue
|
|
250
|
+
expr: SUM(amount)
|
|
251
|
+
- name: order_count
|
|
252
|
+
expr: COUNT(*)
|
|
253
|
+
dimensions:
|
|
254
|
+
- name: order_month
|
|
255
|
+
expr: DATE_TRUNC('MONTH', order_date)
|
|
256
|
+
source_table: ${ catalog }.gold.orders
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
Learn more in the [Functions](guides/functions.md), [Metric Views](guides/metric_views.md), and [ABAC Policies](guides/abacs.md) guides.
|
|
260
|
+
|
|
261
|
+
## Use the Kelp CLI
|
|
262
|
+
|
|
263
|
+
The Kelp CLI provides commands for project management and metadata synchronization:
|
|
264
|
+
|
|
265
|
+
```bash
|
|
266
|
+
# Initialize a new project
|
|
267
|
+
uv run kelp init project ./my_project
|
|
268
|
+
|
|
269
|
+
# Generate JSON schema for IDE support
|
|
270
|
+
uv run kelp json-schema --output kelp_json_schema.json
|
|
271
|
+
|
|
272
|
+
# Sync metadata from Databricks tables to YAML
|
|
273
|
+
uv run kelp catalog sync-from-catalog "catalog.schema.table" --output models/table.yml
|
|
274
|
+
|
|
275
|
+
# Validate project configuration
|
|
276
|
+
uv run kelp validate --target prod
|
|
277
|
+
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
Learn more in the [CLI Reference](guides/cli.md).
|
|
281
|
+
|
|
282
|
+
## Sync Metadata to Unity Catalog
|
|
283
|
+
|
|
284
|
+
After your pipeline creates tables, sync metadata (descriptions, tags, constraints) to the catalog:
|
|
285
|
+
|
|
286
|
+
```python
|
|
287
|
+
import kelp.catalog as kc
|
|
288
|
+
|
|
289
|
+
kc.init("kelp_project.yml", target="prod")
|
|
290
|
+
|
|
291
|
+
# Sync functions first (before pipeline runs)
|
|
292
|
+
for query in kc.sync_functions():
|
|
293
|
+
spark.sql(query)
|
|
294
|
+
|
|
295
|
+
# Sync tables, metric views and ABAC policies (after pipeline runs)
|
|
296
|
+
for query in kc.sync_catalog():
|
|
297
|
+
spark.sql(query)
|
|
298
|
+
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
Learn more in the [Sync Metadata with Your Catalog](guides/catalog.md) guide.
|
|
302
|
+
|
|
303
|
+
## Environment Variables
|
|
304
|
+
|
|
305
|
+
If you frequently reuse a specific target and project path, you can set them as environment variables:
|
|
306
|
+
|
|
307
|
+
```bash
|
|
308
|
+
export KELP_TARGET=prod
|
|
309
|
+
export KELP_PROJECT_FILE=/path/to/kelp_project.yml
|
|
310
|
+
|
|
311
|
+
# Now commands use these defaults
|
|
312
|
+
uv run kelp validate
|
|
313
|
+
uv run kelp catalog sync-from-catalog "catalog.schema.table"
|
|
314
|
+
```
|
|
@@ -0,0 +1,298 @@
|
|
|
1
|
+
|
|
2
|
+
```
|
|
3
|
+
██╗ ██╗███████╗██╗ ██████╗
|
|
4
|
+
██║ ██╔╝██╔════╝██║ ██╔══██╗
|
|
5
|
+
█████╔╝ █████╗ ██║ ██████╔╝
|
|
6
|
+
██╔═██╗ ██╔══╝ ██║ ██╔═══╝
|
|
7
|
+
██║ ██╗███████╗███████╗██║
|
|
8
|
+
╚═╝ ╚═╝╚══════╝╚══════╝╚═╝
|
|
9
|
+
Metadata Toolkit for Databricks Spark and Declarative Pipelines
|
|
10
|
+
```
|
|
11
|
+
Kelp is a powerful framework designed to simplify the management of data pipelines, quality checks, and table configurations. Follow the instructions below to set up Kelp in your environment and start building robust data solutions.
|
|
12
|
+
|
|
13
|
+
Documentation: [https://benschr.github.io/kelp-core/](https://benschr.github.io/kelp-core/)
|
|
14
|
+
|
|
15
|
+
## Why Kelp?
|
|
16
|
+
Kelp provides a metadata and transformation layer for Databricks Spark and Spark Declarative Pipelines (SDP). It lets you define data models, quality checks, and transformations in structured YAML while offering Python utilities for advanced logic. With Kelp you can:
|
|
17
|
+
|
|
18
|
+
### Metadata management
|
|
19
|
+
- Define models, metric views, functions, and ABAC policies in readable, maintainable YAML
|
|
20
|
+
- Keep local metadata synchronized with Unity Catalog for improved governance and discoverability
|
|
21
|
+
- Use variables and targets for environment-specific configuration
|
|
22
|
+
- Inherit directory-level settings and tags across models
|
|
23
|
+
|
|
24
|
+
### Spark Declarative Pipelines (SDP)
|
|
25
|
+
- Inject metadata into SDP decorators with minimal boilerplate
|
|
26
|
+
- Optionally use DQX quality checks instead of SDP expectations
|
|
27
|
+
- Apply a quarantine pattern for validation failures
|
|
28
|
+
- Sync metadata to Unity Catalog after pipeline runs
|
|
29
|
+
- Easily inject catalog and schema names for tables and functions
|
|
30
|
+
- Sync descriptions and tags from metadata to tables and columns without requiring the Spark schema to match exactly
|
|
31
|
+
- Use a low-level API (no decorators) to stay robust against SDP syntax or feature changes
|
|
32
|
+
|
|
33
|
+
### Extra utilities
|
|
34
|
+
- Composable DataFrame transformations for schema enforcement and function application
|
|
35
|
+
- CLI tools for project management and metadata synchronization
|
|
36
|
+
- Metric views for defining business metrics and dimensions in metadata
|
|
37
|
+
- ABAC policies for row- and column-level access control defined in metadata and applied in code and the catalog
|
|
38
|
+
- Reusable function definitions in metadata that can be referenced from code and ABAC policies for consistent logic and easier maintenance
|
|
39
|
+
|
|
40
|
+
## Installation
|
|
41
|
+
|
|
42
|
+
To install Kelp, you can use `uv`, `pip`, or the package manager of your choice. Below are the commands for both methods:
|
|
43
|
+
|
|
44
|
+
```
|
|
45
|
+
uv add kelp-core==0.0.1
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
pip install kelp-core==0.0.1
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
## Initialization
|
|
54
|
+
|
|
55
|
+
After installing `kelp`, initialize a new Kelp project in your desired directory by running the following command:
|
|
56
|
+
|
|
57
|
+
```
|
|
58
|
+
kelp init .
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
This will create a `kelp_project.yml` file in the current directory, which is the main configuration file for your Kelp project. You can customize this file to specify your project's settings, variables and file paths.
|
|
62
|
+
|
|
63
|
+
|
|
64
|
+
```python
|
|
65
|
+
kelp_project.yml # (1)!
|
|
66
|
+
kelp_metadata/# (2)!
|
|
67
|
+
models/**/*.yml
|
|
68
|
+
metrics/**/*.yml
|
|
69
|
+
functions/**/*.yml
|
|
70
|
+
abacs/**/*.yml
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
1. This is where your main project configuration file lives. Here you can set global settings, variables, and other configurations for your Kelp project.
|
|
74
|
+
2. This directory stores your model and metric definitions in YAML format. You can organize them in subdirectories as needed (e.g., by environment, team, or domain).
|
|
75
|
+
|
|
76
|
+
Example structure
|
|
77
|
+
```markdown
|
|
78
|
+
kelp_project.yml
|
|
79
|
+
kelp_metadata/
|
|
80
|
+
models/
|
|
81
|
+
bronze/
|
|
82
|
+
bronze_customers.yml
|
|
83
|
+
silver/
|
|
84
|
+
silver_customers.yml
|
|
85
|
+
gold/
|
|
86
|
+
gold_customers.yml
|
|
87
|
+
metrics/
|
|
88
|
+
customer_metrics.yml
|
|
89
|
+
functions/
|
|
90
|
+
functions.yml
|
|
91
|
+
sql/
|
|
92
|
+
mask_ssn.sql
|
|
93
|
+
abacs/
|
|
94
|
+
policies.yml
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## Set Up Targets and Base Configurations
|
|
98
|
+
|
|
99
|
+
Targets in Kelp represent different environments or configurations for your pipelines (e.g., development, staging, production). Define targets in your `kelp_project.yml` file under the `targets` section. Each target can have its own settings, such as catalog and schema variables, as well as other environment-specific configurations.
|
|
100
|
+
|
|
101
|
+
```yaml
|
|
102
|
+
kelp_project:
|
|
103
|
+
|
|
104
|
+
models_path: "./kelp_metadata/models"
|
|
105
|
+
models:
|
|
106
|
+
+catalog: ${ catalog } # (1)!
|
|
107
|
+
bronze:
|
|
108
|
+
+schema: kelp_bronze
|
|
109
|
+
silver:
|
|
110
|
+
+schema: kelp_silver
|
|
111
|
+
gold:
|
|
112
|
+
+schema: kelp_gold
|
|
113
|
+
+tags:
|
|
114
|
+
kelp_managed: "" # (2)!
|
|
115
|
+
|
|
116
|
+
metrics_path: "./kelp_metadata/metrics"
|
|
117
|
+
metric_views:
|
|
118
|
+
+catalog: ${ catalog }
|
|
119
|
+
+schema: kelp_gold
|
|
120
|
+
+tags:
|
|
121
|
+
kelp_managed: ""
|
|
122
|
+
|
|
123
|
+
functions_path: "./kelp_metadata/functions"
|
|
124
|
+
functions:
|
|
125
|
+
+catalog: ${ security_catalog } # (4)!
|
|
126
|
+
+schema: ${ security_schema }
|
|
127
|
+
|
|
128
|
+
abacs_path: "./kelp_metadata/abacs"
|
|
129
|
+
abacs: {}
|
|
130
|
+
|
|
131
|
+
vars:
|
|
132
|
+
default_catalog: my_catalog
|
|
133
|
+
default_schema: my_schema
|
|
134
|
+
default_security_catalog: security_catalog
|
|
135
|
+
default_security_schema: security_schema
|
|
136
|
+
|
|
137
|
+
targets:
|
|
138
|
+
dev:
|
|
139
|
+
vars:
|
|
140
|
+
catalog: ${default_catalog}_dev # (3)!
|
|
141
|
+
schema: ${default_schema}_dev
|
|
142
|
+
security_catalog: ${default_security_catalog}_dev
|
|
143
|
+
security_schema: ${default_security_schema}_dev
|
|
144
|
+
prod:
|
|
145
|
+
vars:
|
|
146
|
+
catalog: ${default_catalog}_prod
|
|
147
|
+
schema: ${default_schema}_prod
|
|
148
|
+
security_catalog: ${default_security_catalog}_prod
|
|
149
|
+
security_schema: ${default_security_schema}_prod
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
1. Set up directory-level configurations with `+` that can be inherited by all models and metric views in that directory.
|
|
153
|
+
2. This sets a tag on all models in this project.
|
|
154
|
+
3. You can override variables for each target.
|
|
155
|
+
4. Functions often live in a separate security schema/catalog and can be configured independently.
|
|
156
|
+
|
|
157
|
+
## Next Steps
|
|
158
|
+
|
|
159
|
+
Explore Kelp's comprehensive guides to get the most out of the framework:
|
|
160
|
+
|
|
161
|
+
| Guide | Overview |
|
|
162
|
+
|-------|----------|
|
|
163
|
+
| [Spark Declarative Pipelines (SDP)](guides/sdp.md) | Integrate Kelp with Databricks SDP using decorators and the low-level API |
|
|
164
|
+
| [Normal Spark (Non-SDP)](guides/normal_spark.md) | Use Kelp in standard Spark jobs with `kelp.tables`, DDL, and DQX |
|
|
165
|
+
| [Sync Metadata with Your Catalog](guides/catalog.md) | Keep local metadata in sync with Unity Catalog |
|
|
166
|
+
| [DataFrame Transformations](guides/transformations.md) | Use composable transformations like `apply_schema()` and `apply_func()` |
|
|
167
|
+
| [Project Configuration](guides/project_config.md) | Master `kelp_project.yml` configuration, hierarchies, and targets |
|
|
168
|
+
| [CLI Reference](guides/cli.md) | Command-line tools for project management and metadata sync |
|
|
169
|
+
| [Functions](guides/functions.md) | Define reusable SQL and Python functions in Unity Catalog |
|
|
170
|
+
| [ABAC Policies](guides/abacs.md) | Implement row and column access control |
|
|
171
|
+
| [Metric Views](guides/metric_views.md) | Define business metrics and dimensions |
|
|
172
|
+
|
|
173
|
+
## Build Transformations
|
|
174
|
+
|
|
175
|
+
Kelp provides utilities to transform data using DataFrame transformations that can be chained together:
|
|
176
|
+
|
|
177
|
+
- **Schema enforcement** - Apply and enforce schemas from metadata via `apply_schema()`
|
|
178
|
+
- **Function application** - Apply Unity Catalog functions via `apply_func()`
|
|
179
|
+
|
|
180
|
+
Use Kelp's composable transformations in your pipelines:
|
|
181
|
+
|
|
182
|
+
```python
|
|
183
|
+
from kelp.transformations import apply_schema, apply_func
|
|
184
|
+
import kelp.pipelines as kp
|
|
185
|
+
|
|
186
|
+
@kp.table()
|
|
187
|
+
def silver_customers():
|
|
188
|
+
df = spark.readStream.table(kp.ref("bronze_customers"))
|
|
189
|
+
|
|
190
|
+
return (
|
|
191
|
+
df
|
|
192
|
+
.transform(apply_schema("silver_customers"))
|
|
193
|
+
.transform(apply_func(
|
|
194
|
+
func_name="normalize_email",
|
|
195
|
+
new_column="email_clean",
|
|
196
|
+
parameters="email"
|
|
197
|
+
))
|
|
198
|
+
)
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
Learn more in the [DataFrame Transformations](guides/transformations.md) guide.
|
|
202
|
+
|
|
203
|
+
## Define Functions, Metrics, and Policies
|
|
204
|
+
|
|
205
|
+
Kelp supports multiple metadata objects beyond tables:
|
|
206
|
+
|
|
207
|
+
- **`kelp_functions`** - SQL/Python Unity Catalog functions (define once, use in code and ABAC)
|
|
208
|
+
- **`kelp_metric_views`** - Business metrics for analytics and dashboards
|
|
209
|
+
- **`kelp_abacs`** - Row filters and column masking (attribute-based access control)
|
|
210
|
+
|
|
211
|
+
Example function:
|
|
212
|
+
|
|
213
|
+
```yaml
|
|
214
|
+
kelp_functions:
|
|
215
|
+
- name: normalize_email
|
|
216
|
+
language: SQL
|
|
217
|
+
parameters:
|
|
218
|
+
- name: email
|
|
219
|
+
data_type: STRING
|
|
220
|
+
returns_data_type: STRING
|
|
221
|
+
body: lower(trim(email))
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
Example metric view:
|
|
225
|
+
|
|
226
|
+
```yaml
|
|
227
|
+
kelp_metric_views:
|
|
228
|
+
- name: customer_monthly_revenue
|
|
229
|
+
catalog: ${ catalog }
|
|
230
|
+
schema: ${ metric_schema }
|
|
231
|
+
definition:
|
|
232
|
+
measures:
|
|
233
|
+
- name: total_revenue
|
|
234
|
+
expr: SUM(amount)
|
|
235
|
+
- name: order_count
|
|
236
|
+
expr: COUNT(*)
|
|
237
|
+
dimensions:
|
|
238
|
+
- name: order_month
|
|
239
|
+
expr: DATE_TRUNC('MONTH', order_date)
|
|
240
|
+
source_table: ${ catalog }.gold.orders
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
Learn more in the [Functions](guides/functions.md), [Metric Views](guides/metric_views.md), and [ABAC Policies](guides/abacs.md) guides.
|
|
244
|
+
|
|
245
|
+
## Use the Kelp CLI
|
|
246
|
+
|
|
247
|
+
The Kelp CLI provides commands for project management and metadata synchronization:
|
|
248
|
+
|
|
249
|
+
```bash
|
|
250
|
+
# Initialize a new project
|
|
251
|
+
uv run kelp init project ./my_project
|
|
252
|
+
|
|
253
|
+
# Generate JSON schema for IDE support
|
|
254
|
+
uv run kelp json-schema --output kelp_json_schema.json
|
|
255
|
+
|
|
256
|
+
# Sync metadata from Databricks tables to YAML
|
|
257
|
+
uv run kelp catalog sync-from-catalog "catalog.schema.table" --output models/table.yml
|
|
258
|
+
|
|
259
|
+
# Validate project configuration
|
|
260
|
+
uv run kelp validate --target prod
|
|
261
|
+
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
Learn more in the [CLI Reference](guides/cli.md).
|
|
265
|
+
|
|
266
|
+
## Sync Metadata to Unity Catalog
|
|
267
|
+
|
|
268
|
+
After your pipeline creates tables, sync metadata (descriptions, tags, constraints) to the catalog:
|
|
269
|
+
|
|
270
|
+
```python
|
|
271
|
+
import kelp.catalog as kc
|
|
272
|
+
|
|
273
|
+
kc.init("kelp_project.yml", target="prod")
|
|
274
|
+
|
|
275
|
+
# Sync functions first (before pipeline runs)
|
|
276
|
+
for query in kc.sync_functions():
|
|
277
|
+
spark.sql(query)
|
|
278
|
+
|
|
279
|
+
# Sync tables, metric views and ABAC policies (after pipeline runs)
|
|
280
|
+
for query in kc.sync_catalog():
|
|
281
|
+
spark.sql(query)
|
|
282
|
+
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
Learn more in the [Sync Metadata with Your Catalog](guides/catalog.md) guide.
|
|
286
|
+
|
|
287
|
+
## Environment Variables
|
|
288
|
+
|
|
289
|
+
If you frequently reuse a specific target and project path, you can set them as environment variables:
|
|
290
|
+
|
|
291
|
+
```bash
|
|
292
|
+
export KELP_TARGET=prod
|
|
293
|
+
export KELP_PROJECT_FILE=/path/to/kelp_project.yml
|
|
294
|
+
|
|
295
|
+
# Now commands use these defaults
|
|
296
|
+
uv run kelp validate
|
|
297
|
+
uv run kelp catalog sync-from-catalog "catalog.schema.table"
|
|
298
|
+
```
|