sql-glider 0.1.8__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sql_glider-0.1.8.dist-info/METADATA +893 -0
- sql_glider-0.1.8.dist-info/RECORD +34 -0
- sql_glider-0.1.8.dist-info/WHEEL +4 -0
- sql_glider-0.1.8.dist-info/entry_points.txt +9 -0
- sql_glider-0.1.8.dist-info/licenses/LICENSE +201 -0
- sqlglider/__init__.py +3 -0
- sqlglider/_version.py +34 -0
- sqlglider/catalog/__init__.py +30 -0
- sqlglider/catalog/base.py +99 -0
- sqlglider/catalog/databricks.py +255 -0
- sqlglider/catalog/registry.py +121 -0
- sqlglider/cli.py +1589 -0
- sqlglider/dissection/__init__.py +17 -0
- sqlglider/dissection/analyzer.py +767 -0
- sqlglider/dissection/formatters.py +222 -0
- sqlglider/dissection/models.py +112 -0
- sqlglider/global_models.py +17 -0
- sqlglider/graph/__init__.py +42 -0
- sqlglider/graph/builder.py +349 -0
- sqlglider/graph/merge.py +136 -0
- sqlglider/graph/models.py +289 -0
- sqlglider/graph/query.py +287 -0
- sqlglider/graph/serialization.py +107 -0
- sqlglider/lineage/__init__.py +10 -0
- sqlglider/lineage/analyzer.py +1631 -0
- sqlglider/lineage/formatters.py +335 -0
- sqlglider/templating/__init__.py +51 -0
- sqlglider/templating/base.py +103 -0
- sqlglider/templating/jinja.py +163 -0
- sqlglider/templating/registry.py +124 -0
- sqlglider/templating/variables.py +295 -0
- sqlglider/utils/__init__.py +11 -0
- sqlglider/utils/config.py +155 -0
- sqlglider/utils/file_utils.py +38 -0
|
@@ -0,0 +1,893 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: sql-glider
|
|
3
|
+
Version: 0.1.8
|
|
4
|
+
Summary: SQL Utility Toolkit for better understanding, use, and governance of your queries in a native environment.
|
|
5
|
+
Project-URL: Homepage, https://github.com/rycowhi/sql-glider/
|
|
6
|
+
Project-URL: Repository, https://github.com/rycowhi/sql-glider/
|
|
7
|
+
Project-URL: Documentation, https://github.com/rycowhi/sql-glider/
|
|
8
|
+
Project-URL: Issues, https://github.com/rycowhi/sql-glider/issues
|
|
9
|
+
Author-email: Ryan Whitcomb <ryankwhitcomb@gmail.com>
|
|
10
|
+
License-Expression: Apache-2.0
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Keywords: data-governance,data-lineage,lineage,sql,sqlglot
|
|
13
|
+
Classifier: Development Status :: 3 - Alpha
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
18
|
+
Classifier: Programming Language :: SQL
|
|
19
|
+
Classifier: Topic :: Database
|
|
20
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
21
|
+
Classifier: Typing :: Typed
|
|
22
|
+
Requires-Python: >=3.11
|
|
23
|
+
Requires-Dist: jinja2>=3.0.0
|
|
24
|
+
Requires-Dist: pydantic>=2.0.0
|
|
25
|
+
Requires-Dist: rich>=13.0.0
|
|
26
|
+
Requires-Dist: rustworkx>=0.15.0
|
|
27
|
+
Requires-Dist: sqlglot[rs]>=25.0.0
|
|
28
|
+
Requires-Dist: typer>=0.9.0
|
|
29
|
+
Provides-Extra: databricks
|
|
30
|
+
Requires-Dist: databricks-sdk>=0.20.0; extra == 'databricks'
|
|
31
|
+
Description-Content-Type: text/markdown
|
|
32
|
+
|
|
33
|
+
# SQL Glider
|
|
34
|
+
|
|
35
|
+
SQL Utility Toolkit for better understanding, use, and governance of your queries in a native environment.
|
|
36
|
+
|
|
37
|
+
## Overview
|
|
38
|
+
|
|
39
|
+
SQL Glider provides powerful column-level and table-level lineage analysis for SQL queries using SQLGlot. It operates on standalone SQL files without requiring a full project setup, making it perfect for ad-hoc analysis, data governance, and understanding query dependencies.
|
|
40
|
+
|
|
41
|
+
## Features
|
|
42
|
+
|
|
43
|
+
- **Forward Lineage:** Trace output columns back to their source tables and columns
|
|
44
|
+
- **Reverse Lineage:** Impact analysis - find which output columns are affected by a source column
|
|
45
|
+
- **Query Dissection:** Decompose SQL into components (CTEs, subqueries, UNION branches) for unit testing
|
|
46
|
+
- **Table Extraction:** List all tables in SQL files with usage type (INPUT/OUTPUT) and object type (TABLE/VIEW/CTE)
|
|
47
|
+
- **Multi-level Tracing:** Automatically handles CTEs, subqueries, and complex expressions
|
|
48
|
+
- **Graph-Based Lineage:** Build and query lineage graphs across thousands of SQL files
|
|
49
|
+
- **Multiple Output Formats:** Text (human-readable), JSON (machine-readable), CSV (spreadsheet-ready)
|
|
50
|
+
- **Dialect Support:** Works with Spark, PostgreSQL, Snowflake, BigQuery, MySQL, and many more SQL dialects
|
|
51
|
+
- **File Export:** Save lineage results to files for documentation or further processing
|
|
52
|
+
|
|
53
|
+
## Installation
|
|
54
|
+
|
|
55
|
+
SQL Glider is available on PyPI and can be installed with pip or uv. Python 3.11+ is required.
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
# Install with pip
|
|
59
|
+
pip install sql-glider
|
|
60
|
+
|
|
61
|
+
# Or install with uv
|
|
62
|
+
uv pip install sql-glider
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
After installation, the `sqlglider` command is available:
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
sqlglider lineage query.sql
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Development Setup
|
|
72
|
+
|
|
73
|
+
If you want to contribute or run from source:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
# Clone the repository
|
|
77
|
+
git clone https://github.com/ryanholmdahl/sql-glider.git
|
|
78
|
+
cd sql-glider
|
|
79
|
+
|
|
80
|
+
# Install dependencies with uv
|
|
81
|
+
uv sync
|
|
82
|
+
|
|
83
|
+
# Run from source
|
|
84
|
+
uv run sqlglider lineage <sql_file>
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## Quick Start
|
|
88
|
+
|
|
89
|
+
### Forward Lineage (Source Tracing)
|
|
90
|
+
|
|
91
|
+
Find out where your output columns come from:
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
# Analyze all output columns
|
|
95
|
+
uv run sqlglider lineage query.sql
|
|
96
|
+
|
|
97
|
+
# Analyze a specific output column
|
|
98
|
+
uv run sqlglider lineage query.sql --column customer_name
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
**Example Output:**
|
|
102
|
+
```
|
|
103
|
+
Query 0: SELECT customer_name, o.order_total FROM customers c JOIN orders o ...
|
|
104
|
+
+-----------------------------------------------------------------------------+
|
|
105
|
+
| Output Column | Source Column |
|
|
106
|
+
|-----------------+------------------------------------------------------------|
|
|
107
|
+
| customer_name | c.customer_name |
|
|
108
|
+
+-----------------------------------------------------------------------------+
|
|
109
|
+
Total: 1 row(s)
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
This shows that the output column `customer_name` in Query 0 comes from `c.customer_name` (the `customer_name` column in table `c`).
|
|
113
|
+
|
|
114
|
+
### Reverse Lineage (Impact Analysis)
|
|
115
|
+
|
|
116
|
+
Find out which output columns are affected by a source column:
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
# Find outputs affected by a source column
|
|
120
|
+
uv run sqlglider lineage query.sql --source-column orders.customer_id
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
**Example Output:**
|
|
124
|
+
```
|
|
125
|
+
Query 0: SELECT customer_id, segment FROM ...
|
|
126
|
+
+---------------------------------------------------------+
|
|
127
|
+
| Output Column | Source Column |
|
|
128
|
+
|--------------------+------------------------------------|
|
|
129
|
+
| orders.customer_id | orders.customer_id |
|
|
130
|
+
+---------------------------------------------------------+
|
|
131
|
+
Total: 1 row(s)
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
This shows that if `orders.customer_id` changes, it will impact the output column `customer_id` in Query 0.
|
|
135
|
+
|
|
136
|
+
## Usage Examples
|
|
137
|
+
|
|
138
|
+
### Basic Column Lineage
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
# Forward lineage for all columns
|
|
142
|
+
uv run sqlglider lineage query.sql
|
|
143
|
+
|
|
144
|
+
# Forward lineage for specific column
|
|
145
|
+
uv run sqlglider lineage query.sql --column order_total
|
|
146
|
+
|
|
147
|
+
# Reverse lineage (impact analysis)
|
|
148
|
+
uv run sqlglider lineage query.sql --source-column orders.customer_id
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
### Different Output Formats
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
# JSON output
|
|
155
|
+
uv run sqlglider lineage query.sql --output-format json
|
|
156
|
+
|
|
157
|
+
# CSV output
|
|
158
|
+
uv run sqlglider lineage query.sql --output-format csv
|
|
159
|
+
|
|
160
|
+
# Export to file
|
|
161
|
+
uv run sqlglider lineage query.sql --output-format json --output-file lineage.json
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
### Table-Level Lineage
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
# Show which tables are used
|
|
168
|
+
uv run sqlglider lineage query.sql --level table
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
### Table Extraction
|
|
172
|
+
|
|
173
|
+
List all tables involved in SQL files with usage and type information:
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
# List all tables in a SQL file
|
|
177
|
+
uv run sqlglider tables overview query.sql
|
|
178
|
+
|
|
179
|
+
# JSON output with detailed table info
|
|
180
|
+
uv run sqlglider tables overview query.sql --output-format json
|
|
181
|
+
|
|
182
|
+
# Export to CSV
|
|
183
|
+
uv run sqlglider tables overview query.sql --output-format csv --output-file tables.csv
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
### Pull DDL from Remote Catalogs
|
|
187
|
+
|
|
188
|
+
Fetch DDL definitions from remote data catalogs (e.g., Databricks Unity Catalog):
|
|
189
|
+
|
|
190
|
+
```bash
|
|
191
|
+
# Pull DDL for all tables used in a SQL file (outputs to stdout)
|
|
192
|
+
uv run sqlglider tables pull query.sql --catalog-type databricks
|
|
193
|
+
|
|
194
|
+
# Save DDL files to a folder (one file per table)
|
|
195
|
+
uv run sqlglider tables pull query.sql -c databricks -o ./ddl/
|
|
196
|
+
|
|
197
|
+
# List available catalog providers
|
|
198
|
+
uv run sqlglider tables pull --list
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
**Note:** Requires optional dependencies. Install with: `pip install sql-glider[databricks]`
|
|
202
|
+
|
|
203
|
+
**Example Output (JSON):**
|
|
204
|
+
```json
|
|
205
|
+
{
|
|
206
|
+
"queries": [{
|
|
207
|
+
"query_index": 0,
|
|
208
|
+
"tables": [
|
|
209
|
+
{"name": "customers", "usage": "INPUT", "object_type": "UNKNOWN"},
|
|
210
|
+
{"name": "orders", "usage": "INPUT", "object_type": "UNKNOWN"}
|
|
211
|
+
]
|
|
212
|
+
}]
|
|
213
|
+
}
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Table Usage Types:**
|
|
217
|
+
- `INPUT`: Table is read from (SELECT, JOIN, subqueries)
|
|
218
|
+
- `OUTPUT`: Table is written to (INSERT, CREATE TABLE/VIEW, UPDATE)
|
|
219
|
+
- `BOTH`: Table is both read from and written to
|
|
220
|
+
|
|
221
|
+
**Object Types:**
|
|
222
|
+
- `TABLE`: CREATE TABLE or DROP TABLE statement
|
|
223
|
+
- `VIEW`: CREATE VIEW or DROP VIEW statement
|
|
224
|
+
- `CTE`: Common Table Expression (WITH clause)
|
|
225
|
+
- `UNKNOWN`: Cannot determine type from SQL alone
|
|
226
|
+
|
|
227
|
+
### Query Dissection
|
|
228
|
+
|
|
229
|
+
Decompose SQL queries into constituent parts for unit testing and analysis:
|
|
230
|
+
|
|
231
|
+
```bash
|
|
232
|
+
# Dissect a SQL file (text output)
|
|
233
|
+
uv run sqlglider dissect query.sql
|
|
234
|
+
|
|
235
|
+
# JSON output with full component details
|
|
236
|
+
uv run sqlglider dissect query.sql --output-format json
|
|
237
|
+
|
|
238
|
+
# CSV output for spreadsheet analysis
|
|
239
|
+
uv run sqlglider dissect query.sql --output-format csv
|
|
240
|
+
|
|
241
|
+
# Export to file
|
|
242
|
+
uv run sqlglider dissect query.sql -f json -o dissected.json
|
|
243
|
+
|
|
244
|
+
# With templating support
|
|
245
|
+
uv run sqlglider dissect query.sql --templater jinja --var schema=analytics
|
|
246
|
+
|
|
247
|
+
# From stdin
|
|
248
|
+
echo "WITH cte AS (SELECT id FROM users) SELECT * FROM cte" | uv run sqlglider dissect
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
**Example Input:**
|
|
252
|
+
```sql
|
|
253
|
+
WITH order_totals AS (
|
|
254
|
+
SELECT customer_id, SUM(amount) AS total
|
|
255
|
+
FROM orders
|
|
256
|
+
GROUP BY customer_id
|
|
257
|
+
)
|
|
258
|
+
INSERT INTO analytics.summary
|
|
259
|
+
SELECT * FROM order_totals WHERE total > 100
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
**Example Output (JSON):**
|
|
263
|
+
```json
|
|
264
|
+
{
|
|
265
|
+
"queries": [{
|
|
266
|
+
"query_index": 0,
|
|
267
|
+
"statement_type": "INSERT",
|
|
268
|
+
"total_components": 3,
|
|
269
|
+
"components": [
|
|
270
|
+
{
|
|
271
|
+
"component_type": "CTE",
|
|
272
|
+
"component_index": 0,
|
|
273
|
+
"name": "order_totals",
|
|
274
|
+
"sql": "SELECT customer_id, SUM(amount) AS total FROM orders GROUP BY customer_id",
|
|
275
|
+
"is_executable": true,
|
|
276
|
+
"dependencies": [],
|
|
277
|
+
"location": "WITH clause"
|
|
278
|
+
},
|
|
279
|
+
{
|
|
280
|
+
"component_type": "TARGET_TABLE",
|
|
281
|
+
"component_index": 1,
|
|
282
|
+
"name": "analytics.summary",
|
|
283
|
+
"sql": "analytics.summary",
|
|
284
|
+
"is_executable": false,
|
|
285
|
+
"location": "INSERT INTO target"
|
|
286
|
+
},
|
|
287
|
+
{
|
|
288
|
+
"component_type": "SOURCE_QUERY",
|
|
289
|
+
"component_index": 2,
|
|
290
|
+
"sql": "SELECT * FROM order_totals WHERE total > 100",
|
|
291
|
+
"is_executable": true,
|
|
292
|
+
"dependencies": ["order_totals"],
|
|
293
|
+
"location": "INSERT source SELECT"
|
|
294
|
+
}
|
|
295
|
+
]
|
|
296
|
+
}]
|
|
297
|
+
}
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
**Extracted Component Types:**
|
|
301
|
+
- `CTE`: Common Table Expressions from WITH clause
|
|
302
|
+
- `MAIN_QUERY`: The primary SELECT statement
|
|
303
|
+
- `SUBQUERY`: Nested SELECT in FROM clause
|
|
304
|
+
- `SCALAR_SUBQUERY`: Single-value subquery in SELECT list, WHERE, HAVING
|
|
305
|
+
- `TARGET_TABLE`: Output table for INSERT/CREATE/MERGE (not executable)
|
|
306
|
+
- `SOURCE_QUERY`: SELECT within DML/DDL statements
|
|
307
|
+
- `UNION_BRANCH`: Individual SELECT in UNION/UNION ALL
|
|
308
|
+
|
|
309
|
+
**Use Cases:**
|
|
310
|
+
- Unit test CTEs and subqueries individually
|
|
311
|
+
- Extract DQL from CTAS, CREATE VIEW, INSERT statements
|
|
312
|
+
- Analyze query structure and component dependencies
|
|
313
|
+
- Break apart complex queries for understanding
|
|
314
|
+
|
|
315
|
+
### Different SQL Dialects
|
|
316
|
+
|
|
317
|
+
```bash
|
|
318
|
+
# PostgreSQL
|
|
319
|
+
uv run sqlglider lineage query.sql --dialect postgres
|
|
320
|
+
|
|
321
|
+
# Snowflake
|
|
322
|
+
uv run sqlglider lineage query.sql --dialect snowflake
|
|
323
|
+
|
|
324
|
+
# BigQuery
|
|
325
|
+
uv run sqlglider lineage query.sql --dialect bigquery
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
### Multi-Query Files
|
|
329
|
+
|
|
330
|
+
SQL Glider automatically detects and analyzes multiple SQL statements in a single file:
|
|
331
|
+
|
|
332
|
+
```bash
|
|
333
|
+
# Analyze all queries in a file
|
|
334
|
+
uv run sqlglider lineage multi_query.sql
|
|
335
|
+
|
|
336
|
+
# Filter to only queries that reference a specific table
|
|
337
|
+
uv run sqlglider lineage multi_query.sql --table customers
|
|
338
|
+
|
|
339
|
+
# Analyze specific column across all queries
|
|
340
|
+
uv run sqlglider lineage multi_query.sql --column customer_id
|
|
341
|
+
|
|
342
|
+
# Reverse lineage across all queries (impact analysis)
|
|
343
|
+
uv run sqlglider lineage multi_query.sql --source-column orders.customer_id
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
**Example multi-query file:**
|
|
347
|
+
```sql
|
|
348
|
+
-- multi_query.sql
|
|
349
|
+
SELECT customer_id, customer_name FROM customers;
|
|
350
|
+
|
|
351
|
+
SELECT order_id, customer_id, order_total FROM orders;
|
|
352
|
+
|
|
353
|
+
INSERT INTO customer_orders
|
|
354
|
+
SELECT c.customer_id, c.customer_name, o.order_id
|
|
355
|
+
FROM customers c
|
|
356
|
+
JOIN orders o ON c.customer_id = o.customer_id;
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
**Output includes query index for each statement:**
|
|
360
|
+
```
|
|
361
|
+
Query 0: SELECT customer_id, customer_name FROM customers
|
|
362
|
+
+---------------------------------------------------+
|
|
363
|
+
| Output Column | Source Column |
|
|
364
|
+
|-------------------------+-------------------------|
|
|
365
|
+
| customers.customer_id | customers.customer_id |
|
|
366
|
+
| customers.customer_name | customers.customer_name |
|
|
367
|
+
+---------------------------------------------------+
|
|
368
|
+
Total: 2 row(s)
|
|
369
|
+
|
|
370
|
+
Query 1: SELECT order_id, customer_id, order_total FROM orders
|
|
371
|
+
+---------------------------------------------+
|
|
372
|
+
| Output Column | Source Column |
|
|
373
|
+
|--------------------+------------------------|
|
|
374
|
+
| orders.customer_id | orders.customer_id |
|
|
375
|
+
| orders.order_id | orders.order_id |
|
|
376
|
+
| orders.order_total | orders.order_total |
|
|
377
|
+
+---------------------------------------------+
|
|
378
|
+
Total: 3 row(s)
|
|
379
|
+
|
|
380
|
+
Query 2: INSERT INTO customer_orders ...
|
|
381
|
+
+---------------------------------------------+
|
|
382
|
+
| Output Column | Source Column |
|
|
383
|
+
|--------------------+------------------------|
|
|
384
|
+
...
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
### Graph-Based Lineage (Cross-File Analysis)
|
|
388
|
+
|
|
389
|
+
For analyzing lineage across multiple SQL files, SQL Glider provides graph commands:
|
|
390
|
+
|
|
391
|
+
```bash
|
|
392
|
+
# Build a lineage graph from a single file
|
|
393
|
+
uv run sqlglider graph build query.sql -o graph.json
|
|
394
|
+
|
|
395
|
+
# Build from multiple files
|
|
396
|
+
uv run sqlglider graph build query1.sql query2.sql query3.sql -o graph.json
|
|
397
|
+
|
|
398
|
+
# Build from a directory (recursively finds all .sql files)
|
|
399
|
+
uv run sqlglider graph build ./queries/ -r -o graph.json
|
|
400
|
+
|
|
401
|
+
# Build from a manifest CSV file
|
|
402
|
+
uv run sqlglider graph build --manifest manifest.csv -o graph.json
|
|
403
|
+
|
|
404
|
+
# Merge multiple graphs into one
|
|
405
|
+
uv run sqlglider graph merge graph1.json graph2.json -o merged.json
|
|
406
|
+
|
|
407
|
+
# Query upstream dependencies (find all sources for a column)
|
|
408
|
+
uv run sqlglider graph query graph.json --upstream orders.customer_id
|
|
409
|
+
|
|
410
|
+
# Query downstream dependencies (find all columns affected by a source)
|
|
411
|
+
uv run sqlglider graph query graph.json --downstream customers.id
|
|
412
|
+
```
|
|
413
|
+
|
|
414
|
+
**Example Upstream Query Output:**
|
|
415
|
+
```
|
|
416
|
+
Sources for 'order_totals.total'
|
|
417
|
+
+--------------------------------------------------------------------------------------------+
|
|
418
|
+
| Column | Table | Hops | Root | Leaf | Paths | File |
|
|
419
|
+
|--------+--------+------+------+------+------------------------------------+----------------|
|
|
420
|
+
| amount | orders | 1 | Y | N | orders.amount -> order_totals.total| test_graph.sql |
|
|
421
|
+
+--------------------------------------------------------------------------------------------+
|
|
422
|
+
|
|
423
|
+
Total: 1 column(s)
|
|
424
|
+
```
|
|
425
|
+
|
|
426
|
+
**Example Downstream Query Output:**
|
|
427
|
+
```
|
|
428
|
+
Affected Columns for 'orders.amount'
|
|
429
|
+
+--------------------------------------------------------------------------------------------+
|
|
430
|
+
| Column | Table | Hops | Root | Leaf | Paths | File |
|
|
431
|
+
|--------+--------------+------+------+------+------------------------------------+----------------|
|
|
432
|
+
| total | order_totals | 1 | N | Y | orders.amount -> order_totals.total| test_graph.sql |
|
|
433
|
+
+--------------------------------------------------------------------------------------------+
|
|
434
|
+
|
|
435
|
+
Total: 1 column(s)
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
**Output Fields:**
|
|
439
|
+
- **Root**: `Y` if the column has no upstream dependencies (source column)
|
|
440
|
+
- **Leaf**: `Y` if the column has no downstream dependencies (final output)
|
|
441
|
+
- **Paths**: All paths from the dependency to the queried column
|
|
442
|
+
|
|
443
|
+
**Manifest File Format:**
|
|
444
|
+
```csv
|
|
445
|
+
file_path,dialect
|
|
446
|
+
queries/orders.sql,spark
|
|
447
|
+
queries/customers.sql,postgres
|
|
448
|
+
queries/legacy.sql,
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
The graph feature is designed for scale - it can handle thousands of SQL files and provides efficient upstream/downstream queries using rustworkx.
|
|
452
|
+
|
|
453
|
+
## Use Cases
|
|
454
|
+
|
|
455
|
+
### Data Governance
|
|
456
|
+
|
|
457
|
+
**Impact Assessment:**
|
|
458
|
+
```bash
|
|
459
|
+
# Before modifying a source column, check its impact
|
|
460
|
+
uv run sqlglider lineage analytics_dashboard.sql --source-column orders.revenue
|
|
461
|
+
```
|
|
462
|
+
|
|
463
|
+
This helps you understand which downstream outputs will be affected by schema changes.
|
|
464
|
+
|
|
465
|
+
### Query Understanding
|
|
466
|
+
|
|
467
|
+
**Source Tracing:**
|
|
468
|
+
```bash
|
|
469
|
+
# Understand where a metric comes from
|
|
470
|
+
uv run sqlglider lineage metrics.sql --column total_revenue
|
|
471
|
+
```
|
|
472
|
+
|
|
473
|
+
Quickly trace complex calculations back to their source tables.
|
|
474
|
+
|
|
475
|
+
### Documentation
|
|
476
|
+
|
|
477
|
+
**Export Lineage:**
|
|
478
|
+
```bash
|
|
479
|
+
# Generate documentation for your queries
|
|
480
|
+
uv run sqlglider lineage query.sql --output-format csv --output-file docs/lineage.csv
|
|
481
|
+
```
|
|
482
|
+
|
|
483
|
+
Create machine-readable lineage documentation for data catalogs.
|
|
484
|
+
|
|
485
|
+
### Literal Value Handling
|
|
486
|
+
|
|
487
|
+
When analyzing UNION queries, SQL Glider identifies literal values (constants) as sources and displays them clearly:
|
|
488
|
+
|
|
489
|
+
```sql
|
|
490
|
+
-- query.sql
|
|
491
|
+
SELECT customer_id, last_order_date FROM active_customers
|
|
492
|
+
UNION ALL
|
|
493
|
+
SELECT customer_id, NULL AS last_order_date FROM prospects
|
|
494
|
+
UNION ALL
|
|
495
|
+
SELECT customer_id, 'unknown' AS status FROM legacy_data
|
|
496
|
+
```
|
|
497
|
+
|
|
498
|
+
```bash
|
|
499
|
+
uv run sqlglider lineage query.sql
|
|
500
|
+
```
|
|
501
|
+
|
|
502
|
+
**Example Output:**
|
|
503
|
+
```
|
|
504
|
+
Query 0: SELECT customer_id, last_order_date FROM active_customers ...
|
|
505
|
+
+---------------------------------------------------------------------+
|
|
506
|
+
| Output Column | Source Column |
|
|
507
|
+
|----------------------------------+----------------------------------|
|
|
508
|
+
| active_customers.customer_id | active_customers.customer_id |
|
|
509
|
+
| | prospects.customer_id |
|
|
510
|
+
| active_customers.last_order_date | <literal: NULL> |
|
|
511
|
+
| | active_customers.last_order_date |
|
|
512
|
+
+---------------------------------------------------------------------+
|
|
513
|
+
Total: 4 row(s)
|
|
514
|
+
```
|
|
515
|
+
|
|
516
|
+
Literal values are displayed as `<literal: VALUE>` to clearly distinguish them from actual column sources:
|
|
517
|
+
- `<literal: NULL>` - NULL values
|
|
518
|
+
- `<literal: 0>` - Numeric literals
|
|
519
|
+
- `<literal: 'string'>` - String literals
|
|
520
|
+
- `<literal: CURRENT_TIMESTAMP()>` - Function calls
|
|
521
|
+
|
|
522
|
+
This helps identify which branches of a UNION contribute actual data lineage versus hardcoded values.
|
|
523
|
+
|
|
524
|
+
### Multi-Level Analysis
|
|
525
|
+
|
|
526
|
+
SQL Glider automatically traces through CTEs and subqueries:
|
|
527
|
+
|
|
528
|
+
```sql
|
|
529
|
+
-- query.sql
|
|
530
|
+
WITH order_totals AS (
|
|
531
|
+
SELECT customer_id, SUM(order_amount) as total_amount
|
|
532
|
+
FROM orders
|
|
533
|
+
GROUP BY customer_id
|
|
534
|
+
),
|
|
535
|
+
customer_segments AS (
|
|
536
|
+
SELECT
|
|
537
|
+
ot.customer_id,
|
|
538
|
+
c.customer_name,
|
|
539
|
+
CASE
|
|
540
|
+
WHEN ot.total_amount > 10000 THEN 'Premium'
|
|
541
|
+
ELSE 'Standard'
|
|
542
|
+
END as segment
|
|
543
|
+
FROM order_totals ot
|
|
544
|
+
JOIN customers c ON ot.customer_id = c.customer_id
|
|
545
|
+
)
|
|
546
|
+
SELECT customer_name, segment, total_amount
|
|
547
|
+
FROM customer_segments
|
|
548
|
+
```
|
|
549
|
+
|
|
550
|
+
```bash
|
|
551
|
+
# Trace segment back to its ultimate sources
|
|
552
|
+
uv run sqlglider lineage query.sql --column segment
|
|
553
|
+
# Output: orders.order_amount (through the CASE statement and SUM)
|
|
554
|
+
|
|
555
|
+
# Find what's affected by order_amount
|
|
556
|
+
uv run sqlglider lineage query.sql --source-column orders.order_amount
|
|
557
|
+
# Output: segment, total_amount
|
|
558
|
+
```
|
|
559
|
+
|
|
560
|
+
## CLI Reference
|
|
561
|
+
|
|
562
|
+
```
|
|
563
|
+
sqlglider lineage <sql_file> [OPTIONS]
|
|
564
|
+
|
|
565
|
+
Arguments:
|
|
566
|
+
sql_file Path to SQL file to analyze [required]
|
|
567
|
+
|
|
568
|
+
Options:
|
|
569
|
+
--level, -l Analysis level: 'column' or 'table' [default: column]
|
|
570
|
+
--dialect, -d SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
|
|
571
|
+
--column, -c Specific output column for forward lineage [optional]
|
|
572
|
+
--source-column, -s Source column for reverse lineage (impact analysis) [optional]
|
|
573
|
+
--table, -t Filter to only queries that reference this table (multi-query files) [optional]
|
|
574
|
+
--output-format, -f Output format: 'text', 'json', or 'csv' [default: text]
|
|
575
|
+
--output-file, -o Write output to file instead of stdout [optional]
|
|
576
|
+
--help Show help message and exit
|
|
577
|
+
```
|
|
578
|
+
|
|
579
|
+
**Notes:**
|
|
580
|
+
- `--column` and `--source-column` are mutually exclusive. Use one or the other.
|
|
581
|
+
- `--table` filter is useful for multi-query files to analyze only queries that reference a specific table.
|
|
582
|
+
|
|
583
|
+
### Tables Command
|
|
584
|
+
|
|
585
|
+
```
|
|
586
|
+
sqlglider tables overview <sql_file> [OPTIONS]
|
|
587
|
+
|
|
588
|
+
Arguments:
|
|
589
|
+
sql_file Path to SQL file to analyze [required]
|
|
590
|
+
|
|
591
|
+
Options:
|
|
592
|
+
--dialect, -d SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
|
|
593
|
+
--table Filter to only queries that reference this table [optional]
|
|
594
|
+
--output-format, -f Output format: 'text', 'json', or 'csv' [default: text]
|
|
595
|
+
--output-file, -o Write output to file instead of stdout [optional]
|
|
596
|
+
--templater, -t Templater for SQL preprocessing (e.g., 'jinja', 'none') [optional]
|
|
597
|
+
--var, -v Template variable in key=value format (repeatable) [optional]
|
|
598
|
+
--vars-file Path to variables file (JSON or YAML) [optional]
|
|
599
|
+
--help Show help message and exit
|
|
600
|
+
```
|
|
601
|
+
|
|
602
|
+
```
|
|
603
|
+
sqlglider tables pull <sql_file> [OPTIONS]
|
|
604
|
+
|
|
605
|
+
Arguments:
|
|
606
|
+
sql_file Path to SQL file to analyze [optional, reads from stdin if omitted]
|
|
607
|
+
|
|
608
|
+
Options:
|
|
609
|
+
--catalog-type, -c Catalog provider (e.g., 'databricks') [required if not in config]
|
|
610
|
+
--ddl-folder, -o Output folder for DDL files [optional, outputs to stdout if omitted]
|
|
611
|
+
--dialect, -d SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
|
|
612
|
+
--templater, -t Templater for SQL preprocessing (e.g., 'jinja', 'none') [optional]
|
|
613
|
+
--var, -v Template variable in key=value format (repeatable) [optional]
|
|
614
|
+
--vars-file Path to variables file (JSON or YAML) [optional]
|
|
615
|
+
--list, -l List available catalog providers and exit
|
|
616
|
+
--help Show help message and exit
|
|
617
|
+
```
|
|
618
|
+
|
|
619
|
+
**Databricks Setup:**
|
|
620
|
+
|
|
621
|
+
Install the optional Databricks dependency:
|
|
622
|
+
```bash
|
|
623
|
+
pip install sql-glider[databricks]
|
|
624
|
+
```
|
|
625
|
+
|
|
626
|
+
Configure authentication (via environment variables or `sqlglider.toml`):
|
|
627
|
+
```bash
|
|
628
|
+
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
|
|
629
|
+
export DATABRICKS_TOKEN="dapi..."
|
|
630
|
+
export DATABRICKS_WAREHOUSE_ID="abc123..."
|
|
631
|
+
```
|
|
632
|
+
|
|
633
|
+
### Dissect Command
|
|
634
|
+
|
|
635
|
+
```
|
|
636
|
+
sqlglider dissect [sql_file] [OPTIONS]
|
|
637
|
+
|
|
638
|
+
Arguments:
|
|
639
|
+
sql_file Path to SQL file to analyze [optional, reads from stdin if omitted]
|
|
640
|
+
|
|
641
|
+
Options:
|
|
642
|
+
--dialect, -d SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
|
|
643
|
+
--output-format, -f Output format: 'text', 'json', or 'csv' [default: text]
|
|
644
|
+
--output-file, -o Write output to file instead of stdout [optional]
|
|
645
|
+
--templater, -t Templater for SQL preprocessing (e.g., 'jinja', 'none') [optional]
|
|
646
|
+
--var, -v Template variable in key=value format (repeatable) [optional]
|
|
647
|
+
--vars-file Path to variables file (JSON or YAML) [optional]
|
|
648
|
+
--help Show help message and exit
|
|
649
|
+
```
|
|
650
|
+
|
|
651
|
+
**Output Fields:**
|
|
652
|
+
- `component_type`: Type of component (CTE, MAIN_QUERY, SUBQUERY, etc.)
|
|
653
|
+
- `component_index`: Sequential order within the query (0-based)
|
|
654
|
+
- `name`: CTE name, subquery alias, or target table name
|
|
655
|
+
- `sql`: The extracted SQL for this component
|
|
656
|
+
- `is_executable`: Whether the component can run standalone (TARGET_TABLE is false)
|
|
657
|
+
- `dependencies`: List of CTE names this component references
|
|
658
|
+
- `location`: Human-readable context (e.g., "WITH clause", "FROM clause")
|
|
659
|
+
- `depth`: Nesting level (0 = top-level)
|
|
660
|
+
- `parent_index`: Index of parent component for nested components
|
|
661
|
+
|
|
662
|
+
### Graph Commands
|
|
663
|
+
|
|
664
|
+
```
|
|
665
|
+
sqlglider graph build <paths> [OPTIONS]
|
|
666
|
+
|
|
667
|
+
Arguments:
|
|
668
|
+
paths SQL file(s) or directory to process [optional]
|
|
669
|
+
|
|
670
|
+
Options:
|
|
671
|
+
--output, -o Output JSON file path [required]
|
|
672
|
+
--manifest, -m Path to manifest CSV file [optional]
|
|
673
|
+
--recursive, -r Recursively search directories [default: True]
|
|
674
|
+
--glob, -g Glob pattern for SQL files [default: *.sql]
|
|
675
|
+
--dialect, -d SQL dialect [default: spark]
|
|
676
|
+
--node-format, -n Node format: 'qualified' or 'structured' [default: qualified]
|
|
677
|
+
```
|
|
678
|
+
|
|
679
|
+
```
|
|
680
|
+
sqlglider graph merge <inputs> [OPTIONS]
|
|
681
|
+
|
|
682
|
+
Arguments:
|
|
683
|
+
inputs JSON graph files to merge [optional]
|
|
684
|
+
|
|
685
|
+
Options:
|
|
686
|
+
--output, -o Output file path [required]
|
|
687
|
+
--glob, -g Glob pattern for graph files [optional]
|
|
688
|
+
```
|
|
689
|
+
|
|
690
|
+
```
|
|
691
|
+
sqlglider graph query <graph_file> [OPTIONS]
|
|
692
|
+
|
|
693
|
+
Arguments:
|
|
694
|
+
graph_file Path to graph JSON file [required]
|
|
695
|
+
|
|
696
|
+
Options:
|
|
697
|
+
--upstream, -u Find source columns for this column [optional]
|
|
698
|
+
--downstream, -d Find affected columns for this source [optional]
|
|
699
|
+
--output-format, -f Output format: 'text', 'json', or 'csv' [default: text]
|
|
700
|
+
```
|
|
701
|
+
|
|
702
|
+
**Notes:**
|
|
703
|
+
- `--upstream` and `--downstream` are mutually exclusive. Use one or the other.
|
|
704
|
+
- Graph queries are case-insensitive for column matching.
|
|
705
|
+
|
|
706
|
+
## Output Formats
|
|
707
|
+
|
|
708
|
+
### Text Format (Default)
|
|
709
|
+
|
|
710
|
+
Human-readable Rich table format showing query index and preview:
|
|
711
|
+
|
|
712
|
+
```
|
|
713
|
+
Query 0: SELECT customer_name FROM customers c ...
|
|
714
|
+
+---------------------------------------------------+
|
|
715
|
+
| Output Column | Source Column |
|
|
716
|
+
|-----------------+---------------------------------|
|
|
717
|
+
| customer_name | c.customer_name |
|
|
718
|
+
+---------------------------------------------------+
|
|
719
|
+
Total: 1 row(s)
|
|
720
|
+
```
|
|
721
|
+
|
|
722
|
+
### JSON Format
|
|
723
|
+
|
|
724
|
+
Machine-readable structured format with query metadata:
|
|
725
|
+
|
|
726
|
+
```json
|
|
727
|
+
{
|
|
728
|
+
"queries": [
|
|
729
|
+
{
|
|
730
|
+
"query_index": 0,
|
|
731
|
+
"query_preview": "SELECT customer_name FROM customers c ...",
|
|
732
|
+
"level": "column",
|
|
733
|
+
"lineage": [
|
|
734
|
+
{
|
|
735
|
+
"output_name": "customer_name",
|
|
736
|
+
"source_name": "c.customer_name"
|
|
737
|
+
}
|
|
738
|
+
]
|
|
739
|
+
}
|
|
740
|
+
]
|
|
741
|
+
}
|
|
742
|
+
```
|
|
743
|
+
|
|
744
|
+
### CSV Format
|
|
745
|
+
|
|
746
|
+
Spreadsheet-ready tabular format with query index:
|
|
747
|
+
|
|
748
|
+
```csv
|
|
749
|
+
query_index,output_column,source_column
|
|
750
|
+
0,customer_name,c.customer_name
|
|
751
|
+
```
|
|
752
|
+
|
|
753
|
+
**Note:** Each source column gets its own row. If an output column has multiple sources, there will be multiple rows with the same `query_index` and `output_column`.
|
|
754
|
+
|
|
755
|
+
## Development
|
|
756
|
+
|
|
757
|
+
### Setup
|
|
758
|
+
|
|
759
|
+
```bash
|
|
760
|
+
# Install dependencies
|
|
761
|
+
uv sync
|
|
762
|
+
|
|
763
|
+
# Run linter
|
|
764
|
+
uv run ruff check
|
|
765
|
+
|
|
766
|
+
# Auto-fix issues
|
|
767
|
+
uv run ruff check --fix
|
|
768
|
+
|
|
769
|
+
# Format code
|
|
770
|
+
uv run ruff format
|
|
771
|
+
|
|
772
|
+
# Type checking
|
|
773
|
+
uv run basedpyright
|
|
774
|
+
```
|
|
775
|
+
|
|
776
|
+
### Project Structure
|
|
777
|
+
|
|
778
|
+
See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed technical documentation.
|
|
779
|
+
|
|
780
|
+
```
|
|
781
|
+
src/sqlglider/
|
|
782
|
+
├── cli.py # Typer CLI entry point
|
|
783
|
+
├── dissection/
|
|
784
|
+
│ ├── analyzer.py # DissectionAnalyzer for query decomposition
|
|
785
|
+
│ ├── formatters.py # Output formatters (text, JSON, CSV)
|
|
786
|
+
│ └── models.py # ComponentType, SQLComponent, QueryDissectionResult
|
|
787
|
+
├── graph/
|
|
788
|
+
│ ├── builder.py # Build graphs from SQL files
|
|
789
|
+
│ ├── merge.py # Merge multiple graphs
|
|
790
|
+
│ ├── query.py # Query upstream/downstream lineage
|
|
791
|
+
│ └── models.py # Graph data models
|
|
792
|
+
├── lineage/
|
|
793
|
+
│ ├── analyzer.py # Core lineage analysis using SQLGlot
|
|
794
|
+
│ └── formatters.py # Output formatters (text, JSON, CSV)
|
|
795
|
+
└── utils/
|
|
796
|
+
└── file_utils.py # File I/O utilities
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
## Publishing
|
|
800
|
+
|
|
801
|
+
SQL Glider is configured for publishing to both TestPyPI and PyPI using `uv`.
|
|
802
|
+
|
|
803
|
+
### Versioning
|
|
804
|
+
|
|
805
|
+
SQL Glider uses Git tags for version management via [hatch-vcs](https://github.com/ofek/hatch-vcs). The version is automatically derived from Git:
|
|
806
|
+
|
|
807
|
+
- **Tagged commits:** Version matches the tag (e.g., `git tag v0.2.0` produces version `0.2.0`)
|
|
808
|
+
- **Untagged commits:** Version includes development info (e.g., `0.1.dev18+g7216a59`)
|
|
809
|
+
|
|
810
|
+
**Creating a new release:**
|
|
811
|
+
|
|
812
|
+
```bash
|
|
813
|
+
# Create and push a version tag
|
|
814
|
+
git tag v0.2.0
|
|
815
|
+
git push origin v0.2.0
|
|
816
|
+
|
|
817
|
+
# Build will now produce version 0.2.0
|
|
818
|
+
uv build
|
|
819
|
+
```
|
|
820
|
+
|
|
821
|
+
**Tag format:** Use `v` prefix (e.g., `v1.0.0`, `v0.2.1`). The `v` is stripped from the final version number.
|
|
822
|
+
|
|
823
|
+
### Building the Package
|
|
824
|
+
|
|
825
|
+
```bash
|
|
826
|
+
# Build the distribution files (wheel and sdist)
|
|
827
|
+
uv build
|
|
828
|
+
```
|
|
829
|
+
|
|
830
|
+
This creates distribution files in the `dist/` directory.
|
|
831
|
+
|
|
832
|
+
### Publishing to TestPyPI
|
|
833
|
+
|
|
834
|
+
Always test your release on TestPyPI first:
|
|
835
|
+
|
|
836
|
+
```bash
|
|
837
|
+
# Publish to TestPyPI
|
|
838
|
+
uv publish --index testpypi --token <YOUR_TESTPYPI_TOKEN>
|
|
839
|
+
|
|
840
|
+
# Test installation from TestPyPI
|
|
841
|
+
uv pip install --index-url https://test.pypi.org/simple/ sql-glider
|
|
842
|
+
```
|
|
843
|
+
|
|
844
|
+
### Publishing to PyPI
|
|
845
|
+
|
|
846
|
+
Once verified on TestPyPI, publish to production:
|
|
847
|
+
|
|
848
|
+
```bash
|
|
849
|
+
# Publish to PyPI
|
|
850
|
+
uv publish --index pypi --token <YOUR_PYPI_TOKEN>
|
|
851
|
+
```
|
|
852
|
+
|
|
853
|
+
### Token Setup
|
|
854
|
+
|
|
855
|
+
You'll need API tokens from both registries:
|
|
856
|
+
|
|
857
|
+
1. **TestPyPI Token:** Create at https://test.pypi.org/manage/account/token/
|
|
858
|
+
2. **PyPI Token:** Create at https://pypi.org/manage/account/token/
|
|
859
|
+
|
|
860
|
+
**Option 1: Pass token directly (shown above)**
|
|
861
|
+
|
|
862
|
+
**Option 2: Environment variable**
|
|
863
|
+
```bash
|
|
864
|
+
export UV_PUBLISH_TOKEN=pypi-...
|
|
865
|
+
uv publish --index pypi
|
|
866
|
+
```
|
|
867
|
+
|
|
868
|
+
**Option 3: Store in `.env` file (not committed to git)**
|
|
869
|
+
```bash
|
|
870
|
+
# .env
|
|
871
|
+
UV_PUBLISH_TOKEN=pypi-...
|
|
872
|
+
```
|
|
873
|
+
|
|
874
|
+
**Security Note:** Never commit API tokens to version control. The `.gitignore` file should include `.env`.
|
|
875
|
+
|
|
876
|
+
## Dependencies
|
|
877
|
+
|
|
878
|
+
- **sqlglot[rs]:** SQL parser and lineage analysis library with Rust extensions
|
|
879
|
+
- **typer:** CLI framework with type hints
|
|
880
|
+
- **rich:** Terminal formatting and colored output
|
|
881
|
+
- **pydantic:** Data validation and serialization
|
|
882
|
+
- **rustworkx:** High-performance graph library for cross-file lineage analysis
|
|
883
|
+
|
|
884
|
+
## References
|
|
885
|
+
|
|
886
|
+
- [SQLGlot Documentation](https://sqlglot.com/)
|
|
887
|
+
- [UV Documentation](https://docs.astral.sh/uv/)
|
|
888
|
+
- [Typer Documentation](https://typer.tiangolo.com/)
|
|
889
|
+
- [Ruff Documentation](https://docs.astral.sh/ruff/configuration/)
|
|
890
|
+
|
|
891
|
+
## License
|
|
892
|
+
|
|
893
|
+
See LICENSE file for details.
|