sql-glider 0.1.8__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,893 @@
1
+ Metadata-Version: 2.4
2
+ Name: sql-glider
3
+ Version: 0.1.8
4
+ Summary: SQL Utility Toolkit for better understanding, use, and governance of your queries in a native environment.
5
+ Project-URL: Homepage, https://github.com/rycowhi/sql-glider/
6
+ Project-URL: Repository, https://github.com/rycowhi/sql-glider/
7
+ Project-URL: Documentation, https://github.com/rycowhi/sql-glider/
8
+ Project-URL: Issues, https://github.com/rycowhi/sql-glider/issues
9
+ Author-email: Ryan Whitcomb <ryankwhitcomb@gmail.com>
10
+ License-Expression: Apache-2.0
11
+ License-File: LICENSE
12
+ Keywords: data-governance,data-lineage,lineage,sql,sqlglot
13
+ Classifier: Development Status :: 3 - Alpha
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: Apache Software License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3 :: Only
18
+ Classifier: Programming Language :: SQL
19
+ Classifier: Topic :: Database
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Classifier: Typing :: Typed
22
+ Requires-Python: >=3.11
23
+ Requires-Dist: jinja2>=3.0.0
24
+ Requires-Dist: pydantic>=2.0.0
25
+ Requires-Dist: rich>=13.0.0
26
+ Requires-Dist: rustworkx>=0.15.0
27
+ Requires-Dist: sqlglot[rs]>=25.0.0
28
+ Requires-Dist: typer>=0.9.0
29
+ Provides-Extra: databricks
30
+ Requires-Dist: databricks-sdk>=0.20.0; extra == 'databricks'
31
+ Description-Content-Type: text/markdown
32
+
33
+ # SQL Glider
34
+
35
+ SQL Utility Toolkit for better understanding, use, and governance of your queries in a native environment.
36
+
37
+ ## Overview
38
+
39
+ SQL Glider provides powerful column-level and table-level lineage analysis for SQL queries using SQLGlot. It operates on standalone SQL files without requiring a full project setup, making it perfect for ad-hoc analysis, data governance, and understanding query dependencies.
40
+
41
+ ## Features
42
+
43
+ - **Forward Lineage:** Trace output columns back to their source tables and columns
44
+ - **Reverse Lineage:** Impact analysis - find which output columns are affected by a source column
45
+ - **Query Dissection:** Decompose SQL into components (CTEs, subqueries, UNION branches) for unit testing
46
+ - **Table Extraction:** List all tables in SQL files with usage type (INPUT/OUTPUT) and object type (TABLE/VIEW/CTE)
47
+ - **Multi-level Tracing:** Automatically handles CTEs, subqueries, and complex expressions
48
+ - **Graph-Based Lineage:** Build and query lineage graphs across thousands of SQL files
49
+ - **Multiple Output Formats:** Text (human-readable), JSON (machine-readable), CSV (spreadsheet-ready)
50
+ - **Dialect Support:** Works with Spark, PostgreSQL, Snowflake, BigQuery, MySQL, and many more SQL dialects
51
+ - **File Export:** Save lineage results to files for documentation or further processing
52
+
53
+ ## Installation
54
+
55
+ SQL Glider is available on PyPI and can be installed with pip or uv. Python 3.11+ is required.
56
+
57
+ ```bash
58
+ # Install with pip
59
+ pip install sql-glider
60
+
61
+ # Or install with uv
62
+ uv pip install sql-glider
63
+ ```
64
+
65
+ After installation, the `sqlglider` command is available:
66
+
67
+ ```bash
68
+ sqlglider lineage query.sql
69
+ ```
70
+
71
+ ### Development Setup
72
+
73
+ If you want to contribute or run from source:
74
+
75
+ ```bash
76
+ # Clone the repository
77
+ git clone https://github.com/ryanholmdahl/sql-glider.git
78
+ cd sql-glider
79
+
80
+ # Install dependencies with uv
81
+ uv sync
82
+
83
+ # Run from source
84
+ uv run sqlglider lineage <sql_file>
85
+ ```
86
+
87
+ ## Quick Start
88
+
89
+ ### Forward Lineage (Source Tracing)
90
+
91
+ Find out where your output columns come from:
92
+
93
+ ```bash
94
+ # Analyze all output columns
95
+ uv run sqlglider lineage query.sql
96
+
97
+ # Analyze a specific output column
98
+ uv run sqlglider lineage query.sql --column customer_name
99
+ ```
100
+
101
+ **Example Output:**
102
+ ```
103
+ Query 0: SELECT customer_name, o.order_total FROM customers c JOIN orders o ...
104
+ +-----------------------------------------------------------------------------+
105
+ | Output Column | Source Column |
106
+ |-----------------+------------------------------------------------------------|
107
+ | customer_name | c.customer_name |
108
+ +-----------------------------------------------------------------------------+
109
+ Total: 1 row(s)
110
+ ```
111
+
112
+ This shows that the output column `customer_name` in Query 0 comes from `c.customer_name` (the `customer_name` column in table `c`).
113
+
114
+ ### Reverse Lineage (Impact Analysis)
115
+
116
+ Find out which output columns are affected by a source column:
117
+
118
+ ```bash
119
+ # Find outputs affected by a source column
120
+ uv run sqlglider lineage query.sql --source-column orders.customer_id
121
+ ```
122
+
123
+ **Example Output:**
124
+ ```
125
+ Query 0: SELECT customer_id, segment FROM ...
126
+ +---------------------------------------------------------+
127
+ | Output Column | Source Column |
128
+ |--------------------+------------------------------------|
129
+ | orders.customer_id | orders.customer_id |
130
+ +---------------------------------------------------------+
131
+ Total: 1 row(s)
132
+ ```
133
+
134
+ This shows that if `orders.customer_id` changes, it will impact the output column `customer_id` in Query 0.
135
+
136
+ ## Usage Examples
137
+
138
+ ### Basic Column Lineage
139
+
140
+ ```bash
141
+ # Forward lineage for all columns
142
+ uv run sqlglider lineage query.sql
143
+
144
+ # Forward lineage for specific column
145
+ uv run sqlglider lineage query.sql --column order_total
146
+
147
+ # Reverse lineage (impact analysis)
148
+ uv run sqlglider lineage query.sql --source-column orders.customer_id
149
+ ```
150
+
151
+ ### Different Output Formats
152
+
153
+ ```bash
154
+ # JSON output
155
+ uv run sqlglider lineage query.sql --output-format json
156
+
157
+ # CSV output
158
+ uv run sqlglider lineage query.sql --output-format csv
159
+
160
+ # Export to file
161
+ uv run sqlglider lineage query.sql --output-format json --output-file lineage.json
162
+ ```
163
+
164
+ ### Table-Level Lineage
165
+
166
+ ```bash
167
+ # Show which tables are used
168
+ uv run sqlglider lineage query.sql --level table
169
+ ```
170
+
171
+ ### Table Extraction
172
+
173
+ List all tables involved in SQL files with usage and type information:
174
+
175
+ ```bash
176
+ # List all tables in a SQL file
177
+ uv run sqlglider tables overview query.sql
178
+
179
+ # JSON output with detailed table info
180
+ uv run sqlglider tables overview query.sql --output-format json
181
+
182
+ # Export to CSV
183
+ uv run sqlglider tables overview query.sql --output-format csv --output-file tables.csv
184
+ ```
185
+
186
+ ### Pull DDL from Remote Catalogs
187
+
188
+ Fetch DDL definitions from remote data catalogs (e.g., Databricks Unity Catalog):
189
+
190
+ ```bash
191
+ # Pull DDL for all tables used in a SQL file (outputs to stdout)
192
+ uv run sqlglider tables pull query.sql --catalog-type databricks
193
+
194
+ # Save DDL files to a folder (one file per table)
195
+ uv run sqlglider tables pull query.sql -c databricks -o ./ddl/
196
+
197
+ # List available catalog providers
198
+ uv run sqlglider tables pull --list
199
+ ```
200
+
201
+ **Note:** Requires optional dependencies. Install with: `pip install sql-glider[databricks]`
202
+
203
+ **Example Output (JSON):**
204
+ ```json
205
+ {
206
+ "queries": [{
207
+ "query_index": 0,
208
+ "tables": [
209
+ {"name": "customers", "usage": "INPUT", "object_type": "UNKNOWN"},
210
+ {"name": "orders", "usage": "INPUT", "object_type": "UNKNOWN"}
211
+ ]
212
+ }]
213
+ }
214
+ ```
215
+
216
+ **Table Usage Types:**
217
+ - `INPUT`: Table is read from (SELECT, JOIN, subqueries)
218
+ - `OUTPUT`: Table is written to (INSERT, CREATE TABLE/VIEW, UPDATE)
219
+ - `BOTH`: Table is both read from and written to
220
+
221
+ **Object Types:**
222
+ - `TABLE`: CREATE TABLE or DROP TABLE statement
223
+ - `VIEW`: CREATE VIEW or DROP VIEW statement
224
+ - `CTE`: Common Table Expression (WITH clause)
225
+ - `UNKNOWN`: Cannot determine type from SQL alone
226
+
227
+ ### Query Dissection
228
+
229
+ Decompose SQL queries into constituent parts for unit testing and analysis:
230
+
231
+ ```bash
232
+ # Dissect a SQL file (text output)
233
+ uv run sqlglider dissect query.sql
234
+
235
+ # JSON output with full component details
236
+ uv run sqlglider dissect query.sql --output-format json
237
+
238
+ # CSV output for spreadsheet analysis
239
+ uv run sqlglider dissect query.sql --output-format csv
240
+
241
+ # Export to file
242
+ uv run sqlglider dissect query.sql -f json -o dissected.json
243
+
244
+ # With templating support
245
+ uv run sqlglider dissect query.sql --templater jinja --var schema=analytics
246
+
247
+ # From stdin
248
+ echo "WITH cte AS (SELECT id FROM users) SELECT * FROM cte" | uv run sqlglider dissect
249
+ ```
250
+
251
+ **Example Input:**
252
+ ```sql
253
+ WITH order_totals AS (
254
+ SELECT customer_id, SUM(amount) AS total
255
+ FROM orders
256
+ GROUP BY customer_id
257
+ )
258
+ INSERT INTO analytics.summary
259
+ SELECT * FROM order_totals WHERE total > 100
260
+ ```
261
+
262
+ **Example Output (JSON):**
263
+ ```json
264
+ {
265
+ "queries": [{
266
+ "query_index": 0,
267
+ "statement_type": "INSERT",
268
+ "total_components": 3,
269
+ "components": [
270
+ {
271
+ "component_type": "CTE",
272
+ "component_index": 0,
273
+ "name": "order_totals",
274
+ "sql": "SELECT customer_id, SUM(amount) AS total FROM orders GROUP BY customer_id",
275
+ "is_executable": true,
276
+ "dependencies": [],
277
+ "location": "WITH clause"
278
+ },
279
+ {
280
+ "component_type": "TARGET_TABLE",
281
+ "component_index": 1,
282
+ "name": "analytics.summary",
283
+ "sql": "analytics.summary",
284
+ "is_executable": false,
285
+ "location": "INSERT INTO target"
286
+ },
287
+ {
288
+ "component_type": "SOURCE_QUERY",
289
+ "component_index": 2,
290
+ "sql": "SELECT * FROM order_totals WHERE total > 100",
291
+ "is_executable": true,
292
+ "dependencies": ["order_totals"],
293
+ "location": "INSERT source SELECT"
294
+ }
295
+ ]
296
+ }]
297
+ }
298
+ ```
299
+
300
+ **Extracted Component Types:**
301
+ - `CTE`: Common Table Expressions from WITH clause
302
+ - `MAIN_QUERY`: The primary SELECT statement
303
+ - `SUBQUERY`: Nested SELECT in FROM clause
304
+ - `SCALAR_SUBQUERY`: Single-value subquery in SELECT list, WHERE, HAVING
305
+ - `TARGET_TABLE`: Output table for INSERT/CREATE/MERGE (not executable)
306
+ - `SOURCE_QUERY`: SELECT within DML/DDL statements
307
+ - `UNION_BRANCH`: Individual SELECT in UNION/UNION ALL
308
+
309
+ **Use Cases:**
310
+ - Unit test CTEs and subqueries individually
311
+ - Extract DQL from CTAS, CREATE VIEW, INSERT statements
312
+ - Analyze query structure and component dependencies
313
+ - Break apart complex queries for understanding
314
+
315
+ ### Different SQL Dialects
316
+
317
+ ```bash
318
+ # PostgreSQL
319
+ uv run sqlglider lineage query.sql --dialect postgres
320
+
321
+ # Snowflake
322
+ uv run sqlglider lineage query.sql --dialect snowflake
323
+
324
+ # BigQuery
325
+ uv run sqlglider lineage query.sql --dialect bigquery
326
+ ```
327
+
328
+ ### Multi-Query Files
329
+
330
+ SQL Glider automatically detects and analyzes multiple SQL statements in a single file:
331
+
332
+ ```bash
333
+ # Analyze all queries in a file
334
+ uv run sqlglider lineage multi_query.sql
335
+
336
+ # Filter to only queries that reference a specific table
337
+ uv run sqlglider lineage multi_query.sql --table customers
338
+
339
+ # Analyze specific column across all queries
340
+ uv run sqlglider lineage multi_query.sql --column customer_id
341
+
342
+ # Reverse lineage across all queries (impact analysis)
343
+ uv run sqlglider lineage multi_query.sql --source-column orders.customer_id
344
+ ```
345
+
346
+ **Example multi-query file:**
347
+ ```sql
348
+ -- multi_query.sql
349
+ SELECT customer_id, customer_name FROM customers;
350
+
351
+ SELECT order_id, customer_id, order_total FROM orders;
352
+
353
+ INSERT INTO customer_orders
354
+ SELECT c.customer_id, c.customer_name, o.order_id
355
+ FROM customers c
356
+ JOIN orders o ON c.customer_id = o.customer_id;
357
+ ```
358
+
359
+ **Output includes query index for each statement:**
360
+ ```
361
+ Query 0: SELECT customer_id, customer_name FROM customers
362
+ +---------------------------------------------------+
363
+ | Output Column | Source Column |
364
+ |-------------------------+-------------------------|
365
+ | customers.customer_id | customers.customer_id |
366
+ | customers.customer_name | customers.customer_name |
367
+ +---------------------------------------------------+
368
+ Total: 2 row(s)
369
+
370
+ Query 1: SELECT order_id, customer_id, order_total FROM orders
371
+ +---------------------------------------------+
372
+ | Output Column | Source Column |
373
+ |--------------------+------------------------|
374
+ | orders.customer_id | orders.customer_id |
375
+ | orders.order_id | orders.order_id |
376
+ | orders.order_total | orders.order_total |
377
+ +---------------------------------------------+
378
+ Total: 3 row(s)
379
+
380
+ Query 2: INSERT INTO customer_orders ...
381
+ +---------------------------------------------+
382
+ | Output Column | Source Column |
383
+ |--------------------+------------------------|
384
+ ...
385
+ ```
386
+
387
+ ### Graph-Based Lineage (Cross-File Analysis)
388
+
389
+ For analyzing lineage across multiple SQL files, SQL Glider provides graph commands:
390
+
391
+ ```bash
392
+ # Build a lineage graph from a single file
393
+ uv run sqlglider graph build query.sql -o graph.json
394
+
395
+ # Build from multiple files
396
+ uv run sqlglider graph build query1.sql query2.sql query3.sql -o graph.json
397
+
398
+ # Build from a directory (recursively finds all .sql files)
399
+ uv run sqlglider graph build ./queries/ -r -o graph.json
400
+
401
+ # Build from a manifest CSV file
402
+ uv run sqlglider graph build --manifest manifest.csv -o graph.json
403
+
404
+ # Merge multiple graphs into one
405
+ uv run sqlglider graph merge graph1.json graph2.json -o merged.json
406
+
407
+ # Query upstream dependencies (find all sources for a column)
408
+ uv run sqlglider graph query graph.json --upstream orders.customer_id
409
+
410
+ # Query downstream dependencies (find all columns affected by a source)
411
+ uv run sqlglider graph query graph.json --downstream customers.id
412
+ ```
413
+
414
+ **Example Upstream Query Output:**
415
+ ```
416
+ Sources for 'order_totals.total'
417
+ +--------------------------------------------------------------------------------------------+
418
+ | Column | Table | Hops | Root | Leaf | Paths | File |
419
+ |--------+--------+------+------+------+------------------------------------+----------------|
420
+ | amount | orders | 1 | Y | N | orders.amount -> order_totals.total| test_graph.sql |
421
+ +--------------------------------------------------------------------------------------------+
422
+
423
+ Total: 1 column(s)
424
+ ```
425
+
426
+ **Example Downstream Query Output:**
427
+ ```
428
+ Affected Columns for 'orders.amount'
429
+ +--------------------------------------------------------------------------------------------+
430
+ | Column | Table | Hops | Root | Leaf | Paths | File |
431
+ |--------+--------------+------+------+------+------------------------------------+----------------|
432
+ | total | order_totals | 1 | N | Y | orders.amount -> order_totals.total| test_graph.sql |
433
+ +--------------------------------------------------------------------------------------------+
434
+
435
+ Total: 1 column(s)
436
+ ```
437
+
438
+ **Output Fields:**
439
+ - **Root**: `Y` if the column has no upstream dependencies (source column)
440
+ - **Leaf**: `Y` if the column has no downstream dependencies (final output)
441
+ - **Paths**: All paths from the dependency to the queried column
442
+
443
+ **Manifest File Format:**
444
+ ```csv
445
+ file_path,dialect
446
+ queries/orders.sql,spark
447
+ queries/customers.sql,postgres
448
+ queries/legacy.sql,
449
+ ```
450
+
451
+ The graph feature is designed for scale - it can handle thousands of SQL files and provides efficient upstream/downstream queries using rustworkx.
452
+
453
+ ## Use Cases
454
+
455
+ ### Data Governance
456
+
457
+ **Impact Assessment:**
458
+ ```bash
459
+ # Before modifying a source column, check its impact
460
+ uv run sqlglider lineage analytics_dashboard.sql --source-column orders.revenue
461
+ ```
462
+
463
+ This helps you understand which downstream outputs will be affected by schema changes.
464
+
465
+ ### Query Understanding
466
+
467
+ **Source Tracing:**
468
+ ```bash
469
+ # Understand where a metric comes from
470
+ uv run sqlglider lineage metrics.sql --column total_revenue
471
+ ```
472
+
473
+ Quickly trace complex calculations back to their source tables.
474
+
475
+ ### Documentation
476
+
477
+ **Export Lineage:**
478
+ ```bash
479
+ # Generate documentation for your queries
480
+ uv run sqlglider lineage query.sql --output-format csv --output-file docs/lineage.csv
481
+ ```
482
+
483
+ Create machine-readable lineage documentation for data catalogs.
484
+
485
+ ### Literal Value Handling
486
+
487
+ When analyzing UNION queries, SQL Glider identifies literal values (constants) as sources and displays them clearly:
488
+
489
+ ```sql
490
+ -- query.sql
491
+ SELECT customer_id, last_order_date FROM active_customers
492
+ UNION ALL
493
+ SELECT customer_id, NULL AS last_order_date FROM prospects
494
+ UNION ALL
495
+ SELECT customer_id, 'unknown' AS status FROM legacy_data
496
+ ```
497
+
498
+ ```bash
499
+ uv run sqlglider lineage query.sql
500
+ ```
501
+
502
+ **Example Output:**
503
+ ```
504
+ Query 0: SELECT customer_id, last_order_date FROM active_customers ...
505
+ +---------------------------------------------------------------------+
506
+ | Output Column | Source Column |
507
+ |----------------------------------+----------------------------------|
508
+ | active_customers.customer_id | active_customers.customer_id |
509
+ | | prospects.customer_id |
510
+ | active_customers.last_order_date | <literal: NULL> |
511
+ | | active_customers.last_order_date |
512
+ +---------------------------------------------------------------------+
513
+ Total: 4 row(s)
514
+ ```
515
+
516
+ Literal values are displayed as `<literal: VALUE>` to clearly distinguish them from actual column sources:
517
+ - `<literal: NULL>` - NULL values
518
+ - `<literal: 0>` - Numeric literals
519
+ - `<literal: 'string'>` - String literals
520
+ - `<literal: CURRENT_TIMESTAMP()>` - Function calls
521
+
522
+ This helps identify which branches of a UNION contribute actual data lineage versus hardcoded values.
523
+
524
+ ### Multi-Level Analysis
525
+
526
+ SQL Glider automatically traces through CTEs and subqueries:
527
+
528
+ ```sql
529
+ -- query.sql
530
+ WITH order_totals AS (
531
+ SELECT customer_id, SUM(order_amount) as total_amount
532
+ FROM orders
533
+ GROUP BY customer_id
534
+ ),
535
+ customer_segments AS (
536
+ SELECT
537
+ ot.customer_id,
538
+ c.customer_name,
539
+ CASE
540
+ WHEN ot.total_amount > 10000 THEN 'Premium'
541
+ ELSE 'Standard'
542
+ END as segment
543
+ FROM order_totals ot
544
+ JOIN customers c ON ot.customer_id = c.customer_id
545
+ )
546
+ SELECT customer_name, segment, total_amount
547
+ FROM customer_segments
548
+ ```
549
+
550
+ ```bash
551
+ # Trace segment back to its ultimate sources
552
+ uv run sqlglider lineage query.sql --column segment
553
+ # Output: orders.order_amount (through the CASE statement and SUM)
554
+
555
+ # Find what's affected by order_amount
556
+ uv run sqlglider lineage query.sql --source-column orders.order_amount
557
+ # Output: segment, total_amount
558
+ ```
559
+
560
+ ## CLI Reference
561
+
562
+ ```
563
+ sqlglider lineage <sql_file> [OPTIONS]
564
+
565
+ Arguments:
566
+ sql_file Path to SQL file to analyze [required]
567
+
568
+ Options:
569
+ --level, -l Analysis level: 'column' or 'table' [default: column]
570
+ --dialect, -d SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
571
+ --column, -c Specific output column for forward lineage [optional]
572
+ --source-column, -s Source column for reverse lineage (impact analysis) [optional]
573
+ --table, -t Filter to only queries that reference this table (multi-query files) [optional]
574
+ --output-format, -f Output format: 'text', 'json', or 'csv' [default: text]
575
+ --output-file, -o Write output to file instead of stdout [optional]
576
+ --help Show help message and exit
577
+ ```
578
+
579
+ **Notes:**
580
+ - `--column` and `--source-column` are mutually exclusive. Use one or the other.
581
+ - `--table` filter is useful for multi-query files to analyze only queries that reference a specific table.
582
+
583
+ ### Tables Command
584
+
585
+ ```
586
+ sqlglider tables overview <sql_file> [OPTIONS]
587
+
588
+ Arguments:
589
+ sql_file Path to SQL file to analyze [required]
590
+
591
+ Options:
592
+ --dialect, -d SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
593
+ --table Filter to only queries that reference this table [optional]
594
+ --output-format, -f Output format: 'text', 'json', or 'csv' [default: text]
595
+ --output-file, -o Write output to file instead of stdout [optional]
596
+ --templater, -t Templater for SQL preprocessing (e.g., 'jinja', 'none') [optional]
597
+ --var, -v Template variable in key=value format (repeatable) [optional]
598
+ --vars-file Path to variables file (JSON or YAML) [optional]
599
+ --help Show help message and exit
600
+ ```
601
+
602
+ ```
603
+ sqlglider tables pull <sql_file> [OPTIONS]
604
+
605
+ Arguments:
606
+ sql_file Path to SQL file to analyze [optional, reads from stdin if omitted]
607
+
608
+ Options:
609
+ --catalog-type, -c Catalog provider (e.g., 'databricks') [required if not in config]
610
+ --ddl-folder, -o Output folder for DDL files [optional, outputs to stdout if omitted]
611
+ --dialect, -d SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
612
+ --templater, -t Templater for SQL preprocessing (e.g., 'jinja', 'none') [optional]
613
+ --var, -v Template variable in key=value format (repeatable) [optional]
614
+ --vars-file Path to variables file (JSON or YAML) [optional]
615
+ --list, -l List available catalog providers and exit
616
+ --help Show help message and exit
617
+ ```
618
+
619
+ **Databricks Setup:**
620
+
621
+ Install the optional Databricks dependency:
622
+ ```bash
623
+ pip install sql-glider[databricks]
624
+ ```
625
+
626
+ Configure authentication (via environment variables or `sqlglider.toml`):
627
+ ```bash
628
+ export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
629
+ export DATABRICKS_TOKEN="dapi..."
630
+ export DATABRICKS_WAREHOUSE_ID="abc123..."
631
+ ```
632
+
633
+ ### Dissect Command
634
+
635
+ ```
636
+ sqlglider dissect [sql_file] [OPTIONS]
637
+
638
+ Arguments:
639
+ sql_file Path to SQL file to analyze [optional, reads from stdin if omitted]
640
+
641
+ Options:
642
+ --dialect, -d SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
643
+ --output-format, -f Output format: 'text', 'json', or 'csv' [default: text]
644
+ --output-file, -o Write output to file instead of stdout [optional]
645
+ --templater, -t Templater for SQL preprocessing (e.g., 'jinja', 'none') [optional]
646
+ --var, -v Template variable in key=value format (repeatable) [optional]
647
+ --vars-file Path to variables file (JSON or YAML) [optional]
648
+ --help Show help message and exit
649
+ ```
650
+
651
+ **Output Fields:**
652
+ - `component_type`: Type of component (CTE, MAIN_QUERY, SUBQUERY, etc.)
653
+ - `component_index`: Sequential order within the query (0-based)
654
+ - `name`: CTE name, subquery alias, or target table name
655
+ - `sql`: The extracted SQL for this component
656
+ - `is_executable`: Whether the component can run standalone (TARGET_TABLE is false)
657
+ - `dependencies`: List of CTE names this component references
658
+ - `location`: Human-readable context (e.g., "WITH clause", "FROM clause")
659
+ - `depth`: Nesting level (0 = top-level)
660
+ - `parent_index`: Index of parent component for nested components
661
+
662
+ ### Graph Commands
663
+
664
+ ```
665
+ sqlglider graph build <paths> [OPTIONS]
666
+
667
+ Arguments:
668
+ paths SQL file(s) or directory to process [optional]
669
+
670
+ Options:
671
+ --output, -o Output JSON file path [required]
672
+ --manifest, -m Path to manifest CSV file [optional]
673
+ --recursive, -r Recursively search directories [default: True]
674
+ --glob, -g Glob pattern for SQL files [default: *.sql]
675
+ --dialect, -d SQL dialect [default: spark]
676
+ --node-format, -n Node format: 'qualified' or 'structured' [default: qualified]
677
+ ```
678
+
679
+ ```
680
+ sqlglider graph merge <inputs> [OPTIONS]
681
+
682
+ Arguments:
683
+ inputs JSON graph files to merge [optional]
684
+
685
+ Options:
686
+ --output, -o Output file path [required]
687
+ --glob, -g Glob pattern for graph files [optional]
688
+ ```
689
+
690
+ ```
691
+ sqlglider graph query <graph_file> [OPTIONS]
692
+
693
+ Arguments:
694
+ graph_file Path to graph JSON file [required]
695
+
696
+ Options:
697
+ --upstream, -u Find source columns for this column [optional]
698
+ --downstream, -d Find affected columns for this source [optional]
699
+ --output-format, -f Output format: 'text', 'json', or 'csv' [default: text]
700
+ ```
701
+
702
+ **Notes:**
703
+ - `--upstream` and `--downstream` are mutually exclusive. Use one or the other.
704
+ - Graph queries are case-insensitive for column matching.
705
+
706
+ ## Output Formats
707
+
708
+ ### Text Format (Default)
709
+
710
+ Human-readable Rich table format showing query index and preview:
711
+
712
+ ```
713
+ Query 0: SELECT customer_name FROM customers c ...
714
+ +---------------------------------------------------+
715
+ | Output Column | Source Column |
716
+ |-----------------+---------------------------------|
717
+ | customer_name | c.customer_name |
718
+ +---------------------------------------------------+
719
+ Total: 1 row(s)
720
+ ```
721
+
722
+ ### JSON Format
723
+
724
+ Machine-readable structured format with query metadata:
725
+
726
+ ```json
727
+ {
728
+ "queries": [
729
+ {
730
+ "query_index": 0,
731
+ "query_preview": "SELECT customer_name FROM customers c ...",
732
+ "level": "column",
733
+ "lineage": [
734
+ {
735
+ "output_name": "customer_name",
736
+ "source_name": "c.customer_name"
737
+ }
738
+ ]
739
+ }
740
+ ]
741
+ }
742
+ ```
743
+
744
+ ### CSV Format
745
+
746
+ Spreadsheet-ready tabular format with query index:
747
+
748
+ ```csv
749
+ query_index,output_column,source_column
750
+ 0,customer_name,c.customer_name
751
+ ```
752
+
753
+ **Note:** Each source column gets its own row. If an output column has multiple sources, there will be multiple rows with the same `query_index` and `output_column`.
754
+
755
+ ## Development
756
+
757
+ ### Setup
758
+
759
+ ```bash
760
+ # Install dependencies
761
+ uv sync
762
+
763
+ # Run linter
764
+ uv run ruff check
765
+
766
+ # Auto-fix issues
767
+ uv run ruff check --fix
768
+
769
+ # Format code
770
+ uv run ruff format
771
+
772
+ # Type checking
773
+ uv run basedpyright
774
+ ```
775
+
776
+ ### Project Structure
777
+
778
+ See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed technical documentation.
779
+
780
+ ```
781
+ src/sqlglider/
782
+ ├── cli.py # Typer CLI entry point
783
+ ├── dissection/
784
+ │ ├── analyzer.py # DissectionAnalyzer for query decomposition
785
+ │ ├── formatters.py # Output formatters (text, JSON, CSV)
786
+ │ └── models.py # ComponentType, SQLComponent, QueryDissectionResult
787
+ ├── graph/
788
+ │ ├── builder.py # Build graphs from SQL files
789
+ │ ├── merge.py # Merge multiple graphs
790
+ │ ├── query.py # Query upstream/downstream lineage
791
+ │ └── models.py # Graph data models
792
+ ├── lineage/
793
+ │ ├── analyzer.py # Core lineage analysis using SQLGlot
794
+ │ └── formatters.py # Output formatters (text, JSON, CSV)
795
+ └── utils/
796
+ └── file_utils.py # File I/O utilities
797
+ ```
798
+
799
+ ## Publishing
800
+
801
+ SQL Glider is configured for publishing to both TestPyPI and PyPI using `uv`.
802
+
803
+ ### Versioning
804
+
805
+ SQL Glider uses Git tags for version management via [hatch-vcs](https://github.com/ofek/hatch-vcs). The version is automatically derived from Git:
806
+
807
+ - **Tagged commits:** Version matches the tag (e.g., `git tag v0.2.0` produces version `0.2.0`)
808
+ - **Untagged commits:** Version includes development info (e.g., `0.1.dev18+g7216a59`)
809
+
810
+ **Creating a new release:**
811
+
812
+ ```bash
813
+ # Create and push a version tag
814
+ git tag v0.2.0
815
+ git push origin v0.2.0
816
+
817
+ # Build will now produce version 0.2.0
818
+ uv build
819
+ ```
820
+
821
+ **Tag format:** Use `v` prefix (e.g., `v1.0.0`, `v0.2.1`). The `v` is stripped from the final version number.
822
+
823
+ ### Building the Package
824
+
825
+ ```bash
826
+ # Build the distribution files (wheel and sdist)
827
+ uv build
828
+ ```
829
+
830
+ This creates distribution files in the `dist/` directory.
831
+
832
+ ### Publishing to TestPyPI
833
+
834
+ Always test your release on TestPyPI first:
835
+
836
+ ```bash
837
+ # Publish to TestPyPI
838
+ uv publish --index testpypi --token <YOUR_TESTPYPI_TOKEN>
839
+
840
+ # Test installation from TestPyPI
841
+ uv pip install --index-url https://test.pypi.org/simple/ sql-glider
842
+ ```
843
+
844
+ ### Publishing to PyPI
845
+
846
+ Once verified on TestPyPI, publish to production:
847
+
848
+ ```bash
849
+ # Publish to PyPI
850
+ uv publish --index pypi --token <YOUR_PYPI_TOKEN>
851
+ ```
852
+
853
+ ### Token Setup
854
+
855
+ You'll need API tokens from both registries:
856
+
857
+ 1. **TestPyPI Token:** Create at https://test.pypi.org/manage/account/token/
858
+ 2. **PyPI Token:** Create at https://pypi.org/manage/account/token/
859
+
860
+ **Option 1: Pass token directly (shown above)**
861
+
862
+ **Option 2: Environment variable**
863
+ ```bash
864
+ export UV_PUBLISH_TOKEN=pypi-...
865
+ uv publish --index pypi
866
+ ```
867
+
868
+ **Option 3: Store in `.env` file (not committed to git)**
869
+ ```bash
870
+ # .env
871
+ UV_PUBLISH_TOKEN=pypi-...
872
+ ```
873
+
874
+ **Security Note:** Never commit API tokens to version control. The `.gitignore` file should include `.env`.
875
+
876
+ ## Dependencies
877
+
878
+ - **sqlglot[rs]:** SQL parser and lineage analysis library with Rust extensions
879
+ - **typer:** CLI framework with type hints
880
+ - **rich:** Terminal formatting and colored output
881
+ - **pydantic:** Data validation and serialization
882
+ - **rustworkx:** High-performance graph library for cross-file lineage analysis
883
+
884
+ ## References
885
+
886
+ - [SQLGlot Documentation](https://sqlglot.com/)
887
+ - [UV Documentation](https://docs.astral.sh/uv/)
888
+ - [Typer Documentation](https://typer.tiangolo.com/)
889
+ - [Ruff Documentation](https://docs.astral.sh/ruff/configuration/)
890
+
891
+ ## License
892
+
893
+ See LICENSE file for details.