PyPI - sql-glider - Versions diffs - 0.1.2__py3-none-any.whl - Mend

sql-glider 0.1.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

sql_glider-0.1.2.dist-info/METADATA +721 -0
sql_glider-0.1.2.dist-info/RECORD +26 -0
sql_glider-0.1.2.dist-info/WHEEL +4 -0
sql_glider-0.1.2.dist-info/entry_points.txt +6 -0
sql_glider-0.1.2.dist-info/licenses/LICENSE +201 -0
sqlglider/__init__.py +3 -0
sqlglider/_version.py +34 -0
sqlglider/cli.py +1137 -0
sqlglider/global_models.py +17 -0
sqlglider/graph/__init__.py +42 -0
sqlglider/graph/builder.py +310 -0
sqlglider/graph/merge.py +136 -0
sqlglider/graph/models.py +289 -0
sqlglider/graph/query.py +287 -0
sqlglider/graph/serialization.py +107 -0
sqlglider/lineage/__init__.py +10 -0
sqlglider/lineage/analyzer.py +1183 -0
sqlglider/lineage/formatters.py +335 -0
sqlglider/templating/__init__.py +51 -0
sqlglider/templating/base.py +103 -0
sqlglider/templating/jinja.py +163 -0
sqlglider/templating/registry.py +124 -0
sqlglider/templating/variables.py +295 -0
sqlglider/utils/__init__.py +11 -0
sqlglider/utils/config.py +130 -0
sqlglider/utils/file_utils.py +38 -0

sql_glider-0.1.2.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,721 @@
+Metadata-Version: 2.4
+Name: sql-glider
+Version: 0.1.2
+Summary: SQL Utility Toolkit for better understanding, use, and governance of your queries in a native environment.
+Project-URL: Homepage, https://github.com/rycowhi/sql-glider/
+Project-URL: Repository, https://github.com/rycowhi/sql-glider/
+Project-URL: Documentation, https://github.com/rycowhi/sql-glider/
+Project-URL: Issues, https://github.com/rycowhi/sql-glider/issues
+Author-email: Ryan Whitcomb <ryankwhitcomb@gmail.com>
+License-Expression: Apache-2.0
+License-File: LICENSE
+Keywords: data-governance,data-lineage,lineage,sql,sqlglot
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Programming Language :: SQL
+Classifier: Topic :: Database
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Typing :: Typed
+Requires-Python: >=3.11
+Requires-Dist: jinja2>=3.0.0
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: rich>=13.0.0
+Requires-Dist: rustworkx>=0.15.0
+Requires-Dist: sqlglot[rs]>=25.0.0
+Requires-Dist: typer>=0.9.0
+Description-Content-Type: text/markdown
+# SQL Glider
+SQL Utility Toolkit for better understanding, use, and governance of your queries in a native environment.
+## Overview
+SQL Glider provides powerful column-level and table-level lineage analysis for SQL queries using SQLGlot. It operates on standalone SQL files without requiring a full project setup, making it perfect for ad-hoc analysis, data governance, and understanding query dependencies.
+## Features
+- **Forward Lineage:** Trace output columns back to their source tables and columns
+- **Reverse Lineage:** Impact analysis - find which output columns are affected by a source column
+- **Table Extraction:** List all tables in SQL files with usage type (INPUT/OUTPUT) and object type (TABLE/VIEW/CTE)
+- **Multi-level Tracing:** Automatically handles CTEs, subqueries, and complex expressions
+- **Graph-Based Lineage:** Build and query lineage graphs across thousands of SQL files
+- **Multiple Output Formats:** Text (human-readable), JSON (machine-readable), CSV (spreadsheet-ready)
+- **Dialect Support:** Works with Spark, PostgreSQL, Snowflake, BigQuery, MySQL, and many more SQL dialects
+- **File Export:** Save lineage results to files for documentation or further processing
+## Installation
+SQL Glider is available on PyPI and can be installed with pip or uv. Python 3.11+ is required.
+```bash
+# Install with pip
+pip install sql-glider
+# Or install with uv
+uv pip install sql-glider
+```
+After installation, the `sqlglider` command is available:
+```bash
+sqlglider lineage query.sql
+```
+### Development Setup
+If you want to contribute or run from source:
+```bash
+# Clone the repository
+git clone https://github.com/ryanholmdahl/sql-glider.git
+cd sql-glider
+# Install dependencies with uv
+uv sync
+# Run from source
+uv run sqlglider lineage <sql_file>
+```
+## Quick Start
+### Forward Lineage (Source Tracing)
+Find out where your output columns come from:
+```bash
+# Analyze all output columns
+uv run sqlglider lineage query.sql
+# Analyze a specific output column
+uv run sqlglider lineage query.sql --column customer_name
+```
+**Example Output:**
+```
+       Query 0: SELECT customer_name, o.order_total FROM customers c JOIN orders o ...
++-----------------------------------------------------------------------------+
+| Output Column   | Source Column                                             |
+|-----------------+------------------------------------------------------------|
+| customer_name   | c.customer_name                                           |
++-----------------------------------------------------------------------------+
+Total: 1 row(s)
+```
+This shows that the output column `customer_name` in Query 0 comes from `c.customer_name` (the `customer_name` column in table `c`).
+### Reverse Lineage (Impact Analysis)
+Find out which output columns are affected by a source column:
+```bash
+# Find outputs affected by a source column
+uv run sqlglider lineage query.sql --source-column orders.customer_id
+```
+**Example Output:**
+```
+          Query 0: SELECT customer_id, segment FROM ...
++---------------------------------------------------------+
+| Output Column      | Source Column                      |
+|--------------------+------------------------------------|
+| orders.customer_id | orders.customer_id                 |
++---------------------------------------------------------+
+Total: 1 row(s)
+```
+This shows that if `orders.customer_id` changes, it will impact the output column `customer_id` in Query 0.
+## Usage Examples
+### Basic Column Lineage
+```bash
+# Forward lineage for all columns
+uv run sqlglider lineage query.sql
+# Forward lineage for specific column
+uv run sqlglider lineage query.sql --column order_total
+# Reverse lineage (impact analysis)
+uv run sqlglider lineage query.sql --source-column orders.customer_id
+```
+### Different Output Formats
+```bash
+# JSON output
+uv run sqlglider lineage query.sql --output-format json
+# CSV output
+uv run sqlglider lineage query.sql --output-format csv
+# Export to file
+uv run sqlglider lineage query.sql --output-format json --output-file lineage.json
+```
+### Table-Level Lineage
+```bash
+# Show which tables are used
+uv run sqlglider lineage query.sql --level table
+```
+### Table Extraction
+List all tables involved in SQL files with usage and type information:
+```bash
+# List all tables in a SQL file
+uv run sqlglider tables query.sql
+# JSON output with detailed table info
+uv run sqlglider tables query.sql --output-format json
+# Export to CSV
+uv run sqlglider tables query.sql --output-format csv --output-file tables.csv
+```
+**Example Output (JSON):**
+```json
+{
+  "queries": [{
+    "query_index": 0,
+    "tables": [
+      {"name": "customers", "usage": "INPUT", "object_type": "UNKNOWN"},
+      {"name": "orders", "usage": "INPUT", "object_type": "UNKNOWN"}
+    ]
+  }]
+}
+```
+**Table Usage Types:**
+- `INPUT`: Table is read from (SELECT, JOIN, subqueries)
+- `OUTPUT`: Table is written to (INSERT, CREATE TABLE/VIEW, UPDATE)
+- `BOTH`: Table is both read from and written to
+**Object Types:**
+- `TABLE`: CREATE TABLE or DROP TABLE statement
+- `VIEW`: CREATE VIEW or DROP VIEW statement
+- `CTE`: Common Table Expression (WITH clause)
+- `UNKNOWN`: Cannot determine type from SQL alone
+### Different SQL Dialects
+```bash
+# PostgreSQL
+uv run sqlglider lineage query.sql --dialect postgres
+# Snowflake
+uv run sqlglider lineage query.sql --dialect snowflake
+# BigQuery
+uv run sqlglider lineage query.sql --dialect bigquery
+```
+### Multi-Query Files
+SQL Glider automatically detects and analyzes multiple SQL statements in a single file:
+```bash
+# Analyze all queries in a file
+uv run sqlglider lineage multi_query.sql
+# Filter to only queries that reference a specific table
+uv run sqlglider lineage multi_query.sql --table customers
+# Analyze specific column across all queries
+uv run sqlglider lineage multi_query.sql --column customer_id
+# Reverse lineage across all queries (impact analysis)
+uv run sqlglider lineage multi_query.sql --source-column orders.customer_id
+```
+**Example multi-query file:**
+```sql
+-- multi_query.sql
+SELECT customer_id, customer_name FROM customers;
+SELECT order_id, customer_id, order_total FROM orders;
+INSERT INTO customer_orders
+SELECT c.customer_id, c.customer_name, o.order_id
+FROM customers c
+JOIN orders o ON c.customer_id = o.customer_id;
+```
+**Output includes query index for each statement:**
+```
+   Query 0: SELECT customer_id, customer_name FROM customers
++---------------------------------------------------+
+| Output Column           | Source Column           |
+|-------------------------+-------------------------|
+| customers.customer_id   | customers.customer_id   |
+| customers.customer_name | customers.customer_name |
++---------------------------------------------------+
+Total: 2 row(s)
+       Query 1: SELECT order_id, customer_id, order_total FROM orders
++---------------------------------------------+
+| Output Column      | Source Column          |
+|--------------------+------------------------|
+| orders.customer_id | orders.customer_id     |
+| orders.order_id    | orders.order_id        |
+| orders.order_total | orders.order_total     |
++---------------------------------------------+
+Total: 3 row(s)
+       Query 2: INSERT INTO customer_orders ...
++---------------------------------------------+
+| Output Column      | Source Column          |
+|--------------------+------------------------|
+...
+```
+### Graph-Based Lineage (Cross-File Analysis)
+For analyzing lineage across multiple SQL files, SQL Glider provides graph commands:
+```bash
+# Build a lineage graph from a single file
+uv run sqlglider graph build query.sql -o graph.json
+# Build from multiple files
+uv run sqlglider graph build query1.sql query2.sql query3.sql -o graph.json
+# Build from a directory (recursively finds all .sql files)
+uv run sqlglider graph build ./queries/ -r -o graph.json
+# Build from a manifest CSV file
+uv run sqlglider graph build --manifest manifest.csv -o graph.json
+# Merge multiple graphs into one
+uv run sqlglider graph merge graph1.json graph2.json -o merged.json
+# Query upstream dependencies (find all sources for a column)
+uv run sqlglider graph query graph.json --upstream orders.customer_id
+# Query downstream dependencies (find all columns affected by a source)
+uv run sqlglider graph query graph.json --downstream customers.id
+```
+**Example Upstream Query Output:**
+```
+                              Sources for 'order_totals.total'
++--------------------------------------------------------------------------------------------+
+| Column | Table  | Hops | Root | Leaf | Paths                              | File           |
+|--------+--------+------+------+------+------------------------------------+----------------|
+| amount | orders |    1 |  Y   |  N   | orders.amount -> order_totals.total| test_graph.sql |
++--------------------------------------------------------------------------------------------+
+Total: 1 column(s)
+```
+**Example Downstream Query Output:**
+```
+                             Affected Columns for 'orders.amount'
++--------------------------------------------------------------------------------------------+
+| Column | Table        | Hops | Root | Leaf | Paths                              | File           |
+|--------+--------------+------+------+------+------------------------------------+----------------|
+| total  | order_totals |    1 |  N   |  Y   | orders.amount -> order_totals.total| test_graph.sql |
++--------------------------------------------------------------------------------------------+
+Total: 1 column(s)
+```
+**Output Fields:**
+- **Root**: `Y` if the column has no upstream dependencies (source column)
+- **Leaf**: `Y` if the column has no downstream dependencies (final output)
+- **Paths**: All paths from the dependency to the queried column
+**Manifest File Format:**
+```csv
+file_path,dialect
+queries/orders.sql,spark
+queries/customers.sql,postgres
+queries/legacy.sql,
+```
+The graph feature is designed for scale - it can handle thousands of SQL files and provides efficient upstream/downstream queries using rustworkx.
+## Use Cases
+### Data Governance
+**Impact Assessment:**
+```bash
+# Before modifying a source column, check its impact
+uv run sqlglider lineage analytics_dashboard.sql --source-column orders.revenue
+```
+This helps you understand which downstream outputs will be affected by schema changes.
+### Query Understanding
+**Source Tracing:**
+```bash
+# Understand where a metric comes from
+uv run sqlglider lineage metrics.sql --column total_revenue
+```
+Quickly trace complex calculations back to their source tables.
+### Documentation
+**Export Lineage:**
+```bash
+# Generate documentation for your queries
+uv run sqlglider lineage query.sql --output-format csv --output-file docs/lineage.csv
+```
+Create machine-readable lineage documentation for data catalogs.
+### Literal Value Handling
+When analyzing UNION queries, SQL Glider identifies literal values (constants) as sources and displays them clearly:
+```sql
+-- query.sql
+SELECT customer_id, last_order_date FROM active_customers
+UNION ALL
+SELECT customer_id, NULL AS last_order_date FROM prospects
+UNION ALL
+SELECT customer_id, 'unknown' AS status FROM legacy_data
+```
+```bash
+uv run sqlglider lineage query.sql
+```
+**Example Output:**
+```
+       Query 0: SELECT customer_id, last_order_date FROM active_customers ...
++---------------------------------------------------------------------+
+| Output Column                    | Source Column                    |
+|----------------------------------+----------------------------------|
+| active_customers.customer_id     | active_customers.customer_id     |
+|                                  | prospects.customer_id            |
+| active_customers.last_order_date | <literal: NULL>                  |
+|                                  | active_customers.last_order_date |
++---------------------------------------------------------------------+
+Total: 4 row(s)
+```
+Literal values are displayed as `<literal: VALUE>` to clearly distinguish them from actual column sources:
+- `<literal: NULL>` - NULL values
+- `<literal: 0>` - Numeric literals
+- `<literal: 'string'>` - String literals
+- `<literal: CURRENT_TIMESTAMP()>` - Function calls
+This helps identify which branches of a UNION contribute actual data lineage versus hardcoded values.
+### Multi-Level Analysis
+SQL Glider automatically traces through CTEs and subqueries:
+```sql
+-- query.sql
+WITH order_totals AS (
+    SELECT customer_id, SUM(order_amount) as total_amount
+    FROM orders
+    GROUP BY customer_id
+),
+customer_segments AS (
+    SELECT
+        ot.customer_id,
+        c.customer_name,
+        CASE
+            WHEN ot.total_amount > 10000 THEN 'Premium'
+            ELSE 'Standard'
+        END as segment
+    FROM order_totals ot
+    JOIN customers c ON ot.customer_id = c.customer_id
+)
+SELECT customer_name, segment, total_amount
+FROM customer_segments
+```
+```bash
+# Trace segment back to its ultimate sources
+uv run sqlglider lineage query.sql --column segment
+# Output: orders.order_amount (through the CASE statement and SUM)
+# Find what's affected by order_amount
+uv run sqlglider lineage query.sql --source-column orders.order_amount
+# Output: segment, total_amount
+```
+## CLI Reference
+```
+sqlglider lineage <sql_file> [OPTIONS]
+Arguments:
+  sql_file                    Path to SQL file to analyze [required]
+Options:
+  --level, -l                 Analysis level: 'column' or 'table' [default: column]
+  --dialect, -d               SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
+  --column, -c                Specific output column for forward lineage [optional]
+  --source-column, -s         Source column for reverse lineage (impact analysis) [optional]
+  --table, -t                 Filter to only queries that reference this table (multi-query files) [optional]
+  --output-format, -f         Output format: 'text', 'json', or 'csv' [default: text]
+  --output-file, -o           Write output to file instead of stdout [optional]
+  --help                      Show help message and exit
+```
+**Notes:**
+- `--column` and `--source-column` are mutually exclusive. Use one or the other.
+- `--table` filter is useful for multi-query files to analyze only queries that reference a specific table.
+### Tables Command
+```
+sqlglider tables <sql_file> [OPTIONS]
+Arguments:
+  sql_file                    Path to SQL file to analyze [required]
+Options:
+  --dialect, -d               SQL dialect (spark, postgres, snowflake, etc.) [default: spark]
+  --table                     Filter to only queries that reference this table [optional]
+  --output-format, -f         Output format: 'text', 'json', or 'csv' [default: text]
+  --output-file, -o           Write output to file instead of stdout [optional]
+  --templater, -t             Templater for SQL preprocessing (e.g., 'jinja', 'none') [optional]
+  --var, -v                   Template variable in key=value format (repeatable) [optional]
+  --vars-file                 Path to variables file (JSON or YAML) [optional]
+  --help                      Show help message and exit
+```
+### Graph Commands
+```
+sqlglider graph build <paths> [OPTIONS]
+Arguments:
+  paths                       SQL file(s) or directory to process [optional]
+Options:
+  --output, -o                Output JSON file path [required]
+  --manifest, -m              Path to manifest CSV file [optional]
+  --recursive, -r             Recursively search directories [default: True]
+  --glob, -g                  Glob pattern for SQL files [default: *.sql]
+  --dialect, -d               SQL dialect [default: spark]
+  --node-format, -n           Node format: 'qualified' or 'structured' [default: qualified]
+```
+```
+sqlglider graph merge <inputs> [OPTIONS]
+Arguments:
+  inputs                      JSON graph files to merge [optional]
+Options:
+  --output, -o                Output file path [required]
+  --glob, -g                  Glob pattern for graph files [optional]
+```
+```
+sqlglider graph query <graph_file> [OPTIONS]
+Arguments:
+  graph_file                  Path to graph JSON file [required]
+Options:
+  --upstream, -u              Find source columns for this column [optional]
+  --downstream, -d            Find affected columns for this source [optional]
+  --output-format, -f         Output format: 'text', 'json', or 'csv' [default: text]
+```
+**Notes:**
+- `--upstream` and `--downstream` are mutually exclusive. Use one or the other.
+- Graph queries are case-insensitive for column matching.
+## Output Formats
+### Text Format (Default)
+Human-readable Rich table format showing query index and preview:
+```
+       Query 0: SELECT customer_name FROM customers c ...
++---------------------------------------------------+
+| Output Column   | Source Column                   |
+|-----------------+---------------------------------|
+| customer_name   | c.customer_name                 |
++---------------------------------------------------+
+Total: 1 row(s)
+```
+### JSON Format
+Machine-readable structured format with query metadata:
+```json
+{
+  "queries": [
+    {
+      "query_index": 0,
+      "query_preview": "SELECT customer_name FROM customers c ...",
+      "level": "column",
+      "lineage": [
+        {
+          "output_name": "customer_name",
+          "source_name": "c.customer_name"
+        }
+      ]
+    }
+  ]
+}
+```
+### CSV Format
+Spreadsheet-ready tabular format with query index:
+```csv
+query_index,output_column,source_column
+0,customer_name,c.customer_name
+```
+**Note:** Each source column gets its own row. If an output column has multiple sources, there will be multiple rows with the same `query_index` and `output_column`.
+## Development
+### Setup
+```bash
+# Install dependencies
+uv sync
+# Run linter
+uv run ruff check
+# Auto-fix issues
+uv run ruff check --fix
+# Format code
+uv run ruff format
+# Type checking
+uv run basedpyright
+```
+### Project Structure
+See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed technical documentation.
+```
+src/sqlglider/
+├── cli.py                    # Typer CLI entry point
+├── graph/
+│   ├── builder.py            # Build graphs from SQL files
+│   ├── merge.py              # Merge multiple graphs
+│   ├── query.py              # Query upstream/downstream lineage
+│   └── models.py             # Graph data models
+├── lineage/
+│   ├── analyzer.py           # Core lineage analysis using SQLGlot
+│   └── formatters.py         # Output formatters (text, JSON, CSV)
+└── utils/
+    └── file_utils.py         # File I/O utilities
+```
+## Publishing
+SQL Glider is configured for publishing to both TestPyPI and PyPI using `uv`.
+### Versioning
+SQL Glider uses Git tags for version management via [hatch-vcs](https://github.com/ofek/hatch-vcs). The version is automatically derived from Git:
+- **Tagged commits:** Version matches the tag (e.g., `git tag v0.2.0` produces version `0.2.0`)
+- **Untagged commits:** Version includes development info (e.g., `0.1.dev18+g7216a59`)
+**Creating a new release:**
+```bash
+# Create and push a version tag
+git tag v0.2.0
+git push origin v0.2.0
+# Build will now produce version 0.2.0
+uv build
+```
+**Tag format:** Use `v` prefix (e.g., `v1.0.0`, `v0.2.1`). The `v` is stripped from the final version number.
+### Building the Package
+```bash
+# Build the distribution files (wheel and sdist)
+uv build
+```
+This creates distribution files in the `dist/` directory.
+### Publishing to TestPyPI
+Always test your release on TestPyPI first:
+```bash
+# Publish to TestPyPI
+uv publish --index testpypi --token <YOUR_TESTPYPI_TOKEN>
+# Test installation from TestPyPI
+uv pip install --index-url https://test.pypi.org/simple/ sql-glider
+```
+### Publishing to PyPI
+Once verified on TestPyPI, publish to production:
+```bash
+# Publish to PyPI
+uv publish --index pypi --token <YOUR_PYPI_TOKEN>
+```
+### Token Setup
+You'll need API tokens from both registries:
+1. **TestPyPI Token:** Create at https://test.pypi.org/manage/account/token/
+2. **PyPI Token:** Create at https://pypi.org/manage/account/token/
+**Option 1: Pass token directly (shown above)**
+**Option 2: Environment variable**
+```bash
+export UV_PUBLISH_TOKEN=pypi-...
+uv publish --index pypi
+```
+**Option 3: Store in `.env` file (not committed to git)**
+```bash
+# .env
+UV_PUBLISH_TOKEN=pypi-...
+```
+**Security Note:** Never commit API tokens to version control. The `.gitignore` file should include `.env`.
+## Dependencies
+- **sqlglot[rs]:** SQL parser and lineage analysis library with Rust extensions
+- **typer:** CLI framework with type hints
+- **rich:** Terminal formatting and colored output
+- **pydantic:** Data validation and serialization
+- **rustworkx:** High-performance graph library for cross-file lineage analysis
+## References
+- [SQLGlot Documentation](https://sqlglot.com/)
+- [UV Documentation](https://docs.astral.sh/uv/)
+- [Typer Documentation](https://typer.tiangolo.com/)
+- [Ruff Documentation](https://docs.astral.sh/ruff/configuration/)
+## License
+See LICENSE file for details.