PyPI - sqf-py - Versions diffs - 0.1.0__tar.gz - Mend

sqf-py 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

sqf_py-0.1.0/LICENSE +21 -0
sqf_py-0.1.0/PKG-INFO +191 -0
sqf_py-0.1.0/README.md +171 -0
sqf_py-0.1.0/pyproject.toml +32 -0
sqf_py-0.1.0/setup.cfg +4 -0
sqf_py-0.1.0/sqf/__init__.py +70 -0
sqf_py-0.1.0/sqf/analyzer.py +239 -0
sqf_py-0.1.0/sqf/benchmark.py +488 -0
sqf_py-0.1.0/sqf/fingerprint.py +139 -0
sqf_py-0.1.0/sqf/generator.py +647 -0
sqf_py-0.1.0/sqf/normalizer.py +318 -0
sqf_py-0.1.0/sqf/snowflake.py +460 -0
sqf_py-0.1.0/sqf/sql/001_create_cluster_store.sql +74 -0
sqf_py-0.1.0/sqf/sql/002_hit_rate_views.sql +186 -0
sqf_py-0.1.0/sqf/sql/003_query_history_export.sql +31 -0
sqf_py-0.1.0/sqf_py.egg-info/PKG-INFO +191 -0
sqf_py-0.1.0/sqf_py.egg-info/SOURCES.txt +18 -0
sqf_py-0.1.0/sqf_py.egg-info/dependency_links.txt +1 -0
sqf_py-0.1.0/sqf_py.egg-info/requires.txt +11 -0
sqf_py-0.1.0/sqf_py.egg-info/top_level.txt +1 -0

sqf_py-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Pragya Verma
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

sqf_py-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,191 @@
+Metadata-Version: 2.4
+Name: sqf-py
+Version: 0.1.0
+Summary: Semantic Query Fingerprinting for Snowflake — collapse syntactically different but logically identical SQL queries to a canonical fingerprint
+License: MIT
+Project-URL: Homepage, https://github.com/vermapragya/sqf-py
+Project-URL: White Paper, https://github.com/vermapragya/sqf-py/blob/main/WHITEPAPER.md
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: sqlglot>=25.0.0
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0; extra == "dev"
+Requires-Dist: pytest-cov; extra == "dev"
+Provides-Extra: snowflake
+Requires-Dist: snowflake-connector-python>=3.0; extra == "snowflake"
+Provides-Extra: bench
+Requires-Dist: matplotlib>=3.7; extra == "bench"
+Dynamic: license-file
+# sqf-py — Semantic Query Fingerprinting for Snowflake
+[![Tests](https://img.shields.io/badge/tests-68%20passed-brightgreen)]()
+[![Python](https://img.shields.io/badge/python-3.9%2B-blue)]()
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)]()
+**sqf-py** assigns a stable, content-addressed fingerprint to any SQL query by normalizing away syntactic noise. Queries that are *logically identical* but *written differently* collapse to the same fingerprint — enabling deduplication analysis, cost attribution, and query-cache optimization on Snowflake.
+Accompanies the white paper: [**Semantic Query Deduplication in Cloud Data Warehouses**](WHITEPAPER.md).
+---
+## The Problem
+Modern data warehouses are bombarded with semantically identical queries that look different:
+```sql
+-- BI tool A (Looker)
+SELECT o.user_id AS uid, SUM(o.amount) AS revenue
+FROM orders AS o WHERE o.status = 'complete' AND o.created_at > '2024-01-01'
+GROUP BY 1
+-- BI tool B (Tableau)
+SELECT SUM(amount) AS revenue, user_id AS uid
+FROM orders WHERE created_at > '2023-06-01' AND status = 'active'
+GROUP BY uid
+```
+These are logically the same query template. Snowflake's text-keyed result cache treats them as distinct — burning compute on every re-execution. sqf-py proves they're duplicates: both fingerprint to `3c1a8c600789df69…`.
+On a synthetic 10,000-query BI-style workload, sqf-py identifies **99.7% of executions as semantic duplicates**, at **~440 queries/second** analyzed client-side. See [the white paper](WHITEPAPER.md) for methodology and caveats.
+---
+## Installation
+```bash
+pip install sqf-py                 # core library (sqlglot only)
+pip install "sqf-py[snowflake]"    # + Snowflake connector
+pip install "sqf-py[bench]"        # + matplotlib for benchmark charts
+pip install "sqf-py[dev]"          # + pytest
+```
+---
+## Quick Start
+```python
+from sqf import fingerprint, are_equivalent, canonical_form, SQFAnalyzer
+# Single fingerprint
+fp = fingerprint("SELECT a, b FROM t WHERE id = 1")
+# → "3f4a1b9c..."  (64-char hex, stable)
+# Equivalence check — these two queries are semantically identical
+q1 = "SELECT a AS col1, b AS col2 FROM t WHERE id = 99"
+q2 = "SELECT b, a FROM t WHERE id = 1"
+are_equivalent(q1, q2)  # → True
+# See the canonical form
+canonical_form("SELECT a AS x, b AS y FROM t WHERE id = 42")
+# → "SELECT A, B FROM T WHERE ID = ?"
+# Bulk workload analysis
+analyzer = SQFAnalyzer()
+analyzer.ingest_sql(my_query_list, credits_per_query=0.05)
+print(analyzer.report().summary())
+```
+---
+## Normalization Pipeline
+The SQF algorithm applies these passes in order:
+| Pass | What it does | Example |
+|------|-------------|---------|
+| 1. GROUP BY reference resolution | `GROUP BY 1` / `GROUP BY alias` → actual expression | `GROUP BY user_id` |
+| 2. Alias stripping | Remove all `AS` aliases and table qualifiers | `SELECT o.a AS x` → `SELECT a` |
+| 3. Column sort | Sort SELECT list alphabetically | `SELECT b, a` → `SELECT a, b` |
+| 4. GROUP BY sort | Sort GROUP BY keys | `GROUP BY b, a` → `GROUP BY a, b` |
+| 5. Predicate canonicalization | Sort AND/OR operands recursively | `WHERE b=2 AND a=1` → `WHERE a=1 AND b=2` |
+| 6. CTE inlining | Inline single-reference CTEs | `WITH x AS (...) SELECT ... FROM x` → subquery |
+| 7. Literal abstraction | Replace all values with `?` | `WHERE id = 42` → `WHERE id = ?` |
+| 8. Whitespace collapse + uppercase | Canonical string form | |
+| **Hash** | SHA-256 of canonical string | 64-char hex fingerprint |
+The precise equivalence class (and its deliberate trade-offs) is defined in [§2 of the white paper](WHITEPAPER.md).
+---
+## Analyzing a Snowflake Workload
+```python
+from sqf import SnowflakeIngestor, ClusterStore, SQFAnalyzer
+import snowflake.connector
+conn = snowflake.connector.connect(...)  # your credentials
+# 1. Pull the last 30 days of QUERY_HISTORY
+records = SnowflakeIngestor(conn, lookback_days=30, row_limit=50_000).fetch_records()
+# 2. Fingerprint + cluster
+report = SQFAnalyzer().ingest(records).report()
+print(report.summary())
+# ═══════════════════════════════════════════════════════════
+#   Semantic Query Fingerprint (SQF) Analysis Report
+# ═══════════════════════════════════════════════════════════
+#   Total query executions   :    12,847
+#   Unique SQF fingerprints  :     4,203
+#   Dedup hit rate           :    67.3%
+#   Credits wasted           :    86.4800
+#   ...
+# 3. Persist results back to Snowflake (idempotent MERGEs)
+store = ClusterStore(conn, database="SQF", schema="ANALYTICS")
+store.bootstrap()        # creates tables + 6 analytical views
+store.persist(report)
+# 4. Query the views
+store.overall_metrics()          # headline KPIs
+store.daily_hit_rate()           # time series for charts
+store.top_waste(10)              # the 10 most expensive duplicate clusters
+store.multi_variant_offenders(10)  # same logic, many SQL spellings
+```
+The bundled SQL (DDL, views, `QUERY_HISTORY` export) lives in [`sqf/sql/`](sqf/sql/) and is also usable standalone.
+---
+## Synthetic Workloads & Benchmarks
+No Snowflake account needed to try the library:
+```python
+from sqf import SyntheticWorkloadGenerator, SQFAnalyzer
+gen = SyntheticWorkloadGenerator(n_queries=1000, duplication_rate=0.7, seed=42)
+report = SQFAnalyzer().ingest(gen.generate()).report()
+print(report.summary())   # → 96.9% dedup hit rate
+```
+The generator models 12 logical query families (BI aggregates, joins, window functions, funnels, MRR rollups, …) with 8 syntactic variant dimensions each, plus realistic per-family credit cost distributions.
+Reproduce the white paper's full benchmark grid (36 configurations, ~5 min):
+```bash
+python -m sqf.benchmark --out benchmarks --full
+```
+Outputs `benchmarks/results.json` plus five charts:
+![Hit rate vs duplication rate](benchmarks/charts/01_hit_rate_vs_dup_rate.png)
+---
+## Development
+```bash
+git clone https://github.com/vermapragya/sqf-py
+cd sqf-py
+python3 -m venv .venv
+.venv/bin/pip install -e ".[dev,bench]"
+.venv/bin/python -m pytest        # 68 tests
+```
+---
+## License
+[MIT](LICENSE)

sqf_py-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,171 @@
+# sqf-py — Semantic Query Fingerprinting for Snowflake
+[![Tests](https://img.shields.io/badge/tests-68%20passed-brightgreen)]()
+[![Python](https://img.shields.io/badge/python-3.9%2B-blue)]()
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)]()
+**sqf-py** assigns a stable, content-addressed fingerprint to any SQL query by normalizing away syntactic noise. Queries that are *logically identical* but *written differently* collapse to the same fingerprint — enabling deduplication analysis, cost attribution, and query-cache optimization on Snowflake.
+Accompanies the white paper: [**Semantic Query Deduplication in Cloud Data Warehouses**](WHITEPAPER.md).
+---
+## The Problem
+Modern data warehouses are bombarded with semantically identical queries that look different:
+```sql
+-- BI tool A (Looker)
+SELECT o.user_id AS uid, SUM(o.amount) AS revenue
+FROM orders AS o WHERE o.status = 'complete' AND o.created_at > '2024-01-01'
+GROUP BY 1
+-- BI tool B (Tableau)
+SELECT SUM(amount) AS revenue, user_id AS uid
+FROM orders WHERE created_at > '2023-06-01' AND status = 'active'
+GROUP BY uid
+```
+These are logically the same query template. Snowflake's text-keyed result cache treats them as distinct — burning compute on every re-execution. sqf-py proves they're duplicates: both fingerprint to `3c1a8c600789df69…`.
+On a synthetic 10,000-query BI-style workload, sqf-py identifies **99.7% of executions as semantic duplicates**, at **~440 queries/second** analyzed client-side. See [the white paper](WHITEPAPER.md) for methodology and caveats.
+---
+## Installation
+```bash
+pip install sqf-py                 # core library (sqlglot only)
+pip install "sqf-py[snowflake]"    # + Snowflake connector
+pip install "sqf-py[bench]"        # + matplotlib for benchmark charts
+pip install "sqf-py[dev]"          # + pytest
+```
+---
+## Quick Start
+```python
+from sqf import fingerprint, are_equivalent, canonical_form, SQFAnalyzer
+# Single fingerprint
+fp = fingerprint("SELECT a, b FROM t WHERE id = 1")
+# → "3f4a1b9c..."  (64-char hex, stable)
+# Equivalence check — these two queries are semantically identical
+q1 = "SELECT a AS col1, b AS col2 FROM t WHERE id = 99"
+q2 = "SELECT b, a FROM t WHERE id = 1"
+are_equivalent(q1, q2)  # → True
+# See the canonical form
+canonical_form("SELECT a AS x, b AS y FROM t WHERE id = 42")
+# → "SELECT A, B FROM T WHERE ID = ?"
+# Bulk workload analysis
+analyzer = SQFAnalyzer()
+analyzer.ingest_sql(my_query_list, credits_per_query=0.05)
+print(analyzer.report().summary())
+```
+---
+## Normalization Pipeline
+The SQF algorithm applies these passes in order:
+| Pass | What it does | Example |
+|------|-------------|---------|
+| 1. GROUP BY reference resolution | `GROUP BY 1` / `GROUP BY alias` → actual expression | `GROUP BY user_id` |
+| 2. Alias stripping | Remove all `AS` aliases and table qualifiers | `SELECT o.a AS x` → `SELECT a` |
+| 3. Column sort | Sort SELECT list alphabetically | `SELECT b, a` → `SELECT a, b` |
+| 4. GROUP BY sort | Sort GROUP BY keys | `GROUP BY b, a` → `GROUP BY a, b` |
+| 5. Predicate canonicalization | Sort AND/OR operands recursively | `WHERE b=2 AND a=1` → `WHERE a=1 AND b=2` |
+| 6. CTE inlining | Inline single-reference CTEs | `WITH x AS (...) SELECT ... FROM x` → subquery |
+| 7. Literal abstraction | Replace all values with `?` | `WHERE id = 42` → `WHERE id = ?` |
+| 8. Whitespace collapse + uppercase | Canonical string form | |
+| **Hash** | SHA-256 of canonical string | 64-char hex fingerprint |
+The precise equivalence class (and its deliberate trade-offs) is defined in [§2 of the white paper](WHITEPAPER.md).
+---
+## Analyzing a Snowflake Workload
+```python
+from sqf import SnowflakeIngestor, ClusterStore, SQFAnalyzer
+import snowflake.connector
+conn = snowflake.connector.connect(...)  # your credentials
+# 1. Pull the last 30 days of QUERY_HISTORY
+records = SnowflakeIngestor(conn, lookback_days=30, row_limit=50_000).fetch_records()
+# 2. Fingerprint + cluster
+report = SQFAnalyzer().ingest(records).report()
+print(report.summary())
+# ═══════════════════════════════════════════════════════════
+#   Semantic Query Fingerprint (SQF) Analysis Report
+# ═══════════════════════════════════════════════════════════
+#   Total query executions   :    12,847
+#   Unique SQF fingerprints  :     4,203
+#   Dedup hit rate           :    67.3%
+#   Credits wasted           :    86.4800
+#   ...
+# 3. Persist results back to Snowflake (idempotent MERGEs)
+store = ClusterStore(conn, database="SQF", schema="ANALYTICS")
+store.bootstrap()        # creates tables + 6 analytical views
+store.persist(report)
+# 4. Query the views
+store.overall_metrics()          # headline KPIs
+store.daily_hit_rate()           # time series for charts
+store.top_waste(10)              # the 10 most expensive duplicate clusters
+store.multi_variant_offenders(10)  # same logic, many SQL spellings
+```
+The bundled SQL (DDL, views, `QUERY_HISTORY` export) lives in [`sqf/sql/`](sqf/sql/) and is also usable standalone.
+---
+## Synthetic Workloads & Benchmarks
+No Snowflake account needed to try the library:
+```python
+from sqf import SyntheticWorkloadGenerator, SQFAnalyzer
+gen = SyntheticWorkloadGenerator(n_queries=1000, duplication_rate=0.7, seed=42)
+report = SQFAnalyzer().ingest(gen.generate()).report()
+print(report.summary())   # → 96.9% dedup hit rate
+```
+The generator models 12 logical query families (BI aggregates, joins, window functions, funnels, MRR rollups, …) with 8 syntactic variant dimensions each, plus realistic per-family credit cost distributions.
+Reproduce the white paper's full benchmark grid (36 configurations, ~5 min):
+```bash
+python -m sqf.benchmark --out benchmarks --full
+```
+Outputs `benchmarks/results.json` plus five charts:
+![Hit rate vs duplication rate](benchmarks/charts/01_hit_rate_vs_dup_rate.png)
+---
+## Development
+```bash
+git clone https://github.com/vermapragya/sqf-py
+cd sqf-py
+python3 -m venv .venv
+.venv/bin/pip install -e ".[dev,bench]"
+.venv/bin/python -m pytest        # 68 tests
+```
+---
+## License
+[MIT](LICENSE)

sqf_py-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,32 @@
+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "sqf-py"
+version = "0.1.0"
+description = "Semantic Query Fingerprinting for Snowflake — collapse syntactically different but logically identical SQL queries to a canonical fingerprint"
+readme = "README.md"
+license = { text = "MIT" }
+requires-python = ">=3.9"
+dependencies = [
+    "sqlglot>=25.0.0",
+]
+[project.optional-dependencies]
+dev = ["pytest>=7.0", "pytest-cov"]
+snowflake = ["snowflake-connector-python>=3.0"]
+bench = ["matplotlib>=3.7"]
+[project.urls]
+Homepage = "https://github.com/vermapragya/sqf-py"
+"White Paper" = "https://github.com/vermapragya/sqf-py/blob/main/WHITEPAPER.md"
+[tool.setuptools.packages.find]
+include = ["sqf*"]
+[tool.setuptools.package-data]
+sqf = ["sql/*.sql"]
+[tool.pytest.ini_options]
+testpaths = ["."]

sqf_py-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

sqf_py-0.1.0/sqf/__init__.py ADDED Viewed

@@ -0,0 +1,70 @@
+"""
+sqf — Semantic Query Fingerprinting for Snowflake
+==================================================
+A Python library that assigns a stable, content-addressed fingerprint to any
+SQL query by normalizing away syntactic noise.  Queries that are logically
+identical but written differently collapse to the same fingerprint, enabling
+deduplication analysis, cost attribution, and query-cache optimization.
+Quick start::
+    from sqf import fingerprint, are_equivalent, SQFAnalyzer
+    # Single query fingerprinting
+    h = fingerprint("SELECT a, b FROM t WHERE id = 1")
+    # Equivalence check
+    are_equivalent(
+        "SELECT a AS col1, b AS col2 FROM t WHERE id = 1",
+        "SELECT b, a FROM t WHERE id = ?",
+    )  # → True
+    # Bulk workload analysis
+    analyzer = SQFAnalyzer()
+    analyzer.ingest_sql(my_query_list)
+    print(analyzer.report().summary())
+"""
+from .fingerprint import (
+    fingerprint,
+    canonical_form,
+    are_equivalent,
+    QueryRecord,
+    SQFCluster,
+)
+from .normalizer import normalize
+from .analyzer import SQFAnalyzer, SQFReport
+from .generator import SyntheticWorkloadGenerator, FAMILIES, FAMILY_BY_ID
+from .snowflake import SnowflakeIngestor, ClusterStore, load_sql, SQL_FILES
+from .benchmark import (
+    BenchmarkRun,
+    BenchmarkSuite,
+    run_single,
+    run_benchmark_suite,
+    make_charts,
+)
+__version__ = "0.1.0"
+__all__ = [
+    "fingerprint",
+    "canonical_form",
+    "are_equivalent",
+    "normalize",
+    "QueryRecord",
+    "SQFCluster",
+    "SQFAnalyzer",
+    "SQFReport",
+    "SyntheticWorkloadGenerator",
+    "FAMILIES",
+    "FAMILY_BY_ID",
+    "SnowflakeIngestor",
+    "ClusterStore",
+    "load_sql",
+    "SQL_FILES",
+    "BenchmarkRun",
+    "BenchmarkSuite",
+    "run_single",
+    "run_benchmark_suite",
+    "make_charts",
+]