dbt-polyglot 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,25 @@
1
+ # Changelog
2
+
3
+ All notable changes to `dbt-polyglot` are documented here. Format loosely follows
4
+ [Keep a Changelog](https://keepachangelog.com/); this project uses [SemVer](https://semver.org/).
5
+
6
+ ## [0.1.0] — Unreleased
7
+
8
+ ### Added
9
+ Initial release (as `dbt-polyglot`).
10
+ - `dbt-polyglot` - A standard src-layout
11
+ (`src/dbt_polyglot/`): split into `transpile` (the compile-phase patch) and `fixups` (the
12
+ `SPARK_FIXUPS` registry), with import-time activation in `__init__`.
13
+ - Compile-phase transpile: wraps `dbt.compilation.Compiler._compile_code` to translate each opted-in
14
+ model's SQL from a source dialect to Spark via `sqlglot` (`parse → fix-ups → generate`), before dbt
15
+ wraps it in materialization DDL. Opt in with `+transpile_from: <dialect>` in dbt config; no model edits.
16
+ - **Spark-output fix-up layer** (`SPARK_FIXUPS`): repairs sqlglot output that Spark's real parser rejects.
17
+ First transform rewrites quantified-subquery comparisons (`x <> ALL (subq)` / `x = ANY (subq)`) back to
18
+ `NOT x IN (subq)` / `x IN (subq)`. Extensible registry.
19
+ - Fail-soft: any transpile error / empty / multi-statement output logs a WARNING and passes the original
20
+ SQL through unchanged — never crashes a compile, never silently emits a wrong result.
21
+ - Pretty-printed output; no-op when `transpile_from` is unset or equals the target dialect.
22
+
23
+ ### Notes
24
+ - Patches a dbt-core private method (`_compile_code`); import-guarded to fail open. Pin a supported
25
+ dbt-core range and re-verify on major dbt upgrades.
@@ -0,0 +1,202 @@
1
+
2
+ Apache License
3
+ Version 2.0, January 2004
4
+ http://www.apache.org/licenses/
5
+
6
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
+
8
+ 1. Definitions.
9
+
10
+ "License" shall mean the terms and conditions for use, reproduction,
11
+ and distribution as defined by Sections 1 through 9 of this document.
12
+
13
+ "Licensor" shall mean the copyright owner or entity authorized by
14
+ the copyright owner that is granting the License.
15
+
16
+ "Legal Entity" shall mean the union of the acting entity and all
17
+ other entities that control, are controlled by, or are under common
18
+ control with that entity. For the purposes of this definition,
19
+ "control" means (i) the power, direct or indirect, to cause the
20
+ direction or management of such entity, whether by contract or
21
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or (iii) beneficial ownership of such entity.
23
+
24
+ "You" (or "Your") shall mean an individual or Legal Entity
25
+ exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications,
28
+ including but not limited to software source code, documentation
29
+ source, and configuration files.
30
+
31
+ "Object" form shall mean any form resulting from mechanical
32
+ transformation or translation of a Source form, including but
33
+ not limited to compiled object code, generated documentation,
34
+ and conversions to other media types.
35
+
36
+ "Work" shall mean the work of authorship, whether in Source or
37
+ Object form, made available under the License, as indicated by a
38
+ copyright notice that is included in or attached to the work
39
+ (an example is provided in the Appendix below).
40
+
41
+ "Derivative Works" shall mean any work, whether in Source or Object
42
+ form, that is based on (or derived from) the Work and for which the
43
+ editorial revisions, annotations, elaborations, or other modifications
44
+ represent, as a whole, an original work of authorship. For the purposes
45
+ of this License, Derivative Works shall not include works that remain
46
+ separable from, or merely link (or bind by name) to the interfaces of,
47
+ the Work and Derivative Works thereof.
48
+
49
+ "Contribution" shall mean any work of authorship, including
50
+ the original version of the Work and any modifications or additions
51
+ to that Work or Derivative Works thereof, that is intentionally
52
+ submitted to Licensor for inclusion in the Work by the copyright owner
53
+ or by an individual or Legal Entity authorized to submit on behalf of
54
+ the copyright owner. For the purposes of this definition, "submitted"
55
+ means any form of electronic, verbal, or written communication sent
56
+ to the Licensor or its representatives, including but not limited to
57
+ communication on electronic mailing lists, source code control systems,
58
+ and issue tracking systems that are managed by, or on behalf of, the
59
+ Licensor for the purpose of discussing and improving the Work, but
60
+ excluding communication that is conspicuously marked or otherwise
61
+ designated in writing by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have made,
78
+ use, offer to sell, sell, import, and otherwise transfer the Work,
79
+ where such license applies only to those patent claims licensable
80
+ by such Contributor that are necessarily infringed by their
81
+ Contribution(s) alone or by combination of their Contribution(s)
82
+ with the Work to which such Contribution(s) was submitted. If You
83
+ institute patent litigation against any entity (including a
84
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
85
+ or a Contribution incorporated within the Work constitutes direct
86
+ or contributory patent infringement, then any patent licenses
87
+ granted to You under this License for that Work shall terminate
88
+ as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing the
142
+ origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright [2026] [Saket Kumar]
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
@@ -0,0 +1,5 @@
1
+ include dbt_polyglot.pth
2
+ include LICENSE
3
+ include README.md
4
+ include CHANGELOG.md
5
+ recursive-include tests *.py
@@ -0,0 +1,247 @@
1
+ Metadata-Version: 2.4
2
+ Name: dbt-polyglot
3
+ Version: 0.1.0
4
+ Summary: Run any-dialect dbt models on Spark unchanged — transpiles each model's SQL to Spark at dbt compile time via sqlglot.
5
+ Author-email: Saket Kumar <kumar.saket0021@gmail.com>
6
+ License-Expression: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/Saketkr21/dbt-polyglot
8
+ Project-URL: Repository, https://github.com/Saketkr21/dbt-polyglot
9
+ Project-URL: Issues, https://github.com/Saketkr21/dbt-polyglot/issues
10
+ Keywords: dbt,spark,sqlglot,snowflake,sql,transpile,dialect,polyglot
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: SQL
15
+ Classifier: Topic :: Database
16
+ Classifier: Topic :: Software Development :: Code Generators
17
+ Requires-Python: >=3.9
18
+ Description-Content-Type: text/markdown
19
+ License-File: LICENSE
20
+ Requires-Dist: sqlglot>=20.0
21
+ Requires-Dist: dbt-core>=1.5
22
+ Provides-Extra: test
23
+ Requires-Dist: pytest; extra == "test"
24
+ Dynamic: license-file
25
+
26
+ # dbt-polyglot
27
+
28
+ **Run a dbt project written in another SQL dialect (Snowflake, BigQuery, Redshift, …) on
29
+ Spark — unchanged.** Each model's SQL is transpiled to Spark with
30
+ [`sqlglot`](https://github.com/tobikodata/sqlglot) at dbt's **compile phase**, so the SQL
31
+ dbt actually executes (and what lands in `target/compiled/`) is already Spark.
32
+
33
+ The only changes are configuration — your model `.sql` files are never edited. Drop the
34
+ package into any existing dbt repo, point `profiles.yml` at Spark, declare the source
35
+ dialect in `dbt_project.yml`, and `dbt build`.
36
+
37
+ > Why this exists: Spark has no `QUALIFY` clause (`[PARSE_SYNTAX_ERROR] … near 'QUALIFY'`),
38
+ > plus dozens of smaller dialect gaps (`IFF`, `NVL`, `::` casts, `DATEADD`, null ordering, …).
39
+ > A portable/Snowflake-style model fails on Spark until its SQL is translated. This package
40
+ > does that translation transparently, in-place, at compile time.
41
+
42
+ ---
43
+
44
+ ## Install
45
+
46
+ It is a **normal Python package** — install it into the same virtualenv your `dbt` runs in.
47
+ Installation auto-activates the patch (via a `.pth` file that imports the module on
48
+ interpreter start-up; see [Installation: why pip, not `dbt deps`](#installation-why-pip-not-dbt-deps)).
49
+
50
+ ```bash
51
+ pip install dbt-polyglot
52
+ ```
53
+
54
+ From a git checkout (bleeding edge):
55
+
56
+ ```bash
57
+ pip install "git+https://github.com/SaketKumar/dbt-polyglot.git"
58
+ ```
59
+
60
+ Local / editable (developing the package):
61
+
62
+ ```bash
63
+ pip install -e path/to/dbt-polyglot
64
+ ```
65
+
66
+ You also need a Spark adapter for dbt (this package does not pull one in, so you can choose
67
+ your connection method):
68
+
69
+ ```bash
70
+ pip install "dbt-spark[PyHive]" # Thrift/HiveServer2, used in the examples below
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Configure (the only changes you make)
76
+
77
+ ### 1. `profiles.yml` — point the output at Spark
78
+
79
+ ```yaml
80
+ your_profile:
81
+ target: dev
82
+ outputs:
83
+ dev:
84
+ type: spark
85
+ method: thrift
86
+ host: "{{ env_var('DBT_SPARK_HOST', 'localhost') }}"
87
+ port: "{{ env_var('DBT_SPARK_PORT', 10000) | int }}"
88
+ schema: analytics
89
+ ```
90
+
91
+ ### 2. `dbt_project.yml` — declare your models' source dialect
92
+
93
+ ```yaml
94
+ models:
95
+ your_project:
96
+ +transpile_from: snowflake # the dialect your models are written in
97
+ # +transpile_to: spark # optional, default 'spark'
98
+ ```
99
+
100
+ `transpile_from` accepts **any** dialect `sqlglot` understands — `snowflake`, `bigquery`,
101
+ `redshift`, `tsql`, `postgres`, `duckdb`, `presto`, `trino`, … `transpile_to` defaults to
102
+ `spark` and rarely needs changing.
103
+
104
+ You can scope it to a subtree (`models.your_project.staging.+transpile_from: …`) or override
105
+ it per model — a per-model `config` beats the project default:
106
+
107
+ ```sql
108
+ -- models/marts/latest_order.sql (written in Snowflake SQL, runs on Spark)
109
+ {{ config(materialized='table', transpile_from='snowflake') }}
110
+
111
+ select *
112
+ from {{ ref('orders') }}
113
+ qualify row_number() over (partition by customer_id order by ordered_at desc) = 1
114
+ ```
115
+
116
+ That's it. `dbt build` now runs your existing models on Spark, no model edits.
117
+
118
+ ---
119
+
120
+ ## How it works
121
+
122
+ At dbt **compile**, the package wraps `dbt.compilation.Compiler._compile_code` and runs an
123
+ extra step on each opted-in model's compiled SQL body:
124
+
125
+ ```
126
+ parse(read=transpile_from) → apply SPARK_FIXUPS → generate(transpile_to, pretty=True)
127
+ ```
128
+
129
+ Because the rewrite happens on the model body **before** dbt wraps it in the materialization
130
+ DDL (`create table … as …`), both `target/compiled/` and the SQL sent to Spark are pure
131
+ Spark — there is no mixed-dialect string and no separate output directory.
132
+
133
+ ### The fix-up layer (what makes it trustable)
134
+
135
+ `sqlglot`'s output is occasionally valid in *its* model of Spark but rejected by Spark's
136
+ **real** parser. The classic case: `x NOT IN (subquery)`, which `sqlglot`'s Snowflake reader
137
+ canonicalizes to the **unsupported** `x <> ALL (subquery)`. The `SPARK_FIXUPS` registry is a
138
+ list of small AST transforms applied to the parsed tree before Spark SQL is generated; the
139
+ first one rewrites quantified-subquery comparisons (`<> ALL` / `= ANY (subq)`) back to
140
+ `NOT x IN` / `x IN (subq)`. The registry is extensible — one EXPLAIN-verified transform per
141
+ gap discovered.
142
+
143
+ ### Trust model — verified, or fails loud (never silently wrong)
144
+
145
+ A model is either converted to **valid Spark SQL** or it **fails loudly** with a clear
146
+ dbt/Spark error naming the model. It never silently emits a wrong result from an
147
+ un-converted construct:
148
+
149
+ - **Fail-soft + loud.** If `sqlglot` can't parse the SQL as the source dialect, or produces
150
+ empty/multi-statement output, the patch logs a `WARNING` (visible in the dbt run) and
151
+ passes the **original SQL through unchanged**. Spark then either runs it (it was already
152
+ valid) or rejects it loudly — so the failure surfaces, it is never hidden.
153
+
154
+ To certify a whole repo **upfront** — before a heavy run — use dbt's own native validation.
155
+ No extra tooling: dbt already runs SQL through your `profiles.yml` adapter, against whatever
156
+ warehouse you target.
157
+
158
+ ```bash
159
+ dbt build --empty # build every model with 0 input rows (DAG-ordered)
160
+ dbt build --empty --select marts.* # any dbt selector works
161
+ dbt show --limit 0 -s my_model # read-only: validate the SELECT without materializing
162
+ ```
163
+
164
+ `--empty` limits every `ref`/`source` to zero rows, so dbt executes each model's real SQL
165
+ against the warehouse — moving no data — and **fails loudly, naming the model**, if the
166
+ transpiled SQL is invalid. Because it builds in dependency order, there is no "upstream not
167
+ built" ambiguity. That makes `dbt build --empty` a drop-in CI gate (it exits non-zero on the
168
+ first invalid model). `dbt show --limit 0` is the non-destructive variant when the target
169
+ role can't create objects.
170
+
171
+ ### Scope
172
+
173
+ Every opted-in model is transpiled — the full `sqlglot` breadth (`IFF`→`IF`, `NVL`→`COALESCE`,
174
+ `::`→`CAST`, `DATEADD`→`DATE_ADD`, `QUALIFY`→windowed subquery, …). To transpile only part of a
175
+ project, scope `+transpile_from` to a folder/model subtree (or set it per model) — the dbt-native
176
+ way — rather than a global on/off.
177
+
178
+ ### No-op guarantee
179
+
180
+ If `transpile_from` is unset, or equals `transpile_to` (you're already writing Spark SQL),
181
+ the model is **never touched** — `sqlglot` is not even called and nothing is reformatted.
182
+
183
+ ### A note on `NULLS LAST` in the output (intentional)
184
+
185
+ Snowflake and Spark have **opposite** default null ordering (Snowflake sorts NULLs largest →
186
+ last; Spark sorts them smallest → first). When translating a Snowflake `ORDER BY x`,
187
+ `sqlglot` appends an explicit `… NULLS LAST` to **preserve Snowflake semantics** — without
188
+ it, a `QUALIFY ROW_NUMBER() … = 1` top-N pick could choose a different row. It is added only
189
+ on a true cross-dialect translation, and is semantically required — do not strip it.
190
+
191
+ ---
192
+
193
+ ## Installation: why `pip`, not `dbt deps`
194
+
195
+ **`dbt deps` cannot install this — you must `pip install` it.** They do different things:
196
+
197
+ - **`dbt deps`** installs **dbt packages**: bundles of dbt *macros, models, seeds, and
198
+ tests* (the things listed in `packages.yml` / `dependencies.yml`). It pulls SQL/Jinja
199
+ assets into `dbt_packages/` and **never installs or runs Python code**.
200
+ - **`dbt-polyglot`** is a **Python package**. It works by monkeypatching a dbt-core
201
+ function at runtime, and it activates through a `.pth` file that Python executes on
202
+ interpreter start-up. Both of those are Python-installer concerns — only `pip` (or `uv`,
203
+ `poetry`, etc.) places a `.pth` into `site-packages` and registers the dependency.
204
+
205
+ So it is installed exactly like `dbt-core` or an adapter, into the same environment as your
206
+ dbt. It does not appear in `packages.yml`.
207
+
208
+ ---
209
+
210
+ ## Package contents
211
+
212
+ A standard src-layout package — `src/dbt_polyglot/` holds the import package, plus a `.pth`
213
+ that activates it on start-up:
214
+
215
+ | File | Role |
216
+ |------|------|
217
+ | `src/dbt_polyglot/__init__.py` | Import-time activation: patches the dbt Compiler. |
218
+ | `src/dbt_polyglot/transpile.py` | The compile-phase patch (`patch_compiler`) + core `spark_safe_transpile`. |
219
+ | `src/dbt_polyglot/fixups.py` | The `SPARK_FIXUPS` registry of AST transforms. |
220
+ | `dbt_polyglot.pth` | One line (`import dbt_polyglot`); auto-activates on start-up. Installed into `site-packages` by the `build_py` shim in `setup.py`. |
221
+ | `pyproject.toml` / `setup.py` | PEP 517 metadata; `setup.py` exists only to place the `.pth` into purelib. |
222
+ | `LICENSE` | Apache-2.0. |
223
+
224
+ This package is intentionally limited to **transpilation**. Validating the result is left to
225
+ dbt's native `dbt build --empty` (see [Trust model](#trust-model--verified-or-fails-loud-never-silently-wrong)
226
+ above); catalog routing (mapping `file_format` → a Spark catalog) and seed re-runnability are
227
+ **separate concerns** and are not bundled here.
228
+
229
+ ---
230
+
231
+ ## Compatibility & caveats
232
+
233
+ - **dbt-core private method.** The patch wraps `dbt.compilation.Compiler._compile_code`, a
234
+ **private** dbt-core method. It forwards `*args/**kwargs` to tolerate signature drift and
235
+ is fully import-guarded (if dbt-core or `sqlglot` aren't importable, or the seam moves, the
236
+ patch does nothing rather than breaking the interpreter). Still, **pin a supported dbt-core
237
+ range** when depending on this in production, and re-verify after major dbt upgrades.
238
+ - **`sqlglot` coverage.** `sqlglot` maps a large surface but not everything. Exotic dialect
239
+ features — Snowflake `LATERAL FLATTEN`, `VARIANT`/`OBJECT`/`ARRAY` semantics, `:` path
240
+ access, `LISTAGG`, and similar — may not translate cleanly. Those surface via the fail-soft
241
+ WARNING and `dbt build --empty`, by design, rather than silently.
242
+ - **Self-contained.** The module imports nothing from any host project, so it can be lifted
243
+ into its own repo unchanged.
244
+
245
+ ## License
246
+
247
+ Apache-2.0 — see [LICENSE](LICENSE).
@@ -0,0 +1,222 @@
1
+ # dbt-polyglot
2
+
3
+ **Run a dbt project written in another SQL dialect (Snowflake, BigQuery, Redshift, …) on
4
+ Spark — unchanged.** Each model's SQL is transpiled to Spark with
5
+ [`sqlglot`](https://github.com/tobikodata/sqlglot) at dbt's **compile phase**, so the SQL
6
+ dbt actually executes (and what lands in `target/compiled/`) is already Spark.
7
+
8
+ The only changes are configuration — your model `.sql` files are never edited. Drop the
9
+ package into any existing dbt repo, point `profiles.yml` at Spark, declare the source
10
+ dialect in `dbt_project.yml`, and `dbt build`.
11
+
12
+ > Why this exists: Spark has no `QUALIFY` clause (`[PARSE_SYNTAX_ERROR] … near 'QUALIFY'`),
13
+ > plus dozens of smaller dialect gaps (`IFF`, `NVL`, `::` casts, `DATEADD`, null ordering, …).
14
+ > A portable/Snowflake-style model fails on Spark until its SQL is translated. This package
15
+ > does that translation transparently, in-place, at compile time.
16
+
17
+ ---
18
+
19
+ ## Install
20
+
21
+ It is a **normal Python package** — install it into the same virtualenv your `dbt` runs in.
22
+ Installation auto-activates the patch (via a `.pth` file that imports the module on
23
+ interpreter start-up; see [Installation: why pip, not `dbt deps`](#installation-why-pip-not-dbt-deps)).
24
+
25
+ ```bash
26
+ pip install dbt-polyglot
27
+ ```
28
+
29
+ From a git checkout (bleeding edge):
30
+
31
+ ```bash
32
+ pip install "git+https://github.com/SaketKumar/dbt-polyglot.git"
33
+ ```
34
+
35
+ Local / editable (developing the package):
36
+
37
+ ```bash
38
+ pip install -e path/to/dbt-polyglot
39
+ ```
40
+
41
+ You also need a Spark adapter for dbt (this package does not pull one in, so you can choose
42
+ your connection method):
43
+
44
+ ```bash
45
+ pip install "dbt-spark[PyHive]" # Thrift/HiveServer2, used in the examples below
46
+ ```
47
+
48
+ ---
49
+
50
+ ## Configure (the only changes you make)
51
+
52
+ ### 1. `profiles.yml` — point the output at Spark
53
+
54
+ ```yaml
55
+ your_profile:
56
+ target: dev
57
+ outputs:
58
+ dev:
59
+ type: spark
60
+ method: thrift
61
+ host: "{{ env_var('DBT_SPARK_HOST', 'localhost') }}"
62
+ port: "{{ env_var('DBT_SPARK_PORT', 10000) | int }}"
63
+ schema: analytics
64
+ ```
65
+
66
+ ### 2. `dbt_project.yml` — declare your models' source dialect
67
+
68
+ ```yaml
69
+ models:
70
+ your_project:
71
+ +transpile_from: snowflake # the dialect your models are written in
72
+ # +transpile_to: spark # optional, default 'spark'
73
+ ```
74
+
75
+ `transpile_from` accepts **any** dialect `sqlglot` understands — `snowflake`, `bigquery`,
76
+ `redshift`, `tsql`, `postgres`, `duckdb`, `presto`, `trino`, … `transpile_to` defaults to
77
+ `spark` and rarely needs changing.
78
+
79
+ You can scope it to a subtree (`models.your_project.staging.+transpile_from: …`) or override
80
+ it per model — a per-model `config` beats the project default:
81
+
82
+ ```sql
83
+ -- models/marts/latest_order.sql (written in Snowflake SQL, runs on Spark)
84
+ {{ config(materialized='table', transpile_from='snowflake') }}
85
+
86
+ select *
87
+ from {{ ref('orders') }}
88
+ qualify row_number() over (partition by customer_id order by ordered_at desc) = 1
89
+ ```
90
+
91
+ That's it. `dbt build` now runs your existing models on Spark, no model edits.
92
+
93
+ ---
94
+
95
+ ## How it works
96
+
97
+ At dbt **compile**, the package wraps `dbt.compilation.Compiler._compile_code` and runs an
98
+ extra step on each opted-in model's compiled SQL body:
99
+
100
+ ```
101
+ parse(read=transpile_from) → apply SPARK_FIXUPS → generate(transpile_to, pretty=True)
102
+ ```
103
+
104
+ Because the rewrite happens on the model body **before** dbt wraps it in the materialization
105
+ DDL (`create table … as …`), both `target/compiled/` and the SQL sent to Spark are pure
106
+ Spark — there is no mixed-dialect string and no separate output directory.
107
+
108
+ ### The fix-up layer (what makes it trustable)
109
+
110
+ `sqlglot`'s output is occasionally valid in *its* model of Spark but rejected by Spark's
111
+ **real** parser. The classic case: `x NOT IN (subquery)`, which `sqlglot`'s Snowflake reader
112
+ canonicalizes to the **unsupported** `x <> ALL (subquery)`. The `SPARK_FIXUPS` registry is a
113
+ list of small AST transforms applied to the parsed tree before Spark SQL is generated; the
114
+ first one rewrites quantified-subquery comparisons (`<> ALL` / `= ANY (subq)`) back to
115
+ `NOT x IN` / `x IN (subq)`. The registry is extensible — one EXPLAIN-verified transform per
116
+ gap discovered.
117
+
118
+ ### Trust model — verified, or fails loud (never silently wrong)
119
+
120
+ A model is either converted to **valid Spark SQL** or it **fails loudly** with a clear
121
+ dbt/Spark error naming the model. It never silently emits a wrong result from an
122
+ un-converted construct:
123
+
124
+ - **Fail-soft + loud.** If `sqlglot` can't parse the SQL as the source dialect, or produces
125
+ empty/multi-statement output, the patch logs a `WARNING` (visible in the dbt run) and
126
+ passes the **original SQL through unchanged**. Spark then either runs it (it was already
127
+ valid) or rejects it loudly — so the failure surfaces, it is never hidden.
128
+
129
+ To certify a whole repo **upfront** — before a heavy run — use dbt's own native validation.
130
+ No extra tooling: dbt already runs SQL through your `profiles.yml` adapter, against whatever
131
+ warehouse you target.
132
+
133
+ ```bash
134
+ dbt build --empty # build every model with 0 input rows (DAG-ordered)
135
+ dbt build --empty --select marts.* # any dbt selector works
136
+ dbt show --limit 0 -s my_model # read-only: validate the SELECT without materializing
137
+ ```
138
+
139
+ `--empty` limits every `ref`/`source` to zero rows, so dbt executes each model's real SQL
140
+ against the warehouse — moving no data — and **fails loudly, naming the model**, if the
141
+ transpiled SQL is invalid. Because it builds in dependency order, there is no "upstream not
142
+ built" ambiguity. That makes `dbt build --empty` a drop-in CI gate (it exits non-zero on the
143
+ first invalid model). `dbt show --limit 0` is the non-destructive variant when the target
144
+ role can't create objects.
145
+
146
+ ### Scope
147
+
148
+ Every opted-in model is transpiled — the full `sqlglot` breadth (`IFF`→`IF`, `NVL`→`COALESCE`,
149
+ `::`→`CAST`, `DATEADD`→`DATE_ADD`, `QUALIFY`→windowed subquery, …). To transpile only part of a
150
+ project, scope `+transpile_from` to a folder/model subtree (or set it per model) — the dbt-native
151
+ way — rather than a global on/off.
152
+
153
+ ### No-op guarantee
154
+
155
+ If `transpile_from` is unset, or equals `transpile_to` (you're already writing Spark SQL),
156
+ the model is **never touched** — `sqlglot` is not even called and nothing is reformatted.
157
+
158
+ ### A note on `NULLS LAST` in the output (intentional)
159
+
160
+ Snowflake and Spark have **opposite** default null ordering (Snowflake sorts NULLs largest →
161
+ last; Spark sorts them smallest → first). When translating a Snowflake `ORDER BY x`,
162
+ `sqlglot` appends an explicit `… NULLS LAST` to **preserve Snowflake semantics** — without
163
+ it, a `QUALIFY ROW_NUMBER() … = 1` top-N pick could choose a different row. It is added only
164
+ on a true cross-dialect translation, and is semantically required — do not strip it.
165
+
166
+ ---
167
+
168
+ ## Installation: why `pip`, not `dbt deps`
169
+
170
+ **`dbt deps` cannot install this — you must `pip install` it.** They do different things:
171
+
172
+ - **`dbt deps`** installs **dbt packages**: bundles of dbt *macros, models, seeds, and
173
+ tests* (the things listed in `packages.yml` / `dependencies.yml`). It pulls SQL/Jinja
174
+ assets into `dbt_packages/` and **never installs or runs Python code**.
175
+ - **`dbt-polyglot`** is a **Python package**. It works by monkeypatching a dbt-core
176
+ function at runtime, and it activates through a `.pth` file that Python executes on
177
+ interpreter start-up. Both of those are Python-installer concerns — only `pip` (or `uv`,
178
+ `poetry`, etc.) places a `.pth` into `site-packages` and registers the dependency.
179
+
180
+ So it is installed exactly like `dbt-core` or an adapter, into the same environment as your
181
+ dbt. It does not appear in `packages.yml`.
182
+
183
+ ---
184
+
185
+ ## Package contents
186
+
187
+ A standard src-layout package — `src/dbt_polyglot/` holds the import package, plus a `.pth`
188
+ that activates it on start-up:
189
+
190
+ | File | Role |
191
+ |------|------|
192
+ | `src/dbt_polyglot/__init__.py` | Import-time activation: patches the dbt Compiler. |
193
+ | `src/dbt_polyglot/transpile.py` | The compile-phase patch (`patch_compiler`) + core `spark_safe_transpile`. |
194
+ | `src/dbt_polyglot/fixups.py` | The `SPARK_FIXUPS` registry of AST transforms. |
195
+ | `dbt_polyglot.pth` | One line (`import dbt_polyglot`); auto-activates on start-up. Installed into `site-packages` by the `build_py` shim in `setup.py`. |
196
+ | `pyproject.toml` / `setup.py` | PEP 517 metadata; `setup.py` exists only to place the `.pth` into purelib. |
197
+ | `LICENSE` | Apache-2.0. |
198
+
199
+ This package is intentionally limited to **transpilation**. Validating the result is left to
200
+ dbt's native `dbt build --empty` (see [Trust model](#trust-model--verified-or-fails-loud-never-silently-wrong)
201
+ above); catalog routing (mapping `file_format` → a Spark catalog) and seed re-runnability are
202
+ **separate concerns** and are not bundled here.
203
+
204
+ ---
205
+
206
+ ## Compatibility & caveats
207
+
208
+ - **dbt-core private method.** The patch wraps `dbt.compilation.Compiler._compile_code`, a
209
+ **private** dbt-core method. It forwards `*args/**kwargs` to tolerate signature drift and
210
+ is fully import-guarded (if dbt-core or `sqlglot` aren't importable, or the seam moves, the
211
+ patch does nothing rather than breaking the interpreter). Still, **pin a supported dbt-core
212
+ range** when depending on this in production, and re-verify after major dbt upgrades.
213
+ - **`sqlglot` coverage.** `sqlglot` maps a large surface but not everything. Exotic dialect
214
+ features — Snowflake `LATERAL FLATTEN`, `VARIANT`/`OBJECT`/`ARRAY` semantics, `:` path
215
+ access, `LISTAGG`, and similar — may not translate cleanly. Those surface via the fail-soft
216
+ WARNING and `dbt build --empty`, by design, rather than silently.
217
+ - **Self-contained.** The module imports nothing from any host project, so it can be lifted
218
+ into its own repo unchanged.
219
+
220
+ ## License
221
+
222
+ Apache-2.0 — see [LICENSE](LICENSE).
@@ -0,0 +1 @@
1
+ import dbt_polyglot
@@ -0,0 +1,40 @@
1
+ [build-system]
2
+ requires = ["setuptools>=77"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "dbt-polyglot"
7
+ version = "0.1.0"
8
+ description = "Run any-dialect dbt models on Spark unchanged — transpiles each model's SQL to Spark at dbt compile time via sqlglot."
9
+ readme = "README.md"
10
+ requires-python = ">=3.9"
11
+ license = "Apache-2.0"
12
+ license-files = ["LICENSE"]
13
+ authors = [{ name = "Saket Kumar", email = "kumar.saket0021@gmail.com" }]
14
+ keywords = ["dbt", "spark", "sqlglot", "snowflake", "sql", "transpile", "dialect", "polyglot"]
15
+ classifiers = [
16
+ "Development Status :: 4 - Beta",
17
+ "Intended Audience :: Developers",
18
+ "Programming Language :: Python :: 3",
19
+ "Programming Language :: SQL",
20
+ "Topic :: Database",
21
+ "Topic :: Software Development :: Code Generators",
22
+ ]
23
+ dependencies = [
24
+ "sqlglot>=20.0",
25
+ "dbt-core>=1.5",
26
+ ]
27
+
28
+ [project.optional-dependencies]
29
+ test = ["pytest"]
30
+
31
+ [project.urls]
32
+ Homepage = "https://github.com/Saketkr21/dbt-polyglot"
33
+ Repository = "https://github.com/Saketkr21/dbt-polyglot"
34
+ Issues = "https://github.com/Saketkr21/dbt-polyglot/issues"
35
+
36
+ [tool.setuptools.packages.find]
37
+ where = ["src"]
38
+
39
+ [tool.pytest.ini_options]
40
+ testpaths = ["tests"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,21 @@
1
+ """Shim setup.py — metadata lives in pyproject.toml.
2
+
3
+ Places ``dbt_polyglot.pth`` into the wheel's purelib (site-packages) so the patch
4
+ auto-activates on interpreter start-up.
5
+ """
6
+ import os
7
+ import shutil
8
+
9
+ from setuptools import setup
10
+ from setuptools.command.build_py import build_py
11
+
12
+ PTH = "dbt_polyglot.pth"
13
+
14
+
15
+ class build_py_with_pth(build_py):
16
+ def run(self):
17
+ super().run()
18
+ shutil.copyfile(PTH, os.path.join(self.build_lib, PTH))
19
+
20
+
21
+ setup(cmdclass={"build_py": build_py_with_pth})
@@ -0,0 +1,27 @@
1
+ """dbt-polyglot — run any-dialect dbt models on Spark unchanged.
2
+
3
+ Transpiles each opted-in model's SQL to Spark via sqlglot at dbt's compile phase.
4
+ Install:
5
+
6
+ pip install dbt-polyglot
7
+
8
+ Config (dbt_project.yml):
9
+
10
+ models:
11
+ your_project:
12
+ +transpile_from: snowflake
13
+
14
+ To validate the transpiled SQL against your warehouse before a heavy run, use dbt's
15
+ own native flag — no extra tooling needed:
16
+
17
+ dbt build --empty # build every model with zero input rows
18
+ dbt show --limit 0 -s model # read-only: validate without materializing
19
+ """
20
+ __version__ = "0.2.0"
21
+
22
+ # Activate the compile-time transpile patch. Import-guarded so non-dbt Python is unaffected.
23
+ try:
24
+ from dbt_polyglot.transpile import patch_compiler
25
+ patch_compiler()
26
+ except Exception:
27
+ pass
@@ -0,0 +1,31 @@
1
+ """Spark-output fix-up registry.
2
+
3
+ Each entry is an (exp.Expression -> exp.Expression) transform applied (via .transform,
4
+ bottom-up) to the parsed tree BEFORE generating Spark SQL. They repair cases where
5
+ sqlglot's output is rejected by Spark's real parser.
6
+
7
+ Extensible: append a transform function per gap found, EXPLAIN-verify on Spark.
8
+ """
9
+ from sqlglot import exp
10
+
11
+
12
+ def _as_subquery(node):
13
+ return node if isinstance(node, exp.Subquery) else exp.Subquery(this=node)
14
+
15
+
16
+ def fixup_quantified_subquery(node):
17
+ """Spark has no quantified subquery comparison.
18
+
19
+ sqlglot's Snowflake parser canonicalizes:
20
+ x NOT IN (subq) -> x <> ALL (subq)
21
+ x IN (subq) -> x = ANY (subq)
22
+ Spark rejects both. Rewrite back to NOT x IN (subq) / x IN (subq).
23
+ """
24
+ if isinstance(node, exp.NEQ) and isinstance(node.expression, exp.All):
25
+ return exp.Not(this=exp.In(this=node.this, query=_as_subquery(node.expression.this)))
26
+ if isinstance(node, exp.EQ) and isinstance(node.expression, exp.Any):
27
+ return exp.In(this=node.this, query=_as_subquery(node.expression.this))
28
+ return node
29
+
30
+
31
+ SPARK_FIXUPS = [fixup_quantified_subquery]
@@ -0,0 +1,56 @@
1
+ """Core transpile logic — parse source dialect, apply fix-ups, generate Spark SQL.
2
+
3
+ Called at dbt compile time via the Compiler._compile_code monkeypatch.
4
+ """
5
+ import sqlglot
6
+ from dbt_polyglot.fixups import SPARK_FIXUPS
7
+
8
+ _DEFAULT_TARGET = "spark"
9
+
10
+
11
+ def spark_safe_transpile(code, src, dst=None):
12
+ """Parse as `src`, apply fix-up registry (when targeting spark), generate `dst` SQL.
13
+
14
+ Raises on multi-statement / empty so the caller's fail-soft kicks in.
15
+ """
16
+ dst = dst or _DEFAULT_TARGET
17
+ statements = sqlglot.parse(code, read=src)
18
+ if len(statements) != 1 or statements[0] is None:
19
+ raise ValueError(f"expected exactly one statement, got {len(statements)}")
20
+ tree = statements[0]
21
+ if dst == _DEFAULT_TARGET:
22
+ for fixup in SPARK_FIXUPS:
23
+ tree = tree.transform(fixup)
24
+ out = tree.sql(dialect=dst, pretty=True)
25
+ if not (out or "").strip():
26
+ raise ValueError("transpile produced empty SQL")
27
+ return out
28
+
29
+
30
+ def patch_compiler():
31
+ """Monkeypatch dbt's Compiler._compile_code to transpile opted-in models."""
32
+ from dbt.compilation import Compiler
33
+ from dbt.adapters.events.logging import AdapterLogger
34
+
35
+ logger = AdapterLogger("dbt-polyglot")
36
+ orig = Compiler._compile_code
37
+
38
+ def _patched(self, node, manifest, extra_context=None, *args, **kwargs):
39
+ node = orig(self, node, manifest, extra_context, *args, **kwargs)
40
+ src = dst = None
41
+ try:
42
+ src = node.config.get("transpile_from")
43
+ dst = node.config.get("transpile_to") or _DEFAULT_TARGET
44
+ if not src or src == dst:
45
+ return node
46
+ node.compiled_code = spark_safe_transpile(node.compiled_code or "", src, dst)
47
+ except Exception as e:
48
+ uid = getattr(node, "unique_id", "<unknown>")
49
+ logger.warning(
50
+ f"[dbt-polyglot] could not transpile {uid} from '{src}' -> "
51
+ f"'{dst or _DEFAULT_TARGET}' ({type(e).__name__}: {e}); "
52
+ f"passing model SQL through UNCHANGED."
53
+ )
54
+ return node
55
+
56
+ Compiler._compile_code = _patched
@@ -0,0 +1,247 @@
1
+ Metadata-Version: 2.4
2
+ Name: dbt-polyglot
3
+ Version: 0.1.0
4
+ Summary: Run any-dialect dbt models on Spark unchanged — transpiles each model's SQL to Spark at dbt compile time via sqlglot.
5
+ Author-email: Saket Kumar <kumar.saket0021@gmail.com>
6
+ License-Expression: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/Saketkr21/dbt-polyglot
8
+ Project-URL: Repository, https://github.com/Saketkr21/dbt-polyglot
9
+ Project-URL: Issues, https://github.com/Saketkr21/dbt-polyglot/issues
10
+ Keywords: dbt,spark,sqlglot,snowflake,sql,transpile,dialect,polyglot
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: SQL
15
+ Classifier: Topic :: Database
16
+ Classifier: Topic :: Software Development :: Code Generators
17
+ Requires-Python: >=3.9
18
+ Description-Content-Type: text/markdown
19
+ License-File: LICENSE
20
+ Requires-Dist: sqlglot>=20.0
21
+ Requires-Dist: dbt-core>=1.5
22
+ Provides-Extra: test
23
+ Requires-Dist: pytest; extra == "test"
24
+ Dynamic: license-file
25
+
26
+ # dbt-polyglot
27
+
28
+ **Run a dbt project written in another SQL dialect (Snowflake, BigQuery, Redshift, …) on
29
+ Spark — unchanged.** Each model's SQL is transpiled to Spark with
30
+ [`sqlglot`](https://github.com/tobikodata/sqlglot) at dbt's **compile phase**, so the SQL
31
+ dbt actually executes (and what lands in `target/compiled/`) is already Spark.
32
+
33
+ The only changes are configuration — your model `.sql` files are never edited. Drop the
34
+ package into any existing dbt repo, point `profiles.yml` at Spark, declare the source
35
+ dialect in `dbt_project.yml`, and `dbt build`.
36
+
37
+ > Why this exists: Spark has no `QUALIFY` clause (`[PARSE_SYNTAX_ERROR] … near 'QUALIFY'`),
38
+ > plus dozens of smaller dialect gaps (`IFF`, `NVL`, `::` casts, `DATEADD`, null ordering, …).
39
+ > A portable/Snowflake-style model fails on Spark until its SQL is translated. This package
40
+ > does that translation transparently, in-place, at compile time.
41
+
42
+ ---
43
+
44
+ ## Install
45
+
46
+ It is a **normal Python package** — install it into the same virtualenv your `dbt` runs in.
47
+ Installation auto-activates the patch (via a `.pth` file that imports the module on
48
+ interpreter start-up; see [Installation: why pip, not `dbt deps`](#installation-why-pip-not-dbt-deps)).
49
+
50
+ ```bash
51
+ pip install dbt-polyglot
52
+ ```
53
+
54
+ From a git checkout (bleeding edge):
55
+
56
+ ```bash
57
+ pip install "git+https://github.com/SaketKumar/dbt-polyglot.git"
58
+ ```
59
+
60
+ Local / editable (developing the package):
61
+
62
+ ```bash
63
+ pip install -e path/to/dbt-polyglot
64
+ ```
65
+
66
+ You also need a Spark adapter for dbt (this package does not pull one in, so you can choose
67
+ your connection method):
68
+
69
+ ```bash
70
+ pip install "dbt-spark[PyHive]" # Thrift/HiveServer2, used in the examples below
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Configure (the only changes you make)
76
+
77
+ ### 1. `profiles.yml` — point the output at Spark
78
+
79
+ ```yaml
80
+ your_profile:
81
+ target: dev
82
+ outputs:
83
+ dev:
84
+ type: spark
85
+ method: thrift
86
+ host: "{{ env_var('DBT_SPARK_HOST', 'localhost') }}"
87
+ port: "{{ env_var('DBT_SPARK_PORT', 10000) | int }}"
88
+ schema: analytics
89
+ ```
90
+
91
+ ### 2. `dbt_project.yml` — declare your models' source dialect
92
+
93
+ ```yaml
94
+ models:
95
+ your_project:
96
+ +transpile_from: snowflake # the dialect your models are written in
97
+ # +transpile_to: spark # optional, default 'spark'
98
+ ```
99
+
100
+ `transpile_from` accepts **any** dialect `sqlglot` understands — `snowflake`, `bigquery`,
101
+ `redshift`, `tsql`, `postgres`, `duckdb`, `presto`, `trino`, … `transpile_to` defaults to
102
+ `spark` and rarely needs changing.
103
+
104
+ You can scope it to a subtree (`models.your_project.staging.+transpile_from: …`) or override
105
+ it per model — a per-model `config` beats the project default:
106
+
107
+ ```sql
108
+ -- models/marts/latest_order.sql (written in Snowflake SQL, runs on Spark)
109
+ {{ config(materialized='table', transpile_from='snowflake') }}
110
+
111
+ select *
112
+ from {{ ref('orders') }}
113
+ qualify row_number() over (partition by customer_id order by ordered_at desc) = 1
114
+ ```
115
+
116
+ That's it. `dbt build` now runs your existing models on Spark, no model edits.
117
+
118
+ ---
119
+
120
+ ## How it works
121
+
122
+ At dbt **compile**, the package wraps `dbt.compilation.Compiler._compile_code` and runs an
123
+ extra step on each opted-in model's compiled SQL body:
124
+
125
+ ```
126
+ parse(read=transpile_from) → apply SPARK_FIXUPS → generate(transpile_to, pretty=True)
127
+ ```
128
+
129
+ Because the rewrite happens on the model body **before** dbt wraps it in the materialization
130
+ DDL (`create table … as …`), both `target/compiled/` and the SQL sent to Spark are pure
131
+ Spark — there is no mixed-dialect string and no separate output directory.
132
+
133
+ ### The fix-up layer (what makes it trustable)
134
+
135
+ `sqlglot`'s output is occasionally valid in *its* model of Spark but rejected by Spark's
136
+ **real** parser. The classic case: `x NOT IN (subquery)`, which `sqlglot`'s Snowflake reader
137
+ canonicalizes to the **unsupported** `x <> ALL (subquery)`. The `SPARK_FIXUPS` registry is a
138
+ list of small AST transforms applied to the parsed tree before Spark SQL is generated; the
139
+ first one rewrites quantified-subquery comparisons (`<> ALL` / `= ANY (subq)`) back to
140
+ `NOT x IN` / `x IN (subq)`. The registry is extensible — one EXPLAIN-verified transform per
141
+ gap discovered.
142
+
143
+ ### Trust model — verified, or fails loud (never silently wrong)
144
+
145
+ A model is either converted to **valid Spark SQL** or it **fails loudly** with a clear
146
+ dbt/Spark error naming the model. It never silently emits a wrong result from an
147
+ un-converted construct:
148
+
149
+ - **Fail-soft + loud.** If `sqlglot` can't parse the SQL as the source dialect, or produces
150
+ empty/multi-statement output, the patch logs a `WARNING` (visible in the dbt run) and
151
+ passes the **original SQL through unchanged**. Spark then either runs it (it was already
152
+ valid) or rejects it loudly — so the failure surfaces, it is never hidden.
153
+
154
+ To certify a whole repo **upfront** — before a heavy run — use dbt's own native validation.
155
+ No extra tooling: dbt already runs SQL through your `profiles.yml` adapter, against whatever
156
+ warehouse you target.
157
+
158
+ ```bash
159
+ dbt build --empty # build every model with 0 input rows (DAG-ordered)
160
+ dbt build --empty --select marts.* # any dbt selector works
161
+ dbt show --limit 0 -s my_model # read-only: validate the SELECT without materializing
162
+ ```
163
+
164
+ `--empty` limits every `ref`/`source` to zero rows, so dbt executes each model's real SQL
165
+ against the warehouse — moving no data — and **fails loudly, naming the model**, if the
166
+ transpiled SQL is invalid. Because it builds in dependency order, there is no "upstream not
167
+ built" ambiguity. That makes `dbt build --empty` a drop-in CI gate (it exits non-zero on the
168
+ first invalid model). `dbt show --limit 0` is the non-destructive variant when the target
169
+ role can't create objects.
170
+
171
+ ### Scope
172
+
173
+ Every opted-in model is transpiled — the full `sqlglot` breadth (`IFF`→`IF`, `NVL`→`COALESCE`,
174
+ `::`→`CAST`, `DATEADD`→`DATE_ADD`, `QUALIFY`→windowed subquery, …). To transpile only part of a
175
+ project, scope `+transpile_from` to a folder/model subtree (or set it per model) — the dbt-native
176
+ way — rather than a global on/off.
177
+
178
+ ### No-op guarantee
179
+
180
+ If `transpile_from` is unset, or equals `transpile_to` (you're already writing Spark SQL),
181
+ the model is **never touched** — `sqlglot` is not even called and nothing is reformatted.
182
+
183
+ ### A note on `NULLS LAST` in the output (intentional)
184
+
185
+ Snowflake and Spark have **opposite** default null ordering (Snowflake sorts NULLs largest →
186
+ last; Spark sorts them smallest → first). When translating a Snowflake `ORDER BY x`,
187
+ `sqlglot` appends an explicit `… NULLS LAST` to **preserve Snowflake semantics** — without
188
+ it, a `QUALIFY ROW_NUMBER() … = 1` top-N pick could choose a different row. It is added only
189
+ on a true cross-dialect translation, and is semantically required — do not strip it.
190
+
191
+ ---
192
+
193
+ ## Installation: why `pip`, not `dbt deps`
194
+
195
+ **`dbt deps` cannot install this — you must `pip install` it.** They do different things:
196
+
197
+ - **`dbt deps`** installs **dbt packages**: bundles of dbt *macros, models, seeds, and
198
+ tests* (the things listed in `packages.yml` / `dependencies.yml`). It pulls SQL/Jinja
199
+ assets into `dbt_packages/` and **never installs or runs Python code**.
200
+ - **`dbt-polyglot`** is a **Python package**. It works by monkeypatching a dbt-core
201
+ function at runtime, and it activates through a `.pth` file that Python executes on
202
+ interpreter start-up. Both of those are Python-installer concerns — only `pip` (or `uv`,
203
+ `poetry`, etc.) places a `.pth` into `site-packages` and registers the dependency.
204
+
205
+ So it is installed exactly like `dbt-core` or an adapter, into the same environment as your
206
+ dbt. It does not appear in `packages.yml`.
207
+
208
+ ---
209
+
210
+ ## Package contents
211
+
212
+ A standard src-layout package — `src/dbt_polyglot/` holds the import package, plus a `.pth`
213
+ that activates it on start-up:
214
+
215
+ | File | Role |
216
+ |------|------|
217
+ | `src/dbt_polyglot/__init__.py` | Import-time activation: patches the dbt Compiler. |
218
+ | `src/dbt_polyglot/transpile.py` | The compile-phase patch (`patch_compiler`) + core `spark_safe_transpile`. |
219
+ | `src/dbt_polyglot/fixups.py` | The `SPARK_FIXUPS` registry of AST transforms. |
220
+ | `dbt_polyglot.pth` | One line (`import dbt_polyglot`); auto-activates on start-up. Installed into `site-packages` by the `build_py` shim in `setup.py`. |
221
+ | `pyproject.toml` / `setup.py` | PEP 517 metadata; `setup.py` exists only to place the `.pth` into purelib. |
222
+ | `LICENSE` | Apache-2.0. |
223
+
224
+ This package is intentionally limited to **transpilation**. Validating the result is left to
225
+ dbt's native `dbt build --empty` (see [Trust model](#trust-model--verified-or-fails-loud-never-silently-wrong)
226
+ above); catalog routing (mapping `file_format` → a Spark catalog) and seed re-runnability are
227
+ **separate concerns** and are not bundled here.
228
+
229
+ ---
230
+
231
+ ## Compatibility & caveats
232
+
233
+ - **dbt-core private method.** The patch wraps `dbt.compilation.Compiler._compile_code`, a
234
+ **private** dbt-core method. It forwards `*args/**kwargs` to tolerate signature drift and
235
+ is fully import-guarded (if dbt-core or `sqlglot` aren't importable, or the seam moves, the
236
+ patch does nothing rather than breaking the interpreter). Still, **pin a supported dbt-core
237
+ range** when depending on this in production, and re-verify after major dbt upgrades.
238
+ - **`sqlglot` coverage.** `sqlglot` maps a large surface but not everything. Exotic dialect
239
+ features — Snowflake `LATERAL FLATTEN`, `VARIANT`/`OBJECT`/`ARRAY` semantics, `:` path
240
+ access, `LISTAGG`, and similar — may not translate cleanly. Those surface via the fail-soft
241
+ WARNING and `dbt build --empty`, by design, rather than silently.
242
+ - **Self-contained.** The module imports nothing from any host project, so it can be lifted
243
+ into its own repo unchanged.
244
+
245
+ ## License
246
+
247
+ Apache-2.0 — see [LICENSE](LICENSE).
@@ -0,0 +1,17 @@
1
+ CHANGELOG.md
2
+ LICENSE
3
+ MANIFEST.in
4
+ README.md
5
+ dbt_polyglot.pth
6
+ pyproject.toml
7
+ setup.py
8
+ src/dbt_polyglot/__init__.py
9
+ src/dbt_polyglot/fixups.py
10
+ src/dbt_polyglot/transpile.py
11
+ src/dbt_polyglot.egg-info/PKG-INFO
12
+ src/dbt_polyglot.egg-info/SOURCES.txt
13
+ src/dbt_polyglot.egg-info/dependency_links.txt
14
+ src/dbt_polyglot.egg-info/requires.txt
15
+ src/dbt_polyglot.egg-info/top_level.txt
16
+ tests/__init__.py
17
+ tests/test_transpile.py
@@ -0,0 +1,5 @@
1
+ sqlglot>=20.0
2
+ dbt-core>=1.5
3
+
4
+ [test]
5
+ pytest
@@ -0,0 +1 @@
1
+ dbt_polyglot
File without changes
@@ -0,0 +1,42 @@
1
+ """Unit tests for the transpile + fix-up layer. No Spark required (pure sqlglot string checks).
2
+
3
+ Run: pip install -e ".[test]" && pytest
4
+ """
5
+ import pytest
6
+ from dbt_polyglot.transpile import spark_safe_transpile as transpile
7
+
8
+
9
+ def test_not_in_subquery_is_not_emitted_as_unsupported_all():
10
+ out = transpile("select 1 from x where a not in (select a from y)", "snowflake", "spark")
11
+ assert "ALL" not in out.upper()
12
+ assert "NOT" in out.upper() and "IN (" in out.replace("\n", " ").upper().replace("IN(", "IN (")
13
+
14
+
15
+ def test_eq_any_subquery_becomes_in():
16
+ out = transpile("select 1 from x where a = any (select a from y)", "snowflake", "spark")
17
+ assert "ANY" not in out.upper()
18
+ assert "IN" in out.upper()
19
+
20
+
21
+ def test_qualify_is_rewritten_to_subquery():
22
+ out = transpile("select a from x qualify row_number() over (order by a) = 1", "snowflake", "spark")
23
+ assert "QUALIFY" not in out.upper()
24
+
25
+
26
+ def test_common_snowflake_functions_translate():
27
+ out = transpile("select iff(a > 0, 1, 0) c, nvl(b, 'x') d, a::string e from x", "snowflake", "spark")
28
+ up = out.upper()
29
+ assert "IFF(" not in up
30
+ assert "::" not in out
31
+ assert "CAST(" in up
32
+
33
+
34
+ def test_plain_spark_passthrough_is_valid():
35
+ out = transpile("select a, b from x where a = 1", "snowflake", "spark")
36
+ assert "SELECT" in out.upper() and "FROM X" in out.upper()
37
+
38
+
39
+ @pytest.mark.parametrize("bad", ["", "/* only a comment */", "select 1; select 2"])
40
+ def test_empty_or_multistatement_raises_so_failsoft_engages(bad):
41
+ with pytest.raises(Exception):
42
+ transpile(bad, "snowflake", "spark")