aetherdialect 0.1.2__tar.gz → 0.1.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (109) hide show
  1. aetherdialect-0.1.4/PKG-INFO +271 -0
  2. aetherdialect-0.1.4/README.md +234 -0
  3. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/pyproject.toml +92 -91
  4. aetherdialect-0.1.4/src/aetherdialect.egg-info/PKG-INFO +271 -0
  5. aetherdialect-0.1.4/src/aetherdialect.egg-info/SOURCES.txt +70 -0
  6. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/src/aetherdialect.egg-info/requires.txt +4 -3
  7. aetherdialect-0.1.4/src/text2sql/__init__.py +29 -0
  8. aetherdialect-0.1.4/src/text2sql/_config.py +2180 -0
  9. aetherdialect-0.1.4/src/text2sql/_contracts_base.py +2262 -0
  10. aetherdialect-0.1.2/src/text2sql/contracts_core.py → aetherdialect-0.1.4/src/text2sql/_contracts_core.py +1680 -425
  11. aetherdialect-0.1.4/src/text2sql/_core_utils.py +1927 -0
  12. aetherdialect-0.1.4/src/text2sql/_dialect.py +4297 -0
  13. aetherdialect-0.1.2/src/text2sql/expansion_ops.py → aetherdialect-0.1.4/src/text2sql/_expansion_ops.py +776 -366
  14. aetherdialect-0.1.4/src/text2sql/_intent_expr.py +2903 -0
  15. aetherdialect-0.1.4/src/text2sql/_intent_process.py +2958 -0
  16. aetherdialect-0.1.4/src/text2sql/_intent_repair.py +4028 -0
  17. aetherdialect-0.1.4/src/text2sql/_intent_resolve.py +2680 -0
  18. aetherdialect-0.1.2/src/text2sql/live_testing.py → aetherdialect-0.1.4/src/text2sql/_live_testing.py +379 -204
  19. aetherdialect-0.1.4/src/text2sql/_main_execution.py +2767 -0
  20. aetherdialect-0.1.4/src/text2sql/_pipeline.py +3127 -0
  21. aetherdialect-0.1.2/src/text2sql/qsim_sample.py → aetherdialect-0.1.4/src/text2sql/_qsim.py +679 -16
  22. aetherdialect-0.1.2/src/text2sql/qsim_ops.py → aetherdialect-0.1.4/src/text2sql/_qsim_ops.py +1309 -1299
  23. aetherdialect-0.1.4/src/text2sql/_schema.py +6287 -0
  24. aetherdialect-0.1.2/src/text2sql/schema_profiling.py → aetherdialect-0.1.4/src/text2sql/_schema_profiling.py +454 -203
  25. aetherdialect-0.1.4/src/text2sql/_seed_warmup.py +2309 -0
  26. aetherdialect-0.1.4/src/text2sql/_sql_gen.py +3618 -0
  27. aetherdialect-0.1.4/src/text2sql/_templates.py +1777 -0
  28. aetherdialect-0.1.2/src/text2sql/utils.py → aetherdialect-0.1.4/src/text2sql/_utils.py +309 -160
  29. aetherdialect-0.1.2/src/text2sql/validation_agg.py → aetherdialect-0.1.4/src/text2sql/_validation_agg.py +1092 -1083
  30. aetherdialect-0.1.2/src/text2sql/validation_execute.py → aetherdialect-0.1.4/src/text2sql/_validation_execute.py +788 -284
  31. aetherdialect-0.1.2/src/text2sql/validation_schema.py → aetherdialect-0.1.4/src/text2sql/_validation_schema.py +818 -234
  32. aetherdialect-0.1.2/src/text2sql/validation_semantic.py → aetherdialect-0.1.4/src/text2sql/_validation_semantic.py +790 -402
  33. aetherdialect-0.1.4/src/text2sql/text2sql.py +494 -0
  34. aetherdialect-0.1.4/tests/test_artifact_lock.py +72 -0
  35. aetherdialect-0.1.4/tests/test_bool_op_combinations.py +634 -0
  36. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_config.py +159 -45
  37. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_contracts.py +677 -271
  38. aetherdialect-0.1.4/tests/test_core_utils.py +1999 -0
  39. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_dialect.py +853 -638
  40. aetherdialect-0.1.4/tests/test_expansion_ops.py +1778 -0
  41. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_intent_expr.py +1850 -328
  42. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_intent_process.py +931 -90
  43. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_intent_repair.py +1728 -537
  44. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_intent_resolve.py +984 -110
  45. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_live_testing.py +257 -16
  46. aetherdialect-0.1.4/tests/test_main_execution.py +623 -0
  47. aetherdialect-0.1.4/tests/test_migration_diff_driven.py +387 -0
  48. aetherdialect-0.1.4/tests/test_notebook_export_signature.py +27 -0
  49. aetherdialect-0.1.4/tests/test_phase_c_repairs.py +152 -0
  50. aetherdialect-0.1.4/tests/test_pipeline.py +3295 -0
  51. aetherdialect-0.1.4/tests/test_qsim.py +1631 -0
  52. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_qsim_ops.py +12 -12
  53. aetherdialect-0.1.4/tests/test_schema.py +2880 -0
  54. aetherdialect-0.1.4/tests/test_schema_cache_probe.py +314 -0
  55. aetherdialect-0.1.4/tests/test_schema_diff_apply.py +367 -0
  56. aetherdialect-0.1.4/tests/test_schema_diff_renames.py +517 -0
  57. aetherdialect-0.1.4/tests/test_schema_inference_paths.py +168 -0
  58. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_schema_profiling.py +476 -44
  59. aetherdialect-0.1.4/tests/test_schema_scope_change.py +283 -0
  60. aetherdialect-0.1.4/tests/test_seed_warmup.py +1593 -0
  61. aetherdialect-0.1.4/tests/test_sql_gen.py +3194 -0
  62. aetherdialect-0.1.4/tests/test_templates.py +321 -0
  63. aetherdialect-0.1.4/tests/test_text2sql.py +189 -0
  64. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_utils.py +2215 -1649
  65. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_validation_agg.py +1165 -825
  66. aetherdialect-0.1.4/tests/test_validation_execute.py +1049 -0
  67. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_validation_schema.py +887 -42
  68. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/tests/test_validation_semantic.py +1087 -248
  69. aetherdialect-0.1.2/PKG-INFO +0 -199
  70. aetherdialect-0.1.2/README.md +0 -163
  71. aetherdialect-0.1.2/src/aetherdialect.egg-info/PKG-INFO +0 -199
  72. aetherdialect-0.1.2/src/aetherdialect.egg-info/SOURCES.txt +0 -66
  73. aetherdialect-0.1.2/src/text2sql/__init__.py +0 -10
  74. aetherdialect-0.1.2/src/text2sql/config.py +0 -1311
  75. aetherdialect-0.1.2/src/text2sql/contracts_base.py +0 -1165
  76. aetherdialect-0.1.2/src/text2sql/core_utils.py +0 -1247
  77. aetherdialect-0.1.2/src/text2sql/dialect.py +0 -2455
  78. aetherdialect-0.1.2/src/text2sql/intent_expr.py +0 -2026
  79. aetherdialect-0.1.2/src/text2sql/intent_process.py +0 -2320
  80. aetherdialect-0.1.2/src/text2sql/intent_repair.py +0 -2301
  81. aetherdialect-0.1.2/src/text2sql/intent_resolve.py +0 -1381
  82. aetherdialect-0.1.2/src/text2sql/main_execution.py +0 -1727
  83. aetherdialect-0.1.2/src/text2sql/pipeline.py +0 -2607
  84. aetherdialect-0.1.2/src/text2sql/qsim_struct.py +0 -558
  85. aetherdialect-0.1.2/src/text2sql/schema.py +0 -1273
  86. aetherdialect-0.1.2/src/text2sql/simulator.py +0 -1215
  87. aetherdialect-0.1.2/src/text2sql/sql_gen.py +0 -2533
  88. aetherdialect-0.1.2/src/text2sql/templates.py +0 -1264
  89. aetherdialect-0.1.2/src/text2sql/text2sql.py +0 -1006
  90. aetherdialect-0.1.2/tests/test_core_utils.py +0 -1076
  91. aetherdialect-0.1.2/tests/test_expansion_ops.py +0 -1035
  92. aetherdialect-0.1.2/tests/test_join_bool_cte_matrix.py +0 -383
  93. aetherdialect-0.1.2/tests/test_main_execution.py +0 -284
  94. aetherdialect-0.1.2/tests/test_pipeline_session.py +0 -160
  95. aetherdialect-0.1.2/tests/test_pipeline_targeted.py +0 -136
  96. aetherdialect-0.1.2/tests/test_pipeline_units.py +0 -1724
  97. aetherdialect-0.1.2/tests/test_qsim_sample.py +0 -799
  98. aetherdialect-0.1.2/tests/test_qsim_struct.py +0 -808
  99. aetherdialect-0.1.2/tests/test_schema.py +0 -569
  100. aetherdialect-0.1.2/tests/test_simulator.py +0 -554
  101. aetherdialect-0.1.2/tests/test_simulator_pipeline.py +0 -286
  102. aetherdialect-0.1.2/tests/test_sql_gen.py +0 -1801
  103. aetherdialect-0.1.2/tests/test_templates.py +0 -1058
  104. aetherdialect-0.1.2/tests/test_text2sql.py +0 -312
  105. aetherdialect-0.1.2/tests/test_validation_execute.py +0 -622
  106. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/LICENSE +0 -0
  107. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/setup.cfg +0 -0
  108. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/src/aetherdialect.egg-info/dependency_links.txt +0 -0
  109. {aetherdialect-0.1.2 → aetherdialect-0.1.4}/src/aetherdialect.egg-info/top_level.txt +0 -0
@@ -0,0 +1,271 @@
1
+ Metadata-Version: 2.4
2
+ Name: aetherdialect
3
+ Version: 0.1.4
4
+ Summary: Deterministic, validation-first Text-to-SQL system for business databases
5
+ Author-email: Akul Ameya <akul.ameya@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/akul-ameya/aetherdialect
8
+ Requires-Python: >=3.10
9
+ Description-Content-Type: text/markdown
10
+ License-File: LICENSE
11
+ Requires-Dist: pandas<3,>=2.0
12
+ Requires-Dist: packaging<25,>=23.0
13
+ Requires-Dist: jsonschema<5,>=4.0
14
+ Requires-Dist: openai<3,>=2.0.0
15
+ Requires-Dist: platformdirs<5,>=2.0.0
16
+ Requires-Dist: python-dotenv<2,>=1.0.0
17
+ Requires-Dist: sqlglot<30,>=29.0
18
+ Requires-Dist: SQLAlchemy<3,>=2.0
19
+ Provides-Extra: databricks
20
+ Requires-Dist: databricks-sql-connector<4,>=3.0; extra == "databricks"
21
+ Requires-Dist: databricks-sqlalchemy<3,>=2.0; extra == "databricks"
22
+ Provides-Extra: postgresql
23
+ Requires-Dist: pglast<8,>=5.0; extra == "postgresql"
24
+ Requires-Dist: psycopg2-binary<3,>=2.9; extra == "postgresql"
25
+ Provides-Extra: dev
26
+ Requires-Dist: pytest>=8.0; extra == "dev"
27
+ Requires-Dist: pytest-cov>=5.0; extra == "dev"
28
+ Requires-Dist: vulture<3,>=2.11; extra == "dev"
29
+ Requires-Dist: ruff>=0.4; extra == "dev"
30
+ Requires-Dist: mypy>=1.10; extra == "dev"
31
+ Requires-Dist: twine>=5.0; extra == "dev"
32
+ Requires-Dist: build>=1.0; extra == "dev"
33
+ Requires-Dist: pre-commit>=3.0; extra == "dev"
34
+ Requires-Dist: black<25,>=24; extra == "dev"
35
+ Requires-Dist: docformatter<2,>=1.7; extra == "dev"
36
+ Dynamic: license-file
37
+
38
+ # Deterministic, validation-first Text-to-SQL for business databases
39
+
40
+ This library turns **analytical questions** into **read-only `SELECT`** pipelines on **PostgreSQL** or **Databricks**: structured intent, heavy validation (including dialect AST and `EXPLAIN`), optional **template reuse** from accepted answers, and **negative memory** from rejections. When you construct **`Text2SQL`**, it checks **database connectivity**, **LLM reachability**, and whether **on-disk artifacts** still match the live schema.
41
+
42
+ **Practical tips:** Questions resolve more reliably when you state intent explicitly—entities, grain, filters, time scope, and ordering—instead of leaving those details implied. The same goes for optional domain notes (`SchemaContext.notes_file`, see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**): richer notes and clearer questions generally improve routing speed and SQL quality.
43
+
44
+ **Internals:** for how the schema graph is built, how columns become visible to the LLM, how joins are picked, and how stored artifacts migrate when the schema drifts, see **[OVERVIEW.md](https://github.com/akul-ameya/aetherdialect/blob/main/OVERVIEW.md)**.
45
+
46
+ ## Installation
47
+
48
+ ```bash
49
+ pip install aetherdialect
50
+ pip install "aetherdialect[postgresql]"
51
+ pip install "aetherdialect[databricks]"
52
+ pip install "aetherdialect[postgresql,databricks]"
53
+ ```
54
+
55
+ Requires Python ≥ 3.10 and either an [OpenAI API key](https://platform.openai.com/api-keys) or Azure OpenAI credentials. Construction verifies LLM connectivity for **each distinct** model or deployment the run uses; see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)** for required variables, optional deployment-name overrides on Azure, and Databricks SQL warehouse vs PySpark.
56
+
57
+ | Extra | Brings in | Use when |
58
+ | ------------ | ------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------- |
59
+ | (base) | **SQLAlchemy** (shared introspection / execution interface) | Always installed |
60
+ | `postgresql` | PostgreSQL driver (`psycopg2-binary`), **`pglast`** | PostgreSQL via `PG*` / `POSTGRES_*` env (see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**) |
61
+ | `databricks` | Databricks SQL connector (preferred), PySpark (fallback), **`databricks-sqlalchemy`**, `sqlglot` | Databricks via `DATABRICKS_*` / related aliases (see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**) |
62
+
63
+ **SQL parsing for validation:** PostgreSQL uses **`pglast`** for structural AST checks (join pairs, CTE bodies, `ast_validate`). Databricks / Spark SQL uses **`sqlglot`** with the **Spark** dialect.
64
+
65
+ ---
66
+
67
+ ## Quickstart
68
+
69
+ ```python
70
+ from text2sql import SchemaContext, Text2SQL
71
+
72
+ t2s = Text2SQL(
73
+ SchemaContext(include="tables"),
74
+ artifacts_dir="./my_run",
75
+ env_file=".env",
76
+ )
77
+
78
+ t2s.run_interactive()
79
+ ```
80
+
81
+ Set database and LLM variables in the process environment or in **`env_file`**. The full matrix is in **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**. Pass **`artifacts_dir=`** so artifacts are written under `<root>/text2sql`; when omitted, a platform user-data directory is used.
82
+
83
+ **Interactive two ways:** **`run_interactive()`** is a stdin loop. For your own UI or protocol, use **`Text2SQL.pipeline_session()`** with **`PipelineSession.ask`** and **`PipelineSession.step`**, which return **`SessionStep`** until **`done`** is true. Details are in **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**.
84
+
85
+ ---
86
+
87
+ ## When to use it
88
+
89
+ **Good fit:** star/snowflake-style models, clear foreign keys, repeated BI-style questions, PostgreSQL or Databricks.
90
+
91
+ **Poor fit:** schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL outside the supported analytical subset (see **What it supports** and **What it is not** below).
92
+
93
+ **Typical scale:** The toolkit targets **business-facing** warehouses on the order of **tens to hundreds of tables**. If your catalog is much larger, narrow **`SchemaContext`** (`allow_objects`, `include`, deny lists) so the reflected graph matches how analysts work.
94
+
95
+ ---
96
+
97
+ ## What this is
98
+
99
+ A **validation-first** layer for **stable business schemas** and **repeated analytical questions**, not open-ended “any SQL” generation.
100
+
101
+ **Core idea:** the system prefers **reuse and rules** over open-ended generation. Your question is matched to **stored, validated patterns** when possible; otherwise the model proposes **structured analytical intent**, which is **repaired and checked** against the schema, then turned into **read-only `SELECT`** SQL that must pass **static and `EXPLAIN`** gates before it runs.
102
+
103
+ - Natural language is turned into a **structured intent** (tables, select expressions, filters, grouping, ordering, optional CTEs) that is **shared across dialects**; dialect-specific SQL is produced later.
104
+ - **Templates** store previously accepted query patterns; **negative memory** records rejections so bad shapes are less likely to repeat.
105
+ - LLM calls run at **temperature 0**; for the same inputs and schema state, behavior is **repeatable**.
106
+ - **Bounded LLM use**: strong paths reuse templates or deterministic structure before asking the model for SQL.
107
+
108
+ ---
109
+
110
+ ## What you can use it for
111
+
112
+ - **Repeated BI-style questions** on a schema you control: faster turns when templates already fit, consistent SQL shape across users.
113
+ - **Governed analytical subset** of SQL—not arbitrary strings—so reviewers have a predictable surface area.
114
+ - **Learning from usage**: accepting good answers strengthens reuse; rejecting bad ones feeds **negative memory**. When the session asks **what was wrong**, your short free-text reason is **normalized and categorized** so similar intents are steered away more reliably than with a reject flag alone.
115
+ - **Bootstrapping coverage**: **seed warmup** (gold intents from a seed file → expansion → validation/execute → optional NL for new templates) and **QSim** (reproducible synthetic questions from schema and profiles) for datasets, regression lists, or onboarding.
116
+ - **Operational feedback**: construction checks **database connectivity**, **LLM reachability** for every configured model, and reconciles **on-disk artifacts** with the **current** schema (see **Artifacts and migration** below).
117
+
118
+ ---
119
+
120
+ ## LLM: three fixed models
121
+
122
+ The library uses **three** named models internally. On **Azure OpenAI** you expose **three deployments** whose default names match those internal names, or you map each name to your deployment with optional env vars (**[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)** lists the exact strings and variables).
123
+
124
+ When `Text2SQL` is constructed, a **short completion** is sent once per **distinct** configured model name (OpenAI) or deployment name (Azure), so bad keys, endpoints, or deployment maps fail immediately with **`ConfigError`** instead of halfway through a session.
125
+
126
+ When the process environment exposes credentials for **both** providers, **OpenAI is tried first** and **Azure** is used as the fallback when the OpenAI preflight fails; if neither preflight succeeds the construction raises **`ConfigError`**.
127
+
128
+ ---
129
+
130
+ ## Database credentials and permissions
131
+
132
+ Treat credentials as you would for any read-only analyst account.
133
+
134
+ - The engine needs to **reflect** the tables or views in your **`SchemaContext`**, run **`SELECT`** (and **`EXPLAIN`**) on generated queries, and execute the paths you enable (interactive display, warmup, etc.).
135
+ - **Least privilege** is recommended: a role limited to **`SELECT`** (and whatever your database requires for **`EXPLAIN`**) on the objects you include. The library enforces an analytical **`SELECT`**-only policy in generated SQL, but that is **not** a substitute for database- and network-level security.
136
+ - **Scope** matters: allow/deny lists and `include` settings restrict what is visible; they also feed **fingerprinting** so template stores stay aligned when you change scope (**[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**).
137
+
138
+ ---
139
+
140
+ ## Artifacts and migration
141
+
142
+ Everything learned or cached for a connection “shape” lives under **`artifacts_dir`** (see Quickstart): resolved to **`<root>/text2sql`**, or a platform user-data directory if you omit **`artifacts_dir`**. That folder holds the **schema snapshot**, **template store**, **QSim skeletons**, seed-warmup cache, and a small **manifest** of fingerprints—not your raw database.
143
+
144
+ Each time **`Text2SQL(...)`** runs, the **live** schema graph is compared to the **stored** manifest. The outcome is one of four **migration tiers** (construction step 6 and the table below; for a conceptual walkthrough see **[OVERVIEW.md § Migration tiers](https://github.com/akul-ameya/aetherdialect/blob/main/OVERVIEW.md#5-migration-tiers-what-happens-when-your-schema-changes)**):
145
+
146
+ | Tier | What it means for you |
147
+ | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
148
+ | **No change** | Stored fingerprints already match the live graph; existing templates and caches apply as-is. |
149
+ | **Soft refresh** | The same structural scope changed only in ways that do not invalidate stored SQL shapes (for example, refreshed notes or profiling metadata). Fingerprints on disk are updated; **templates are kept**. |
150
+ | **Remap** | The engine detected a **rename-only** style drift (tables/columns) within the same scope. Stored templates may be **rewritten** to the new names; a **Migration applied** summary is printed to stdout when the tier is not `no_change`. |
151
+ | **Destructive** | The drift is too large to remap safely. Applying this tier **clears persisted templates** and **related caches** on disk (including the **compressed schema snapshot**, QSim skeletons, and seed-warmup zip) under that artifact root, then stamps a fresh manifest. **Your database is not modified**—only local learning files. The live graph in memory for the current process is already built; the next process run rewrites artifacts as needed. |
152
+
153
+ **When destructive migration is required**, construction does **not** wipe disk immediately. The in-memory template store starts **empty** until you **confirm** on the first interactive or programmatic turn:
154
+
155
+ - **`run_interactive()`**: you enter a question line as usual; **`ask`** then returns a **`SessionStep`** that asks for migration confirmation **before** that question runs. Answer on the next prompt; **`n`** returns a terminal **`SessionStep`** with **`kind=error`** and guidance to use a **new `artifacts_dir`** (or remove the old artifact tree) before continuing.
156
+ - **`PipelineSession`**: the first **`ask`** may return a **`SessionStep`** whose **`prompt`** asks for confirmation (your question string is held until you confirm). Supply **`y`** / **`n`** via **`step`**. Declining migration returns the same terminal **`SessionStep`** error shape instead of raising.
157
+
158
+ After construction, the migration outcome is printed when non-trivial; use **`Text2SQL.show_config()`** for a redacted configuration snapshot. For backups, copy the whole **`<root>/text2sql`** directory; to reset learning, remove it or switch **`artifacts_dir`**.
159
+
160
+ ---
161
+
162
+ ## How to improve results (without touching code)
163
+
164
+ - **Schema quality:** **declared foreign keys** in the database, sensible types, and a stable star/snowflake-style layout are ideal. If production metadata is thin, the graph still gains **inferred FK-style links** from naming conventions where those rules apply, plus **semantic join neighbors** from profiled column value overlap—so imperfect warehouses get extra join signal, not only whatever the catalog declared.
165
+ - **Domain language:** optional **notes** file (`SchemaContext.notes_file`) and concrete questions (entities, time range, grain) improve routing and SQL quality; see the opening paragraph of this README.
166
+ - **Scope:** use **allow/deny** and **`include`** deliberately so the graph matches how analysts think about the warehouse; changing scope changes fingerprints and can trigger migration.
167
+ - **Operational learning:** accept good SQL; when you reject, answer the **“what was wrong?”** prompt when it appears—the reason improves negative learning (see above). Use **seed warmup** or **QSim** to broaden template coverage in a controlled way.
168
+
169
+ ---
170
+
171
+ ## Philosophy
172
+
173
+ - **Determinism over creativity** — prefer a correct, boring plan to a novel one.
174
+ - **Correct joins over clever SQL** — join paths come from **declared and inferred foreign-key structure**, precomputed paths, and profiled semantic links where applicable, not free-form guessing.
175
+ - **Validate before execute** — schema checks, intent consistency, join shape, and dialect-safe **read-only** `SELECT` rules.
176
+ - **Minimal LLM surface** — parse intent, resolve ambiguous joins when needed; everything else is rules and stores.
177
+ - **Safe defaults for non-analysts** — narrow allowed SQL; you still choose **database credentials** (read-only vs write-capable is outside this library).
178
+
179
+ ---
180
+
181
+ ## What it supports (at a glance)
182
+
183
+ **Backends**
184
+
185
+ - PostgreSQL
186
+ - Databricks
187
+
188
+ **Schema**
189
+
190
+ - Load from **live introspection** (primary). Set **`SchemaContext.include`** to **`"tables"`** (base tables and materialized views), **`"views"`** (ordinary views only), or **`"both"`** (union of the two passes; ambiguous relation names are rejected when building the graph).
191
+ - Optional **`CREATE TABLE` file** as extra or fallback (especially when you cannot reach all metadata from the driver).
192
+ - Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
193
+ - **Table roles** (e.g. fact vs dimension), **column roles** (measure, categorical, temporal, identifier, etc.), **filter / aggregation / HAVING** allowances per column, **value domains** from profiling — all assigned when the graph is built (reflection, DDL, profiling, and optional notes).
194
+ - Profiling captures the **mode frequency** for each column. When one value occupies ≥99% of the non-null distribution (a sentinel like `0`, `-1`, or `'Unknown'`) the column is hidden from the LLM by the same gate as columns that are ≥99% null — sentinel-dominated columns carry no useful filter or grouping signal.
195
+ - Optional **human notes** (plain text), via **`SchemaContext.notes_file`** (see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**): merged when the graph is built or when notes change; if the cache already contains notes and you omit `notes_file` on a later run, cached roles and hints are kept.
196
+ - Optional **`deny_columns`** and **`allow_objects`** in **`SchemaContext`**; they participate in **scope hashing** so template stores reconcile when scope changes. Each `deny_columns` entry is either a qualified **`"table.column"`** (denies that exact column) or a bare **`"column"`** name (denies that column name on every table where it appears — qualify if you want one-table scope). Denied columns are hidden from the LLM context and rejected anywhere they would appear in the IR (bare select, filter, `GROUP BY`, `HAVING`, `ORDER BY`, aggregate).
197
+ - Optional **`allow_columns`** in **`SchemaContext`** complements `deny_columns`: when non-empty, only the listed columns survive reflection. Same grammar as `deny_columns` (qualified or bare). **Pragmatic auto-include**: primary key columns and any column appearing in a foreign key edge (source or destination) are always retained so the join graph survives a narrow allow list. Participates in scope hashing.
198
+ - Per-column **`sensitivity`** tag on `ColumnMetadata` accepts `"pii"` or `"restricted"`. Both hide the column from the LLM context. `"pii"` additionally rejects bare select-list projection and `GROUP BY` references; aggregates and equality filters remain available. `"restricted"` hides from the LLM only — IR references that survive other validators are permitted.
199
+
200
+ **Intent / SQL shape (analytical subset)**
201
+
202
+ - **Queries:** `SELECT` only (enforced with pattern checks and dialect parsing). **CTEs** reuse the same intent model as the outer query.
203
+ - **Joins:** **foreign keys** and precomputed paths, plus **profiled value-overlap** links between columns when samples agree; when several valid paths could link the same tables, one coherent path is chosen for the whole query. **CTEs** act as virtual tables in join discovery; self-joins use CTEs instead of repeating the same base table.
204
+ - **Select list:** bare columns, **aggregates** (`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, etc.), **arithmetic and string expressions** where the schema allows them, **`DISTINCT`**, and **scalar functions** subject to column metadata.
205
+ - **Filters / boolean logic:** comparisons, **`AND` / `OR`**, **`IN`**, **`LIKE`**; **`ILIKE` / `NOT ILIKE` on PostgreSQL only** (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the deterministic intent and SQL preparation chain.
206
+ - **`BETWEEN`** in intent is **decomposed** into a pair of comparable predicates.
207
+ - **Grouping / ordering:** **`GROUP BY`**, **`HAVING`** (aggregate-aware), **`ORDER BY`**, **`LIMIT`**; rules tie **grain** (row-level vs grouped) to aggregates and grouped columns.
208
+ - **Dates:** structured **`date_window`** (anchor unit + offset) and **date-difference** filters between columns where supported. Calendar grains supplied by the LLM are accepted in colloquial forms — `monthly`, `quarterly`, `yearly`, `annual`, `daily`, `weekly`, `hourly`, plurals, and short abbreviations like `yr`, `mo`, `qtr`, `wk` — and are folded internally to a single canonical token before SQL is rendered, so dialect emit functions see only `day` / `week` / `month` / `quarter` / `year` / `hour` / `minute` / `second`.
209
+ - **Windows:** `ROW_NUMBER`, `RANK`, `DENSE_RANK`, and windowed `SUM`/`AVG`, `LAG`, `LEAD`, `FIRST_VALUE`, and `LAST_VALUE` on select columns (main query and CTEs).
210
+ - **`CASE` / `WHEN`:** only in the **select list** in the intent model (not in `WHERE` / `HAVING`).
211
+ - **Arrays / lists:** membership-style filters; SQL uses dialect-appropriate forms; optional **UNNEST / EXPLODE-style** expansion in CTE select lists for typed array columns.
212
+ - **Metadata:** **UNIQUE** (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.
213
+
214
+ **Operational modes**
215
+
216
+ - **Interactive** — ask questions, accept/reject, results export; via **`run_interactive()`** or a programmatic **`PipelineSession`** (see Quickstart above and **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**).
217
+ - **Seed warmup** — seed questions → gold intents → **deterministic expansion** (many operators, deduplicated) → validate/execute → NL question generation for new templates.
218
+ - **QSim** — reproducible synthetic questions from schema and profiles (seeded randomness).
219
+
220
+ ---
221
+
222
+ ## What it is not
223
+
224
+ - Not a **full SQL** or **stored-procedure** generator: no `UNION`/`INTERSECT`/`EXCEPT`, no correlated subqueries, no `EXISTS`, no `LATERAL`, no **DML/DDL**, no arbitrary **RIGHT/FULL OUTER** join policy in the constrained path.
225
+ - Not a substitute for **database security** (see **Database credentials and permissions** above).
226
+ - Not **schema-agnostic**: quality depends on **relationship signal** (declared FKs, inferred links, semantic profile edges), sensible types, and optional notes for domain language.
227
+
228
+ ---
229
+
230
+ ## How a question becomes SQL
231
+
232
+ 1. **Template match** — if a trusted pattern fits, reuse parameterized SQL (often **no** SQL LLM call).
233
+ 2. **Intent parse** — structured intent from the question + schema summary, then a long **deterministic repair chain**.
234
+ 3. **Join resolution** — choose among valid **FK paths** for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
235
+ 4. **SQL generation & validation** — deterministic skeleton, injected joins, LLM fill **under constraints** where applicable, then **semantic validation**, **`SELECT`-only / forbidden-pattern checks**, **dialect AST** validation (**`pglast`** or **`sqlglot`**), and **`EXPLAIN`**.
236
+ 5. **Execute** (where the mode allows) and **learn** — accept → promote template trust; reject → record a **categorized** negative pattern (optionally after a short **rejection reason** when the UI or interactive loop asks for one).
237
+
238
+ ---
239
+
240
+ ## Validation (layers)
241
+
242
+ - **Safety / shape** — `SELECT`-only enforcement, configurable **forbidden SQL** substrings, then **dialect `ast_validate`** (**`pglast`** or **`sqlglot`**). **`EXPLAIN`** is used as an extra executability check. Intent handling is structured (contracts and parsers); a few **narrow string passes** still exist for SQL post-processing where AST coverage is not yet unified. Natural-language questions are not executed as raw SQL; they pass through structured intent and these gates, with values bound as parameters or validated literals rather than unconstrained interpolation into executable SQL (together with least-privilege roles, see _What it is not_).
243
+ - **Schema vs intent** — tables/columns/CTEs, **selectability**, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
244
+ - **Semantic consistency** — grouped queries require proper aggregation; contradictions and impossible HAVING; **grain** alignment.
245
+ - **Joins** — paths use the FK graph, virtual CTE bridges, and stored semantic-profile edges where applicable; guarded path avoids ad-hoc join guessing.
246
+
247
+ ---
248
+
249
+ ## Learning and reuse
250
+
251
+ - **Accepted templates** — intent fingerprint, parameterized SQL, optional example question, **trust** that rises with validation and falls with rejection.
252
+ - **Rejected templates** (“negative memory”) — failures are stored with **categories** (and optional user **rejection reasons** when collected) so similar bad intents are discouraged on later turns.
253
+ - **Loader reconciliation** — when you open an existing template file, rows that no longer match the current graph (missing tables, columns, or join segments) are pruned, negative memory for removed rejects is cleared, and stale failure-log rows from older hashes are filtered before the store is saved for the current scope. Large fingerprint jumps are handled by the **migration** path above, not only this incremental prune.
254
+ - Persistence lives next to the manifest under your **`artifacts_dir`** tree (**[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**); back it up or reset it as described under **Artifacts and migration**.
255
+
256
+ ---
257
+
258
+ ## Seed warmup (brief)
259
+
260
+ 1. Parse each **seed** line into a gold intent.
261
+ 2. **Expand** with a fixed set of **deterministic operators** (filters, aggregates, joins, etc.), **deduplicated** by body similarity across depths.
262
+ 3. Classify gold rows against the template store; resolve joins per table set; validate and **execute** the full pool (subject to caps).
263
+ 4. **Stratified sampling** on execute successes, then an LLM step (up to **three** paraphrases) from the SQL with a **realism** filter.
264
+
265
+ Warmup loads at most **500** seed lines per run (fixed internal cap; larger files are truncated). **`run_warmup_preflight`** runs the same execute path without question LLM or template persistence and writes **`warmup_preflight_report_v{m}.json`**.
266
+
267
+ ---
268
+
269
+ ## QSim (brief)
270
+
271
+ Generates **reproducible** question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.
@@ -0,0 +1,234 @@
1
+ # Deterministic, validation-first Text-to-SQL for business databases
2
+
3
+ This library turns **analytical questions** into **read-only `SELECT`** pipelines on **PostgreSQL** or **Databricks**: structured intent, heavy validation (including dialect AST and `EXPLAIN`), optional **template reuse** from accepted answers, and **negative memory** from rejections. When you construct **`Text2SQL`**, it checks **database connectivity**, **LLM reachability**, and whether **on-disk artifacts** still match the live schema.
4
+
5
+ **Practical tips:** Questions resolve more reliably when you state intent explicitly—entities, grain, filters, time scope, and ordering—instead of leaving those details implied. The same goes for optional domain notes (`SchemaContext.notes_file`, see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**): richer notes and clearer questions generally improve routing speed and SQL quality.
6
+
7
+ **Internals:** for how the schema graph is built, how columns become visible to the LLM, how joins are picked, and how stored artifacts migrate when the schema drifts, see **[OVERVIEW.md](https://github.com/akul-ameya/aetherdialect/blob/main/OVERVIEW.md)**.
8
+
9
+ ## Installation
10
+
11
+ ```bash
12
+ pip install aetherdialect
13
+ pip install "aetherdialect[postgresql]"
14
+ pip install "aetherdialect[databricks]"
15
+ pip install "aetherdialect[postgresql,databricks]"
16
+ ```
17
+
18
+ Requires Python ≥ 3.10 and either an [OpenAI API key](https://platform.openai.com/api-keys) or Azure OpenAI credentials. Construction verifies LLM connectivity for **each distinct** model or deployment the run uses; see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)** for required variables, optional deployment-name overrides on Azure, and Databricks SQL warehouse vs PySpark.
19
+
20
+ | Extra | Brings in | Use when |
21
+ | ------------ | ------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------- |
22
+ | (base) | **SQLAlchemy** (shared introspection / execution interface) | Always installed |
23
+ | `postgresql` | PostgreSQL driver (`psycopg2-binary`), **`pglast`** | PostgreSQL via `PG*` / `POSTGRES_*` env (see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**) |
24
+ | `databricks` | Databricks SQL connector (preferred), PySpark (fallback), **`databricks-sqlalchemy`**, `sqlglot` | Databricks via `DATABRICKS_*` / related aliases (see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**) |
25
+
26
+ **SQL parsing for validation:** PostgreSQL uses **`pglast`** for structural AST checks (join pairs, CTE bodies, `ast_validate`). Databricks / Spark SQL uses **`sqlglot`** with the **Spark** dialect.
27
+
28
+ ---
29
+
30
+ ## Quickstart
31
+
32
+ ```python
33
+ from text2sql import SchemaContext, Text2SQL
34
+
35
+ t2s = Text2SQL(
36
+ SchemaContext(include="tables"),
37
+ artifacts_dir="./my_run",
38
+ env_file=".env",
39
+ )
40
+
41
+ t2s.run_interactive()
42
+ ```
43
+
44
+ Set database and LLM variables in the process environment or in **`env_file`**. The full matrix is in **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**. Pass **`artifacts_dir=`** so artifacts are written under `<root>/text2sql`; when omitted, a platform user-data directory is used.
45
+
46
+ **Interactive two ways:** **`run_interactive()`** is a stdin loop. For your own UI or protocol, use **`Text2SQL.pipeline_session()`** with **`PipelineSession.ask`** and **`PipelineSession.step`**, which return **`SessionStep`** until **`done`** is true. Details are in **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**.
47
+
48
+ ---
49
+
50
+ ## When to use it
51
+
52
+ **Good fit:** star/snowflake-style models, clear foreign keys, repeated BI-style questions, PostgreSQL or Databricks.
53
+
54
+ **Poor fit:** schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL outside the supported analytical subset (see **What it supports** and **What it is not** below).
55
+
56
+ **Typical scale:** The toolkit targets **business-facing** warehouses on the order of **tens to hundreds of tables**. If your catalog is much larger, narrow **`SchemaContext`** (`allow_objects`, `include`, deny lists) so the reflected graph matches how analysts work.
57
+
58
+ ---
59
+
60
+ ## What this is
61
+
62
+ A **validation-first** layer for **stable business schemas** and **repeated analytical questions**, not open-ended “any SQL” generation.
63
+
64
+ **Core idea:** the system prefers **reuse and rules** over open-ended generation. Your question is matched to **stored, validated patterns** when possible; otherwise the model proposes **structured analytical intent**, which is **repaired and checked** against the schema, then turned into **read-only `SELECT`** SQL that must pass **static and `EXPLAIN`** gates before it runs.
65
+
66
+ - Natural language is turned into a **structured intent** (tables, select expressions, filters, grouping, ordering, optional CTEs) that is **shared across dialects**; dialect-specific SQL is produced later.
67
+ - **Templates** store previously accepted query patterns; **negative memory** records rejections so bad shapes are less likely to repeat.
68
+ - LLM calls run at **temperature 0**; for the same inputs and schema state, behavior is **repeatable**.
69
+ - **Bounded LLM use**: strong paths reuse templates or deterministic structure before asking the model for SQL.
70
+
71
+ ---
72
+
73
+ ## What you can use it for
74
+
75
+ - **Repeated BI-style questions** on a schema you control: faster turns when templates already fit, consistent SQL shape across users.
76
+ - **Governed analytical subset** of SQL—not arbitrary strings—so reviewers have a predictable surface area.
77
+ - **Learning from usage**: accepting good answers strengthens reuse; rejecting bad ones feeds **negative memory**. When the session asks **what was wrong**, your short free-text reason is **normalized and categorized** so similar intents are steered away more reliably than with a reject flag alone.
78
+ - **Bootstrapping coverage**: **seed warmup** (gold intents from a seed file → expansion → validation/execute → optional NL for new templates) and **QSim** (reproducible synthetic questions from schema and profiles) for datasets, regression lists, or onboarding.
79
+ - **Operational feedback**: construction checks **database connectivity**, **LLM reachability** for every configured model, and reconciles **on-disk artifacts** with the **current** schema (see **Artifacts and migration** below).
80
+
81
+ ---
82
+
83
+ ## LLM: three fixed models
84
+
85
+ The library uses **three** named models internally. On **Azure OpenAI** you expose **three deployments** whose default names match those internal names, or you map each name to your deployment with optional env vars (**[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)** lists the exact strings and variables).
86
+
87
+ When `Text2SQL` is constructed, a **short completion** is sent once per **distinct** configured model name (OpenAI) or deployment name (Azure), so bad keys, endpoints, or deployment maps fail immediately with **`ConfigError`** instead of halfway through a session.
88
+
89
+ When the process environment exposes credentials for **both** providers, **OpenAI is tried first** and **Azure** is used as the fallback when the OpenAI preflight fails; if neither preflight succeeds the construction raises **`ConfigError`**.
90
+
91
+ ---
92
+
93
+ ## Database credentials and permissions
94
+
95
+ Treat credentials as you would for any read-only analyst account.
96
+
97
+ - The engine needs to **reflect** the tables or views in your **`SchemaContext`**, run **`SELECT`** (and **`EXPLAIN`**) on generated queries, and execute the paths you enable (interactive display, warmup, etc.).
98
+ - **Least privilege** is recommended: a role limited to **`SELECT`** (and whatever your database requires for **`EXPLAIN`**) on the objects you include. The library enforces an analytical **`SELECT`**-only policy in generated SQL, but that is **not** a substitute for database- and network-level security.
99
+ - **Scope** matters: allow/deny lists and `include` settings restrict what is visible; they also feed **fingerprinting** so template stores stay aligned when you change scope (**[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**).
100
+
101
+ ---
102
+
103
+ ## Artifacts and migration
104
+
105
+ Everything learned or cached for a connection “shape” lives under **`artifacts_dir`** (see Quickstart): resolved to **`<root>/text2sql`**, or a platform user-data directory if you omit **`artifacts_dir`**. That folder holds the **schema snapshot**, **template store**, **QSim skeletons**, seed-warmup cache, and a small **manifest** of fingerprints—not your raw database.
106
+
107
+ Each time **`Text2SQL(...)`** runs, the **live** schema graph is compared to the **stored** manifest. The outcome is one of four **migration tiers** (construction step 6 and the table below; for a conceptual walkthrough see **[OVERVIEW.md § Migration tiers](https://github.com/akul-ameya/aetherdialect/blob/main/OVERVIEW.md#5-migration-tiers-what-happens-when-your-schema-changes)**):
108
+
109
+ | Tier | What it means for you |
110
+ | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
111
+ | **No change** | Stored fingerprints already match the live graph; existing templates and caches apply as-is. |
112
+ | **Soft refresh** | The same structural scope changed only in ways that do not invalidate stored SQL shapes (for example, refreshed notes or profiling metadata). Fingerprints on disk are updated; **templates are kept**. |
113
+ | **Remap** | The engine detected a **rename-only** style drift (tables/columns) within the same scope. Stored templates may be **rewritten** to the new names; a **Migration applied** summary is printed to stdout when the tier is not `no_change`. |
114
+ | **Destructive** | The drift is too large to remap safely. Applying this tier **clears persisted templates** and **related caches** on disk (including the **compressed schema snapshot**, QSim skeletons, and seed-warmup zip) under that artifact root, then stamps a fresh manifest. **Your database is not modified**—only local learning files. The live graph in memory for the current process is already built; the next process run rewrites artifacts as needed. |
115
+
116
+ **When destructive migration is required**, construction does **not** wipe disk immediately. The in-memory template store starts **empty** until you **confirm** on the first interactive or programmatic turn:
117
+
118
+ - **`run_interactive()`**: you enter a question line as usual; **`ask`** then returns a **`SessionStep`** that asks for migration confirmation **before** that question runs. Answer on the next prompt; **`n`** returns a terminal **`SessionStep`** with **`kind=error`** and guidance to use a **new `artifacts_dir`** (or remove the old artifact tree) before continuing.
119
+ - **`PipelineSession`**: the first **`ask`** may return a **`SessionStep`** whose **`prompt`** asks for confirmation (your question string is held until you confirm). Supply **`y`** / **`n`** via **`step`**. Declining migration returns the same terminal **`SessionStep`** error shape instead of raising.
120
+
121
+ After construction, the migration outcome is printed when non-trivial; use **`Text2SQL.show_config()`** for a redacted configuration snapshot. For backups, copy the whole **`<root>/text2sql`** directory; to reset learning, remove it or switch **`artifacts_dir`**.
122
+
123
+ ---
124
+
125
+ ## How to improve results (without touching code)
126
+
127
+ - **Schema quality:** **declared foreign keys** in the database, sensible types, and a stable star/snowflake-style layout are ideal. If production metadata is thin, the graph still gains **inferred FK-style links** from naming conventions where those rules apply, plus **semantic join neighbors** from profiled column value overlap—so imperfect warehouses get extra join signal, not only whatever the catalog declared.
128
+ - **Domain language:** optional **notes** file (`SchemaContext.notes_file`) and concrete questions (entities, time range, grain) improve routing and SQL quality; see the opening paragraph of this README.
129
+ - **Scope:** use **allow/deny** and **`include`** deliberately so the graph matches how analysts think about the warehouse; changing scope changes fingerprints and can trigger migration.
130
+ - **Operational learning:** accept good SQL; when you reject, answer the **“what was wrong?”** prompt when it appears—the reason improves negative learning (see above). Use **seed warmup** or **QSim** to broaden template coverage in a controlled way.
131
+
132
+ ---
133
+
134
+ ## Philosophy
135
+
136
+ - **Determinism over creativity** — prefer a correct, boring plan to a novel one.
137
+ - **Correct joins over clever SQL** — join paths come from **declared and inferred foreign-key structure**, precomputed paths, and profiled semantic links where applicable, not free-form guessing.
138
+ - **Validate before execute** — schema checks, intent consistency, join shape, and dialect-safe **read-only** `SELECT` rules.
139
+ - **Minimal LLM surface** — parse intent, resolve ambiguous joins when needed; everything else is rules and stores.
140
+ - **Safe defaults for non-analysts** — narrow allowed SQL; you still choose **database credentials** (read-only vs write-capable is outside this library).
141
+
142
+ ---
143
+
144
+ ## What it supports (at a glance)
145
+
146
+ **Backends**
147
+
148
+ - PostgreSQL
149
+ - Databricks
150
+
151
+ **Schema**
152
+
153
+ - Load from **live introspection** (primary). Set **`SchemaContext.include`** to **`"tables"`** (base tables and materialized views), **`"views"`** (ordinary views only), or **`"both"`** (union of the two passes; ambiguous relation names are rejected when building the graph).
154
+ - Optional **`CREATE TABLE` file** as extra or fallback (especially when you cannot reach all metadata from the driver).
155
+ - Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
156
+ - **Table roles** (e.g. fact vs dimension), **column roles** (measure, categorical, temporal, identifier, etc.), **filter / aggregation / HAVING** allowances per column, **value domains** from profiling — all assigned when the graph is built (reflection, DDL, profiling, and optional notes).
157
+ - Profiling captures the **mode frequency** for each column. When one value occupies ≥99% of the non-null distribution (a sentinel like `0`, `-1`, or `'Unknown'`) the column is hidden from the LLM by the same gate as columns that are ≥99% null — sentinel-dominated columns carry no useful filter or grouping signal.
158
+ - Optional **human notes** (plain text), via **`SchemaContext.notes_file`** (see **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**): merged when the graph is built or when notes change; if the cache already contains notes and you omit `notes_file` on a later run, cached roles and hints are kept.
159
+ - Optional **`deny_columns`** and **`allow_objects`** in **`SchemaContext`**; they participate in **scope hashing** so template stores reconcile when scope changes. Each `deny_columns` entry is either a qualified **`"table.column"`** (denies that exact column) or a bare **`"column"`** name (denies that column name on every table where it appears — qualify if you want one-table scope). Denied columns are hidden from the LLM context and rejected anywhere they would appear in the IR (bare select, filter, `GROUP BY`, `HAVING`, `ORDER BY`, aggregate).
160
+ - Optional **`allow_columns`** in **`SchemaContext`** complements `deny_columns`: when non-empty, only the listed columns survive reflection. Same grammar as `deny_columns` (qualified or bare). **Pragmatic auto-include**: primary key columns and any column appearing in a foreign key edge (source or destination) are always retained so the join graph survives a narrow allow list. Participates in scope hashing.
161
+ - Per-column **`sensitivity`** tag on `ColumnMetadata` accepts `"pii"` or `"restricted"`. Both hide the column from the LLM context. `"pii"` additionally rejects bare select-list projection and `GROUP BY` references; aggregates and equality filters remain available. `"restricted"` hides from the LLM only — IR references that survive other validators are permitted.
162
+
163
+ **Intent / SQL shape (analytical subset)**
164
+
165
+ - **Queries:** `SELECT` only (enforced with pattern checks and dialect parsing). **CTEs** reuse the same intent model as the outer query.
166
+ - **Joins:** **foreign keys** and precomputed paths, plus **profiled value-overlap** links between columns when samples agree; when several valid paths could link the same tables, one coherent path is chosen for the whole query. **CTEs** act as virtual tables in join discovery; self-joins use CTEs instead of repeating the same base table.
167
+ - **Select list:** bare columns, **aggregates** (`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, etc.), **arithmetic and string expressions** where the schema allows them, **`DISTINCT`**, and **scalar functions** subject to column metadata.
168
+ - **Filters / boolean logic:** comparisons, **`AND` / `OR`**, **`IN`**, **`LIKE`**; **`ILIKE` / `NOT ILIKE` on PostgreSQL only** (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the deterministic intent and SQL preparation chain.
169
+ - **`BETWEEN`** in intent is **decomposed** into a pair of comparable predicates.
170
+ - **Grouping / ordering:** **`GROUP BY`**, **`HAVING`** (aggregate-aware), **`ORDER BY`**, **`LIMIT`**; rules tie **grain** (row-level vs grouped) to aggregates and grouped columns.
171
+ - **Dates:** structured **`date_window`** (anchor unit + offset) and **date-difference** filters between columns where supported. Calendar grains supplied by the LLM are accepted in colloquial forms — `monthly`, `quarterly`, `yearly`, `annual`, `daily`, `weekly`, `hourly`, plurals, and short abbreviations like `yr`, `mo`, `qtr`, `wk` — and are folded internally to a single canonical token before SQL is rendered, so dialect emit functions see only `day` / `week` / `month` / `quarter` / `year` / `hour` / `minute` / `second`.
172
+ - **Windows:** `ROW_NUMBER`, `RANK`, `DENSE_RANK`, and windowed `SUM`/`AVG`, `LAG`, `LEAD`, `FIRST_VALUE`, and `LAST_VALUE` on select columns (main query and CTEs).
173
+ - **`CASE` / `WHEN`:** only in the **select list** in the intent model (not in `WHERE` / `HAVING`).
174
+ - **Arrays / lists:** membership-style filters; SQL uses dialect-appropriate forms; optional **UNNEST / EXPLODE-style** expansion in CTE select lists for typed array columns.
175
+ - **Metadata:** **UNIQUE** (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.
176
+
177
+ **Operational modes**
178
+
179
+ - **Interactive** — ask questions, accept/reject, results export; via **`run_interactive()`** or a programmatic **`PipelineSession`** (see Quickstart above and **[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**).
180
+ - **Seed warmup** — seed questions → gold intents → **deterministic expansion** (many operators, deduplicated) → validate/execute → NL question generation for new templates.
181
+ - **QSim** — reproducible synthetic questions from schema and profiles (seeded randomness).
182
+
183
+ ---
184
+
185
+ ## What it is not
186
+
187
+ - Not a **full SQL** or **stored-procedure** generator: no `UNION`/`INTERSECT`/`EXCEPT`, no correlated subqueries, no `EXISTS`, no `LATERAL`, no **DML/DDL**, no arbitrary **RIGHT/FULL OUTER** join policy in the constrained path.
188
+ - Not a substitute for **database security** (see **Database credentials and permissions** above).
189
+ - Not **schema-agnostic**: quality depends on **relationship signal** (declared FKs, inferred links, semantic profile edges), sensible types, and optional notes for domain language.
190
+
191
+ ---
192
+
193
+ ## How a question becomes SQL
194
+
195
+ 1. **Template match** — if a trusted pattern fits, reuse parameterized SQL (often **no** SQL LLM call).
196
+ 2. **Intent parse** — structured intent from the question + schema summary, then a long **deterministic repair chain**.
197
+ 3. **Join resolution** — choose among valid **FK paths** for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
198
+ 4. **SQL generation & validation** — deterministic skeleton, injected joins, LLM fill **under constraints** where applicable, then **semantic validation**, **`SELECT`-only / forbidden-pattern checks**, **dialect AST** validation (**`pglast`** or **`sqlglot`**), and **`EXPLAIN`**.
199
+ 5. **Execute** (where the mode allows) and **learn** — accept → promote template trust; reject → record a **categorized** negative pattern (optionally after a short **rejection reason** when the UI or interactive loop asks for one).
200
+
201
+ ---
202
+
203
+ ## Validation (layers)
204
+
205
+ - **Safety / shape** — `SELECT`-only enforcement, configurable **forbidden SQL** substrings, then **dialect `ast_validate`** (**`pglast`** or **`sqlglot`**). **`EXPLAIN`** is used as an extra executability check. Intent handling is structured (contracts and parsers); a few **narrow string passes** still exist for SQL post-processing where AST coverage is not yet unified. Natural-language questions are not executed as raw SQL; they pass through structured intent and these gates, with values bound as parameters or validated literals rather than unconstrained interpolation into executable SQL (together with least-privilege roles, see _What it is not_).
206
+ - **Schema vs intent** — tables/columns/CTEs, **selectability**, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
207
+ - **Semantic consistency** — grouped queries require proper aggregation; contradictions and impossible HAVING; **grain** alignment.
208
+ - **Joins** — paths use the FK graph, virtual CTE bridges, and stored semantic-profile edges where applicable; guarded path avoids ad-hoc join guessing.
209
+
210
+ ---
211
+
212
+ ## Learning and reuse
213
+
214
+ - **Accepted templates** — intent fingerprint, parameterized SQL, optional example question, **trust** that rises with validation and falls with rejection.
215
+ - **Rejected templates** (“negative memory”) — failures are stored with **categories** (and optional user **rejection reasons** when collected) so similar bad intents are discouraged on later turns.
216
+ - **Loader reconciliation** — when you open an existing template file, rows that no longer match the current graph (missing tables, columns, or join segments) are pruned, negative memory for removed rejects is cleared, and stale failure-log rows from older hashes are filtered before the store is saved for the current scope. Large fingerprint jumps are handled by the **migration** path above, not only this incremental prune.
217
+ - Persistence lives next to the manifest under your **`artifacts_dir`** tree (**[API_REFERENCE.md](https://github.com/akul-ameya/aetherdialect/blob/main/API_REFERENCE.md)**); back it up or reset it as described under **Artifacts and migration**.
218
+
219
+ ---
220
+
221
+ ## Seed warmup (brief)
222
+
223
+ 1. Parse each **seed** line into a gold intent.
224
+ 2. **Expand** with a fixed set of **deterministic operators** (filters, aggregates, joins, etc.), **deduplicated** by body similarity across depths.
225
+ 3. Classify gold rows against the template store; resolve joins per table set; validate and **execute** the full pool (subject to caps).
226
+ 4. **Stratified sampling** on execute successes, then an LLM step (up to **three** paraphrases) from the SQL with a **realism** filter.
227
+
228
+ Warmup loads at most **500** seed lines per run (fixed internal cap; larger files are truncated). **`run_warmup_preflight`** runs the same execute path without question LLM or template persistence and writes **`warmup_preflight_report_v{m}.json`**.
229
+
230
+ ---
231
+
232
+ ## QSim (brief)
233
+
234
+ Generates **reproducible** question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.