sql-code-graph 0.2.1__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (173) hide show
  1. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/ARCHITECTURE_REVIEW.md +667 -16
  2. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/PKG-INFO +50 -4
  3. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/README.md +49 -3
  4. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/docs/AIRBNB_PARSE_REPORT.md +3 -3
  5. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/docs/cli.md +36 -1
  6. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/plan/progress.txt +163 -5
  7. sql_code_graph-0.3.0/plan/sprint_next.md +1203 -0
  8. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/pyproject.toml +1 -1
  9. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/__init__.py +1 -1
  10. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/db.py +48 -0
  11. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/gain.py +86 -14
  12. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/index.py +5 -0
  13. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/install.py +21 -7
  14. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/mcp.py +1 -0
  15. sql_code_graph-0.3.0/src/sqlcg/cli/commands/uninstall.py +213 -0
  16. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/main.py +26 -3
  17. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/kuzu_backend.py +22 -20
  18. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/indexer/indexer.py +21 -3
  19. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/parsers/ansi_parser.py +18 -1
  20. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/parsers/base.py +17 -1
  21. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/parsers/bigquery_parser.py +2 -2
  22. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/parsers/snowflake_parser.py +3 -2
  23. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/server/models.py +44 -0
  24. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/server/tools.py +149 -16
  25. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/e2e/test_mcp_tools.py +6 -6
  26. sql_code_graph-0.3.0/tests/fixtures/bigquery/.gitkeep +0 -0
  27. sql_code_graph-0.3.0/tests/fixtures/synthetic/base_tables.sql +26 -0
  28. sql_code_graph-0.3.0/tests/fixtures/synthetic/reports.sql +20 -0
  29. sql_code_graph-0.3.0/tests/fixtures/synthetic/views.sql +23 -0
  30. sql_code_graph-0.3.0/tests/integration/snowflake/__init__.py +0 -0
  31. sql_code_graph-0.3.0/tests/integration/snowflake/test_insert_select.py +70 -0
  32. sql_code_graph-0.3.0/tests/unit/snowflake/__init__.py +0 -0
  33. sql_code_graph-0.3.0/tests/unit/snowflake/test_scripting_noise.py +38 -0
  34. sql_code_graph-0.3.0/tests/unit/test_aggregator.py +44 -0
  35. sql_code_graph-0.3.0/tests/unit/test_base_parser.py +102 -0
  36. sql_code_graph-0.3.0/tests/unit/test_cli_help.py +59 -0
  37. sql_code_graph-0.3.0/tests/unit/test_db_info.py +130 -0
  38. sql_code_graph-0.3.0/tests/unit/test_gain_ratio.py +117 -0
  39. sql_code_graph-0.3.0/tests/unit/test_index_cmd.py +132 -0
  40. sql_code_graph-0.3.0/tests/unit/test_indexer_progress.py +58 -0
  41. sql_code_graph-0.3.0/tests/unit/test_indexer_quality.py +83 -0
  42. sql_code_graph-0.3.0/tests/unit/test_install_message.py +70 -0
  43. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_kuzu_backend.py +29 -0
  44. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_metrics.py +1 -1
  45. sql_code_graph-0.3.0/tests/unit/test_parse_quality.py +69 -0
  46. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_parser.py +7 -1
  47. sql_code_graph-0.3.0/tests/unit/test_submit_feedback.py +64 -0
  48. sql_code_graph-0.3.0/tests/unit/test_tools_hints.py +175 -0
  49. sql_code_graph-0.3.0/tests/unit/test_tools_warnings.py +39 -0
  50. sql_code_graph-0.3.0/tests/unit/test_uninstall.py +224 -0
  51. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.claude/agents/api-documenter.md +0 -0
  52. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.claude/agents/architect-planner.md +0 -0
  53. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.claude/agents/architect-reviewer.md +0 -0
  54. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.claude/agents/code-reviewer.md +0 -0
  55. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.claude/agents/developer.md +0 -0
  56. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.claude/agents/plan-reviewer.md +0 -0
  57. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.github/ISSUE_TEMPLATE/bug_report.yml +0 -0
  58. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.github/ISSUE_TEMPLATE/config.yml +0 -0
  59. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.github/ISSUE_TEMPLATE/feature_request.yml +0 -0
  60. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.github/workflows/benchmark.yml +0 -0
  61. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.github/workflows/e2e-tests.yml +0 -0
  62. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.github/workflows/release.yml +0 -0
  63. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.github/workflows/test.yml +0 -0
  64. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.gitignore +0 -0
  65. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.pre-commit-config.yaml +0 -0
  66. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/.sqlcgignore +0 -0
  67. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/CHANGELOG.md +0 -0
  68. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/main.py +0 -0
  69. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/plan/WORKFLOW.md +0 -0
  70. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/plan/blueprint.md +0 -0
  71. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/plan/phase10_deployment.md +0 -0
  72. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/plan/sqlcg.md +0 -0
  73. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/pyrightconfig.json +0 -0
  74. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/scripts/generate_cli_docs.sh +0 -0
  75. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/__main__.py +0 -0
  76. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/__init__.py +0 -0
  77. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/__init__.py +0 -0
  78. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/analyze.py +0 -0
  79. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/find.py +0 -0
  80. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/git.py +0 -0
  81. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/report.py +0 -0
  82. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/cli/commands/watch.py +0 -0
  83. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/__init__.py +0 -0
  84. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/config.py +0 -0
  85. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/graph_db.py +0 -0
  86. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/jobs.py +0 -0
  87. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/neo4j_backend.py +0 -0
  88. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/queries.py +0 -0
  89. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/schema.cypher +0 -0
  90. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/core/schema.py +0 -0
  91. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/indexer/__init__.py +0 -0
  92. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/indexer/dbt_adapter.py +0 -0
  93. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/indexer/walker.py +0 -0
  94. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/indexer/watcher.py +0 -0
  95. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/lineage/__init__.py +0 -0
  96. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/lineage/aggregator.py +0 -0
  97. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/lineage/schema_resolver.py +0 -0
  98. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/metrics/__init__.py +0 -0
  99. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/metrics/store.py +0 -0
  100. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/parsers/__init__.py +0 -0
  101. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/parsers/postgres_parser.py +0 -0
  102. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/parsers/registry.py +0 -0
  103. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/parsers/tsql_parser.py +0 -0
  104. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/server/__init__.py +0 -0
  105. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/server/exceptions.py +0 -0
  106. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/server/server.py +0 -0
  107. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/utils/__init__.py +0 -0
  108. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/utils/hashing.py +0 -0
  109. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/utils/ignore.py +0 -0
  110. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/src/sqlcg/utils/logging.py +0 -0
  111. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/__init__.py +0 -0
  112. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/__init__.py +0 -0
  113. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/adversarial/200_join.sql +0 -0
  114. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/adversarial/500_union.sql +0 -0
  115. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/bench_indexer.py +0 -0
  116. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/conftest.py +0 -0
  117. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/case_normalization.sql +0 -0
  118. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/colon_cast.sql +0 -0
  119. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/colon_reserved_word.sql +0 -0
  120. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/copy_into.sql +0 -0
  121. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/create_procedure.sql +0 -0
  122. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/identifier_dynamic.sql +0 -0
  123. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/lateral_flatten.sql +0 -0
  124. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/qualify.sql +0 -0
  125. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/scripting_block.sql +0 -0
  126. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/golden_corpus/snowflake/three_part.sql +0 -0
  127. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/tpch/q01.sql +0 -0
  128. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/tpch/q02.sql +0 -0
  129. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/tpch/q03.sql +0 -0
  130. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/tpch/q04.sql +0 -0
  131. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/benchmarks/tpch/q05.sql +0 -0
  132. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/e2e/__init__.py +0 -0
  133. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/e2e/conftest.py +0 -0
  134. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/e2e/test_airbnb_e2e.py +0 -0
  135. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/e2e/test_cli_index.py +0 -0
  136. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/e2e/test_dwh_e2e.py +0 -0
  137. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/e2e/test_git_hook_install.py +0 -0
  138. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/e2e/test_watch.py +0 -0
  139. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/dim_hosts_cleansed.sql +0 -0
  140. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/dim_listings_cleansed.sql +0 -0
  141. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/fct_reviews.sql +0 -0
  142. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/mart_fullmoon_reviews.sql +0 -0
  143. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/raw_hosts.sql +0 -0
  144. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/raw_listings.sql +0 -0
  145. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/raw_reviews.sql +0 -0
  146. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/src_hosts.sql +0 -0
  147. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/src_listings.sql +0 -0
  148. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/airbnb/src_reviews.sql +0 -0
  149. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/jaffle_shop/customers.sql +0 -0
  150. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/jaffle_shop/orders.sql +0 -0
  151. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/fixtures/jaffle_shop/raw_orders.sql +0 -0
  152. {sql_code_graph-0.2.1/tests/fixtures/synthetic → sql_code_graph-0.3.0/tests/fixtures/snowflake}/base_tables.sql +0 -0
  153. {sql_code_graph-0.2.1/tests/fixtures/synthetic → sql_code_graph-0.3.0/tests/fixtures/snowflake}/reports.sql +0 -0
  154. {sql_code_graph-0.2.1/tests/fixtures/synthetic → sql_code_graph-0.3.0/tests/fixtures/snowflake}/views.sql +0 -0
  155. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/integration/__init__.py +0 -0
  156. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/integration/test_cross_file_lineage.py +0 -0
  157. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/integration/test_dialect_matrix.py +0 -0
  158. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/integration/test_indexer_to_graph.py +0 -0
  159. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/__init__.py +0 -0
  160. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_branch_monitor.py +0 -0
  161. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_cli.py +0 -0
  162. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_config.py +0 -0
  163. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_data_models.py +0 -0
  164. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_git_hooks.py +0 -0
  165. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_graph_backend.py +0 -0
  166. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_index_flags.py +0 -0
  167. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_install.py +0 -0
  168. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_jobs.py +0 -0
  169. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_schema_resolver.py +0 -0
  170. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_server.py +0 -0
  171. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_walker.py +0 -0
  172. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/tests/unit/test_watcher.py +0 -0
  173. {sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/uv.lock +0 -0
@@ -1,7 +1,7 @@
1
1
  # Architecture Review — sql-code-graph (sqlcg)
2
2
 
3
3
  Blueprint version: v1.2 (May 2026)
4
- Review date: 2026-05-02
4
+ Review date: 2026-05-02 (updated 2026-05-05)
5
5
  Reviewer agent: architect-reviewer
6
6
 
7
7
  ---
@@ -276,6 +276,23 @@ Recommended resolution: Refactor `_classify` to use `match`/`case` (pure style,
276
276
  Leave the string `kind` values but extract them into a `QueryKind` `StrEnum` in
277
277
  `parsers/base.py` or `core/schema.py` to eliminate magic string duplication.
278
278
 
279
+ **Additional improvement from 10.2.7 research**: sqlglot exposes `exp.DML` as a base
280
+ class covering `Insert`, `Update`, `Delete`, and `Merge`. The current `_classify`
281
+ enumerates these individually, which caused `MERGE` to be missing from the original
282
+ DML filter proposal (caught and fixed in 10.2.7). The `match`/`case` refactor should
283
+ use `exp.DML` as a base branch to future-proof against new DML types sqlglot may add:
284
+
285
+ ```python
286
+ case stmt if isinstance(stmt, exp.DML) and not isinstance(stmt, exp.Copy):
287
+ # handles Insert, Update, Delete, Merge — and any future exp.DML subclass
288
+ ...
289
+ case exp.Select():
290
+ ...
291
+ ```
292
+
293
+ This ensures `_classify` inherits coverage from sqlglot's own type hierarchy rather
294
+ than requiring manual updates each time a new DML expression type is added.
295
+
279
296
  **Comment 8 — `parsers/base.py` line 263 and line 294: `_real_tables` and `_convert_table_expr_to_ref` use `elif`**
280
297
 
281
298
  Reviewer text: "same here match statements are much cleaner" / "match"
@@ -486,18 +503,28 @@ produces zero edges, which is indistinguishable from a column with no upstream s
486
503
  This directly undermines the "explicit failure mode inventory" strength described in
487
504
  section 2.4.
488
505
 
489
- ### 3.5 [MEDIUM] `SnowflakeParser._parse_scripting_file` applies a file-level heuristic that can false-positive
506
+ ### 3.5 ~~[MEDIUM]~~ **RESOLVED** `SnowflakeParser._parse_scripting_file` applies a file-level heuristic that can false-positive
507
+
508
+ > **Status: already fixed in code.** The token-aware `TokenType.BEGIN` check described
509
+ > below is implemented in `_has_scripting_block()` (lines 79–85 of
510
+ > `src/sqlcg/parsers/snowflake_parser.py`). This finding is closed.
490
511
 
491
- The heuristic `if _SCRIPTING_BLOCK.search(sql): return self._parse_scripting_file()`
492
- matches any file containing the word `BEGIN`. This includes:
512
+ ~~The heuristic `if _SCRIPTING_BLOCK.search(sql): return self._parse_scripting_file()`
513
+ matches any file containing the word `BEGIN`. This includes:~~
493
514
 
494
- - Comments: `-- BEGIN transaction isolation block`
495
- - String literals: `SELECT 'BEGIN' AS status_label FROM t`
496
- - Legitimate `BEGIN TRANSACTION` in non-scripting contexts
515
+ ~~- Comments: `-- BEGIN transaction isolation block`~~
516
+ ~~- String literals: `SELECT 'BEGIN' AS status_label FROM t`~~
517
+ ~~- Legitimate `BEGIN TRANSACTION` in non-scripting contexts~~
497
518
 
498
- A false positive routes an entire file through the regex DML extractor, which sets
519
+ ~~A false positive routes an entire file through the regex DML extractor, which sets
499
520
  `parse_failed=True` and `confidence=0.3` on all statements and drops all column
500
- lineage. This degrades quality on files that are fully parseable.
521
+ lineage. This degrades quality on files that are fully parseable.~~
522
+
523
+ The implemented fix uses `sqlglot.tokens.Tokenizer.from_dialect("snowflake").tokenize(sql)`
524
+ and checks for a `TokenType.BEGIN` token directly — string literals and comments are
525
+ never tokenised as `BEGIN`, so false positives on `SELECT 'BEGIN' AS x` or
526
+ `-- BEGIN block` are impossible. A regex fallback is retained only for the edge case
527
+ where tokenisation itself raises an exception.
501
528
 
502
529
  ### 3.6 [MEDIUM] `QueryNode` is a mutable dataclass but `LineageEdge` is frozen — inconsistency
503
530
 
@@ -586,12 +613,12 @@ Ranking uses: **Impact** (correctness/stability consequence if unaddressed) x
586
613
  | 2 | 3.4 — `_extract_column_lineage` bare `except Exception: continue` | HIGH | Low | Log the exception at WARNING with column name and file path; append to `ParsedFile.errors`; set `confidence=0.0` on the affected edge slot |
587
614
  | 3 | 3.2 — `resolve_pass2` re-opens file without error handling | HIGH | Low | Wrap `open()` in `try/except (FileNotFoundError, OSError)`; log a warning and return the pass-1 `ParsedFile` unchanged |
588
615
  | 4 | 3.3 — `SchemaResolver.as_dict()` cache is not thread-safe | HIGH | Medium | Document the single-threaded assumption explicitly in the class docstring; if `jobs.py` uses threads, use `threading.Lock` around cache mutation and read, or switch to a per-parse-job `SchemaResolver` instance |
589
- | 5 | 3.5 — `BEGIN` heuristic false-positives on comments and string literals | MEDIUM | Low | Use a token-aware check: call `sqlglot.tokenize(sql, dialect="snowflake")` and look for a `TokenType.BEGIN` token that is not inside a string or comment, rather than a raw regex |
616
+ | 5 | ~~3.5 — `BEGIN` heuristic false-positives~~ | ~~MEDIUM~~ | ~~Low~~ | **RESOLVED** token-aware `TokenType.BEGIN` check already implemented in `_has_scripting_block()`. No action needed. |
590
617
  | 6 | 3.6 — `QueryNode` mutability inconsistency with frozen `LineageEdge` | MEDIUM | Low | Either add `frozen=True` to `QueryNode` and use `dataclasses.replace()` for pass-2 patching, or document the mutability contract explicitly in the class docstring |
591
618
  | 7 | 3.7 — No per-file timeout or SIGINT handling during indexing | MEDIUM | Medium | Add a configurable `--timeout-per-file` (default 30s) enforced via `concurrent.futures.ThreadPoolExecutor` with `future.result(timeout=N)`; handle `KeyboardInterrupt` in the walker loop to flush and exit cleanly |
592
619
  | 8 | 3.8 — `add_information_schema` is a silent no-op stub | MEDIUM | Low | Raise `NotImplementedError("--schema-from-info-schema is not yet implemented")` until the method is built; guard the CLI flag to surface this error immediately |
593
620
  | 9 | 3.10 — `acryl-datahub` upper bound missing on unstable module | LOW | Low | Change to `"acryl-datahub[sql-parsing]>=0.14.0,<0.15.0"` in `pyproject.toml` |
594
- | 10 | 3.12 — `_EMBEDDED_DML` regex greedy match across statement boundaries | LOW | Low | Add `re.MULTILINE`; test against a two-statement scripting block with no trailing semicolon on the last statement |
621
+ | 10 | ~~3.12 — `_EMBEDDED_DML` regex greedy match across statement boundaries~~ | ~~LOW~~ | ~~Low~~ | **SUPERSEDED by 10.2.7** the regex is being replaced entirely with a sqlglot tokenizer + `exp.DML` base class filter. Adding `re.MULTILINE` is no longer the fix. |
595
622
  | 11 | 3.9 — sqlglot upper bound `<31.0` too narrow | LOW | Low | Widen to `<32.0`; add CHANGELOG review step to CI upgrade process |
596
623
  | 12 | 3.11 — MCP tools lack explicit error documentation | LOW | Low | Add a `Raises` section to each tool docstring listing `NotIndexedError`, `InvalidColumnRef`, etc.; define a small custom exception hierarchy in `server/exceptions.py` |
597
624
 
@@ -744,11 +771,22 @@ Four capabilities are missing and each requires non-trivial work:
744
771
  pipeline execution order model, which is not representable in a static file walker.
745
772
 
746
773
  2. **Stored procedure body full parsing.** Snowflake `$$...$$` blocks and T-SQL
747
- `BEGIN/END` bodies currently fall to regex extraction. A proper solution would
748
- use sqlglot's `dialect="snowflake"` parser on the extracted body after stripping
749
- the procedure wrapper. The blocker is that sqlglot classifies these as `exp.Command`
750
- and does not expose a stored-procedure-body parser. This is blueprint Gap 5,
751
- classified as a "design limitation" with "workaround only" resolution path.
774
+ `BEGIN/END` bodies currently fall to regex extraction. The blocker is
775
+ **dialect-specific** this is not a uniform limitation:
776
+
777
+ - **Snowflake / Databricks**: sqlglot exposes the `$$...$$` body as an `exp.RawString`
778
+ node inside the `exp.Create` AST. The body is extractable, strippable of its
779
+ `BEGIN`/`END` wrapper, and re-parseable with the tokenizer + `exp.DML` filter
780
+ described in finding 10.2.7. This path is now viable and is part of the 10.2.7
781
+ implementation plan.
782
+ - **BigQuery**: sqlglot cannot fully parse `CREATE PROCEDURE ... BEGIN...END` for
783
+ BigQuery; the body lands as `exp.Command` inside `exp.Create`. For BigQuery, the
784
+ regex fallback (`_EMBEDDED_DML`) remains the only option. This is the true
785
+ "design limitation / workaround only" case.
786
+ - **T-SQL**: not yet verified — likely similar to BigQuery (body as `exp.Command`).
787
+
788
+ This is blueprint Gap 5. The Snowflake portion is now addressable via 10.2.7;
789
+ BigQuery procedure bodies remain a fundamental static-analysis limit.
752
790
 
753
791
  3. **COPY INTO column lineage.** Inferring column mapping from a stage file to a
754
792
  target table requires either an explicit column list in the COPY statement or an
@@ -902,3 +940,616 @@ have the GitHub repository configured as a trusted publisher:
902
940
  This is enforced via `needs: test` in the workflow YAML.
903
941
 
904
942
  Full implementation spec: `plan/phase10_deployment.md`
943
+
944
+ ---
945
+
946
+ ## 10. GitHub Issues Review — v0.2.1 Feedback (2026-05-05)
947
+
948
+ Review date: 2026-05-05
949
+ Source: GitHub issues #5 and #6, Warhorze/sql-code-graph
950
+ Tester environment: sqlcg 0.2.1, Snowflake dialect, 1457-file DWH corpus, WSL2/Linux, Claude Code MCP session
951
+
952
+ Both issues are marked `[feedback]` and were filed by the repo owner after a real
953
+ production session. They are high-signal because they reflect actual LLM-agent and
954
+ end-user experience, not hypothetical concerns.
955
+
956
+ ---
957
+
958
+ ### 10.1 Issue #6 — No clean uninstall / opt-out path
959
+
960
+ **Problem statement**: The tool installs side effects in three separate locations
961
+ (`~/.claude/settings.json`, `~/.sqlcg/`, `.git/hooks/post-checkout`) with no
962
+ corresponding removal command. A user who wants to remove the tool must discover and
963
+ manually undo all three locations.
964
+
965
+ **Architectural assessment**: The `sqlcg install` command was designed with symmetry in
966
+ mind (Phase 10), but only the install direction was implemented. The three side-effect
967
+ locations map directly to the three install operations:
968
+
969
+ 1. `sqlcg install` writes `mcpServers["sql-code-graph"]` to `~/.claude/settings.json`
970
+ 2. The user's KùzuDB lives at `~/.sqlcg/` (controlled by `SQLCG_DB_PATH` or the default
971
+ set in `config.py`)
972
+ 3. `sqlcg git install-hooks` writes `.git/hooks/post-checkout`
973
+
974
+ A `sqlcg uninstall` command is the natural symmetric counterpart and follows the exact
975
+ same pattern as `install.py`:
976
+
977
+ - Remove `mcpServers["sql-code-graph"]` from `~/.claude/settings.json` using atomic
978
+ `.tmp` + `os.replace` write (same guard as install)
979
+ - Delete `~/.sqlcg/` (or the path from `SQLCG_DB_PATH`) via `shutil.rmtree` with
980
+ a `--keep-db` flag to allow MCP deregistration without data loss
981
+ - Remove the `# sqlcg post-checkout hook` sentinel block from `.git/hooks/post-checkout`
982
+ in the specified repo (or `Path.cwd()` by default); if the hook file becomes empty
983
+ after stripping, delete it
984
+
985
+ **Risk**: LOW severity (no data is corrupted). HIGH usability impact — an unclean
986
+ opt-out path erodes trust and makes the tool feel invasive.
987
+
988
+ **New finding (10.A)**: The three side-effect locations are not documented anywhere in
989
+ `--help` or the README. Even without `sqlcg uninstall`, documenting them explicitly
990
+ would reduce user friction when opting out.
991
+
992
+ ---
993
+
994
+ ### 10.2 Issue #5 — LLM Agent Experience: Silent Failures and False Positives
995
+
996
+ This is a compound issue with nine distinct sub-problems. Each is assessed separately
997
+ below in order of severity.
998
+
999
+ ---
1000
+
1001
+ #### 10.2.1 [CRITICAL] LLM cannot self-recover from package/binary name mismatch
1002
+
1003
+ The PyPI package is `sql-code-graph`; the binary is `sqlcg`. When an LLM agent
1004
+ invokes the MCP server by its package name (`sql-code-graph`) it gets "executable not
1005
+ found" and cannot recover without human intervention.
1006
+
1007
+ The `mcp setup` / `sqlcg install` JSON already writes the correct invocation
1008
+ (`uvx sql-code-graph` or `sqlcg`), but the LLM does not read its own MCP settings
1009
+ file — it relies on the MCP server's tool descriptions to understand how to invoke it.
1010
+
1011
+ **Root cause**: No tool or `--help` output names the binary explicitly with context.
1012
+ The `list_dialects_and_repos` tool description could carry this hint, but currently
1013
+ does not.
1014
+
1015
+ **Resolution**: Add a one-line note to the `sqlcg mcp setup` JSON output and to the
1016
+ `list_dialects_and_repos` / `index_repo` tool docstrings: "Binary is `sqlcg`; PyPI
1017
+ package is `sql-code-graph`." This is a docstring-only change with no code impact.
1018
+
1019
+ ---
1020
+
1021
+ #### 10.2.2 [CRITICAL] No workflow order surfaced in `--help` or tools
1022
+
1023
+ The required sequence (`db init` -> `index <path>` -> `git install-hooks`) is not
1024
+ implied by `--help`, `mcp setup`, or any tool description. During the test session
1025
+ the LLM jumped directly to `find`/`analyze` and received empty results for the entire
1026
+ session because `db init` and `index` were never run.
1027
+
1028
+ The `watch` command currently has no initialization guard: it calls
1029
+ `backend.init_schema()` internally (correct), but `sqlcg watch` on an empty database
1030
+ with no index will start the watcher, produce no results, and give no indication that
1031
+ the database has zero content.
1032
+
1033
+ **Resolution**:
1034
+
1035
+ 1. Add a `QUICK START` block to the top-level `--help` text in `cli/main.py`
1036
+ (within the `Typer` `help=` parameter string), listing the three required steps
1037
+ in order. `typer` renders this in the `--help` output verbatim.
1038
+
1039
+ 2. In every tool that calls `_assert_indexed()`, surface the error message as:
1040
+ "No repos indexed. Run `sqlcg db init` then `sqlcg index <path>` first."
1041
+ (The current `NotIndexedError` message already says this but uses the CLI form;
1042
+ confirm it matches the actual binary name.)
1043
+
1044
+ 3. The `watch` command is fine architecturally — it calls `init_schema()` before
1045
+ starting the observer — but it should print a warning if zero repos are indexed
1046
+ after the initial full index: "Warning: 0 tables indexed. Check that the path
1047
+ contains .sql files and the dialect is correct."
1048
+
1049
+ ---
1050
+
1051
+ #### 10.2.3 [HIGH] `db info` gives no warning when `SqlColumn: 0`
1052
+
1053
+ `db info` shows node counts for all labels including `SqlColumn`. When `SqlColumn: 0`,
1054
+ the `trace_column_lineage` and `get_downstream_dependencies` / `get_upstream_dependencies`
1055
+ tools are completely non-functional — they will always return empty results. The user
1056
+ cannot distinguish "column not found" from "column graph not built."
1057
+
1058
+ The current `db info` implementation (in `db.py`) iterates all `NodeLabel` values and
1059
+ prints counts without any interpretation. A zero count on `SqlColumn` is printed
1060
+ identically to a non-zero count.
1061
+
1062
+ **Resolution**: After printing counts, add a structured health check section to
1063
+ `db info` output:
1064
+
1065
+ - If `SqlColumn == 0` and `SqlQuery > 0`: print a yellow warning: "Column lineage
1066
+ not available. Tools `trace_column_lineage`, `get_downstream_dependencies`, and
1067
+ `get_upstream_dependencies` will return empty results."
1068
+ - If `SqlQuery == 0` and `Repo > 0`: print a yellow warning: "No queries indexed.
1069
+ Run `sqlcg index <path>` to populate the graph."
1070
+ - If `Repo == 0`: print a red error: "Database is empty. Run `sqlcg db init` and
1071
+ `sqlcg index <path>` first."
1072
+
1073
+ This is also relevant to the MCP server: `list_dialects_and_repos` could append
1074
+ a `warnings` field to its `DialectRepoResult` listing health status, so Claude
1075
+ can surface it proactively.
1076
+
1077
+ ---
1078
+
1079
+ #### 10.2.4 [HIGH] `analyze unused` returns 100% false positives for external consumers
1080
+
1081
+ `analyze unused` returned ~100 `IA_ANALYTICS` / `IA_TABLEAU` / `IA_BUSINESSOBJECTS`
1082
+ views as "unused" because these are consumer-facing views queried by Tableau and BI
1083
+ tools — external consumers with no SQL references within the indexed corpus.
1084
+
1085
+ The current query in `analyze.py`:
1086
+ ```
1087
+ MATCH (t:SqlTable) WHERE NOT (t)<-[:SELECTS_FROM]-() RETURN t.qualified
1088
+ ```
1089
+ is logically correct for the closed-world assumption (only SQL files in the indexed
1090
+ corpus), but this assumption is never stated to the user.
1091
+
1092
+ **Resolution**:
1093
+
1094
+ 1. **Medallion-architecture-aware classification.** In a Bronze/Silver/Gold (medallion)
1095
+ warehouse, tables at each layer have different "unused" semantics:
1096
+
1097
+ | Layer | Typical schema prefixes | Expected consumption pattern | "Unused" meaning |
1098
+ |---|---|---|---|
1099
+ | Bronze / Raw | `RAW_`, `STG_`, `STAGE_` | Loaded by COPY INTO / ELT jobs | Bug if unused — no ETL reads it |
1100
+ | Silver / Intermediate | `INT_`, `PREP_`, `DWH_`, `BA_` | Read by ETL INSERT/MERGE | Bug if unused |
1101
+ | Gold / Consumption | `DIM_`, `FACT_`, `MART_`, `IA_`, `RPT_` | Queried by BI tools externally | **Expected** — external consumers have no SQL in the corpus |
1102
+
1103
+ `analyze unused` should tag each result with its inferred layer so the LLM can
1104
+ distinguish "this Gold view has no SQL consumers — expected, BI queries it" from
1105
+ "this Silver staging table has no SQL consumers — likely a dead table."
1106
+
1107
+ Implementation: after the Cypher query, classify each result by matching its schema
1108
+ prefix against a configurable tier map (with sensible defaults). Surface the tier
1109
+ in the CLI output column and in the MCP tool's result model. The MCP tool docstring
1110
+ must explain the tier model so Claude can give informed recommendations without
1111
+ the user needing to explain their naming conventions.
1112
+
1113
+ 2. Add a `--exclude-schema` flag to `analyze unused` for manual exclusion of known
1114
+ consumer schemas. The flag description must explain **when to use it**: "Use to
1115
+ exclude schemas whose tables are consumed externally (e.g., by BI tools or APIs)
1116
+ and will always appear unused within the indexed SQL corpus."
1117
+
1118
+ The MCP tool version of this flag (if exposed) must carry the same explanation in
1119
+ its parameter docstring so the LLM knows when to apply it autonomously.
1120
+
1121
+ 3. Always append a closed-world caveat to results: "Note: 'unused' means no SQL file
1122
+ in the indexed corpus selects from this table. External consumers (Tableau, BI tools,
1123
+ APIs) are not visible to this tool." This caveat must appear in: CLI output, MCP
1124
+ tool docstring, and `analyze unused --help`.
1125
+
1126
+ 4. The same tier classification and caveat applies to the MCP server's future
1127
+ `find_unused_tables` tool (if planned).
1128
+
1129
+ ---
1130
+
1131
+ #### 10.2.5 [HIGH] `analyze impact` vs `find pattern` return inconsistent results for the same table
1132
+
1133
+ For table `BA.WTFV_VOORRAAD_DAGSTAND_IGDC`, `analyze impact` returned only DDL files
1134
+ while `find pattern` additionally found an ETL `INSERT INTO` in `etl/sql/fact/`. The
1135
+ two commands should return overlapping or identical results for this query.
1136
+
1137
+ **Root cause analysis**: `analyze impact` uses the `SELECTS_FROM` relationship to find
1138
+ queries that reference the table. If the ETL `INSERT INTO` statement was indexed as a
1139
+ query with `sources` pointing to `BA.WTFV_VOORRAAD_DAGSTAND_IGDC` via a `SELECTS_FROM`
1140
+ edge, it would appear in both commands. The inconsistency suggests one of:
1141
+
1142
+ (a) The ETL INSERT statement was not indexed with the correct `SELECTS_FROM` edges (the
1143
+ parser did not extract table sources for it), or
1144
+
1145
+ (b) The INSERT was indexed under a different table name/qualified form than the DDL
1146
+ table node, causing the relationship to point to a different node.
1147
+
1148
+ **Resolution**: This is a parser correctness issue, not a CLI design issue. The fix
1149
+ is to verify that `INSERT INTO ... SELECT FROM <table>` correctly emits a `SELECTS_FROM`
1150
+ edge from the `SqlQuery` node to the `SqlTable` node. Add a test fixture with a known
1151
+ INSERT-SELECT pattern and assert that `analyze impact` finds it. The diagnostic step
1152
+ is: run `execute_cypher("MATCH (q:SqlQuery)-[:SELECTS_FROM]->(t:SqlTable {qualified:
1153
+ 'BA.WTFV_VOORRAAD_DAGSTAND_IGDC'}) RETURN q.id LIMIT 10")` and compare to
1154
+ `find pattern "WTFV_VOORRAAD_DAGSTAND_IGDC"`.
1155
+
1156
+ New finding: **the impact/pattern inconsistency is a measurable regression test gap**.
1157
+ The `_upsert_parsed_file` path must create `SELECTS_FROM` edges for INSERT statements,
1158
+ not only SELECT statements. Verify this is handled.
1159
+
1160
+ ---
1161
+
1162
+ #### 10.2.6 [MEDIUM] Feedback loop has no false-negative path
1163
+
1164
+ `submit_feedback` only fires when there are results to rate. The most valuable signal —
1165
+ empty result when the user expected one — is never collected. In the test session,
1166
+ `trace_column_lineage` was called 7 times (all returning empty) and 0 feedback samples
1167
+ were collected.
1168
+
1169
+ Additionally, `execute_cypher` was the #1 MCP tool with 15 calls vs
1170
+ `trace_column_lineage` at 7. A high `execute_cypher`/`high-level-tool` ratio is a
1171
+ proxy signal that the high-level tools are failing and the LLM is falling back to raw
1172
+ Cypher. This ratio is not surfaced in `sqlcg gain`.
1173
+
1174
+ **Resolution**:
1175
+
1176
+ 1. In `trace_column_lineage`, `get_downstream_dependencies`, and
1177
+ `get_upstream_dependencies`: when the result `lineage` / `nodes` list is empty,
1178
+ include a `hint` field in the returned model: "If you expected results, check that
1179
+ `db info` shows SqlColumn > 0. Submit feedback with `submit_feedback` tool if this
1180
+ was a false negative."
1181
+
1182
+ 2. In `submit_feedback`, add a `FN` (false negative) label to the allowed set alongside
1183
+ `TP` and `FP`. Currently only `TP` and `FP` are valid labels.
1184
+
1185
+ 3. In `sqlcg gain`, add a `execute_cypher / total_calls` ratio section. A ratio above
1186
+ 0.3 (more than 30% of calls are raw Cypher) is a signal that the high-level
1187
+ abstractions are not working.
1188
+
1189
+ ---
1190
+
1191
+ #### 10.2.7 [HIGH] Parser: noise statements in scripting blocks
1192
+
1193
+ `ALTER WAREHOUSE IDENTIFIER($var) SET WAREHOUSE_SIZE = 'X-Large'` is a standard
1194
+ Snowflake DDL rebuild pattern that fails to parse and falls back to `exp.Command`.
1195
+ `CALL MA.MSSPR_*()` stored procedure calls similarly produce noise. These were the
1196
+ two observed instances, but scripting blocks can contain many other non-DML statement
1197
+ types: `SET variable = value`, `LET x := y`, `EXECUTE IMMEDIATE`, `RETURN`, `RAISE`,
1198
+ `OPEN/FETCH/CLOSE` (cursor ops), `TRUNCATE TABLE`, and any `CREATE/DROP/ALTER` inside
1199
+ a block. Enumerating them per-dialect is a maintenance trap.
1200
+
1201
+ **Root cause**: `_EMBEDDED_DML` is a regex applied to the raw SQL text. It does not
1202
+ understand statement boundaries (semicolons inside string literals break it — finding
1203
+ 3.12) and has no way to classify what it captures. Any statement fragment that leaks
1204
+ through gets handed to `sqlglot.parse()` and produces a noisy `exp.Command` error.
1205
+
1206
+ **Resolution**: Replace the `_EMBEDDED_DML` regex extraction with a two-step
1207
+ sqlglot-native approach in `SnowflakeParser._parse_scripting_file`:
1208
+
1209
+ 1. **Split on statement boundaries** using `sqlglot.tokenize()` + semicolon tokens.
1210
+ The tokenizer handles semicolons inside string literals correctly — no regex needed.
1211
+ 2. **Classify each chunk** using sqlglot's built-in `exp.DML` base class:
1212
+
1213
+ ```python
1214
+ isinstance(stmt, (exp.DML, exp.Select)) and not isinstance(stmt, exp.Copy)
1215
+ ```
1216
+
1217
+ `exp.DML` is a sqlglot base class that covers `Delete`, `Insert`, `Update`, and
1218
+ `Merge` in a single isinstance check — no per-statement enumeration needed.
1219
+ `exp.Select` is added separately (it inherits from `Query`, not `DML`).
1220
+ `exp.Copy` (Snowflake `COPY INTO`) is excluded because its source is a file stage,
1221
+ not a table — no table lineage to extract.
1222
+ Everything else — `exp.Command`, `exp.DDL`, utility statements — is dropped and
1223
+ logged at DEBUG level, not WARNING.
1224
+
1225
+ **DML coverage after this fix:**
1226
+
1227
+ | Statement | sqlglot type | Included |
1228
+ |---|---|---|
1229
+ | SELECT | `exp.Select` | Yes |
1230
+ | INSERT INTO … SELECT / VALUES | `exp.Insert` (via `exp.DML`) | Yes |
1231
+ | UPDATE | `exp.Update` (via `exp.DML`) | Yes |
1232
+ | DELETE | `exp.Delete` (via `exp.DML`) | Yes |
1233
+ | MERGE INTO | `exp.Merge` (via `exp.DML`) | Yes — was missing from original proposal |
1234
+ | COPY INTO (Snowflake) | `exp.Copy` (via `exp.DML`) | No — stage source, not a table |
1235
+ | CREATE TABLE AS SELECT | `exp.Create` (via `exp.DDL`) | Handled separately (see below) |
1236
+ | ALTER WAREHOUSE, CALL, SET, LET, … | `exp.Command` or specific DDL | Dropped at DEBUG |
1237
+
1238
+ **CTAS**: `CREATE TABLE AS SELECT` produces `exp.Create`, not `exp.DML`. The existing
1239
+ `AnsiParser._parse_statement` already handles CTAS by detecting a SELECT inside a
1240
+ Create node. In the scripting-block path, pass `exp.Create` instances through to
1241
+ `_parse_statement` rather than dropping them — it will handle them or ignore them
1242
+ cleanly.
1243
+
1244
+ **Procedure bodies — dialect differences:**
1245
+
1246
+ The tokenizer approach above handles scripting-block files (non-procedure files with
1247
+ `BEGIN` blocks) correctly across all dialects. Procedure bodies are different:
1248
+
1249
+ | Dialect | How sqlglot exposes the body | Extractable? |
1250
+ |---|---|---|
1251
+ | Snowflake | `exp.RawString` node inside `exp.Create` | Yes — extract, strip `BEGIN`/`END`, re-apply step 1–2 |
1252
+ | Databricks | `exp.RawString` (same `$$` delimiter) | Yes — same as Snowflake |
1253
+ | BigQuery | `exp.Command` inside `exp.Create` | No — body is opaque; regex fallback same as today |
1254
+
1255
+ For Snowflake/Databricks: detect `exp.Create` nodes that contain an `exp.RawString`
1256
+ body, extract the string, strip the outer `BEGIN ... END` wrapper, then run the
1257
+ tokenizer-split + DML filter on the inner content recursively. For BigQuery: leave the
1258
+ current `_EMBEDDED_DML` regex path as a fallback for `exp.Command` bodies only — do
1259
+ not regress existing behaviour.
1260
+
1261
+ **Files**: `src/sqlcg/parsers/snowflake_parser.py` — remove `_EMBEDDED_DML` regex,
1262
+ rewrite `_parse_scripting_file`. No schema or MCP impact. Test fixtures required:
1263
+ - Scripting block with `ALTER WAREHOUSE`, `CALL`, `MERGE INTO`, and `INSERT INTO … SELECT` — assert only MERGE and INSERT are indexed
1264
+ - Snowflake procedure with embedded DML — assert body DML is indexed
1265
+ - BigQuery procedure — assert existing regex fallback still extracts DML
1266
+
1267
+ ---
1268
+
1269
+ #### 10.2.8 [MEDIUM] Missing persistent index config (`.sqlcg.toml`)
1270
+
1271
+ After a `db reset` or fresh clone there is no way to replay the index without
1272
+ remembering the original `sqlcg index <path> --dialect <dialect>` invocation. The
1273
+ git hook calls `sqlcg index --dialect auto --quiet` which reads `.sqlcg.toml` if it
1274
+ exists, but this file is never created by any `sqlcg` command.
1275
+
1276
+ **Current state**: The `get_dialect()` helper in `config.py` reads `.sqlcg.toml`
1277
+ (via `[sqlcg] dialect = "..."`) if present, but `sqlcg index` never writes this file.
1278
+ A user must create `.sqlcg.toml` manually.
1279
+
1280
+ **Resolution**: `sqlcg index <path> --dialect <dialect>` should write (or update) a
1281
+ `.sqlcg.toml` file at the root of the indexed path on first successful index. The
1282
+ write should be idempotent (do not overwrite if the file already exists with a matching
1283
+ dialect). This makes the git hook self-healing after a `db reset` — the hook will read
1284
+ the dialect from `.sqlcg.toml` without the user needing to remember the original
1285
+ invocation.
1286
+
1287
+ Spec for `.sqlcg.toml`:
1288
+ ```toml
1289
+ [sqlcg]
1290
+ path = "/absolute/path/to/repo"
1291
+ dialect = "snowflake"
1292
+ ```
1293
+
1294
+ The `path` field allows `sqlcg index` (with no arguments) to infer the target
1295
+ directory from the config file when run from the repo root.
1296
+
1297
+ **Risk**: Low — no breaking changes. The file should be added to `.gitignore` templates
1298
+ in the README since it contains an absolute path that is machine-specific.
1299
+
1300
+ ---
1301
+
1302
+ #### 10.2.9 [LOW] No progress feedback during `sqlcg index`
1303
+
1304
+ `sqlcg index` on 1457 files took ~17 seconds with zero stdout output until the final
1305
+ summary line appeared. The `edges: 0` signal (indicating the graph has no relationships
1306
+ and lineage tracing will return nothing) appeared only in the raw log, never surfaced
1307
+ in `db info`.
1308
+
1309
+ **Resolution**:
1310
+
1311
+ 1. Add a simple periodic progress line to `Indexer.index_repo()`: print
1312
+ `"INFO: indexed N/total files..."` every 100 files (or every 5 seconds on a timer).
1313
+ Use `rich.progress` if available, or a plain `console.print` with carriage return.
1314
+
1315
+ 2. After indexing, if `lineage_edges_created == 0`, print a yellow warning to stdout
1316
+ (not just the log): "Warning: 0 lineage edges created. Column-level lineage tracing
1317
+ will not be available. Check parse errors above."
1318
+
1319
+ 3. Surface `lineage_edges_created` (renamed to `edges`) in `db info` as an explicit
1320
+ field. A graph with zero edges cannot answer any lineage query, and this must be
1321
+ immediately visible without reading the raw index log.
1322
+
1323
+ ---
1324
+
1325
+ #### 10.2.10 [LOW] `uvx` cold-start is invisible (6.5 minutes with no output)
1326
+
1327
+ The MCP server config (`"command": "uvx"`) means Claude Code pays a `uvx` download
1328
+ cost on first run (and potentially on each session if the `uvx` cache is cold). In
1329
+ the test session this took 6.5 minutes with zero output, indistinguishable from a hang.
1330
+
1331
+ **Resolution**: Documentation-only change. The README `QUICK START` section should:
1332
+ - Recommend `uv tool install sql-code-graph` for persistent local installs
1333
+ - Reserve `uvx sql-code-graph` for one-shot tries
1334
+ - Note that first run will be slow due to package download
1335
+ - The `sqlcg install` confirmation message should note: "Using uvx — first MCP startup
1336
+ may take several minutes on a cold cache. Run `uv tool install sql-code-graph` for
1337
+ faster startup."
1338
+
1339
+ ---
1340
+
1341
+ ### 10.3 Cross-Cutting Findings from Issue Analysis
1342
+
1343
+ **Finding 10.B — LLM agent is a first-class user of this tool but the tool was not designed for it**
1344
+
1345
+ The issue session revealed a consistent pattern: the tool produces correct outputs but
1346
+ does not communicate machine-readable semantics about its own state. A human user who
1347
+ gets "No results" knows to check if the database was indexed. An LLM agent does not.
1348
+ Every tool that can return empty due to a missing prerequisite (unindexed graph, zero
1349
+ column nodes, schema mismatch) must include a `hint` field in the returned model so
1350
+ Claude can self-diagnose without falling back to `execute_cypher`.
1351
+
1352
+ This is the root cause of the `execute_cypher`-dominance pattern (15 calls vs 7 for
1353
+ `trace_column_lineage`): Claude fell back to raw Cypher because the high-level tools
1354
+ returned empty with no explanation.
1355
+
1356
+ **Action**: Add a `hint: str | None` field to `LineageResult`, `DependencyResult`,
1357
+ `TableUsageResult`, and `SqlPatternResult` models. Populate it when the result list
1358
+ is empty with a diagnostic string. This is a backward-compatible model extension
1359
+ (new optional field).
1360
+
1361
+ **Finding 10.C — Index success rate is misleading**
1362
+
1363
+ The index reported 100% success rate with 900KB of parse warnings and `SqlColumn: 0`.
1364
+ "Success" in the current codebase means "the file was processed without raising an
1365
+ unhandled exception." It does not mean "lineage was extracted." A file that falls back
1366
+ to `exp.Command` (scripting block) is counted as "success" even though it produced
1367
+ zero edges.
1368
+
1369
+ This is related to finding 3.4 (bare `except Exception: continue`) — it means errors
1370
+ are swallowed and the success rate is a vanity metric.
1371
+
1372
+ **Action**: Introduce a `parse_quality` metric in `ParsedFile`:
1373
+ - `full`: sqlglot parsed fully, column lineage extracted
1374
+ - `table_only`: tables extracted, column lineage not available (schema required or
1375
+ `SELECT *`)
1376
+ - `scripting_fallback`: fell back to regex extraction, confidence 0.3
1377
+ - `failed`: unhandled exception, no lineage
1378
+
1379
+ Report these four categories in the `index` summary line and in `db info`. This
1380
+ replaces "parse_errors" (which is binary) with a quality breakdown that accurately
1381
+ reflects what Claude can and cannot answer.
1382
+
1383
+ ---
1384
+
1385
+ ### 10.4 Prioritized Implementation Plan — Issues #5 and #6
1386
+
1387
+ Ranked by impact (breadth of user pain) multiplied by implementation effort (inverse:
1388
+ low effort ranks higher).
1389
+
1390
+ | Rank | Issue | Finding | File(s) | Action | Effort | Impact |
1391
+ |------|-------|---------|---------|--------|--------|--------|
1392
+ | 1 | #5.1 | 10.2.2 | `cli/main.py` | Add `QUICK START` block to top-level `--help` with ordered steps; update `NotIndexedError` message to include step hint | XS | CRITICAL |
1393
+ | 2 | #5.2 | 10.2.3 | `cli/commands/db.py` | Add health check interpretation to `db info` output (red/yellow warnings for empty Repo, zero SqlQuery, zero SqlColumn) | XS | HIGH |
1394
+ | 3 | #6 | 10.A, 10.1 | `cli/commands/uninstall.py` (new) | Implement `sqlcg uninstall`: remove MCP entry, delete `~/.sqlcg/`, strip git hook; `--keep-db` flag; register in `cli/main.py` | S | HIGH |
1395
+ | 4 | #5.2 | 10.2.9 | `indexer/indexer.py`, `cli/commands/index.py` | Add periodic progress output; surface `edges: 0` warning on stdout; add `edges` to `db info` | S | HIGH |
1396
+ | 5 | #5.2 | 10.B | `server/models.py`, `server/tools.py` | Add `hint: str | None` to `LineageResult`, `DependencyResult`, `TableUsageResult`; populate on empty returns with diagnostic message | S | HIGH |
1397
+ | 6 | #5.2 | 10.2.8 | `cli/commands/index.py`, `core/config.py` | Write `.sqlcg.toml` on successful index (idempotent); update `get_dialect()` to also read `path` from toml; allow `sqlcg index` with no args | S | MEDIUM |
1398
+ | 7 | #5.2 | 10.2.4 | `cli/commands/analyze.py` | Add `--exclude-schema` to `analyze unused`; always append closed-world caveat to output | S | MEDIUM |
1399
+ | 8 | #5.2 | 10.2.6 | `server/tools.py`, `metrics/store.py` | Add `FN` label to `submit_feedback`; add `execute_cypher` ratio to `sqlcg gain` output | S | MEDIUM |
1400
+ | 9 | #5.2 | 10.2.7 | `parsers/snowflake_parser.py` | Replace `_EMBEDDED_DML` regex with sqlglot tokenizer-split + `isinstance(stmt, (exp.DML, exp.Select)) and not isinstance(stmt, exp.Copy)` filter; handle Snowflake/Databricks procedure bodies via `exp.RawString` extraction; keep regex fallback for BigQuery `exp.Command` bodies; 3 test fixtures required | M | HIGH |
1401
+ | 10 | #5.2 | 10.2.5 | `indexer/indexer.py`, `parsers/base.py` | Verify `SELECTS_FROM` edges are created for INSERT-SELECT queries; add test fixture asserting `analyze impact` finds ETL INSERT — **elevated to HIGH (confirmed parser bug, see 10.6 Q4)** | M | HIGH |
1402
+ | 11 | #5.2 | 10.C | `indexer/indexer.py`, `parsers/base.py`, `cli/commands/index.py` | Introduce `parse_quality` breakdown (full / table_only / scripting_fallback / failed); surface in index summary and `db info` | M | MEDIUM |
1403
+ | 12 | #5.2 | 10.2.1 | `server/tools.py`, `cli/commands/mcp.py` | Add binary/package name note to `mcp setup` output and `index_repo`/`list_dialects_and_repos` tool docstrings | XS | LOW |
1404
+ | 13 | #5.2 | 10.2.10 | `README.md` / install docs | Add `uv tool install` recommendation; note cold-start latency in `sqlcg install` confirmation message | XS | LOW |
1405
+
1406
+ Note (2026-05-05): Ranks 3, 6, 8, and 10 were blocked on open questions. All four
1407
+ are now resolved — see section 10.6. Rank 10 was elevated from MEDIUM to HIGH after
1408
+ the user confirmed that `analyze impact` must include ETL INSERT statements (confirmed
1409
+ parser bug, not a design constraint).
1410
+
1411
+ ---
1412
+
1413
+ ### 10.5 Impact on Existing Architecture Review Sections
1414
+
1415
+ **Finding 3.4 amplified (10.C)**: The bare `except Exception: continue` in
1416
+ `_extract_column_lineage` was already a HIGH finding. Issue #5 confirms its real-world
1417
+ consequence: a 1457-file corpus reports 100% success with `SqlColumn: 0`. This
1418
+ elevates finding 3.4 from a theoretical correctness concern to a confirmed user-visible
1419
+ failure mode. The `parse_quality` breakdown (rank 11 above) is the correct systemic fix.
1420
+
1421
+ **Finding 5.2 (wizard.py unspec'd) is now confirmed scope creep risk**: The wizard
1422
+ was never mentioned in the test session. The LLM relied on `--help`. This confirms that
1423
+ the QUICK START in `--help` (rank 1 above) is the highest-value onboarding investment,
1424
+ not a GUI wizard.
1425
+
1426
+ **`analyze unused` (finding 5.3 / JOINS) extended**: The `analyze unused` false positive
1427
+ issue (#5.3) confirms finding 5.3 from the original review — the graph uses a
1428
+ closed-world assumption that must be made explicit to users. The `--exclude-schema` flag
1429
+ and caveat text address this without changing the graph schema.
1430
+
1431
+ **`analyze impact` vs `find pattern` inconsistency (10.2.5)** is a new finding not
1432
+ previously identified in this review. It points to a potential gap in `_upsert_parsed_file`:
1433
+ `SELECTS_FROM` edges may not be created for all query kinds that reference tables as
1434
+ sources. This requires a developer investigation before a fix can be scoped.
1435
+
1436
+ ---
1437
+
1438
+ ### 10.6 Open Questions from Issue Analysis — RESOLVED (2026-05-05)
1439
+
1440
+ All four open questions were resolved by user comments on the GitHub issues.
1441
+
1442
+ ---
1443
+
1444
+ **Q1 — `sqlcg uninstall` interaction model: RESOLVED**
1445
+
1446
+ Decision: prompt the user before deleting the graph database; support `--force` to
1447
+ skip the prompt. DB deletion is only offered when the database is local (KùzuDB
1448
+ embedded at `~/.sqlcg/` or `SQLCG_DB_PATH`). Neo4j remote backends are never
1449
+ deleted by `sqlcg uninstall` — only the MCP registration and git hook are removed.
1450
+
1451
+ Concrete interaction contract for `cli/commands/uninstall.py`:
1452
+
1453
+ - Step 1: always remove `mcpServers["sql-code-graph"]` from `~/.claude/settings.json`
1454
+ (atomic `.tmp` + `os.replace`, same guard as install).
1455
+ - Step 2: if the active backend is local KùzuDB, prompt:
1456
+ "This will delete the graph database at `<path>`. Continue? [y/N]"
1457
+ If `--force` is passed, skip the prompt and proceed. If `--keep-db` is passed,
1458
+ skip both prompt and deletion.
1459
+ - Step 3: remove the `# sqlcg post-checkout hook` sentinel block from
1460
+ `.git/hooks/post-checkout` in `Path.cwd()` (or `--repo <path>`). Delete the hook
1461
+ file if it becomes empty after stripping.
1462
+ - Print a confirmation line per step taken.
1463
+
1464
+ The `--force` flag also deletes the SQLite metrics store (same `~/.sqlcg/` path)
1465
+ per the user's comment "maybe also delete the database (graph + sqlite) when run
1466
+ with --force."
1467
+
1468
+ Impact on implementation plan: rank 3 (`uninstall.py`) spec is now complete. No
1469
+ further design questions.
1470
+
1471
+ ---
1472
+
1473
+ **Q2 — `.sqlcg.toml` committing policy: RESOLVED**
1474
+
1475
+ Decision: make `path` optional in `.sqlcg.toml`; resolve to `cwd` when absent.
1476
+ The file is therefore committable (no machine-specific absolute path).
1477
+
1478
+ Revised spec for `.sqlcg.toml`:
1479
+
1480
+ ```toml
1481
+ [sqlcg]
1482
+ dialect = "snowflake"
1483
+ # path is optional; defaults to the directory containing this file (cwd at index time)
1484
+ ```
1485
+
1486
+ `sqlcg index <path> --dialect <dialect>` writes this file at the root of the indexed
1487
+ path. If `path` equals `cwd`, the `path` key is omitted entirely. `sqlcg index`
1488
+ with no arguments reads `.sqlcg.toml` from cwd and uses `dialect` from it; `path`
1489
+ defaults to cwd.
1490
+
1491
+ The file should NOT be added to `.gitignore`. Teams can commit it to share the
1492
+ dialect config. The README should note that `.sqlcg.toml` without a `path` key is
1493
+ safe to commit; a `path` key (if the user adds one manually for a non-cwd root)
1494
+ is machine-specific and should be gitignored.
1495
+
1496
+ Impact on implementation plan: rank 6 (`.sqlcg.toml` write) spec is now complete.
1497
+
1498
+ ---
1499
+
1500
+ **Q3 — `FN` label for `submit_feedback`: RESOLVED**
1501
+
1502
+ Decision: yes, add `FN` as a valid label. The MetricsStore schema change is
1503
+ acceptable — the project has a no-backward-compatibility policy ("WE DON'T NEED TO
1504
+ KEEP A VERSION, WE WILL BREAK THINGS CAUSE THE PACKAGE IS LIVE"). Schema migrations
1505
+ are handled by re-initialising the SQLite metrics store on startup if the schema
1506
+ version changes.
1507
+
1508
+ Implementation note: add `FN` to the `FeedbackLabel` enum (or string literal set)
1509
+ in `metrics/store.py` and update the `submit_feedback` tool docstring and any
1510
+ validation that guards the label field. No migration script is needed — the store
1511
+ is a local append-only log; old records with `TP`/`FP` labels remain valid.
1512
+
1513
+ Impact on implementation plan: rank 8 (`submit_feedback` `FN` label) is unblocked.
1514
+
1515
+ ---
1516
+
1517
+ **Q4 — `analyze impact` expected scope: RESOLVED**
1518
+
1519
+ Decision: `analyze impact` must include ETL INSERT statements that use the table as
1520
+ a source. The inconsistency reported in issue #5 ("`analyze impact` returned only
1521
+ DDL files while `find pattern` additionally found the ETL INSERT") is confirmed as a
1522
+ parser bug, not an intentional design constraint. The user's issue description states
1523
+ "the two commands should return consistent results."
1524
+
1525
+ Implication: `analyze impact` is not DDL-only. Its scope is all `SqlQuery` nodes
1526
+ that reference the target table via `SELECTS_FROM` edges — SELECT, INSERT-SELECT,
1527
+ CTAS, MERGE, and UPDATE-FROM all qualify. The bug is that `_upsert_parsed_file` may
1528
+ not create `SELECTS_FROM` edges for INSERT statements. This must be verified and
1529
+ fixed.
1530
+
1531
+ Priority escalation: the inconsistency (originally rank 10 in section 10.4) is now
1532
+ confirmed as a parser bug and should be treated as HIGH priority. Elevate to rank 4
1533
+ (tied with the `hint` field addition, rank 5). The regression test fixture
1534
+ (`tests/fixtures/synthetic/etl_chain.sql` or a new `insert_select_impact.sql`)
1535
+ is now a mandatory deliverable for the developer.
1536
+
1537
+ ---
1538
+
1539
+ ### 10.7 Policy Note — No Backward Compatibility Constraint
1540
+
1541
+ The user confirmed: "WE DON'T NEED TO KEEP A VERSION, WE WILL BREAK THINGS CAUSE
1542
+ THE PACKAGE IS LIVE."
1543
+
1544
+ This policy applies to: MetricsStore SQLite schema, KùzuDB graph schema, MCP tool
1545
+ response model shapes, `.sqlcg.toml` format, and the `pyproject.toml` dependency
1546
+ pins. Breaking changes between releases are acceptable without a deprecation cycle.
1547
+ The architect-reviewer notes this means:
1548
+
1549
+ - Finding 3.6 (`QueryNode` mutability) can be fixed with `frozen=True` without a
1550
+ migration path for serialised graph data (re-index is the migration).
1551
+ - The `FN` label schema change (Q3) needs no migration script.
1552
+ - The `parse_quality` breakdown (finding 10.C) can replace the existing `parse_errors`
1553
+ field in `ParsedFile` without an adapter layer.
1554
+ - The `hint` field addition to result models (finding 10.B) can be non-optional if
1555
+ desired, though keeping it `str | None` is still the cleaner design.