PyPI - sql-code-graph - Versions diffs - 0.2.1__tar.gz → 0.3.0__tar.gz - Mend

sql-code-graph 0.2.1tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (173) hide show

{sql_code_graph-0.2.1 → sql_code_graph-0.3.0}/ARCHITECTURE_REVIEW.md RENAMED Viewed

@@ -1,7 +1,7 @@
 # Architecture Review — sql-code-graph (sqlcg)
 Blueprint version: v1.2 (May 2026)
-Review date: 2026-05-02
+Review date: 2026-05-02 (updated 2026-05-05)
 Reviewer agent: architect-reviewer
 ---
@@ -276,6 +276,23 @@ Recommended resolution: Refactor `_classify` to use `match`/`case` (pure style,
 Leave the string `kind` values but extract them into a `QueryKind` `StrEnum` in
 `parsers/base.py` or `core/schema.py` to eliminate magic string duplication.
+**Additional improvement from 10.2.7 research**: sqlglot exposes `exp.DML` as a base
+class covering `Insert`, `Update`, `Delete`, and `Merge`. The current `_classify`
+enumerates these individually, which caused `MERGE` to be missing from the original
+DML filter proposal (caught and fixed in 10.2.7). The `match`/`case` refactor should
+use `exp.DML` as a base branch to future-proof against new DML types sqlglot may add:
+```python
+case stmt if isinstance(stmt, exp.DML) and not isinstance(stmt, exp.Copy):
+    # handles Insert, Update, Delete, Merge — and any future exp.DML subclass
+    ...
+case exp.Select():
+    ...
+```
+This ensures `_classify` inherits coverage from sqlglot's own type hierarchy rather
+than requiring manual updates each time a new DML expression type is added.
 **Comment 8 — `parsers/base.py` line 263 and line 294: `_real_tables` and `_convert_table_expr_to_ref` use `elif`**
 Reviewer text: "same here match statements are much cleaner" / "match"
@@ -486,18 +503,28 @@ produces zero edges, which is indistinguishable from a column with no upstream s
 This directly undermines the "explicit failure mode inventory" strength described in
 section 2.4.
-### 3.5 [MEDIUM] `SnowflakeParser._parse_scripting_file` applies a file-level heuristic that can false-positive
+### 3.5 ~~[MEDIUM]~~ **RESOLVED** `SnowflakeParser._parse_scripting_file` applies a file-level heuristic that can false-positive
+> **Status: already fixed in code.** The token-aware `TokenType.BEGIN` check described
+> below is implemented in `_has_scripting_block()` (lines 79–85 of
+> `src/sqlcg/parsers/snowflake_parser.py`). This finding is closed.
-The heuristic `if _SCRIPTING_BLOCK.search(sql): return self._parse_scripting_file()`
-matches any file containing the word `BEGIN`. This includes:
+~~The heuristic `if _SCRIPTING_BLOCK.search(sql): return self._parse_scripting_file()`
+matches any file containing the word `BEGIN`. This includes:~~
-- Comments: `-- BEGIN transaction isolation block`
-- String literals: `SELECT 'BEGIN' AS status_label FROM t`
-- Legitimate `BEGIN TRANSACTION` in non-scripting contexts
+~~- Comments: `-- BEGIN transaction isolation block`~~
+~~- String literals: `SELECT 'BEGIN' AS status_label FROM t`~~
+~~- Legitimate `BEGIN TRANSACTION` in non-scripting contexts~~
-A false positive routes an entire file through the regex DML extractor, which sets
+~~A false positive routes an entire file through the regex DML extractor, which sets
 `parse_failed=True` and `confidence=0.3` on all statements and drops all column
-lineage. This degrades quality on files that are fully parseable.
+lineage. This degrades quality on files that are fully parseable.~~
+The implemented fix uses `sqlglot.tokens.Tokenizer.from_dialect("snowflake").tokenize(sql)`
+and checks for a `TokenType.BEGIN` token directly — string literals and comments are
+never tokenised as `BEGIN`, so false positives on `SELECT 'BEGIN' AS x` or
+`-- BEGIN block` are impossible. A regex fallback is retained only for the edge case
+where tokenisation itself raises an exception.
 ### 3.6 [MEDIUM] `QueryNode` is a mutable dataclass but `LineageEdge` is frozen — inconsistency
@@ -586,12 +613,12 @@ Ranking uses: **Impact** (correctness/stability consequence if unaddressed) x
 | 2 | 3.4 — `_extract_column_lineage` bare `except Exception: continue` | HIGH | Low | Log the exception at WARNING with column name and file path; append to `ParsedFile.errors`; set `confidence=0.0` on the affected edge slot |
 | 3 | 3.2 — `resolve_pass2` re-opens file without error handling | HIGH | Low | Wrap `open()` in `try/except (FileNotFoundError, OSError)`; log a warning and return the pass-1 `ParsedFile` unchanged |
 | 4 | 3.3 — `SchemaResolver.as_dict()` cache is not thread-safe | HIGH | Medium | Document the single-threaded assumption explicitly in the class docstring; if `jobs.py` uses threads, use `threading.Lock` around cache mutation and read, or switch to a per-parse-job `SchemaResolver` instance |
-| 5 | 3.5 — `BEGIN` heuristic false-positives on comments and string literals | MEDIUM | Low | Use a token-aware check: call `sqlglot.tokenize(sql, dialect="snowflake")` and look for a `TokenType.BEGIN` token that is not inside a string or comment, rather than a raw regex |
+| 5 | ~~3.5 — `BEGIN` heuristic false-positives~~ | ~~MEDIUM~~ | ~~Low~~ | **RESOLVED** — token-aware `TokenType.BEGIN` check already implemented in `_has_scripting_block()`. No action needed. |
 | 6 | 3.6 — `QueryNode` mutability inconsistency with frozen `LineageEdge` | MEDIUM | Low | Either add `frozen=True` to `QueryNode` and use `dataclasses.replace()` for pass-2 patching, or document the mutability contract explicitly in the class docstring |
 | 7 | 3.7 — No per-file timeout or SIGINT handling during indexing | MEDIUM | Medium | Add a configurable `--timeout-per-file` (default 30s) enforced via `concurrent.futures.ThreadPoolExecutor` with `future.result(timeout=N)`; handle `KeyboardInterrupt` in the walker loop to flush and exit cleanly |
 | 8 | 3.8 — `add_information_schema` is a silent no-op stub | MEDIUM | Low | Raise `NotImplementedError("--schema-from-info-schema is not yet implemented")` until the method is built; guard the CLI flag to surface this error immediately |
 | 9 | 3.10 — `acryl-datahub` upper bound missing on unstable module | LOW | Low | Change to `"acryl-datahub[sql-parsing]>=0.14.0,<0.15.0"` in `pyproject.toml` |
-| 10 | 3.12 — `_EMBEDDED_DML` regex greedy match across statement boundaries | LOW | Low | Add `re.MULTILINE`; test against a two-statement scripting block with no trailing semicolon on the last statement |
+| 10 | ~~3.12 — `_EMBEDDED_DML` regex greedy match across statement boundaries~~ | ~~LOW~~ | ~~Low~~ | **SUPERSEDED by 10.2.7** — the regex is being replaced entirely with a sqlglot tokenizer + `exp.DML` base class filter. Adding `re.MULTILINE` is no longer the fix. |
 | 11 | 3.9 — sqlglot upper bound `<31.0` too narrow | LOW | Low | Widen to `<32.0`; add CHANGELOG review step to CI upgrade process |
 | 12 | 3.11 — MCP tools lack explicit error documentation | LOW | Low | Add a `Raises` section to each tool docstring listing `NotIndexedError`, `InvalidColumnRef`, etc.; define a small custom exception hierarchy in `server/exceptions.py` |
@@ -744,11 +771,22 @@ Four capabilities are missing and each requires non-trivial work:
    pipeline execution order model, which is not representable in a static file walker.
 2. **Stored procedure body full parsing.** Snowflake `$$...$$` blocks and T-SQL
-   `BEGIN/END` bodies currently fall to regex extraction. A proper solution would
-   use sqlglot's `dialect="snowflake"` parser on the extracted body after stripping
-   the procedure wrapper. The blocker is that sqlglot classifies these as `exp.Command`
-   and does not expose a stored-procedure-body parser. This is blueprint Gap 5,
-   classified as a "design limitation" with "workaround only" resolution path.
+   `BEGIN/END` bodies currently fall to regex extraction. The blocker is
+   **dialect-specific** — this is not a uniform limitation:
+   - **Snowflake / Databricks**: sqlglot exposes the `$$...$$` body as an `exp.RawString`
+     node inside the `exp.Create` AST. The body is extractable, strippable of its
+     `BEGIN`/`END` wrapper, and re-parseable with the tokenizer + `exp.DML` filter
+     described in finding 10.2.7. This path is now viable and is part of the 10.2.7
+     implementation plan.
+   - **BigQuery**: sqlglot cannot fully parse `CREATE PROCEDURE ... BEGIN...END` for
+     BigQuery; the body lands as `exp.Command` inside `exp.Create`. For BigQuery, the
+     regex fallback (`_EMBEDDED_DML`) remains the only option. This is the true
+     "design limitation / workaround only" case.
+   - **T-SQL**: not yet verified — likely similar to BigQuery (body as `exp.Command`).
+   This is blueprint Gap 5. The Snowflake portion is now addressable via 10.2.7;
+   BigQuery procedure bodies remain a fundamental static-analysis limit.
 3. **COPY INTO column lineage.** Inferring column mapping from a stage file to a
    target table requires either an explicit column list in the COPY statement or an
@@ -902,3 +940,616 @@ have the GitHub repository configured as a trusted publisher:
   This is enforced via `needs: test` in the workflow YAML.
 Full implementation spec: `plan/phase10_deployment.md`
+---
+## 10. GitHub Issues Review — v0.2.1 Feedback (2026-05-05)
+Review date: 2026-05-05
+Source: GitHub issues #5 and #6, Warhorze/sql-code-graph
+Tester environment: sqlcg 0.2.1, Snowflake dialect, 1457-file DWH corpus, WSL2/Linux, Claude Code MCP session
+Both issues are marked `[feedback]` and were filed by the repo owner after a real
+production session. They are high-signal because they reflect actual LLM-agent and
+end-user experience, not hypothetical concerns.
+---
+### 10.1 Issue #6 — No clean uninstall / opt-out path
+**Problem statement**: The tool installs side effects in three separate locations
+(`~/.claude/settings.json`, `~/.sqlcg/`, `.git/hooks/post-checkout`) with no
+corresponding removal command. A user who wants to remove the tool must discover and
+manually undo all three locations.
+**Architectural assessment**: The `sqlcg install` command was designed with symmetry in
+mind (Phase 10), but only the install direction was implemented. The three side-effect
+locations map directly to the three install operations:
+1. `sqlcg install` writes `mcpServers["sql-code-graph"]` to `~/.claude/settings.json`
+2. The user's KùzuDB lives at `~/.sqlcg/` (controlled by `SQLCG_DB_PATH` or the default
+   set in `config.py`)
+3. `sqlcg git install-hooks` writes `.git/hooks/post-checkout`
+A `sqlcg uninstall` command is the natural symmetric counterpart and follows the exact
+same pattern as `install.py`:
+- Remove `mcpServers["sql-code-graph"]` from `~/.claude/settings.json` using atomic
+  `.tmp` + `os.replace` write (same guard as install)
+- Delete `~/.sqlcg/` (or the path from `SQLCG_DB_PATH`) via `shutil.rmtree` with
+  a `--keep-db` flag to allow MCP deregistration without data loss
+- Remove the `# sqlcg post-checkout hook` sentinel block from `.git/hooks/post-checkout`
+  in the specified repo (or `Path.cwd()` by default); if the hook file becomes empty
+  after stripping, delete it
+**Risk**: LOW severity (no data is corrupted). HIGH usability impact — an unclean
+opt-out path erodes trust and makes the tool feel invasive.
+**New finding (10.A)**: The three side-effect locations are not documented anywhere in
+`--help` or the README. Even without `sqlcg uninstall`, documenting them explicitly
+would reduce user friction when opting out.
+---
+### 10.2 Issue #5 — LLM Agent Experience: Silent Failures and False Positives
+This is a compound issue with nine distinct sub-problems. Each is assessed separately
+below in order of severity.
+---
+#### 10.2.1 [CRITICAL] LLM cannot self-recover from package/binary name mismatch
+The PyPI package is `sql-code-graph`; the binary is `sqlcg`. When an LLM agent
+invokes the MCP server by its package name (`sql-code-graph`) it gets "executable not
+found" and cannot recover without human intervention.
+The `mcp setup` / `sqlcg install` JSON already writes the correct invocation
+(`uvx sql-code-graph` or `sqlcg`), but the LLM does not read its own MCP settings
+file — it relies on the MCP server's tool descriptions to understand how to invoke it.
+**Root cause**: No tool or `--help` output names the binary explicitly with context.
+The `list_dialects_and_repos` tool description could carry this hint, but currently
+does not.
+**Resolution**: Add a one-line note to the `sqlcg mcp setup` JSON output and to the
+`list_dialects_and_repos` / `index_repo` tool docstrings: "Binary is `sqlcg`; PyPI
+package is `sql-code-graph`." This is a docstring-only change with no code impact.
+---
+#### 10.2.2 [CRITICAL] No workflow order surfaced in `--help` or tools
+The required sequence (`db init` -> `index <path>` -> `git install-hooks`) is not
+implied by `--help`, `mcp setup`, or any tool description. During the test session
+the LLM jumped directly to `find`/`analyze` and received empty results for the entire
+session because `db init` and `index` were never run.
+The `watch` command currently has no initialization guard: it calls
+`backend.init_schema()` internally (correct), but `sqlcg watch` on an empty database
+with no index will start the watcher, produce no results, and give no indication that
+the database has zero content.
+**Resolution**:
+1. Add a `QUICK START` block to the top-level `--help` text in `cli/main.py`
+   (within the `Typer` `help=` parameter string), listing the three required steps
+   in order. `typer` renders this in the `--help` output verbatim.
+2. In every tool that calls `_assert_indexed()`, surface the error message as:
+   "No repos indexed. Run `sqlcg db init` then `sqlcg index <path>` first."
+   (The current `NotIndexedError` message already says this but uses the CLI form;
+   confirm it matches the actual binary name.)
+3. The `watch` command is fine architecturally — it calls `init_schema()` before
+   starting the observer — but it should print a warning if zero repos are indexed
+   after the initial full index: "Warning: 0 tables indexed. Check that the path
+   contains .sql files and the dialect is correct."
+---
+#### 10.2.3 [HIGH] `db info` gives no warning when `SqlColumn: 0`
+`db info` shows node counts for all labels including `SqlColumn`. When `SqlColumn: 0`,
+the `trace_column_lineage` and `get_downstream_dependencies` / `get_upstream_dependencies`
+tools are completely non-functional — they will always return empty results. The user
+cannot distinguish "column not found" from "column graph not built."
+The current `db info` implementation (in `db.py`) iterates all `NodeLabel` values and
+prints counts without any interpretation. A zero count on `SqlColumn` is printed
+identically to a non-zero count.
+**Resolution**: After printing counts, add a structured health check section to
+`db info` output:
+- If `SqlColumn == 0` and `SqlQuery > 0`: print a yellow warning: "Column lineage
+  not available. Tools `trace_column_lineage`, `get_downstream_dependencies`, and
+  `get_upstream_dependencies` will return empty results."
+- If `SqlQuery == 0` and `Repo > 0`: print a yellow warning: "No queries indexed.
+  Run `sqlcg index <path>` to populate the graph."
+- If `Repo == 0`: print a red error: "Database is empty. Run `sqlcg db init` and
+  `sqlcg index <path>` first."
+This is also relevant to the MCP server: `list_dialects_and_repos` could append
+a `warnings` field to its `DialectRepoResult` listing health status, so Claude
+can surface it proactively.
+---
+#### 10.2.4 [HIGH] `analyze unused` returns 100% false positives for external consumers
+`analyze unused` returned ~100 `IA_ANALYTICS` / `IA_TABLEAU` / `IA_BUSINESSOBJECTS`
+views as "unused" because these are consumer-facing views queried by Tableau and BI
+tools — external consumers with no SQL references within the indexed corpus.
+The current query in `analyze.py`:
+```
+MATCH (t:SqlTable) WHERE NOT (t)<-[:SELECTS_FROM]-() RETURN t.qualified
+```
+is logically correct for the closed-world assumption (only SQL files in the indexed
+corpus), but this assumption is never stated to the user.
+**Resolution**:
+1. **Medallion-architecture-aware classification.** In a Bronze/Silver/Gold (medallion)
+   warehouse, tables at each layer have different "unused" semantics:
+   | Layer | Typical schema prefixes | Expected consumption pattern | "Unused" meaning |
+   |---|---|---|---|
+   | Bronze / Raw | `RAW_`, `STG_`, `STAGE_` | Loaded by COPY INTO / ELT jobs | Bug if unused — no ETL reads it |
+   | Silver / Intermediate | `INT_`, `PREP_`, `DWH_`, `BA_` | Read by ETL INSERT/MERGE | Bug if unused |
+   | Gold / Consumption | `DIM_`, `FACT_`, `MART_`, `IA_`, `RPT_` | Queried by BI tools externally | **Expected** — external consumers have no SQL in the corpus |
+   `analyze unused` should tag each result with its inferred layer so the LLM can
+   distinguish "this Gold view has no SQL consumers — expected, BI queries it" from
+   "this Silver staging table has no SQL consumers — likely a dead table."
+   Implementation: after the Cypher query, classify each result by matching its schema
+   prefix against a configurable tier map (with sensible defaults). Surface the tier
+   in the CLI output column and in the MCP tool's result model. The MCP tool docstring
+   must explain the tier model so Claude can give informed recommendations without
+   the user needing to explain their naming conventions.
+2. Add a `--exclude-schema` flag to `analyze unused` for manual exclusion of known
+   consumer schemas. The flag description must explain **when to use it**: "Use to
+   exclude schemas whose tables are consumed externally (e.g., by BI tools or APIs)
+   and will always appear unused within the indexed SQL corpus."
+   The MCP tool version of this flag (if exposed) must carry the same explanation in
+   its parameter docstring so the LLM knows when to apply it autonomously.
+3. Always append a closed-world caveat to results: "Note: 'unused' means no SQL file
+   in the indexed corpus selects from this table. External consumers (Tableau, BI tools,
+   APIs) are not visible to this tool." This caveat must appear in: CLI output, MCP
+   tool docstring, and `analyze unused --help`.
+4. The same tier classification and caveat applies to the MCP server's future
+   `find_unused_tables` tool (if planned).
+---
+#### 10.2.5 [HIGH] `analyze impact` vs `find pattern` return inconsistent results for the same table
+For table `BA.WTFV_VOORRAAD_DAGSTAND_IGDC`, `analyze impact` returned only DDL files
+while `find pattern` additionally found an ETL `INSERT INTO` in `etl/sql/fact/`. The
+two commands should return overlapping or identical results for this query.
+**Root cause analysis**: `analyze impact` uses the `SELECTS_FROM` relationship to find
+queries that reference the table. If the ETL `INSERT INTO` statement was indexed as a
+query with `sources` pointing to `BA.WTFV_VOORRAAD_DAGSTAND_IGDC` via a `SELECTS_FROM`
+edge, it would appear in both commands. The inconsistency suggests one of:
+(a) The ETL INSERT statement was not indexed with the correct `SELECTS_FROM` edges (the
+    parser did not extract table sources for it), or
+(b) The INSERT was indexed under a different table name/qualified form than the DDL
+    table node, causing the relationship to point to a different node.
+**Resolution**: This is a parser correctness issue, not a CLI design issue. The fix
+is to verify that `INSERT INTO ... SELECT FROM <table>` correctly emits a `SELECTS_FROM`
+edge from the `SqlQuery` node to the `SqlTable` node. Add a test fixture with a known
+INSERT-SELECT pattern and assert that `analyze impact` finds it. The diagnostic step
+is: run `execute_cypher("MATCH (q:SqlQuery)-[:SELECTS_FROM]->(t:SqlTable {qualified:
+'BA.WTFV_VOORRAAD_DAGSTAND_IGDC'}) RETURN q.id LIMIT 10")` and compare to
+`find pattern "WTFV_VOORRAAD_DAGSTAND_IGDC"`.
+New finding: **the impact/pattern inconsistency is a measurable regression test gap**.
+The `_upsert_parsed_file` path must create `SELECTS_FROM` edges for INSERT statements,
+not only SELECT statements. Verify this is handled.
+---
+#### 10.2.6 [MEDIUM] Feedback loop has no false-negative path
+`submit_feedback` only fires when there are results to rate. The most valuable signal —
+empty result when the user expected one — is never collected. In the test session,
+`trace_column_lineage` was called 7 times (all returning empty) and 0 feedback samples
+were collected.
+Additionally, `execute_cypher` was the #1 MCP tool with 15 calls vs
+`trace_column_lineage` at 7. A high `execute_cypher`/`high-level-tool` ratio is a
+proxy signal that the high-level tools are failing and the LLM is falling back to raw
+Cypher. This ratio is not surfaced in `sqlcg gain`.
+**Resolution**:
+1. In `trace_column_lineage`, `get_downstream_dependencies`, and
+   `get_upstream_dependencies`: when the result `lineage` / `nodes` list is empty,
+   include a `hint` field in the returned model: "If you expected results, check that
+   `db info` shows SqlColumn > 0. Submit feedback with `submit_feedback` tool if this
+   was a false negative."
+2. In `submit_feedback`, add a `FN` (false negative) label to the allowed set alongside
+   `TP` and `FP`. Currently only `TP` and `FP` are valid labels.
+3. In `sqlcg gain`, add a `execute_cypher / total_calls` ratio section. A ratio above
+   0.3 (more than 30% of calls are raw Cypher) is a signal that the high-level
+   abstractions are not working.
+---
+#### 10.2.7 [HIGH] Parser: noise statements in scripting blocks
+`ALTER WAREHOUSE IDENTIFIER($var) SET WAREHOUSE_SIZE = 'X-Large'` is a standard
+Snowflake DDL rebuild pattern that fails to parse and falls back to `exp.Command`.
+`CALL MA.MSSPR_*()` stored procedure calls similarly produce noise. These were the
+two observed instances, but scripting blocks can contain many other non-DML statement
+types: `SET variable = value`, `LET x := y`, `EXECUTE IMMEDIATE`, `RETURN`, `RAISE`,
+`OPEN/FETCH/CLOSE` (cursor ops), `TRUNCATE TABLE`, and any `CREATE/DROP/ALTER` inside
+a block. Enumerating them per-dialect is a maintenance trap.
+**Root cause**: `_EMBEDDED_DML` is a regex applied to the raw SQL text. It does not
+understand statement boundaries (semicolons inside string literals break it — finding
+3.12) and has no way to classify what it captures. Any statement fragment that leaks
+through gets handed to `sqlglot.parse()` and produces a noisy `exp.Command` error.
+**Resolution**: Replace the `_EMBEDDED_DML` regex extraction with a two-step
+sqlglot-native approach in `SnowflakeParser._parse_scripting_file`:
+1. **Split on statement boundaries** using `sqlglot.tokenize()` + semicolon tokens.
+   The tokenizer handles semicolons inside string literals correctly — no regex needed.
+2. **Classify each chunk** using sqlglot's built-in `exp.DML` base class:
+   ```python
+   isinstance(stmt, (exp.DML, exp.Select)) and not isinstance(stmt, exp.Copy)
+   ```
+   `exp.DML` is a sqlglot base class that covers `Delete`, `Insert`, `Update`, and
+   `Merge` in a single isinstance check — no per-statement enumeration needed.
+   `exp.Select` is added separately (it inherits from `Query`, not `DML`).
+   `exp.Copy` (Snowflake `COPY INTO`) is excluded because its source is a file stage,
+   not a table — no table lineage to extract.
+   Everything else — `exp.Command`, `exp.DDL`, utility statements — is dropped and
+   logged at DEBUG level, not WARNING.
+**DML coverage after this fix:**
+| Statement | sqlglot type | Included |
+|---|---|---|
+| SELECT | `exp.Select` | Yes |
+| INSERT INTO … SELECT / VALUES | `exp.Insert` (via `exp.DML`) | Yes |
+| UPDATE | `exp.Update` (via `exp.DML`) | Yes |
+| DELETE | `exp.Delete` (via `exp.DML`) | Yes |
+| MERGE INTO | `exp.Merge` (via `exp.DML`) | Yes — was missing from original proposal |
+| COPY INTO (Snowflake) | `exp.Copy` (via `exp.DML`) | No — stage source, not a table |
+| CREATE TABLE AS SELECT | `exp.Create` (via `exp.DDL`) | Handled separately (see below) |
+| ALTER WAREHOUSE, CALL, SET, LET, … | `exp.Command` or specific DDL | Dropped at DEBUG |
+**CTAS**: `CREATE TABLE AS SELECT` produces `exp.Create`, not `exp.DML`. The existing
+`AnsiParser._parse_statement` already handles CTAS by detecting a SELECT inside a
+Create node. In the scripting-block path, pass `exp.Create` instances through to
+`_parse_statement` rather than dropping them — it will handle them or ignore them
+cleanly.
+**Procedure bodies — dialect differences:**
+The tokenizer approach above handles scripting-block files (non-procedure files with
+`BEGIN` blocks) correctly across all dialects. Procedure bodies are different:
+| Dialect | How sqlglot exposes the body | Extractable? |
+|---|---|---|
+| Snowflake | `exp.RawString` node inside `exp.Create` | Yes — extract, strip `BEGIN`/`END`, re-apply step 1–2 |
+| Databricks | `exp.RawString` (same `$$` delimiter) | Yes — same as Snowflake |
+| BigQuery | `exp.Command` inside `exp.Create` | No — body is opaque; regex fallback same as today |
+For Snowflake/Databricks: detect `exp.Create` nodes that contain an `exp.RawString`
+body, extract the string, strip the outer `BEGIN ... END` wrapper, then run the
+tokenizer-split + DML filter on the inner content recursively. For BigQuery: leave the
+current `_EMBEDDED_DML` regex path as a fallback for `exp.Command` bodies only — do
+not regress existing behaviour.
+**Files**: `src/sqlcg/parsers/snowflake_parser.py` — remove `_EMBEDDED_DML` regex,
+rewrite `_parse_scripting_file`. No schema or MCP impact. Test fixtures required:
+- Scripting block with `ALTER WAREHOUSE`, `CALL`, `MERGE INTO`, and `INSERT INTO … SELECT` — assert only MERGE and INSERT are indexed
+- Snowflake procedure with embedded DML — assert body DML is indexed
+- BigQuery procedure — assert existing regex fallback still extracts DML
+---
+#### 10.2.8 [MEDIUM] Missing persistent index config (`.sqlcg.toml`)
+After a `db reset` or fresh clone there is no way to replay the index without
+remembering the original `sqlcg index <path> --dialect <dialect>` invocation. The
+git hook calls `sqlcg index --dialect auto --quiet` which reads `.sqlcg.toml` if it
+exists, but this file is never created by any `sqlcg` command.
+**Current state**: The `get_dialect()` helper in `config.py` reads `.sqlcg.toml`
+(via `[sqlcg] dialect = "..."`) if present, but `sqlcg index` never writes this file.
+A user must create `.sqlcg.toml` manually.
+**Resolution**: `sqlcg index <path> --dialect <dialect>` should write (or update) a
+`.sqlcg.toml` file at the root of the indexed path on first successful index. The
+write should be idempotent (do not overwrite if the file already exists with a matching
+dialect). This makes the git hook self-healing after a `db reset` — the hook will read
+the dialect from `.sqlcg.toml` without the user needing to remember the original
+invocation.
+Spec for `.sqlcg.toml`:
+```toml
+[sqlcg]
+path = "/absolute/path/to/repo"
+dialect = "snowflake"
+```
+The `path` field allows `sqlcg index` (with no arguments) to infer the target
+directory from the config file when run from the repo root.
+**Risk**: Low — no breaking changes. The file should be added to `.gitignore` templates
+in the README since it contains an absolute path that is machine-specific.
+---
+#### 10.2.9 [LOW] No progress feedback during `sqlcg index`
+`sqlcg index` on 1457 files took ~17 seconds with zero stdout output until the final
+summary line appeared. The `edges: 0` signal (indicating the graph has no relationships
+and lineage tracing will return nothing) appeared only in the raw log, never surfaced
+in `db info`.
+**Resolution**:
+1. Add a simple periodic progress line to `Indexer.index_repo()`: print
+   `"INFO: indexed N/total files..."` every 100 files (or every 5 seconds on a timer).
+   Use `rich.progress` if available, or a plain `console.print` with carriage return.
+2. After indexing, if `lineage_edges_created == 0`, print a yellow warning to stdout
+   (not just the log): "Warning: 0 lineage edges created. Column-level lineage tracing
+   will not be available. Check parse errors above."
+3. Surface `lineage_edges_created` (renamed to `edges`) in `db info` as an explicit
+   field. A graph with zero edges cannot answer any lineage query, and this must be
+   immediately visible without reading the raw index log.
+---
+#### 10.2.10 [LOW] `uvx` cold-start is invisible (6.5 minutes with no output)
+The MCP server config (`"command": "uvx"`) means Claude Code pays a `uvx` download
+cost on first run (and potentially on each session if the `uvx` cache is cold). In
+the test session this took 6.5 minutes with zero output, indistinguishable from a hang.
+**Resolution**: Documentation-only change. The README `QUICK START` section should:
+- Recommend `uv tool install sql-code-graph` for persistent local installs
+- Reserve `uvx sql-code-graph` for one-shot tries
+- Note that first run will be slow due to package download
+- The `sqlcg install` confirmation message should note: "Using uvx — first MCP startup
+  may take several minutes on a cold cache. Run `uv tool install sql-code-graph` for
+  faster startup."
+---
+### 10.3 Cross-Cutting Findings from Issue Analysis
+**Finding 10.B — LLM agent is a first-class user of this tool but the tool was not designed for it**
+The issue session revealed a consistent pattern: the tool produces correct outputs but
+does not communicate machine-readable semantics about its own state. A human user who
+gets "No results" knows to check if the database was indexed. An LLM agent does not.
+Every tool that can return empty due to a missing prerequisite (unindexed graph, zero
+column nodes, schema mismatch) must include a `hint` field in the returned model so
+Claude can self-diagnose without falling back to `execute_cypher`.
+This is the root cause of the `execute_cypher`-dominance pattern (15 calls vs 7 for
+`trace_column_lineage`): Claude fell back to raw Cypher because the high-level tools
+returned empty with no explanation.
+**Action**: Add a `hint: str | None` field to `LineageResult`, `DependencyResult`,
+`TableUsageResult`, and `SqlPatternResult` models. Populate it when the result list
+is empty with a diagnostic string. This is a backward-compatible model extension
+(new optional field).
+**Finding 10.C — Index success rate is misleading**
+The index reported 100% success rate with 900KB of parse warnings and `SqlColumn: 0`.
+"Success" in the current codebase means "the file was processed without raising an
+unhandled exception." It does not mean "lineage was extracted." A file that falls back
+to `exp.Command` (scripting block) is counted as "success" even though it produced
+zero edges.
+This is related to finding 3.4 (bare `except Exception: continue`) — it means errors
+are swallowed and the success rate is a vanity metric.
+**Action**: Introduce a `parse_quality` metric in `ParsedFile`:
+- `full`: sqlglot parsed fully, column lineage extracted
+- `table_only`: tables extracted, column lineage not available (schema required or
+  `SELECT *`)
+- `scripting_fallback`: fell back to regex extraction, confidence 0.3
+- `failed`: unhandled exception, no lineage
+Report these four categories in the `index` summary line and in `db info`. This
+replaces "parse_errors" (which is binary) with a quality breakdown that accurately
+reflects what Claude can and cannot answer.
+---
+### 10.4 Prioritized Implementation Plan — Issues #5 and #6
+Ranked by impact (breadth of user pain) multiplied by implementation effort (inverse:
+low effort ranks higher).
+| Rank | Issue | Finding | File(s) | Action | Effort | Impact |
+|------|-------|---------|---------|--------|--------|--------|
+| 1 | #5.1 | 10.2.2 | `cli/main.py` | Add `QUICK START` block to top-level `--help` with ordered steps; update `NotIndexedError` message to include step hint | XS | CRITICAL |
+| 2 | #5.2 | 10.2.3 | `cli/commands/db.py` | Add health check interpretation to `db info` output (red/yellow warnings for empty Repo, zero SqlQuery, zero SqlColumn) | XS | HIGH |
+| 3 | #6 | 10.A, 10.1 | `cli/commands/uninstall.py` (new) | Implement `sqlcg uninstall`: remove MCP entry, delete `~/.sqlcg/`, strip git hook; `--keep-db` flag; register in `cli/main.py` | S | HIGH |
+| 4 | #5.2 | 10.2.9 | `indexer/indexer.py`, `cli/commands/index.py` | Add periodic progress output; surface `edges: 0` warning on stdout; add `edges` to `db info` | S | HIGH |
+| 5 | #5.2 | 10.B | `server/models.py`, `server/tools.py` | Add `hint: str | None` to `LineageResult`, `DependencyResult`, `TableUsageResult`; populate on empty returns with diagnostic message | S | HIGH |
+| 6 | #5.2 | 10.2.8 | `cli/commands/index.py`, `core/config.py` | Write `.sqlcg.toml` on successful index (idempotent); update `get_dialect()` to also read `path` from toml; allow `sqlcg index` with no args | S | MEDIUM |
+| 7 | #5.2 | 10.2.4 | `cli/commands/analyze.py` | Add `--exclude-schema` to `analyze unused`; always append closed-world caveat to output | S | MEDIUM |
+| 8 | #5.2 | 10.2.6 | `server/tools.py`, `metrics/store.py` | Add `FN` label to `submit_feedback`; add `execute_cypher` ratio to `sqlcg gain` output | S | MEDIUM |
+| 9 | #5.2 | 10.2.7 | `parsers/snowflake_parser.py` | Replace `_EMBEDDED_DML` regex with sqlglot tokenizer-split + `isinstance(stmt, (exp.DML, exp.Select)) and not isinstance(stmt, exp.Copy)` filter; handle Snowflake/Databricks procedure bodies via `exp.RawString` extraction; keep regex fallback for BigQuery `exp.Command` bodies; 3 test fixtures required | M | HIGH |
+| 10 | #5.2 | 10.2.5 | `indexer/indexer.py`, `parsers/base.py` | Verify `SELECTS_FROM` edges are created for INSERT-SELECT queries; add test fixture asserting `analyze impact` finds ETL INSERT — **elevated to HIGH (confirmed parser bug, see 10.6 Q4)** | M | HIGH |
+| 11 | #5.2 | 10.C | `indexer/indexer.py`, `parsers/base.py`, `cli/commands/index.py` | Introduce `parse_quality` breakdown (full / table_only / scripting_fallback / failed); surface in index summary and `db info` | M | MEDIUM |
+| 12 | #5.2 | 10.2.1 | `server/tools.py`, `cli/commands/mcp.py` | Add binary/package name note to `mcp setup` output and `index_repo`/`list_dialects_and_repos` tool docstrings | XS | LOW |
+| 13 | #5.2 | 10.2.10 | `README.md` / install docs | Add `uv tool install` recommendation; note cold-start latency in `sqlcg install` confirmation message | XS | LOW |
+Note (2026-05-05): Ranks 3, 6, 8, and 10 were blocked on open questions. All four
+are now resolved — see section 10.6. Rank 10 was elevated from MEDIUM to HIGH after
+the user confirmed that `analyze impact` must include ETL INSERT statements (confirmed
+parser bug, not a design constraint).
+---
+### 10.5 Impact on Existing Architecture Review Sections
+**Finding 3.4 amplified (10.C)**: The bare `except Exception: continue` in
+`_extract_column_lineage` was already a HIGH finding. Issue #5 confirms its real-world
+consequence: a 1457-file corpus reports 100% success with `SqlColumn: 0`. This
+elevates finding 3.4 from a theoretical correctness concern to a confirmed user-visible
+failure mode. The `parse_quality` breakdown (rank 11 above) is the correct systemic fix.
+**Finding 5.2 (wizard.py unspec'd) is now confirmed scope creep risk**: The wizard
+was never mentioned in the test session. The LLM relied on `--help`. This confirms that
+the QUICK START in `--help` (rank 1 above) is the highest-value onboarding investment,
+not a GUI wizard.
+**`analyze unused` (finding 5.3 / JOINS) extended**: The `analyze unused` false positive
+issue (#5.3) confirms finding 5.3 from the original review — the graph uses a
+closed-world assumption that must be made explicit to users. The `--exclude-schema` flag
+and caveat text address this without changing the graph schema.
+**`analyze impact` vs `find pattern` inconsistency (10.2.5)** is a new finding not
+previously identified in this review. It points to a potential gap in `_upsert_parsed_file`:
+`SELECTS_FROM` edges may not be created for all query kinds that reference tables as
+sources. This requires a developer investigation before a fix can be scoped.
+---
+### 10.6 Open Questions from Issue Analysis — RESOLVED (2026-05-05)
+All four open questions were resolved by user comments on the GitHub issues.
+---
+**Q1 — `sqlcg uninstall` interaction model: RESOLVED**
+Decision: prompt the user before deleting the graph database; support `--force` to
+skip the prompt. DB deletion is only offered when the database is local (KùzuDB
+embedded at `~/.sqlcg/` or `SQLCG_DB_PATH`). Neo4j remote backends are never
+deleted by `sqlcg uninstall` — only the MCP registration and git hook are removed.
+Concrete interaction contract for `cli/commands/uninstall.py`:
+- Step 1: always remove `mcpServers["sql-code-graph"]` from `~/.claude/settings.json`
+  (atomic `.tmp` + `os.replace`, same guard as install).
+- Step 2: if the active backend is local KùzuDB, prompt:
+  "This will delete the graph database at `<path>`. Continue? [y/N]"
+  If `--force` is passed, skip the prompt and proceed. If `--keep-db` is passed,
+  skip both prompt and deletion.
+- Step 3: remove the `# sqlcg post-checkout hook` sentinel block from
+  `.git/hooks/post-checkout` in `Path.cwd()` (or `--repo <path>`). Delete the hook
+  file if it becomes empty after stripping.
+- Print a confirmation line per step taken.
+The `--force` flag also deletes the SQLite metrics store (same `~/.sqlcg/` path)
+per the user's comment "maybe also delete the database (graph + sqlite) when run
+with --force."
+Impact on implementation plan: rank 3 (`uninstall.py`) spec is now complete. No
+further design questions.
+---
+**Q2 — `.sqlcg.toml` committing policy: RESOLVED**
+Decision: make `path` optional in `.sqlcg.toml`; resolve to `cwd` when absent.
+The file is therefore committable (no machine-specific absolute path).
+Revised spec for `.sqlcg.toml`:
+```toml
+[sqlcg]
+dialect = "snowflake"
+# path is optional; defaults to the directory containing this file (cwd at index time)
+```
+`sqlcg index <path> --dialect <dialect>` writes this file at the root of the indexed
+path. If `path` equals `cwd`, the `path` key is omitted entirely. `sqlcg index`
+with no arguments reads `.sqlcg.toml` from cwd and uses `dialect` from it; `path`
+defaults to cwd.
+The file should NOT be added to `.gitignore`. Teams can commit it to share the
+dialect config. The README should note that `.sqlcg.toml` without a `path` key is
+safe to commit; a `path` key (if the user adds one manually for a non-cwd root)
+is machine-specific and should be gitignored.
+Impact on implementation plan: rank 6 (`.sqlcg.toml` write) spec is now complete.
+---
+**Q3 — `FN` label for `submit_feedback`: RESOLVED**
+Decision: yes, add `FN` as a valid label. The MetricsStore schema change is
+acceptable — the project has a no-backward-compatibility policy ("WE DON'T NEED TO
+KEEP A VERSION, WE WILL BREAK THINGS CAUSE THE PACKAGE IS LIVE"). Schema migrations
+are handled by re-initialising the SQLite metrics store on startup if the schema
+version changes.
+Implementation note: add `FN` to the `FeedbackLabel` enum (or string literal set)
+in `metrics/store.py` and update the `submit_feedback` tool docstring and any
+validation that guards the label field. No migration script is needed — the store
+is a local append-only log; old records with `TP`/`FP` labels remain valid.
+Impact on implementation plan: rank 8 (`submit_feedback` `FN` label) is unblocked.
+---
+**Q4 — `analyze impact` expected scope: RESOLVED**
+Decision: `analyze impact` must include ETL INSERT statements that use the table as
+a source. The inconsistency reported in issue #5 ("`analyze impact` returned only
+DDL files while `find pattern` additionally found the ETL INSERT") is confirmed as a
+parser bug, not an intentional design constraint. The user's issue description states
+"the two commands should return consistent results."
+Implication: `analyze impact` is not DDL-only. Its scope is all `SqlQuery` nodes
+that reference the target table via `SELECTS_FROM` edges — SELECT, INSERT-SELECT,
+CTAS, MERGE, and UPDATE-FROM all qualify. The bug is that `_upsert_parsed_file` may
+not create `SELECTS_FROM` edges for INSERT statements. This must be verified and
+fixed.
+Priority escalation: the inconsistency (originally rank 10 in section 10.4) is now
+confirmed as a parser bug and should be treated as HIGH priority. Elevate to rank 4
+(tied with the `hint` field addition, rank 5). The regression test fixture
+(`tests/fixtures/synthetic/etl_chain.sql` or a new `insert_select_impact.sql`)
+is now a mandatory deliverable for the developer.
+---
+### 10.7 Policy Note — No Backward Compatibility Constraint
+The user confirmed: "WE DON'T NEED TO KEEP A VERSION, WE WILL BREAK THINGS CAUSE
+THE PACKAGE IS LIVE."
+This policy applies to: MetricsStore SQLite schema, KùzuDB graph schema, MCP tool
+response model shapes, `.sqlcg.toml` format, and the `pyproject.toml` dependency
+pins. Breaking changes between releases are acceptable without a deprecation cycle.
+The architect-reviewer notes this means:
+- Finding 3.6 (`QueryNode` mutability) can be fixed with `frozen=True` without a
+  migration path for serialised graph data (re-index is the migration).
+- The `FN` label schema change (Q3) needs no migration script.
+- The `parse_quality` breakdown (finding 10.C) can replace the existing `parse_errors`
+  field in `ParsedFile` without an adapter layer.
+- The `hint` field addition to result models (finding 10.B) can be non-optional if
+  desired, though keeping it `str | None` is still the cleaner design.

sql-code-graph 0.2.1__tar.gz → 0.3.0__tar.gz

sql-code-graph 0.2.1tar.gz → 0.3.0tar.gz