InfoTracker 0.2.3__tar.gz → 0.2.5__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {infotracker-0.2.3 → infotracker-0.2.5}/PKG-INFO +1 -1
- {infotracker-0.2.3 → infotracker-0.2.5}/pyproject.toml +10 -2
- infotracker-0.2.3/ProjectDescription.md +0 -82
- infotracker-0.2.3/docs/adapters.md +0 -117
- infotracker-0.2.3/docs/advanced_use_cases.md +0 -294
- infotracker-0.2.3/docs/agentic_workflow.md +0 -134
- infotracker-0.2.3/docs/algorithm.md +0 -109
- infotracker-0.2.3/docs/architecture.md +0 -35
- infotracker-0.2.3/docs/breaking_changes.md +0 -291
- infotracker-0.2.3/docs/cli_usage.md +0 -133
- infotracker-0.2.3/docs/configuration.md +0 -38
- infotracker-0.2.3/docs/dbt_integration.md +0 -144
- infotracker-0.2.3/docs/edge_cases.md +0 -102
- infotracker-0.2.3/docs/example_dataset.md +0 -134
- infotracker-0.2.3/docs/faq.md +0 -23
- infotracker-0.2.3/docs/lineage_concepts.md +0 -410
- infotracker-0.2.3/docs/openlineage_mapping.md +0 -45
- infotracker-0.2.3/docs/overview.md +0 -123
- infotracker-0.2.3/examples/warehouse/lineage/01_customers.json +0 -25
- infotracker-0.2.3/examples/warehouse/lineage/02_orders.json +0 -25
- infotracker-0.2.3/examples/warehouse/lineage/03_products.json +0 -25
- infotracker-0.2.3/examples/warehouse/lineage/04_order_items.json +0 -27
- infotracker-0.2.3/examples/warehouse/lineage/10_stg_orders.json +0 -43
- infotracker-0.2.3/examples/warehouse/lineage/11_stg_order_items.json +0 -55
- infotracker-0.2.3/examples/warehouse/lineage/12_stg_customers.json +0 -48
- infotracker-0.2.3/examples/warehouse/lineage/20_vw_recent_orders.json +0 -38
- infotracker-0.2.3/examples/warehouse/lineage/30_dim_customer.json +0 -43
- infotracker-0.2.3/examples/warehouse/lineage/31_dim_product.json +0 -43
- infotracker-0.2.3/examples/warehouse/lineage/40_fct_sales.json +0 -59
- infotracker-0.2.3/examples/warehouse/lineage/41_agg_sales_by_day.json +0 -39
- infotracker-0.2.3/examples/warehouse/lineage/50_vw_orders_all.json +0 -25
- infotracker-0.2.3/examples/warehouse/lineage/51_vw_orders_all_enriched.json +0 -26
- infotracker-0.2.3/examples/warehouse/lineage/52_vw_order_details_star.json +0 -31
- infotracker-0.2.3/examples/warehouse/lineage/53_vw_products_all.json +0 -25
- infotracker-0.2.3/examples/warehouse/lineage/54_vw_recent_orders_star_cte.json +0 -25
- infotracker-0.2.3/examples/warehouse/lineage/55_vw_orders_shipped_or_delivered.json +0 -25
- infotracker-0.2.3/examples/warehouse/lineage/56_vw_orders_union_star.json +0 -40
- infotracker-0.2.3/examples/warehouse/lineage/60_vw_customer_order_analysis.json +0 -149
- infotracker-0.2.3/examples/warehouse/lineage/90_usp_refresh_sales_with_temp.json +0 -57
- infotracker-0.2.3/examples/warehouse/lineage/91_usp_snapshot_recent_orders_star.json +0 -26
- infotracker-0.2.3/examples/warehouse/lineage/92_usp_rebuild_recent_sales_with_vars.json +0 -52
- infotracker-0.2.3/examples/warehouse/lineage/93_usp_top_products_since_var.json +0 -42
- infotracker-0.2.3/examples/warehouse/lineage/94_fn_customer_orders_tvf.json +0 -137
- infotracker-0.2.3/examples/warehouse/lineage/95_usp_customer_metrics_dataset.json +0 -96
- infotracker-0.2.3/examples/warehouse/lineage/96_demo_usage_tvf_and_proc.json +0 -98
- infotracker-0.2.3/examples/warehouse/sql/01_customers.sql +0 -6
- infotracker-0.2.3/examples/warehouse/sql/02_orders.sql +0 -6
- infotracker-0.2.3/examples/warehouse/sql/03_products.sql +0 -6
- infotracker-0.2.3/examples/warehouse/sql/04_order_items.sql +0 -8
- infotracker-0.2.3/examples/warehouse/sql/10_stg_orders.sql +0 -7
- infotracker-0.2.3/examples/warehouse/sql/11_stg_order_items.sql +0 -9
- infotracker-0.2.3/examples/warehouse/sql/12_stg_customers.sql +0 -10
- infotracker-0.2.3/examples/warehouse/sql/20_vw_recent_orders.sql +0 -14
- infotracker-0.2.3/examples/warehouse/sql/30_dim_customer.sql +0 -7
- infotracker-0.2.3/examples/warehouse/sql/31_dim_product.sql +0 -7
- infotracker-0.2.3/examples/warehouse/sql/40_fct_sales.sql +0 -12
- infotracker-0.2.3/examples/warehouse/sql/41_agg_sales_by_day.sql +0 -9
- infotracker-0.2.3/examples/warehouse/sql/50_vw_orders_all.sql +0 -3
- infotracker-0.2.3/examples/warehouse/sql/51_vw_orders_all_enriched.sql +0 -5
- infotracker-0.2.3/examples/warehouse/sql/52_vw_order_details_star.sql +0 -9
- infotracker-0.2.3/examples/warehouse/sql/53_vw_products_all.sql +0 -3
- infotracker-0.2.3/examples/warehouse/sql/54_vw_recent_orders_star_cte.sql +0 -7
- infotracker-0.2.3/examples/warehouse/sql/55_vw_orders_shipped_or_delivered.sql +0 -4
- infotracker-0.2.3/examples/warehouse/sql/56_vw_orders_union_star.sql +0 -4
- infotracker-0.2.3/examples/warehouse/sql/60_vw_customer_order_analysis.sql +0 -12
- infotracker-0.2.3/examples/warehouse/sql/60_vw_customer_order_ranking.sql +0 -11
- infotracker-0.2.3/examples/warehouse/sql/61_vw_sales_analytics.sql +0 -11
- infotracker-0.2.3/examples/warehouse/sql/90_usp_refresh_sales_with_temp.sql +0 -53
- infotracker-0.2.3/examples/warehouse/sql/91_usp_snapshot_recent_orders_star.sql +0 -23
- infotracker-0.2.3/examples/warehouse/sql/92_usp_rebuild_recent_sales_with_vars.sql +0 -48
- infotracker-0.2.3/examples/warehouse/sql/93_usp_top_products_since_var.sql +0 -41
- infotracker-0.2.3/examples/warehouse/sql/94_fn_customer_orders_tvf.sql +0 -78
- infotracker-0.2.3/examples/warehouse/sql/95_usp_customer_metrics_dataset.sql +0 -71
- infotracker-0.2.3/examples/warehouse/sql/96_demo_usage_tvf_and_proc.sql +0 -91
- infotracker-0.2.3/tests/__init__.py +0 -21
- infotracker-0.2.3/tests/conftest.py +0 -60
- infotracker-0.2.3/tests/test_adapter.py +0 -150
- infotracker-0.2.3/tests/test_expected_outputs.py +0 -228
- infotracker-0.2.3/tests/test_impact_output.py +0 -89
- infotracker-0.2.3/tests/test_insert_exec.py +0 -75
- infotracker-0.2.3/tests/test_integration.py +0 -148
- infotracker-0.2.3/tests/test_parser.py +0 -145
- infotracker-0.2.3/tests/test_preprocessing.py +0 -157
- infotracker-0.2.3/tests/test_wildcard.py +0 -137
- {infotracker-0.2.3 → infotracker-0.2.5}/.gitignore +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/MANIFEST.in +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/README.md +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/infotracker.yml +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/requirements.txt +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/__init__.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/__main__.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/adapters.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/cli.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/config.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/diff.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/engine.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/lineage.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/models.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/openlineage_utils.py +0 -0
- {infotracker-0.2.3 → infotracker-0.2.5}/src/infotracker/parser.py +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: InfoTracker
|
3
|
-
Version: 0.2.
|
3
|
+
Version: 0.2.5
|
4
4
|
Summary: Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)
|
5
5
|
Project-URL: homepage, https://example.com/infotracker
|
6
6
|
Project-URL: documentation, https://example.com/infotracker/docs
|
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
|
|
4
4
|
|
5
5
|
[project]
|
6
6
|
name = "InfoTracker"
|
7
|
-
version = "0.2.
|
7
|
+
version = "0.2.5"
|
8
8
|
description = "Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)"
|
9
9
|
readme = "README.md"
|
10
10
|
requires-python = ">=3.10"
|
@@ -44,8 +44,16 @@ exclude = [
|
|
44
44
|
"src/**/tests",
|
45
45
|
"tests/",
|
46
46
|
"tests/**",
|
47
|
+
"docs", "examples",
|
48
|
+
".github", ".git", ".venv", ".gitignore","ProjectDescription.md",
|
49
|
+
]
|
50
|
+
|
51
|
+
[tool.hatch.build.targets.sdist]
|
52
|
+
exclude = [
|
53
|
+
"tests", "tests/**",
|
54
|
+
"docs", "examples",
|
55
|
+
".github", ".git", ".venv", ".gitignore","ProjectDescription.md",
|
47
56
|
]
|
48
|
-
only-include = ["src/infotracker"]
|
49
57
|
|
50
58
|
[tool.ruff]
|
51
59
|
line-length = 100
|
@@ -1,82 +0,0 @@
|
|
1
|
-
### InfoTracker — Student Brief
|
2
|
-
|
3
|
-
#### Welcome to the Data Dungeon (friendly, no actual monsters)
|
4
|
-
You are the hero. Your quest: teach a tool to read SQL scrolls and tell true stories about columns. You’ll map where data comes from, spot traps (breaking changes), and keep the kingdom’s dashboards happy.
|
5
|
-
|
6
|
-
- Your gear: a CLI, example SQLs, and clear docs
|
7
|
-
- Your allies: adapters, lineage graphs, and CI checks
|
8
|
-
- Your enemies: sneaky `SELECT *`, UNION goblins, and 3 a.m. alerts
|
9
|
-
- Goal: green checks, clear diffs, and no broken charts
|
10
|
-
|
11
|
-
If you get stuck, it’s normal. Take a sip of tea, re-read the step, try a smaller example.
|
12
|
-
|
13
|
-
### For Beginners
|
14
|
-
If you're new to SQL, try this free tutorial: [Khan Academy SQL](https://www.khanacademy.org/computing/computer-programming/sql). It's in simple English and has exercises.
|
15
|
-
|
16
|
-
### Action plan (read and build in this order)
|
17
|
-
1) Understand the goal and scope (1 hour)
|
18
|
-
- Overview: [docs/overview.md](docs/overview.md)
|
19
|
-
- What you’re building, supported features, and what’s out of scope. Keep this open as your north star.
|
20
|
-
|
21
|
-
2) Learn column-level lineage basics (1-2 hours)
|
22
|
-
- Concepts: [docs/lineage_concepts.md](docs/lineage_concepts.md)
|
23
|
-
- Visual examples showing how each output column maps back to inputs (joins, transforms, aggregations). This informs how your extractor must reason.
|
24
|
-
|
25
|
-
3) Explore the example dataset (your training corpus)
|
26
|
-
- Dataset map: [docs/example_dataset.md](docs/example_dataset.md)
|
27
|
-
- Where the SQL files live, what each file represents (tables, views, CTEs, procs), and the matching OpenLineage JSON expectations you must reproduce.
|
28
|
-
|
29
|
-
4) Implement the algorithm incrementally
|
30
|
-
- Algorithm: [docs/algorithm.md](docs/algorithm.md)
|
31
|
-
- Steps: parse → object graph → schema resolution (expand `*` late) → column lineage extraction → impact graph → outputs.
|
32
|
-
- Aim for correctness on simple files first, then progress to joins and aggregations.
|
33
|
-
|
34
|
-
5) Handle edge cases early enough to avoid rewrites
|
35
|
-
- SELECT-star and star expansion: [docs/edge_cases.md](docs/edge_cases.md)
|
36
|
-
- Requires object-level lineage first, then star expansion. Also watch UNION ordinals and SELECT INTO schema inference.
|
37
|
-
|
38
|
-
6) Decide on architecture and adapter boundaries
|
39
|
-
- Adapters & extensibility: [docs/adapters.md](docs/adapters.md)
|
40
|
-
- Define a clear adapter interface and implement MS SQL first (temp tables, variables, SELECT INTO, T-SQL functions). Keep the core engine adapter-agnostic.
|
41
|
-
|
42
|
-
7) Wire up the agentic workflow and regression tests
|
43
|
-
- Agentic workflow: [docs/agentic_workflow.md](docs/agentic_workflow.md)
|
44
|
-
- Loop the agent on the example corpus until the generated lineage matches the gold JSON. Add CI to auto-run on any SQL/lineage change.
|
45
|
-
|
46
|
-
8) Expose the CLI and iterate to parity
|
47
|
-
- CLI usage: [docs/cli_usage.md](docs/cli_usage.md)
|
48
|
-
- Implement `extract`, `impact`, and `diff`. The CLI is your acceptance surface; keep behavior stable and well-documented.
|
49
|
-
|
50
|
-
9) Implement breaking-change detection and reporting
|
51
|
-
- Breaking changes: [docs/breaking_changes.md](docs/breaking_changes.md)
|
52
|
-
- Compare base vs head branches: diff schemas/expressions, classify severity, compute downstream impacts, and emit machine + human-readable reports.
|
53
|
-
|
54
|
-
10) Optional: Integrate with dbt
|
55
|
-
- dbt integration: [docs/dbt_integration.md](docs/dbt_integration.md)
|
56
|
-
|
57
|
-
### Milestones (suggested timebox)
|
58
|
-
- Day 1–2: read docs, install CLI, run extract on examples
|
59
|
-
- Day 3–5: implement simple lineage (no joins), pass gold files
|
60
|
-
- Day 6–8: add joins and aggregations, handle star expansion
|
61
|
-
- Day 9–10: wire warn-only diff in CI, polish docs
|
62
|
-
|
63
|
-
### Acceptance checklist
|
64
|
-
- Lineage matches gold JSONs in `examples/warehouse/lineage`
|
65
|
-
- Impact queries return correct columns for sample selectors
|
66
|
-
- Diff runs in CI (warn-only) and shows helpful messages
|
67
|
-
- Docs updated where needed; examples run without errors
|
68
|
-
|
69
|
-
### Quick-start CLI (target behavior)
|
70
|
-
Simple Example: To test one file, run `infotracker extract --sql-dir examples/warehouse/sql/01_customers.sql --out-dir build/lineage`
|
71
|
-
|
72
|
-
### Tips (pro-level, easy to follow)
|
73
|
-
- Start small: one view, then a join, then an aggregate
|
74
|
-
- Be explicit: avoid `SELECT *` while testing
|
75
|
-
- Commit often: small steps are easy to undo
|
76
|
-
- Use the example JSONs as your “gold” truth
|
77
|
-
|
78
|
-
### If stuck (quick help)
|
79
|
-
- Re-read the related doc step (linked above)
|
80
|
-
- Run the CLI with `--log-level debug` to see more info
|
81
|
-
- Create a tiny SQL with just the failing pattern and test that first
|
82
|
-
- Write down expected lineage for one column, then match it in code
|
@@ -1,117 +0,0 @@
|
|
1
|
-
### Adapters and extensibility
|
2
|
-
|
3
|
-
#### Forge your adapter (smithing for data heroes)
|
4
|
-
In the forge of Integration Keep, you’ll craft adapters that turn raw SQL into neatly qualified lineage. Sparks may fly; that’s normal.
|
5
|
-
|
6
|
-
- Materials: `parse`, `qualify`, `resolve`, `to_openlineage`
|
7
|
-
- Armor enchantments: case-normalization, bracket taming, and dialect charms
|
8
|
-
- Future artifacts: Snowflake blade, BigQuery bow, Postgres shield
|
9
|
-
|
10
|
-
If an imp named “Case Insensitivity” throws a tantrum, feed it brackets: `[like_this]`.
|
11
|
-
|
12
|
-
#### Audience & prerequisites
|
13
|
-
- Audience: engineers implementing or extending dialect adapters (Level 2: After basics)
|
14
|
-
- Prerequisites: Python; SQL basics; familiarity with SQLGlot or similar parser
|
15
|
-
|
16
|
-
Define an adapter interface:
|
17
|
-
- parse(sql) → AST
|
18
|
-
- qualify(ast) → fully qualified refs (db.schema.object)
|
19
|
-
- resolve(ast, catalog) → output schema + expressions
|
20
|
-
- to_openlineage(object) → columnLineage facet
|
21
|
-
|
22
|
-
MS SQL adapter (first):
|
23
|
-
- Use `SQLGlot`/`sqllineage` for parsing/lineage hints
|
24
|
-
- Handle T-SQL specifics: temp tables, SELECT INTO, variables, functions
|
25
|
-
- Normalize identifiers (brackets vs quotes), case-insensitivity
|
26
|
-
|
27
|
-
Future adapters: Snowflake, BigQuery, Postgres, etc.
|
28
|
-
|
29
|
-
### Adapter interface (pseudocode)
|
30
|
-
```python
|
31
|
-
class Adapter(Protocol):
|
32
|
-
name: str
|
33
|
-
dialect: str
|
34
|
-
|
35
|
-
def parse(self, sql: str) -> AST: ...
|
36
|
-
def qualify(self, ast: AST, default_db: str | None) -> AST: ...
|
37
|
-
def resolve(self, ast: AST, catalog: Catalog) -> tuple[Schema, ColumnLineage]: ...
|
38
|
-
def to_openlineage(self, obj_name: str, schema: Schema, lineage: ColumnLineage) -> dict: ...
|
39
|
-
```
|
40
|
-
|
41
|
-
### MS SQL specifics
|
42
|
-
- Case-insensitive identifiers; bracket quoting `[name]`
|
43
|
-
- Temp tables (`#t`) live in tempdb; scope to procedure; support SELECT INTO schema inference
|
44
|
-
- Variables (`@v`) and their use in filters/windows; capture expressions for context
|
45
|
-
- GETDATE/DATEADD and common built-ins; treat as CONSTANT/ARITHMETIC transformations
|
46
|
-
- JOINs default to INNER; OUTER joins affect nullability
|
47
|
-
- Parser: prefer SQLGlot for AST; use sqllineage as an optional hint only
|
48
|
-
|
49
|
-
### Mini example (very small, illustrative)
|
50
|
-
```python
|
51
|
-
class MssqlAdapter(Adapter):
|
52
|
-
name = "mssql"
|
53
|
-
dialect = "tsql"
|
54
|
-
|
55
|
-
def parse(self, sql: str) -> AST:
|
56
|
-
return sqlglot.parse_one(sql, read=self.dialect)
|
57
|
-
|
58
|
-
def qualify(self, ast: AST, default_db: str | None) -> AST:
|
59
|
-
# apply name normalization and database/schema defaults
|
60
|
-
return qualify_identifiers(ast, default_db)
|
61
|
-
|
62
|
-
def resolve(self, ast: AST, catalog: Catalog) -> tuple[Schema, ColumnLineage]:
|
63
|
-
schema = infer_schema(ast, catalog)
|
64
|
-
lineage = extract_column_lineage(ast, catalog)
|
65
|
-
return schema, lineage
|
66
|
-
|
67
|
-
def to_openlineage(self, obj_name: str, schema: Schema, lineage: ColumnLineage) -> dict:
|
68
|
-
return build_openlineage_payload(obj_name, schema, lineage)
|
69
|
-
```
|
70
|
-
|
71
|
-
### How to Test Your Adapter
|
72
|
-
Create a test.py:
|
73
|
-
```python
|
74
|
-
adapter = MssqlAdapter()
|
75
|
-
ast = adapter.parse("SELECT * FROM table")
|
76
|
-
print(ast)
|
77
|
-
```
|
78
|
-
// Run: python test.py
|
79
|
-
|
80
|
-
### Adding a new adapter
|
81
|
-
1. Implement the interface; configure SQLGlot dialect
|
82
|
-
2. Provide normalization rules (case, quoting, name resolution)
|
83
|
-
3. Add adapter-specific tests using a small example corpus
|
84
|
-
4. Document limitations and differences
|
85
|
-
|
86
|
-
### Adapter testing template
|
87
|
-
- Create 3 SQL files: simple select, join with alias, aggregation with group by
|
88
|
-
- Write expected schema (columns, types, nullability)
|
89
|
-
- Write expected lineage (inputs per output column)
|
90
|
-
- Run extraction and compare to expected JSON in CI
|
91
|
-
|
92
|
-
### Adapter selection and registry
|
93
|
-
```python
|
94
|
-
ADAPTERS: dict[str, Adapter] = {
|
95
|
-
"mssql": MssqlAdapter(),
|
96
|
-
}
|
97
|
-
|
98
|
-
def get_adapter(name: str) -> Adapter:
|
99
|
-
return ADAPTERS[name]
|
100
|
-
```
|
101
|
-
|
102
|
-
### Catalog handling
|
103
|
-
- Accept a `catalog.yml` with known schemas for external refs
|
104
|
-
- Use catalog to resolve `*`, disambiguate references, and provide types when DDL is missing
|
105
|
-
- Warn on unknown objects; continue best-effort
|
106
|
-
|
107
|
-
### Common pitfalls
|
108
|
-
- Case-insensitive matching; normalize but preserve display casing
|
109
|
-
- Bracket/quoted identifiers: `[Name]` vs `"Name"`
|
110
|
-
- Temp table scoping and lifetime
|
111
|
-
- SELECT INTO column ordinals and inferred types
|
112
|
-
- Variables used in expressions and filters
|
113
|
-
|
114
|
-
### See also
|
115
|
-
- `docs/algorithm.md`
|
116
|
-
- `docs/cli_usage.md`
|
117
|
-
- `docs/overview.md`
|
@@ -1,294 +0,0 @@
|
|
1
|
-
# Advanced Use Cases: Tabular Functions and Procedures
|
2
|
-
|
3
|
-
## Overview
|
4
|
-
This document covers advanced SQL patterns that extend beyond basic views and procedures. These patterns are important for lineage analysis because they represent reusable transformation layers that can be used as inputs to other operations.
|
5
|
-
|
6
|
-
## Use Case 1: Parametrized Tabular Functions
|
7
|
-
|
8
|
-
### What are Tabular Functions?
|
9
|
-
Tabular functions are SQL Server objects that accept parameters and return table result sets. They behave like parameterized views and can be used in JOINs, subqueries, and other SQL operations.
|
10
|
-
|
11
|
-
### Two Syntax Variants
|
12
|
-
|
13
|
-
#### 1. Inline Table-Valued Function (RETURN AS)
|
14
|
-
```sql
|
15
|
-
CREATE FUNCTION dbo.fn_customer_orders_inline
|
16
|
-
(
|
17
|
-
@CustomerID INT,
|
18
|
-
@StartDate DATE,
|
19
|
-
@EndDate DATE
|
20
|
-
)
|
21
|
-
RETURNS TABLE
|
22
|
-
AS
|
23
|
-
RETURN
|
24
|
-
(
|
25
|
-
SELECT
|
26
|
-
o.OrderID,
|
27
|
-
o.CustomerID,
|
28
|
-
o.OrderDate,
|
29
|
-
oi.ProductID,
|
30
|
-
CAST(oi.Quantity * oi.UnitPrice AS DECIMAL(18,2)) AS ExtendedPrice
|
31
|
-
FROM dbo.Orders AS o
|
32
|
-
INNER JOIN dbo.OrderItems AS oi ON o.OrderID = oi.OrderID
|
33
|
-
WHERE o.CustomerID = @CustomerID
|
34
|
-
AND o.OrderDate BETWEEN @StartDate AND @EndDate
|
35
|
-
);
|
36
|
-
```
|
37
|
-
|
38
|
-
**Characteristics:**
|
39
|
-
- Single SELECT statement
|
40
|
-
- Optimized by query optimizer
|
41
|
-
- Inlined into calling query for better performance
|
42
|
-
- Parameters control data scope
|
43
|
-
|
44
|
-
#### 2. Multi-Statement Table-Valued Function (RETURN TABLE)
|
45
|
-
```sql
|
46
|
-
CREATE FUNCTION dbo.fn_customer_orders_mstvf
|
47
|
-
(
|
48
|
-
@CustomerID INT,
|
49
|
-
@StartDate DATE,
|
50
|
-
@EndDate DATE
|
51
|
-
)
|
52
|
-
RETURNS @Result TABLE
|
53
|
-
(
|
54
|
-
OrderID INT,
|
55
|
-
CustomerID INT,
|
56
|
-
OrderDate DATE,
|
57
|
-
ProductID INT,
|
58
|
-
ExtendedPrice DECIMAL(18,2),
|
59
|
-
DaysSinceOrder INT
|
60
|
-
)
|
61
|
-
AS
|
62
|
-
BEGIN
|
63
|
-
INSERT INTO @Result
|
64
|
-
SELECT
|
65
|
-
o.OrderID,
|
66
|
-
o.CustomerID,
|
67
|
-
o.OrderDate,
|
68
|
-
oi.ProductID,
|
69
|
-
CAST(oi.Quantity * oi.UnitPrice AS DECIMAL(18,2)) AS ExtendedPrice,
|
70
|
-
DATEDIFF(DAY, o.OrderDate, GETDATE()) AS DaysSinceOrder
|
71
|
-
FROM dbo.Orders AS o
|
72
|
-
INNER JOIN dbo.OrderItems AS oi ON o.OrderID = oi.OrderID
|
73
|
-
WHERE o.CustomerID = @CustomerID
|
74
|
-
AND o.OrderDate BETWEEN @StartDate AND @EndDate;
|
75
|
-
|
76
|
-
RETURN;
|
77
|
-
END;
|
78
|
-
```
|
79
|
-
|
80
|
-
**Characteristics:**
|
81
|
-
- Multiple statements possible
|
82
|
-
- Table variable for intermediate results
|
83
|
-
- More complex logic and calculations
|
84
|
-
- Better for complex business rules
|
85
|
-
|
86
|
-
### Lineage Analysis for Tabular Functions
|
87
|
-
Tabular functions are treated as view-like objects in lineage analysis:
|
88
|
-
|
89
|
-
```mermaid
|
90
|
-
graph TD
|
91
|
-
O[Orders.OrderID] --> F[fn_customer_orders_inline.OrderID]
|
92
|
-
C[Orders.CustomerID] --> F2[fn_customer_orders_inline.CustomerID]
|
93
|
-
OI[OrderItems.ProductID] --> F3[fn_customer_orders_inline.ProductID]
|
94
|
-
Q[OrderItems.Quantity] --> F4[fn_customer_orders_inline.ExtendedPrice]
|
95
|
-
P[OrderItems.UnitPrice] --> F4
|
96
|
-
```
|
97
|
-
|
98
|
-
**Key Points:**
|
99
|
-
- Function parameters affect filtering but don't create columns
|
100
|
-
- Output columns show lineage to source table columns
|
101
|
-
- Computed columns (like ExtendedPrice) show lineage to multiple inputs
|
102
|
-
- Functions can be used as inputs to other operations
|
103
|
-
|
104
|
-
## Use Case 2: Procedures Returning Datasets
|
105
|
-
|
106
|
-
### What are Dataset-Returning Procedures?
|
107
|
-
These are stored procedures that perform calculations and end with a SELECT statement that returns data. The results can be captured into temp tables for further processing.
|
108
|
-
|
109
|
-
### Example Procedure
|
110
|
-
```sql
|
111
|
-
CREATE PROCEDURE dbo.usp_customer_metrics_dataset
|
112
|
-
@CustomerID INT = NULL,
|
113
|
-
@StartDate DATE = NULL,
|
114
|
-
@EndDate DATE = NULL,
|
115
|
-
@IncludeInactive BIT = 0
|
116
|
-
AS
|
117
|
-
BEGIN
|
118
|
-
SET NOCOUNT ON;
|
119
|
-
|
120
|
-
-- Set default parameters if not provided
|
121
|
-
IF @StartDate IS NULL SET @StartDate = DATEADD(MONTH, -3, GETDATE());
|
122
|
-
IF @EndDate IS NULL SET @EndDate = GETDATE();
|
123
|
-
|
124
|
-
-- Main calculation query that returns the dataset
|
125
|
-
SELECT
|
126
|
-
c.CustomerID,
|
127
|
-
c.CustomerName,
|
128
|
-
COUNT(DISTINCT o.OrderID) AS TotalOrders,
|
129
|
-
SUM(oi.Quantity * oi.UnitPrice) AS TotalRevenue,
|
130
|
-
MAX(o.OrderDate) AS LastOrderDate,
|
131
|
-
CASE
|
132
|
-
WHEN MAX(o.OrderDate) >= DATEADD(MONTH, -1, GETDATE()) THEN 'Active'
|
133
|
-
WHEN MAX(o.OrderDate) >= DATEADD(MONTH, -3, GETDATE()) THEN 'Recent'
|
134
|
-
ELSE 'Inactive'
|
135
|
-
END AS CustomerActivityStatus
|
136
|
-
FROM dbo.Customers AS c
|
137
|
-
LEFT JOIN dbo.Orders AS o ON c.CustomerID = o.CustomerID
|
138
|
-
LEFT JOIN dbo.OrderItems AS oi ON o.OrderID = oi.OrderID
|
139
|
-
WHERE (@CustomerID IS NULL OR c.CustomerID = @CustomerID)
|
140
|
-
AND (@IncludeInactive = 1 OR c.IsActive = 1)
|
141
|
-
AND (o.OrderDate IS NULL OR o.OrderDate BETWEEN @StartDate AND @EndDate)
|
142
|
-
GROUP BY c.CustomerID, c.CustomerName
|
143
|
-
HAVING COUNT(DISTINCT o.OrderID) > 0
|
144
|
-
ORDER BY TotalRevenue DESC;
|
145
|
-
END;
|
146
|
-
```
|
147
|
-
|
148
|
-
**Characteristics:**
|
149
|
-
- Ends with a single SELECT statement
|
150
|
-
- Parameters control data scope and filtering
|
151
|
-
- Complex calculations and aggregations
|
152
|
-
- Results can be captured for further processing
|
153
|
-
|
154
|
-
### Usage Patterns
|
155
|
-
|
156
|
-
#### 1. EXEC into Temp Table
|
157
|
-
```sql
|
158
|
-
-- Create temp table to capture procedure output
|
159
|
-
IF OBJECT_ID('tempdb..#customer_metrics') IS NOT NULL
|
160
|
-
DROP TABLE #customer_metrics;
|
161
|
-
|
162
|
-
-- Execute procedure and capture results into temp table
|
163
|
-
INSERT INTO #customer_metrics
|
164
|
-
EXEC dbo.usp_customer_metrics_dataset
|
165
|
-
@CustomerID = NULL,
|
166
|
-
@StartDate = '2024-01-01',
|
167
|
-
@EndDate = '2024-12-31',
|
168
|
-
@IncludeInactive = 0;
|
169
|
-
```
|
170
|
-
|
171
|
-
#### 2. Further Processing
|
172
|
-
```sql
|
173
|
-
-- Use the temp table for downstream operations
|
174
|
-
SELECT
|
175
|
-
cm.*,
|
176
|
-
CASE
|
177
|
-
WHEN cm.TotalRevenue >= 10000 THEN 'High Value'
|
178
|
-
WHEN cm.TotalRevenue >= 5000 THEN 'Medium Value'
|
179
|
-
ELSE 'Standard'
|
180
|
-
END AS CustomerTier
|
181
|
-
FROM #customer_metrics AS cm
|
182
|
-
WHERE cm.CustomerActivityStatus IN ('Active', 'Recent');
|
183
|
-
```
|
184
|
-
|
185
|
-
#### 3. Insert into Permanent Table
|
186
|
-
```sql
|
187
|
-
-- Archive procedure results
|
188
|
-
INSERT INTO dbo.customer_metrics_archive (
|
189
|
-
ArchiveDate, CustomerID, CustomerName, TotalOrders, TotalRevenue
|
190
|
-
)
|
191
|
-
SELECT
|
192
|
-
GETDATE(),
|
193
|
-
cm.*
|
194
|
-
FROM #customer_metrics AS cm;
|
195
|
-
```
|
196
|
-
|
197
|
-
### Lineage Analysis for Dataset Procedures
|
198
|
-
Procedures that return datasets are treated as view-like objects:
|
199
|
-
|
200
|
-
```mermaid
|
201
|
-
graph TD
|
202
|
-
C[Customers.CustomerID] --> P[usp_customer_metrics_dataset.CustomerID]
|
203
|
-
CN[Customers.CustomerName] --> P2[usp_customer_metrics_dataset.CustomerName]
|
204
|
-
O[Orders.OrderID] --> P3[usp_customer_metrics_dataset.TotalOrders]
|
205
|
-
Q[OrderItems.Quantity] --> P4[usp_customer_metrics_dataset.TotalRevenue]
|
206
|
-
P5[OrderItems.UnitPrice] --> P4
|
207
|
-
OD[Orders.OrderDate] --> P6[usp_customer_metrics_dataset.LastOrderDate]
|
208
|
-
OD --> P7[usp_customer_metrics_dataset.CustomerActivityStatus]
|
209
|
-
```
|
210
|
-
|
211
|
-
**Key Points:**
|
212
|
-
- Procedure output columns show lineage to source tables
|
213
|
-
- Aggregated columns show lineage to source columns used in aggregation
|
214
|
-
- Computed columns show lineage to multiple inputs
|
215
|
-
- Procedure can become an input to other operations via temp tables
|
216
|
-
|
217
|
-
## Combined Workflow Example
|
218
|
-
|
219
|
-
### End-to-End Lineage
|
220
|
-
```sql
|
221
|
-
-- Step 1: Use tabular function
|
222
|
-
SELECT f.*, p.Category
|
223
|
-
FROM dbo.fn_customer_orders_inline(1, '2024-01-01', '2024-12-31') AS f
|
224
|
-
INNER JOIN dbo.Products AS p ON f.ProductID = p.ProductID;
|
225
|
-
|
226
|
-
-- Step 2: Use procedure output
|
227
|
-
INSERT INTO #customer_metrics
|
228
|
-
EXEC dbo.usp_customer_metrics_dataset @CustomerID = 1;
|
229
|
-
|
230
|
-
-- Step 3: Combine both for analysis
|
231
|
-
SELECT
|
232
|
-
f.CustomerID,
|
233
|
-
f.OrderID,
|
234
|
-
f.ExtendedPrice,
|
235
|
-
cm.TotalRevenue,
|
236
|
-
CAST(f.ExtendedPrice / NULLIF(cm.TotalRevenue, 0) * 100 AS DECIMAL(5,2)) AS OrderContributionPercent
|
237
|
-
FROM dbo.fn_customer_orders_inline(1, '2024-01-01', '2024-12-31') AS f
|
238
|
-
INNER JOIN #customer_metrics AS cm ON f.CustomerID = cm.CustomerID;
|
239
|
-
```
|
240
|
-
|
241
|
-
### Lineage Chain
|
242
|
-
```mermaid
|
243
|
-
graph TD
|
244
|
-
O[Orders.OrderID] --> F[fn_customer_orders_inline.OrderID]
|
245
|
-
OI[OrderItems.Quantity] --> F2[fn_customer_orders_inline.ExtendedPrice]
|
246
|
-
UP[OrderItems.UnitPrice] --> F2
|
247
|
-
|
248
|
-
C[Customers.CustomerID] --> P[usp_customer_metrics_dataset.CustomerID]
|
249
|
-
O2[Orders.OrderID] --> P2[usp_customer_metrics_dataset.TotalOrders]
|
250
|
-
|
251
|
-
F --> A[Analysis.OrderID]
|
252
|
-
F2 --> A2[Analysis.ExtendedPrice]
|
253
|
-
P --> A3[Analysis.TotalRevenue]
|
254
|
-
P2 --> A4[Analysis.OrderContributionPercent]
|
255
|
-
```
|
256
|
-
|
257
|
-
## Implementation Considerations
|
258
|
-
|
259
|
-
### For Lineage Extraction
|
260
|
-
1. **Function Detection**: Identify `CREATE FUNCTION` statements with `RETURNS TABLE`
|
261
|
-
2. **Procedure Detection**: Identify `CREATE PROCEDURE` statements ending with `SELECT`
|
262
|
-
3. **Parameter Handling**: Track parameters that affect filtering but don't create columns
|
263
|
-
4. **Output Schema**: Extract column definitions from function returns or procedure SELECT
|
264
|
-
5. **Dependency Mapping**: Map output columns to source table columns
|
265
|
-
|
266
|
-
### For Impact Analysis
|
267
|
-
1. **Function Usage**: Track where tabular functions are used as inputs
|
268
|
-
2. **Procedure Results**: Track where procedure outputs are captured and used
|
269
|
-
3. **Temp Table Lifecycle**: Understand temp table creation, population, and usage
|
270
|
-
4. **Downstream Effects**: Follow lineage from functions/procedures to final targets
|
271
|
-
|
272
|
-
### Testing Scenarios
|
273
|
-
1. **Parameter Variations**: Test functions with different parameter values
|
274
|
-
2. **Complex Calculations**: Verify lineage for multi-step computations
|
275
|
-
3. **Temp Table Workflows**: Test EXEC into temp table patterns
|
276
|
-
4. **Combined Usage**: Test functions and procedures used together
|
277
|
-
5. **Edge Cases**: Test with NULL parameters, empty results, etc.
|
278
|
-
|
279
|
-
## Best Practices
|
280
|
-
|
281
|
-
### For Developers
|
282
|
-
1. **Clear Naming**: Use descriptive names for functions and procedures
|
283
|
-
2. **Parameter Validation**: Validate input parameters early
|
284
|
-
3. **Consistent Output**: Maintain consistent column structure
|
285
|
-
4. **Documentation**: Document parameter effects and return values
|
286
|
-
|
287
|
-
### For Lineage Analysis
|
288
|
-
1. **Schema Tracking**: Track function and procedure schemas
|
289
|
-
2. **Parameter Impact**: Understand how parameters affect lineage
|
290
|
-
3. **Temp Table Handling**: Properly track temp table creation and usage
|
291
|
-
4. **Performance Considerations**: Be aware of function inlining and optimization
|
292
|
-
|
293
|
-
## Conclusion
|
294
|
-
Tabular functions and dataset-returning procedures represent powerful patterns for creating reusable transformation layers. They extend the lineage analysis beyond simple views and tables, requiring careful handling of parameters, computed columns, and complex workflows. Proper implementation of these patterns enables comprehensive data lineage tracking across modern SQL Server environments.
|
@@ -1,134 +0,0 @@
|
|
1
|
-
### Agentic workflow and regression tests
|
2
|
-
|
3
|
-
#### Train your lineage familiar (it learns by fetching JSON)
|
4
|
-
Summon your agent, toss it SQL scrolls, and reward it when it returns with matching OpenLineage scrolls. Repeat until it purrs (tests pass).
|
5
|
-
|
6
|
-
- The loop: cast → compare → tweak → repeat
|
7
|
-
- The arena: `examples/warehouse/{sql,lineage}`
|
8
|
-
- Victory condition: exact matches, zero diffs, tests pass (green)
|
9
|
-
|
10
|
-
Remember: agents love clear acceptance criteria more than tuna.
|
11
|
-
|
12
|
-
#### Audience & prerequisites
|
13
|
-
- Audience: engineers using agents/CIs to iterate on lineage extractors
|
14
|
-
- Prerequisites: basic SQL; Python; familiarity with CI and diff workflows
|
15
|
-
|
16
|
-
### Gold files (recap)
|
17
|
-
- Gold files = expected JSON lineage in `examples/warehouse/lineage`
|
18
|
-
- Your extractor must match them exactly (order and content)
|
19
|
-
|
20
|
-
### Fixing diffs (common)
|
21
|
-
- If a column mismatches: compare expressions; check alias qualification and star expansion timing
|
22
|
-
- If extra/missing columns: check join resolution and GROUP BY; ensure inputs are resolved before expansion
|
23
|
-
- If ordering differs: make outputs and diagnostics deterministic
|
24
|
-
|
25
|
-
- Prepare training set: SQL files + expected OpenLineage JSONs
|
26
|
-
- Loop (Cursor AI/CLI/web agents):
|
27
|
-
1) Generate lineage → 2) Compare with expected → 3) Adjust prompts/code → 4) Repeat until pass
|
28
|
-
- CI: on any change under `examples/warehouse/{sql,lineage}`, run extraction and compare; fail on diffs
|
29
|
-
- Track coverage and edge cases (SELECT *, temp tables, UNION, variables)
|
30
|
-
|
31
|
-
### Setup
|
32
|
-
- Install Cursor CLI and authenticate
|
33
|
-
- Organize repo with `examples/warehouse/{sql,lineage}` and a `build/` output folder
|
34
|
-
|
35
|
-
### Agent loop
|
36
|
-
1. Prompt template includes: adapter target (MS SQL), acceptance criteria (must match gold JSON), and allowed libraries (SQLGlot)
|
37
|
-
2. Agent writes code to `src/` and runs `infotracker extract` on the SQL corpus
|
38
|
-
3. Compare `build/lineage/*.json` to `examples/warehouse/lineage/*.json`
|
39
|
-
4. If diff exists, agent refines parsing/resolution rules and retries
|
40
|
-
5. Stop condition: all files match; record commit checkpoint
|
41
|
-
|
42
|
-
### Loop diagram
|
43
|
-
```mermaid
|
44
|
-
flowchart LR
|
45
|
-
G[Generate lineage] --> C[Compare to gold JSONs]
|
46
|
-
C -->|diffs| R[Refine rules/prompts]
|
47
|
-
R --> G
|
48
|
-
C -->|no diffs| S[Stop (green)]
|
49
|
-
```
|
50
|
-
|
51
|
-
### Artifacts
|
52
|
-
- Inputs: SQL corpus under `examples/warehouse/sql`, optional catalog
|
53
|
-
- Outputs: `build/lineage/*.json`, diff logs, warnings
|
54
|
-
- CI artifacts: upload generated lineage for review
|
55
|
-
|
56
|
-
### Stop criteria
|
57
|
-
- All gold JSONs match exactly (order and content)
|
58
|
-
- No warnings if using `--fail-on-warn`
|
59
|
-
|
60
|
-
### CI integration
|
61
|
-
- GitHub Actions (example): on push/PR, run extraction and `git diff --no-index` against gold lineage; fail on differences
|
62
|
-
- Cache Python deps and AST caches for speed
|
63
|
-
- Upload generated `build/lineage/*.json` as CI artifacts for review
|
64
|
-
|
65
|
-
### Evaluation metrics
|
66
|
-
- Exact-match rate across files
|
67
|
-
- Column coverage (percentage of outputs with lineage)
|
68
|
-
- Warning/error counts should trend down across iterations
|
69
|
-
|
70
|
-
### Updating gold files
|
71
|
-
- Intentional changes: regenerate lineage and review diffs; update gold JSON with PR describing the change
|
72
|
-
|
73
|
-
### See also
|
74
|
-
- `docs/example_dataset.md`
|
75
|
-
- `docs/algorithm.md`
|
76
|
-
- `docs/cli_usage.md`
|
77
|
-
- `docs/dbt_integration.md`
|
78
|
-
|
79
|
-
### Modus Operandi: Continuous Improvement with Cursor Agents
|
80
|
-
To extend the agentic workflow for 24/7/365 improvement of InfoTracker, integrate Cursor's web-based agents (like Background Agents) with GitHub. This builds on the regression testing and CI loops above, enabling automated code suggestions, bug fixes, and PR reviews. See [Cursor Changelog](https://cursor.com/changelog) for details.
|
81
|
-
|
82
|
-
#### Step 1: Set Up Cursor Web Agents for Continuous Improvement
|
83
|
-
1. **Enable Background Agents:** In Cursor, use Background Agents which run remotely and can be triggered via web (e.g., GitHub or Slack).
|
84
|
-
2. **Integrate with GitHub for 24/7 Operation:** Use GitHub Actions to schedule agent runs, combined with Cursor's API for AI tasks.
|
85
|
-
- Create a workflow: `.github/workflows/cursor-improve.yml` to run daily.
|
86
|
-
- Use Cursor's features like tagging @Cursor in issues for automated suggestions.
|
87
|
-
3. **Example Workflow for Scheduled Improvements:**
|
88
|
-
```yaml
|
89
|
-
name: Cursor AI Improvement
|
90
|
-
on: schedule
|
91
|
-
- cron: '0 0 * * *' # Daily at midnight
|
92
|
-
jobs:
|
93
|
-
improve:
|
94
|
-
runs-on: ubuntu-latest
|
95
|
-
steps:
|
96
|
-
- uses: actions/checkout@v4
|
97
|
-
- name: Run Cursor Agent
|
98
|
-
env:
|
99
|
-
CURSOR_API_KEY: ${{ secrets.CURSOR_API_KEY }} # Add in GitHub Secrets
|
100
|
-
run: |
|
101
|
-
# Script to call Cursor API or simulate agent for code analysis
|
102
|
-
python cursor_improve.py # Prompt agent to suggest repo improvements
|
103
|
-
```
|
104
|
-
4. **Script (cursor_improve.py):** Use Cursor's API to analyze code and open issues/PRs with suggestions.
|
105
|
-
|
106
|
-
#### Step 2: Set Up Cursor Agents for PR Reviews
|
107
|
-
1. **GitHub PR Integration:** In GitHub, tag @Cursor in PR comments to trigger Background Agent for reviews and fixes.
|
108
|
-
- Example: In a PR, comment "@Cursor review this for bugs" – it will analyze and suggest changes.
|
109
|
-
2. **Automate with Workflow:** Trigger on PR events to auto-invoke Cursor.
|
110
|
-
```yaml
|
111
|
-
name: Cursor PR Review
|
112
|
-
on: pull_request
|
113
|
-
jobs:
|
114
|
-
review:
|
115
|
-
runs-on: ubuntu-latest
|
116
|
-
steps:
|
117
|
-
- uses: actions/checkout@v4
|
118
|
-
- name: Invoke Cursor Agent
|
119
|
-
run: |
|
120
|
-
# Use GitHub API to comment "@Cursor review" on the PR
|
121
|
-
gh pr comment $PR_NUMBER --body "@Cursor review this PR for improvements"
|
122
|
-
```
|
123
|
-
3. **How It Works:** Cursor's Background Agent will read the PR, apply fixes if needed, and push commits (per changelog features).
|
124
|
-
|
125
|
-
#### Safeguards to Prevent Breaking Working Code
|
126
|
-
Constant improvements are great, but we must avoid breaking things that already work well. Here's how to build safety into the process:
|
127
|
-
|
128
|
-
- **Regression Testing:** Always run tests against gold standards (e.g., example JSONs in `examples/warehouse/lineage`). If an agent's suggestion changes outputs, reject it unless intentionally updating the gold.
|
129
|
-
- **CI/CD Pipelines:** Set up automated tests in GitHub Actions to run on every PR or scheduled run. Fail the build if tests break, catching issues early.
|
130
|
-
- **Human Oversight:** Agents suggest changes—review and approve them manually before merging. Use PR reviews to double-check.
|
131
|
-
- **Modular Changes:** Limit agent tasks to small, isolated improvements (e.g., one file at a time) to minimize risk.
|
132
|
-
- **Monitoring and Rollback:** Track metrics like test pass rates. Use Git for easy rollbacks if something breaks.
|
133
|
-
|
134
|
-
Integrate these into workflows: Add test steps to the example YAML files, and prompt agents with "Suggest improvements without changing existing correct behavior."
|