InfoTracker 0.2.3__tar.gz → 0.2.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (100) hide show
  1. {infotracker-0.2.3 → infotracker-0.2.4}/PKG-INFO +1 -1
  2. {infotracker-0.2.3 → infotracker-0.2.4}/pyproject.toml +9 -2
  3. infotracker-0.2.3/ProjectDescription.md +0 -82
  4. infotracker-0.2.3/docs/adapters.md +0 -117
  5. infotracker-0.2.3/docs/advanced_use_cases.md +0 -294
  6. infotracker-0.2.3/docs/agentic_workflow.md +0 -134
  7. infotracker-0.2.3/docs/algorithm.md +0 -109
  8. infotracker-0.2.3/docs/architecture.md +0 -35
  9. infotracker-0.2.3/docs/breaking_changes.md +0 -291
  10. infotracker-0.2.3/docs/cli_usage.md +0 -133
  11. infotracker-0.2.3/docs/configuration.md +0 -38
  12. infotracker-0.2.3/docs/dbt_integration.md +0 -144
  13. infotracker-0.2.3/docs/edge_cases.md +0 -102
  14. infotracker-0.2.3/docs/example_dataset.md +0 -134
  15. infotracker-0.2.3/docs/faq.md +0 -23
  16. infotracker-0.2.3/docs/lineage_concepts.md +0 -410
  17. infotracker-0.2.3/docs/openlineage_mapping.md +0 -45
  18. infotracker-0.2.3/docs/overview.md +0 -123
  19. infotracker-0.2.3/examples/warehouse/lineage/01_customers.json +0 -25
  20. infotracker-0.2.3/examples/warehouse/lineage/02_orders.json +0 -25
  21. infotracker-0.2.3/examples/warehouse/lineage/03_products.json +0 -25
  22. infotracker-0.2.3/examples/warehouse/lineage/04_order_items.json +0 -27
  23. infotracker-0.2.3/examples/warehouse/lineage/10_stg_orders.json +0 -43
  24. infotracker-0.2.3/examples/warehouse/lineage/11_stg_order_items.json +0 -55
  25. infotracker-0.2.3/examples/warehouse/lineage/12_stg_customers.json +0 -48
  26. infotracker-0.2.3/examples/warehouse/lineage/20_vw_recent_orders.json +0 -38
  27. infotracker-0.2.3/examples/warehouse/lineage/30_dim_customer.json +0 -43
  28. infotracker-0.2.3/examples/warehouse/lineage/31_dim_product.json +0 -43
  29. infotracker-0.2.3/examples/warehouse/lineage/40_fct_sales.json +0 -59
  30. infotracker-0.2.3/examples/warehouse/lineage/41_agg_sales_by_day.json +0 -39
  31. infotracker-0.2.3/examples/warehouse/lineage/50_vw_orders_all.json +0 -25
  32. infotracker-0.2.3/examples/warehouse/lineage/51_vw_orders_all_enriched.json +0 -26
  33. infotracker-0.2.3/examples/warehouse/lineage/52_vw_order_details_star.json +0 -31
  34. infotracker-0.2.3/examples/warehouse/lineage/53_vw_products_all.json +0 -25
  35. infotracker-0.2.3/examples/warehouse/lineage/54_vw_recent_orders_star_cte.json +0 -25
  36. infotracker-0.2.3/examples/warehouse/lineage/55_vw_orders_shipped_or_delivered.json +0 -25
  37. infotracker-0.2.3/examples/warehouse/lineage/56_vw_orders_union_star.json +0 -40
  38. infotracker-0.2.3/examples/warehouse/lineage/60_vw_customer_order_analysis.json +0 -149
  39. infotracker-0.2.3/examples/warehouse/lineage/90_usp_refresh_sales_with_temp.json +0 -57
  40. infotracker-0.2.3/examples/warehouse/lineage/91_usp_snapshot_recent_orders_star.json +0 -26
  41. infotracker-0.2.3/examples/warehouse/lineage/92_usp_rebuild_recent_sales_with_vars.json +0 -52
  42. infotracker-0.2.3/examples/warehouse/lineage/93_usp_top_products_since_var.json +0 -42
  43. infotracker-0.2.3/examples/warehouse/lineage/94_fn_customer_orders_tvf.json +0 -137
  44. infotracker-0.2.3/examples/warehouse/lineage/95_usp_customer_metrics_dataset.json +0 -96
  45. infotracker-0.2.3/examples/warehouse/lineage/96_demo_usage_tvf_and_proc.json +0 -98
  46. infotracker-0.2.3/examples/warehouse/sql/01_customers.sql +0 -6
  47. infotracker-0.2.3/examples/warehouse/sql/02_orders.sql +0 -6
  48. infotracker-0.2.3/examples/warehouse/sql/03_products.sql +0 -6
  49. infotracker-0.2.3/examples/warehouse/sql/04_order_items.sql +0 -8
  50. infotracker-0.2.3/examples/warehouse/sql/10_stg_orders.sql +0 -7
  51. infotracker-0.2.3/examples/warehouse/sql/11_stg_order_items.sql +0 -9
  52. infotracker-0.2.3/examples/warehouse/sql/12_stg_customers.sql +0 -10
  53. infotracker-0.2.3/examples/warehouse/sql/20_vw_recent_orders.sql +0 -14
  54. infotracker-0.2.3/examples/warehouse/sql/30_dim_customer.sql +0 -7
  55. infotracker-0.2.3/examples/warehouse/sql/31_dim_product.sql +0 -7
  56. infotracker-0.2.3/examples/warehouse/sql/40_fct_sales.sql +0 -12
  57. infotracker-0.2.3/examples/warehouse/sql/41_agg_sales_by_day.sql +0 -9
  58. infotracker-0.2.3/examples/warehouse/sql/50_vw_orders_all.sql +0 -3
  59. infotracker-0.2.3/examples/warehouse/sql/51_vw_orders_all_enriched.sql +0 -5
  60. infotracker-0.2.3/examples/warehouse/sql/52_vw_order_details_star.sql +0 -9
  61. infotracker-0.2.3/examples/warehouse/sql/53_vw_products_all.sql +0 -3
  62. infotracker-0.2.3/examples/warehouse/sql/54_vw_recent_orders_star_cte.sql +0 -7
  63. infotracker-0.2.3/examples/warehouse/sql/55_vw_orders_shipped_or_delivered.sql +0 -4
  64. infotracker-0.2.3/examples/warehouse/sql/56_vw_orders_union_star.sql +0 -4
  65. infotracker-0.2.3/examples/warehouse/sql/60_vw_customer_order_analysis.sql +0 -12
  66. infotracker-0.2.3/examples/warehouse/sql/60_vw_customer_order_ranking.sql +0 -11
  67. infotracker-0.2.3/examples/warehouse/sql/61_vw_sales_analytics.sql +0 -11
  68. infotracker-0.2.3/examples/warehouse/sql/90_usp_refresh_sales_with_temp.sql +0 -53
  69. infotracker-0.2.3/examples/warehouse/sql/91_usp_snapshot_recent_orders_star.sql +0 -23
  70. infotracker-0.2.3/examples/warehouse/sql/92_usp_rebuild_recent_sales_with_vars.sql +0 -48
  71. infotracker-0.2.3/examples/warehouse/sql/93_usp_top_products_since_var.sql +0 -41
  72. infotracker-0.2.3/examples/warehouse/sql/94_fn_customer_orders_tvf.sql +0 -78
  73. infotracker-0.2.3/examples/warehouse/sql/95_usp_customer_metrics_dataset.sql +0 -71
  74. infotracker-0.2.3/examples/warehouse/sql/96_demo_usage_tvf_and_proc.sql +0 -91
  75. infotracker-0.2.3/tests/__init__.py +0 -21
  76. infotracker-0.2.3/tests/conftest.py +0 -60
  77. infotracker-0.2.3/tests/test_adapter.py +0 -150
  78. infotracker-0.2.3/tests/test_expected_outputs.py +0 -228
  79. infotracker-0.2.3/tests/test_impact_output.py +0 -89
  80. infotracker-0.2.3/tests/test_insert_exec.py +0 -75
  81. infotracker-0.2.3/tests/test_integration.py +0 -148
  82. infotracker-0.2.3/tests/test_parser.py +0 -145
  83. infotracker-0.2.3/tests/test_preprocessing.py +0 -157
  84. infotracker-0.2.3/tests/test_wildcard.py +0 -137
  85. {infotracker-0.2.3 → infotracker-0.2.4}/.gitignore +0 -0
  86. {infotracker-0.2.3 → infotracker-0.2.4}/MANIFEST.in +0 -0
  87. {infotracker-0.2.3 → infotracker-0.2.4}/README.md +0 -0
  88. {infotracker-0.2.3 → infotracker-0.2.4}/infotracker.yml +0 -0
  89. {infotracker-0.2.3 → infotracker-0.2.4}/requirements.txt +0 -0
  90. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/__init__.py +0 -0
  91. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/__main__.py +0 -0
  92. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/adapters.py +0 -0
  93. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/cli.py +0 -0
  94. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/config.py +0 -0
  95. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/diff.py +0 -0
  96. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/engine.py +0 -0
  97. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/lineage.py +0 -0
  98. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/models.py +0 -0
  99. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/openlineage_utils.py +0 -0
  100. {infotracker-0.2.3 → infotracker-0.2.4}/src/infotracker/parser.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: InfoTracker
3
- Version: 0.2.3
3
+ Version: 0.2.4
4
4
  Summary: Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)
5
5
  Project-URL: homepage, https://example.com/infotracker
6
6
  Project-URL: documentation, https://example.com/infotracker/docs
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
4
4
 
5
5
  [project]
6
6
  name = "InfoTracker"
7
- version = "0.2.3"
7
+ version = "0.2.4"
8
8
  description = "Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.10"
@@ -38,6 +38,7 @@ infotracker = "infotracker.cli:entrypoint"
38
38
 
39
39
  [tool.hatch.build.targets.wheel]
40
40
  packages = ["src/infotracker"]
41
+ only-include = ["src/infotracker"]
41
42
  exclude = [
42
43
  "src/infotracker/tests",
43
44
  "src/infotracker/**/tests",
@@ -45,7 +46,13 @@ exclude = [
45
46
  "tests/",
46
47
  "tests/**",
47
48
  ]
48
- only-include = ["src/infotracker"]
49
+
50
+ [tool.hatch.build.targets.sdist]
51
+ exclude = [
52
+ "tests", "tests/**",
53
+ "docs", "examples",
54
+ ".github", ".git", ".venv", ".gitignore", "README.md","ProjectDescription.md",
55
+ ]
49
56
 
50
57
  [tool.ruff]
51
58
  line-length = 100
@@ -1,82 +0,0 @@
1
- ### InfoTracker — Student Brief
2
-
3
- #### Welcome to the Data Dungeon (friendly, no actual monsters)
4
- You are the hero. Your quest: teach a tool to read SQL scrolls and tell true stories about columns. You’ll map where data comes from, spot traps (breaking changes), and keep the kingdom’s dashboards happy.
5
-
6
- - Your gear: a CLI, example SQLs, and clear docs
7
- - Your allies: adapters, lineage graphs, and CI checks
8
- - Your enemies: sneaky `SELECT *`, UNION goblins, and 3 a.m. alerts
9
- - Goal: green checks, clear diffs, and no broken charts
10
-
11
- If you get stuck, it’s normal. Take a sip of tea, re-read the step, try a smaller example.
12
-
13
- ### For Beginners
14
- If you're new to SQL, try this free tutorial: [Khan Academy SQL](https://www.khanacademy.org/computing/computer-programming/sql). It's in simple English and has exercises.
15
-
16
- ### Action plan (read and build in this order)
17
- 1) Understand the goal and scope (1 hour)
18
- - Overview: [docs/overview.md](docs/overview.md)
19
- - What you’re building, supported features, and what’s out of scope. Keep this open as your north star.
20
-
21
- 2) Learn column-level lineage basics (1-2 hours)
22
- - Concepts: [docs/lineage_concepts.md](docs/lineage_concepts.md)
23
- - Visual examples showing how each output column maps back to inputs (joins, transforms, aggregations). This informs how your extractor must reason.
24
-
25
- 3) Explore the example dataset (your training corpus)
26
- - Dataset map: [docs/example_dataset.md](docs/example_dataset.md)
27
- - Where the SQL files live, what each file represents (tables, views, CTEs, procs), and the matching OpenLineage JSON expectations you must reproduce.
28
-
29
- 4) Implement the algorithm incrementally
30
- - Algorithm: [docs/algorithm.md](docs/algorithm.md)
31
- - Steps: parse → object graph → schema resolution (expand `*` late) → column lineage extraction → impact graph → outputs.
32
- - Aim for correctness on simple files first, then progress to joins and aggregations.
33
-
34
- 5) Handle edge cases early enough to avoid rewrites
35
- - SELECT-star and star expansion: [docs/edge_cases.md](docs/edge_cases.md)
36
- - Requires object-level lineage first, then star expansion. Also watch UNION ordinals and SELECT INTO schema inference.
37
-
38
- 6) Decide on architecture and adapter boundaries
39
- - Adapters & extensibility: [docs/adapters.md](docs/adapters.md)
40
- - Define a clear adapter interface and implement MS SQL first (temp tables, variables, SELECT INTO, T-SQL functions). Keep the core engine adapter-agnostic.
41
-
42
- 7) Wire up the agentic workflow and regression tests
43
- - Agentic workflow: [docs/agentic_workflow.md](docs/agentic_workflow.md)
44
- - Loop the agent on the example corpus until the generated lineage matches the gold JSON. Add CI to auto-run on any SQL/lineage change.
45
-
46
- 8) Expose the CLI and iterate to parity
47
- - CLI usage: [docs/cli_usage.md](docs/cli_usage.md)
48
- - Implement `extract`, `impact`, and `diff`. The CLI is your acceptance surface; keep behavior stable and well-documented.
49
-
50
- 9) Implement breaking-change detection and reporting
51
- - Breaking changes: [docs/breaking_changes.md](docs/breaking_changes.md)
52
- - Compare base vs head branches: diff schemas/expressions, classify severity, compute downstream impacts, and emit machine + human-readable reports.
53
-
54
- 10) Optional: Integrate with dbt
55
- - dbt integration: [docs/dbt_integration.md](docs/dbt_integration.md)
56
-
57
- ### Milestones (suggested timebox)
58
- - Day 1–2: read docs, install CLI, run extract on examples
59
- - Day 3–5: implement simple lineage (no joins), pass gold files
60
- - Day 6–8: add joins and aggregations, handle star expansion
61
- - Day 9–10: wire warn-only diff in CI, polish docs
62
-
63
- ### Acceptance checklist
64
- - Lineage matches gold JSONs in `examples/warehouse/lineage`
65
- - Impact queries return correct columns for sample selectors
66
- - Diff runs in CI (warn-only) and shows helpful messages
67
- - Docs updated where needed; examples run without errors
68
-
69
- ### Quick-start CLI (target behavior)
70
- Simple Example: To test one file, run `infotracker extract --sql-dir examples/warehouse/sql/01_customers.sql --out-dir build/lineage`
71
-
72
- ### Tips (pro-level, easy to follow)
73
- - Start small: one view, then a join, then an aggregate
74
- - Be explicit: avoid `SELECT *` while testing
75
- - Commit often: small steps are easy to undo
76
- - Use the example JSONs as your “gold” truth
77
-
78
- ### If stuck (quick help)
79
- - Re-read the related doc step (linked above)
80
- - Run the CLI with `--log-level debug` to see more info
81
- - Create a tiny SQL with just the failing pattern and test that first
82
- - Write down expected lineage for one column, then match it in code
@@ -1,117 +0,0 @@
1
- ### Adapters and extensibility
2
-
3
- #### Forge your adapter (smithing for data heroes)
4
- In the forge of Integration Keep, you’ll craft adapters that turn raw SQL into neatly qualified lineage. Sparks may fly; that’s normal.
5
-
6
- - Materials: `parse`, `qualify`, `resolve`, `to_openlineage`
7
- - Armor enchantments: case-normalization, bracket taming, and dialect charms
8
- - Future artifacts: Snowflake blade, BigQuery bow, Postgres shield
9
-
10
- If an imp named “Case Insensitivity” throws a tantrum, feed it brackets: `[like_this]`.
11
-
12
- #### Audience & prerequisites
13
- - Audience: engineers implementing or extending dialect adapters (Level 2: After basics)
14
- - Prerequisites: Python; SQL basics; familiarity with SQLGlot or similar parser
15
-
16
- Define an adapter interface:
17
- - parse(sql) → AST
18
- - qualify(ast) → fully qualified refs (db.schema.object)
19
- - resolve(ast, catalog) → output schema + expressions
20
- - to_openlineage(object) → columnLineage facet
21
-
22
- MS SQL adapter (first):
23
- - Use `SQLGlot`/`sqllineage` for parsing/lineage hints
24
- - Handle T-SQL specifics: temp tables, SELECT INTO, variables, functions
25
- - Normalize identifiers (brackets vs quotes), case-insensitivity
26
-
27
- Future adapters: Snowflake, BigQuery, Postgres, etc.
28
-
29
- ### Adapter interface (pseudocode)
30
- ```python
31
- class Adapter(Protocol):
32
- name: str
33
- dialect: str
34
-
35
- def parse(self, sql: str) -> AST: ...
36
- def qualify(self, ast: AST, default_db: str | None) -> AST: ...
37
- def resolve(self, ast: AST, catalog: Catalog) -> tuple[Schema, ColumnLineage]: ...
38
- def to_openlineage(self, obj_name: str, schema: Schema, lineage: ColumnLineage) -> dict: ...
39
- ```
40
-
41
- ### MS SQL specifics
42
- - Case-insensitive identifiers; bracket quoting `[name]`
43
- - Temp tables (`#t`) live in tempdb; scope to procedure; support SELECT INTO schema inference
44
- - Variables (`@v`) and their use in filters/windows; capture expressions for context
45
- - GETDATE/DATEADD and common built-ins; treat as CONSTANT/ARITHMETIC transformations
46
- - JOINs default to INNER; OUTER joins affect nullability
47
- - Parser: prefer SQLGlot for AST; use sqllineage as an optional hint only
48
-
49
- ### Mini example (very small, illustrative)
50
- ```python
51
- class MssqlAdapter(Adapter):
52
- name = "mssql"
53
- dialect = "tsql"
54
-
55
- def parse(self, sql: str) -> AST:
56
- return sqlglot.parse_one(sql, read=self.dialect)
57
-
58
- def qualify(self, ast: AST, default_db: str | None) -> AST:
59
- # apply name normalization and database/schema defaults
60
- return qualify_identifiers(ast, default_db)
61
-
62
- def resolve(self, ast: AST, catalog: Catalog) -> tuple[Schema, ColumnLineage]:
63
- schema = infer_schema(ast, catalog)
64
- lineage = extract_column_lineage(ast, catalog)
65
- return schema, lineage
66
-
67
- def to_openlineage(self, obj_name: str, schema: Schema, lineage: ColumnLineage) -> dict:
68
- return build_openlineage_payload(obj_name, schema, lineage)
69
- ```
70
-
71
- ### How to Test Your Adapter
72
- Create a test.py:
73
- ```python
74
- adapter = MssqlAdapter()
75
- ast = adapter.parse("SELECT * FROM table")
76
- print(ast)
77
- ```
78
- // Run: python test.py
79
-
80
- ### Adding a new adapter
81
- 1. Implement the interface; configure SQLGlot dialect
82
- 2. Provide normalization rules (case, quoting, name resolution)
83
- 3. Add adapter-specific tests using a small example corpus
84
- 4. Document limitations and differences
85
-
86
- ### Adapter testing template
87
- - Create 3 SQL files: simple select, join with alias, aggregation with group by
88
- - Write expected schema (columns, types, nullability)
89
- - Write expected lineage (inputs per output column)
90
- - Run extraction and compare to expected JSON in CI
91
-
92
- ### Adapter selection and registry
93
- ```python
94
- ADAPTERS: dict[str, Adapter] = {
95
- "mssql": MssqlAdapter(),
96
- }
97
-
98
- def get_adapter(name: str) -> Adapter:
99
- return ADAPTERS[name]
100
- ```
101
-
102
- ### Catalog handling
103
- - Accept a `catalog.yml` with known schemas for external refs
104
- - Use catalog to resolve `*`, disambiguate references, and provide types when DDL is missing
105
- - Warn on unknown objects; continue best-effort
106
-
107
- ### Common pitfalls
108
- - Case-insensitive matching; normalize but preserve display casing
109
- - Bracket/quoted identifiers: `[Name]` vs `"Name"`
110
- - Temp table scoping and lifetime
111
- - SELECT INTO column ordinals and inferred types
112
- - Variables used in expressions and filters
113
-
114
- ### See also
115
- - `docs/algorithm.md`
116
- - `docs/cli_usage.md`
117
- - `docs/overview.md`
@@ -1,294 +0,0 @@
1
- # Advanced Use Cases: Tabular Functions and Procedures
2
-
3
- ## Overview
4
- This document covers advanced SQL patterns that extend beyond basic views and procedures. These patterns are important for lineage analysis because they represent reusable transformation layers that can be used as inputs to other operations.
5
-
6
- ## Use Case 1: Parametrized Tabular Functions
7
-
8
- ### What are Tabular Functions?
9
- Tabular functions are SQL Server objects that accept parameters and return table result sets. They behave like parameterized views and can be used in JOINs, subqueries, and other SQL operations.
10
-
11
- ### Two Syntax Variants
12
-
13
- #### 1. Inline Table-Valued Function (RETURN AS)
14
- ```sql
15
- CREATE FUNCTION dbo.fn_customer_orders_inline
16
- (
17
- @CustomerID INT,
18
- @StartDate DATE,
19
- @EndDate DATE
20
- )
21
- RETURNS TABLE
22
- AS
23
- RETURN
24
- (
25
- SELECT
26
- o.OrderID,
27
- o.CustomerID,
28
- o.OrderDate,
29
- oi.ProductID,
30
- CAST(oi.Quantity * oi.UnitPrice AS DECIMAL(18,2)) AS ExtendedPrice
31
- FROM dbo.Orders AS o
32
- INNER JOIN dbo.OrderItems AS oi ON o.OrderID = oi.OrderID
33
- WHERE o.CustomerID = @CustomerID
34
- AND o.OrderDate BETWEEN @StartDate AND @EndDate
35
- );
36
- ```
37
-
38
- **Characteristics:**
39
- - Single SELECT statement
40
- - Optimized by query optimizer
41
- - Inlined into calling query for better performance
42
- - Parameters control data scope
43
-
44
- #### 2. Multi-Statement Table-Valued Function (RETURN TABLE)
45
- ```sql
46
- CREATE FUNCTION dbo.fn_customer_orders_mstvf
47
- (
48
- @CustomerID INT,
49
- @StartDate DATE,
50
- @EndDate DATE
51
- )
52
- RETURNS @Result TABLE
53
- (
54
- OrderID INT,
55
- CustomerID INT,
56
- OrderDate DATE,
57
- ProductID INT,
58
- ExtendedPrice DECIMAL(18,2),
59
- DaysSinceOrder INT
60
- )
61
- AS
62
- BEGIN
63
- INSERT INTO @Result
64
- SELECT
65
- o.OrderID,
66
- o.CustomerID,
67
- o.OrderDate,
68
- oi.ProductID,
69
- CAST(oi.Quantity * oi.UnitPrice AS DECIMAL(18,2)) AS ExtendedPrice,
70
- DATEDIFF(DAY, o.OrderDate, GETDATE()) AS DaysSinceOrder
71
- FROM dbo.Orders AS o
72
- INNER JOIN dbo.OrderItems AS oi ON o.OrderID = oi.OrderID
73
- WHERE o.CustomerID = @CustomerID
74
- AND o.OrderDate BETWEEN @StartDate AND @EndDate;
75
-
76
- RETURN;
77
- END;
78
- ```
79
-
80
- **Characteristics:**
81
- - Multiple statements possible
82
- - Table variable for intermediate results
83
- - More complex logic and calculations
84
- - Better for complex business rules
85
-
86
- ### Lineage Analysis for Tabular Functions
87
- Tabular functions are treated as view-like objects in lineage analysis:
88
-
89
- ```mermaid
90
- graph TD
91
- O[Orders.OrderID] --> F[fn_customer_orders_inline.OrderID]
92
- C[Orders.CustomerID] --> F2[fn_customer_orders_inline.CustomerID]
93
- OI[OrderItems.ProductID] --> F3[fn_customer_orders_inline.ProductID]
94
- Q[OrderItems.Quantity] --> F4[fn_customer_orders_inline.ExtendedPrice]
95
- P[OrderItems.UnitPrice] --> F4
96
- ```
97
-
98
- **Key Points:**
99
- - Function parameters affect filtering but don't create columns
100
- - Output columns show lineage to source table columns
101
- - Computed columns (like ExtendedPrice) show lineage to multiple inputs
102
- - Functions can be used as inputs to other operations
103
-
104
- ## Use Case 2: Procedures Returning Datasets
105
-
106
- ### What are Dataset-Returning Procedures?
107
- These are stored procedures that perform calculations and end with a SELECT statement that returns data. The results can be captured into temp tables for further processing.
108
-
109
- ### Example Procedure
110
- ```sql
111
- CREATE PROCEDURE dbo.usp_customer_metrics_dataset
112
- @CustomerID INT = NULL,
113
- @StartDate DATE = NULL,
114
- @EndDate DATE = NULL,
115
- @IncludeInactive BIT = 0
116
- AS
117
- BEGIN
118
- SET NOCOUNT ON;
119
-
120
- -- Set default parameters if not provided
121
- IF @StartDate IS NULL SET @StartDate = DATEADD(MONTH, -3, GETDATE());
122
- IF @EndDate IS NULL SET @EndDate = GETDATE();
123
-
124
- -- Main calculation query that returns the dataset
125
- SELECT
126
- c.CustomerID,
127
- c.CustomerName,
128
- COUNT(DISTINCT o.OrderID) AS TotalOrders,
129
- SUM(oi.Quantity * oi.UnitPrice) AS TotalRevenue,
130
- MAX(o.OrderDate) AS LastOrderDate,
131
- CASE
132
- WHEN MAX(o.OrderDate) >= DATEADD(MONTH, -1, GETDATE()) THEN 'Active'
133
- WHEN MAX(o.OrderDate) >= DATEADD(MONTH, -3, GETDATE()) THEN 'Recent'
134
- ELSE 'Inactive'
135
- END AS CustomerActivityStatus
136
- FROM dbo.Customers AS c
137
- LEFT JOIN dbo.Orders AS o ON c.CustomerID = o.CustomerID
138
- LEFT JOIN dbo.OrderItems AS oi ON o.OrderID = oi.OrderID
139
- WHERE (@CustomerID IS NULL OR c.CustomerID = @CustomerID)
140
- AND (@IncludeInactive = 1 OR c.IsActive = 1)
141
- AND (o.OrderDate IS NULL OR o.OrderDate BETWEEN @StartDate AND @EndDate)
142
- GROUP BY c.CustomerID, c.CustomerName
143
- HAVING COUNT(DISTINCT o.OrderID) > 0
144
- ORDER BY TotalRevenue DESC;
145
- END;
146
- ```
147
-
148
- **Characteristics:**
149
- - Ends with a single SELECT statement
150
- - Parameters control data scope and filtering
151
- - Complex calculations and aggregations
152
- - Results can be captured for further processing
153
-
154
- ### Usage Patterns
155
-
156
- #### 1. EXEC into Temp Table
157
- ```sql
158
- -- Create temp table to capture procedure output
159
- IF OBJECT_ID('tempdb..#customer_metrics') IS NOT NULL
160
- DROP TABLE #customer_metrics;
161
-
162
- -- Execute procedure and capture results into temp table
163
- INSERT INTO #customer_metrics
164
- EXEC dbo.usp_customer_metrics_dataset
165
- @CustomerID = NULL,
166
- @StartDate = '2024-01-01',
167
- @EndDate = '2024-12-31',
168
- @IncludeInactive = 0;
169
- ```
170
-
171
- #### 2. Further Processing
172
- ```sql
173
- -- Use the temp table for downstream operations
174
- SELECT
175
- cm.*,
176
- CASE
177
- WHEN cm.TotalRevenue >= 10000 THEN 'High Value'
178
- WHEN cm.TotalRevenue >= 5000 THEN 'Medium Value'
179
- ELSE 'Standard'
180
- END AS CustomerTier
181
- FROM #customer_metrics AS cm
182
- WHERE cm.CustomerActivityStatus IN ('Active', 'Recent');
183
- ```
184
-
185
- #### 3. Insert into Permanent Table
186
- ```sql
187
- -- Archive procedure results
188
- INSERT INTO dbo.customer_metrics_archive (
189
- ArchiveDate, CustomerID, CustomerName, TotalOrders, TotalRevenue
190
- )
191
- SELECT
192
- GETDATE(),
193
- cm.*
194
- FROM #customer_metrics AS cm;
195
- ```
196
-
197
- ### Lineage Analysis for Dataset Procedures
198
- Procedures that return datasets are treated as view-like objects:
199
-
200
- ```mermaid
201
- graph TD
202
- C[Customers.CustomerID] --> P[usp_customer_metrics_dataset.CustomerID]
203
- CN[Customers.CustomerName] --> P2[usp_customer_metrics_dataset.CustomerName]
204
- O[Orders.OrderID] --> P3[usp_customer_metrics_dataset.TotalOrders]
205
- Q[OrderItems.Quantity] --> P4[usp_customer_metrics_dataset.TotalRevenue]
206
- P5[OrderItems.UnitPrice] --> P4
207
- OD[Orders.OrderDate] --> P6[usp_customer_metrics_dataset.LastOrderDate]
208
- OD --> P7[usp_customer_metrics_dataset.CustomerActivityStatus]
209
- ```
210
-
211
- **Key Points:**
212
- - Procedure output columns show lineage to source tables
213
- - Aggregated columns show lineage to source columns used in aggregation
214
- - Computed columns show lineage to multiple inputs
215
- - Procedure can become an input to other operations via temp tables
216
-
217
- ## Combined Workflow Example
218
-
219
- ### End-to-End Lineage
220
- ```sql
221
- -- Step 1: Use tabular function
222
- SELECT f.*, p.Category
223
- FROM dbo.fn_customer_orders_inline(1, '2024-01-01', '2024-12-31') AS f
224
- INNER JOIN dbo.Products AS p ON f.ProductID = p.ProductID;
225
-
226
- -- Step 2: Use procedure output
227
- INSERT INTO #customer_metrics
228
- EXEC dbo.usp_customer_metrics_dataset @CustomerID = 1;
229
-
230
- -- Step 3: Combine both for analysis
231
- SELECT
232
- f.CustomerID,
233
- f.OrderID,
234
- f.ExtendedPrice,
235
- cm.TotalRevenue,
236
- CAST(f.ExtendedPrice / NULLIF(cm.TotalRevenue, 0) * 100 AS DECIMAL(5,2)) AS OrderContributionPercent
237
- FROM dbo.fn_customer_orders_inline(1, '2024-01-01', '2024-12-31') AS f
238
- INNER JOIN #customer_metrics AS cm ON f.CustomerID = cm.CustomerID;
239
- ```
240
-
241
- ### Lineage Chain
242
- ```mermaid
243
- graph TD
244
- O[Orders.OrderID] --> F[fn_customer_orders_inline.OrderID]
245
- OI[OrderItems.Quantity] --> F2[fn_customer_orders_inline.ExtendedPrice]
246
- UP[OrderItems.UnitPrice] --> F2
247
-
248
- C[Customers.CustomerID] --> P[usp_customer_metrics_dataset.CustomerID]
249
- O2[Orders.OrderID] --> P2[usp_customer_metrics_dataset.TotalOrders]
250
-
251
- F --> A[Analysis.OrderID]
252
- F2 --> A2[Analysis.ExtendedPrice]
253
- P --> A3[Analysis.TotalRevenue]
254
- P2 --> A4[Analysis.OrderContributionPercent]
255
- ```
256
-
257
- ## Implementation Considerations
258
-
259
- ### For Lineage Extraction
260
- 1. **Function Detection**: Identify `CREATE FUNCTION` statements with `RETURNS TABLE`
261
- 2. **Procedure Detection**: Identify `CREATE PROCEDURE` statements ending with `SELECT`
262
- 3. **Parameter Handling**: Track parameters that affect filtering but don't create columns
263
- 4. **Output Schema**: Extract column definitions from function returns or procedure SELECT
264
- 5. **Dependency Mapping**: Map output columns to source table columns
265
-
266
- ### For Impact Analysis
267
- 1. **Function Usage**: Track where tabular functions are used as inputs
268
- 2. **Procedure Results**: Track where procedure outputs are captured and used
269
- 3. **Temp Table Lifecycle**: Understand temp table creation, population, and usage
270
- 4. **Downstream Effects**: Follow lineage from functions/procedures to final targets
271
-
272
- ### Testing Scenarios
273
- 1. **Parameter Variations**: Test functions with different parameter values
274
- 2. **Complex Calculations**: Verify lineage for multi-step computations
275
- 3. **Temp Table Workflows**: Test EXEC into temp table patterns
276
- 4. **Combined Usage**: Test functions and procedures used together
277
- 5. **Edge Cases**: Test with NULL parameters, empty results, etc.
278
-
279
- ## Best Practices
280
-
281
- ### For Developers
282
- 1. **Clear Naming**: Use descriptive names for functions and procedures
283
- 2. **Parameter Validation**: Validate input parameters early
284
- 3. **Consistent Output**: Maintain consistent column structure
285
- 4. **Documentation**: Document parameter effects and return values
286
-
287
- ### For Lineage Analysis
288
- 1. **Schema Tracking**: Track function and procedure schemas
289
- 2. **Parameter Impact**: Understand how parameters affect lineage
290
- 3. **Temp Table Handling**: Properly track temp table creation and usage
291
- 4. **Performance Considerations**: Be aware of function inlining and optimization
292
-
293
- ## Conclusion
294
- Tabular functions and dataset-returning procedures represent powerful patterns for creating reusable transformation layers. They extend the lineage analysis beyond simple views and tables, requiring careful handling of parameters, computed columns, and complex workflows. Proper implementation of these patterns enables comprehensive data lineage tracking across modern SQL Server environments.
@@ -1,134 +0,0 @@
1
- ### Agentic workflow and regression tests
2
-
3
- #### Train your lineage familiar (it learns by fetching JSON)
4
- Summon your agent, toss it SQL scrolls, and reward it when it returns with matching OpenLineage scrolls. Repeat until it purrs (tests pass).
5
-
6
- - The loop: cast → compare → tweak → repeat
7
- - The arena: `examples/warehouse/{sql,lineage}`
8
- - Victory condition: exact matches, zero diffs, tests pass (green)
9
-
10
- Remember: agents love clear acceptance criteria more than tuna.
11
-
12
- #### Audience & prerequisites
13
- - Audience: engineers using agents/CIs to iterate on lineage extractors
14
- - Prerequisites: basic SQL; Python; familiarity with CI and diff workflows
15
-
16
- ### Gold files (recap)
17
- - Gold files = expected JSON lineage in `examples/warehouse/lineage`
18
- - Your extractor must match them exactly (order and content)
19
-
20
- ### Fixing diffs (common)
21
- - If a column mismatches: compare expressions; check alias qualification and star expansion timing
22
- - If extra/missing columns: check join resolution and GROUP BY; ensure inputs are resolved before expansion
23
- - If ordering differs: make outputs and diagnostics deterministic
24
-
25
- - Prepare training set: SQL files + expected OpenLineage JSONs
26
- - Loop (Cursor AI/CLI/web agents):
27
- 1) Generate lineage → 2) Compare with expected → 3) Adjust prompts/code → 4) Repeat until pass
28
- - CI: on any change under `examples/warehouse/{sql,lineage}`, run extraction and compare; fail on diffs
29
- - Track coverage and edge cases (SELECT *, temp tables, UNION, variables)
30
-
31
- ### Setup
32
- - Install Cursor CLI and authenticate
33
- - Organize repo with `examples/warehouse/{sql,lineage}` and a `build/` output folder
34
-
35
- ### Agent loop
36
- 1. Prompt template includes: adapter target (MS SQL), acceptance criteria (must match gold JSON), and allowed libraries (SQLGlot)
37
- 2. Agent writes code to `src/` and runs `infotracker extract` on the SQL corpus
38
- 3. Compare `build/lineage/*.json` to `examples/warehouse/lineage/*.json`
39
- 4. If diff exists, agent refines parsing/resolution rules and retries
40
- 5. Stop condition: all files match; record commit checkpoint
41
-
42
- ### Loop diagram
43
- ```mermaid
44
- flowchart LR
45
- G[Generate lineage] --> C[Compare to gold JSONs]
46
- C -->|diffs| R[Refine rules/prompts]
47
- R --> G
48
- C -->|no diffs| S[Stop (green)]
49
- ```
50
-
51
- ### Artifacts
52
- - Inputs: SQL corpus under `examples/warehouse/sql`, optional catalog
53
- - Outputs: `build/lineage/*.json`, diff logs, warnings
54
- - CI artifacts: upload generated lineage for review
55
-
56
- ### Stop criteria
57
- - All gold JSONs match exactly (order and content)
58
- - No warnings if using `--fail-on-warn`
59
-
60
- ### CI integration
61
- - GitHub Actions (example): on push/PR, run extraction and `git diff --no-index` against gold lineage; fail on differences
62
- - Cache Python deps and AST caches for speed
63
- - Upload generated `build/lineage/*.json` as CI artifacts for review
64
-
65
- ### Evaluation metrics
66
- - Exact-match rate across files
67
- - Column coverage (percentage of outputs with lineage)
68
- - Warning/error counts should trend down across iterations
69
-
70
- ### Updating gold files
71
- - Intentional changes: regenerate lineage and review diffs; update gold JSON with PR describing the change
72
-
73
- ### See also
74
- - `docs/example_dataset.md`
75
- - `docs/algorithm.md`
76
- - `docs/cli_usage.md`
77
- - `docs/dbt_integration.md`
78
-
79
- ### Modus Operandi: Continuous Improvement with Cursor Agents
80
- To extend the agentic workflow for 24/7/365 improvement of InfoTracker, integrate Cursor's web-based agents (like Background Agents) with GitHub. This builds on the regression testing and CI loops above, enabling automated code suggestions, bug fixes, and PR reviews. See [Cursor Changelog](https://cursor.com/changelog) for details.
81
-
82
- #### Step 1: Set Up Cursor Web Agents for Continuous Improvement
83
- 1. **Enable Background Agents:** In Cursor, use Background Agents which run remotely and can be triggered via web (e.g., GitHub or Slack).
84
- 2. **Integrate with GitHub for 24/7 Operation:** Use GitHub Actions to schedule agent runs, combined with Cursor's API for AI tasks.
85
- - Create a workflow: `.github/workflows/cursor-improve.yml` to run daily.
86
- - Use Cursor's features like tagging @Cursor in issues for automated suggestions.
87
- 3. **Example Workflow for Scheduled Improvements:**
88
- ```yaml
89
- name: Cursor AI Improvement
90
- on: schedule
91
- - cron: '0 0 * * *' # Daily at midnight
92
- jobs:
93
- improve:
94
- runs-on: ubuntu-latest
95
- steps:
96
- - uses: actions/checkout@v4
97
- - name: Run Cursor Agent
98
- env:
99
- CURSOR_API_KEY: ${{ secrets.CURSOR_API_KEY }} # Add in GitHub Secrets
100
- run: |
101
- # Script to call Cursor API or simulate agent for code analysis
102
- python cursor_improve.py # Prompt agent to suggest repo improvements
103
- ```
104
- 4. **Script (cursor_improve.py):** Use Cursor's API to analyze code and open issues/PRs with suggestions.
105
-
106
- #### Step 2: Set Up Cursor Agents for PR Reviews
107
- 1. **GitHub PR Integration:** In GitHub, tag @Cursor in PR comments to trigger Background Agent for reviews and fixes.
108
- - Example: In a PR, comment "@Cursor review this for bugs" – it will analyze and suggest changes.
109
- 2. **Automate with Workflow:** Trigger on PR events to auto-invoke Cursor.
110
- ```yaml
111
- name: Cursor PR Review
112
- on: pull_request
113
- jobs:
114
- review:
115
- runs-on: ubuntu-latest
116
- steps:
117
- - uses: actions/checkout@v4
118
- - name: Invoke Cursor Agent
119
- run: |
120
- # Use GitHub API to comment "@Cursor review" on the PR
121
- gh pr comment $PR_NUMBER --body "@Cursor review this PR for improvements"
122
- ```
123
- 3. **How It Works:** Cursor's Background Agent will read the PR, apply fixes if needed, and push commits (per changelog features).
124
-
125
- #### Safeguards to Prevent Breaking Working Code
126
- Constant improvements are great, but we must avoid breaking things that already work well. Here's how to build safety into the process:
127
-
128
- - **Regression Testing:** Always run tests against gold standards (e.g., example JSONs in `examples/warehouse/lineage`). If an agent's suggestion changes outputs, reject it unless intentionally updating the gold.
129
- - **CI/CD Pipelines:** Set up automated tests in GitHub Actions to run on every PR or scheduled run. Fail the build if tests break, catching issues early.
130
- - **Human Oversight:** Agents suggest changes—review and approve them manually before merging. Use PR reviews to double-check.
131
- - **Modular Changes:** Limit agent tasks to small, isolated improvements (e.g., one file at a time) to minimize risk.
132
- - **Monitoring and Rollback:** Track metrics like test pass rates. Use Git for easy rollbacks if something breaks.
133
-
134
- Integrate these into workflows: Add test steps to the example YAML files, and prompt agents with "Suggest improvements without changing existing correct behavior."