InfoTracker 0.3.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (28) hide show
  1. infotracker-0.4.0/.cursorrules +105 -0
  2. {infotracker-0.3.0 → infotracker-0.4.0}/.gitignore +3 -4
  3. infotracker-0.4.0/PKG-INFO +301 -0
  4. infotracker-0.4.0/README.md +270 -0
  5. {infotracker-0.3.0 → infotracker-0.4.0}/pyproject.toml +1 -1
  6. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/__init__.py +1 -1
  7. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/cli.py +11 -0
  8. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/engine.py +27 -11
  9. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/infotracker.yml +1 -1
  10. infotracker-0.4.0/src/infotracker/io_utils.py +312 -0
  11. infotracker-0.4.0/src/infotracker/lineage.py +255 -0
  12. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/models.py +4 -0
  13. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/openlineage_utils.py +16 -0
  14. infotracker-0.4.0/src/infotracker/parser.py +3125 -0
  15. infotracker-0.3.0/PKG-INFO +0 -285
  16. infotracker-0.3.0/README.md +0 -254
  17. infotracker-0.3.0/requirements.txt +0 -11
  18. infotracker-0.3.0/sql_diff_demo/base/01_view.sql +0 -6
  19. infotracker-0.3.0/sql_diff_demo/head_breaking/01_view.sql +0 -6
  20. infotracker-0.3.0/sql_diff_demo/head_non/01_view.sql +0 -7
  21. infotracker-0.3.0/sql_diff_demo/head_potential/01_view.sql +0 -6
  22. infotracker-0.3.0/src/infotracker/lineage.py +0 -125
  23. infotracker-0.3.0/src/infotracker/parser.py +0 -1606
  24. {infotracker-0.3.0 → infotracker-0.4.0}/MANIFEST.in +0 -0
  25. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/__main__.py +0 -0
  26. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/adapters.py +0 -0
  27. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/config.py +0 -0
  28. {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/diff.py +0 -0
@@ -0,0 +1,105 @@
1
+ # InfoTracker – Cursor Rules
2
+ # Krótkie reguły dla automatycznych zmian w projekcie (parser/engine/CLI/diff/OL).
3
+
4
+ # ✱ ZASADY OGÓLNE
5
+ - Nie zmieniaj publicznych interfejsów CLI (nazwy komend i flag): extract, impact, diff.
6
+ - Zachowaj kompatybilność plików wyjściowych (OpenLineage JSON, format tabel tekstowych).
7
+ - Nie usuwaj dokumentacji ani komentarzy TODO; dopisuj „// NOTE(cursor): …” gdzie uzasadniasz decyzję.
8
+ - Logowanie: use WARNING dla problemów użytkownika (parse/cycle), DEBUG dla szczegółów; nie spamuj INFO.
9
+ - Deterministyczność: zachowaj stabilne sortowanie (po object_name, ordinal) i stabilne ID runów.
10
+ - Każda zmiana, która dotyka semantyki, musi mieć test(y) w `tests/` i przechodzić `pytest -q`.
11
+ - Nie dodawaj twardych zależności zewnętrznych; trzymaj się stdlib + obecnych pakietów.
12
+
13
+ # ✱ PARSER (T-SQL)
14
+ - Zawsze uruchamiaj pre-processing przed sqlglot.parse:
15
+ - Usuń linie zaczynające się od DECLARE/SET/PRINT.
16
+ - Usuń GO (puste batche).
17
+ - Usuń jednowierszowe IF OBJECT_ID('tempdb..#…') … DROP TABLE #… i gołe DROP TABLE #… (temp tables housekeeping).
18
+ - Sklej dwuliniowe „INSERT INTO #tmp” + następna „EXEC …” do jednej linii: „INSERT INTO #tmp EXEC …”.
19
+ - Fallback gdy parse się wywala:
20
+ - Rozpoznaj `INSERT INTO <tabela|#temp> EXEC <proc>` (obsłuż zarówno temp jak i zwykłe tabele).
21
+ - Wyczyść nazwę procedury (_bez_ końcowego `;` i bez `()`).
22
+ - Zwróć ObjectInfo:
23
+ - name = nazwa tabeli (temp → `#…`), object_type = temp_table|table,
24
+ - schema: 2 placeholderowe kolumny (nullable=True),
25
+ - dependencies = {proc_name},
26
+ - lineage: ColumnReference(table_name=proc_name, column_name="*", transformation_type=EXEC).
27
+ - SELECT/CAST:
28
+ - Projekcje pobieraj defensywnie: expressions | projections | args["expressions"].
29
+ - Dla `CAST(... AS TYPE)` ustaw ColumnSchema.data_type na docelowy typ (np. DECIMAL(10,2), NVARCHAR(50)).
30
+ - Kwalifikacja:
31
+ - Temp-tabele zawsze pod `namespace="tempdb"`.
32
+ - Mssql dataset namespace: `mssql://localhost/InfoTrackerDW` (chyba że user ustawi w configu).
33
+
34
+ # ✱ ENGINE
35
+ - Graf zależności buduj z `ObjectInfo.dependencies`. Jeśli puste → fallback na tabele z `lineage.input_fields`.
36
+ - Topologiczne sortowanie nie może polegać na kolejności plików.
37
+ - Przy „circular/missing deps” loguj WARNING i przetwarzaj mimo wszystko (best-effort).
38
+
39
+ # ✱ OPENLINEAGE
40
+ - Dla outputs dodawaj facet `schema` dla wszystkich obiektów z kolumnami (tabele, widoki, proc/fn z wynikami).
41
+ - Dodawaj facet `columnLineage` gdy istnieje lineage.
42
+ - Namespace i nazwy datasetów muszą być spójne w całym repo.
43
+
44
+ # ✱ DIFF
45
+ - Klasyfikacja zmian:
46
+ - COLUMN_TYPE_CHANGED → BREAKING.
47
+ - COLUMN_RENAMED → POTENTIALLY_BREAKING (wykrywanie: typ/nullability/lineage zgodne; heurystyka łącząca pary).
48
+ - COLUMN_ADDED → NON_BREAKING jeśli nullable, w przeciwnym razie POTENTIALLY_BREAKING.
49
+ - OBJECT_REMOVED/ADDED → BREAKING.
50
+ - Honoruj `severity_threshold` z configu w filtrze wyświetlania i exit code:
51
+ - BREAKING → tylko breaking zmiany.
52
+ - POTENTIALLY_BREAKING → breaking + potentially.
53
+ - NON_BREAKING → wszystkie zmiany.
54
+
55
+ # ✱ WILDCARDY (ColumnGraph.find_columns_wildcard)
56
+ - Pusty/pół-pusty wzorzec (`"."`, "`..`", "`ns..`" bez patternu kolumny) → zwracaj pustą listę.
57
+ - `schema.table.*`:
58
+ - Dopasowuj case-insensitive po końcówce nazwy tabeli (sufiks 3/2/1 segmentów), niezależnie od obecności bazy.
59
+ - `..pattern` i `prefix..pattern`:
60
+ - `pattern` bez metaznaków działa jako „contains” (case-insensitive); z metaznakami → `fnmatch`.
61
+ - `prefix` z `://` filtruje po namespace, bez `://` po `table_name` (startswith).
62
+ - Zawsze deduplikuj węzły przed zwrotem.
63
+
64
+ # ✱ CLI
65
+ - Nie dodawaj nowych flag bez uzasadnienia. Dozwolone: `--graph-dir`, `--out`, `--format`, `--config`.
66
+ - (Opcjonalnie) możesz dodać `--threshold` do `diff`, ale musi honorować wartości z configu (flagą tylko nadpisujesz).
67
+
68
+ # ✱ TESTY
69
+ - Dodawaj test przy każdej zmianie heurystyki parsera/diff/wildcard.
70
+ - W testach `INSERT … EXEC` nazwę procedury porównuj bez `;`.
71
+ - W testach wildcardów – sprawdzaj zarówno tryb exact (`schema.table.*`), jak i contains (`..customer*`), z i bez namespace.
72
+
73
+ # ✱ KOMITY / PR-Y
74
+ - Commity: konwencja
75
+ - feat(parser|engine|cli|diff|models): …
76
+ - fix(parser|engine|…): …
77
+ - chore(logging|tests|ci): …
78
+ - PR ma być mały i atomowy; zawiera opis zmiany, wpływ na diff/impact, link do testów.
79
+ - Nie łącz refaktoru z funkcjonalną zmianą w jednym PR (chyba że to mechaniczny lint bez ryzyka).
80
+
81
+ # ✱ TYPOWE AUTO-NAPRAWY DOZWOLONE
82
+ - Usuwanie martwego kodu po `return`.
83
+ - Wstawienie brakujących `GO` między wieloma `CREATE` w jednym pliku przykładowym.
84
+ - Dodanie brakującego `CREATE` przed `VIEW` gdy wykryto wzorzec `;VIEW … AS` (demo).
85
+ - Normalizacja nazw procedur (usunięcie `;`/`()`).
86
+
87
+ # ✱ CO JEST ZABRONIONE
88
+ - Zmiana semantyki outputu OpenLineage (np. inny kształt JSON).
89
+ - Wyciszanie ostrzeżeń bez adresowania przyczyny (np. „pass” w except).
90
+ - Wymiana parsera SQL lub dodawanie ciężkich zależności.
91
+
92
+ # ✱ SZYBKIE POLECENIA (smoke)
93
+ - extract: `infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage`
94
+ - impact: `infotracker impact -s "+dbo.orders.total_amount+" --graph-dir build/lineage --max-depth 3`
95
+ - diff: `infotracker --config tmp.yml diff --base build/ol_base --head build/ol_head --format text`
96
+ - tests: `pytest -q` lub selektywnie `pytest -k wildcard -q`
97
+
98
+ # ✱ KONTEKST PROJEKTU (dla agenta)
99
+ - Projekt: InfoTracker – lineage/diff dla MSSQL z OL artifacts; moduły: parser.py, engine.py, diff.py, models.py, lineage.py, cli.py.
100
+ - Najważniejsze kontrakty: ObjectInfo, ColumnSchema, ColumnLineage, ColumnGraph.
101
+ - Kluczowe inwarianty:
102
+ - parser: preprocess + fallback dla INSERT INTO … EXEC,
103
+ - engine: deps z ObjectInfo.dependencies,
104
+ - diff: filtr po severity_threshold,
105
+ - models: wildcardy case-insensitive + dopasowanie po sufiksie tabeli.
@@ -36,7 +36,6 @@ docs/_build/
36
36
  (?i)*phase*.py
37
37
  (?i)*test*.md
38
38
  (?i)*phase*.md
39
- InfoTracker/PHASE2_COMPLETE.md
40
- InfoTracker/PHASE3_PLAN.md
41
- InfoTracker/test_phase2.py
42
- InfoTracker/test_comprehensive.py
39
+ examples/warehouse/sql2
40
+
41
+ *.txt
@@ -0,0 +1,301 @@
1
+ Metadata-Version: 2.4
2
+ Name: InfoTracker
3
+ Version: 0.4.0
4
+ Summary: Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)
5
+ Project-URL: homepage, https://example.com/infotracker
6
+ Project-URL: documentation, https://example.com/infotracker/docs
7
+ Author: InfoTracker Authors
8
+ License: MIT
9
+ Keywords: data-lineage,impact-analysis,lineage,mssql,openlineage,sql
10
+ Classifier: Environment :: Console
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Operating System :: OS Independent
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Classifier: Topic :: Database
16
+ Classifier: Topic :: Software Development :: Libraries
17
+ Requires-Python: >=3.10
18
+ Requires-Dist: click
19
+ Requires-Dist: networkx>=3.3
20
+ Requires-Dist: packaging>=24.0
21
+ Requires-Dist: pydantic>=2.8.2
22
+ Requires-Dist: pyyaml>=6.0.1
23
+ Requires-Dist: rich
24
+ Requires-Dist: shellingham
25
+ Requires-Dist: sqlglot>=23.0.0
26
+ Requires-Dist: typer
27
+ Provides-Extra: dev
28
+ Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
29
+ Requires-Dist: pytest>=7.4.0; extra == 'dev'
30
+ Description-Content-Type: text/markdown
31
+
32
+ # InfoTracker
33
+
34
+ **Column-level SQL lineage extraction and impact analysis for MS SQL Server**
35
+
36
+ InfoTracker is a powerful command-line tool that parses T-SQL files and generates detailed column-level lineage in OpenLineage format. It supports advanced SQL Server features including table-valued functions, stored procedures, temp tables, and EXEC patterns.
37
+
38
+ [![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://python.org)
39
+ [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
40
+ [![PyPI](https://img.shields.io/badge/PyPI-InfoTracker-blue.svg)](https://pypi.org/project/InfoTracker/)
41
+
42
+ ## 🚀 Features
43
+
44
+ - **Column-level lineage** - Track data flow at the column level with precise transformations
45
+ - **Advanced SQL support** - T-SQL dialect with temp tables, variables, CTEs, and window functions
46
+ - **Impact analysis** - Find upstream and downstream dependencies with flexible selectors
47
+ - **Wildcard matching** - Support for table wildcards (`schema.table.*`) and column wildcards (`..pattern`)
48
+ - **Breaking change detection** - Detect schema changes that could break downstream processes
49
+ - **Multiple output formats** - Text tables or JSON for integration with other tools
50
+ - **OpenLineage compatible** - Standard format for data lineage interoperability
51
+ - **Advanced SQL objects** - Table-valued functions (TVF) and dataset-returning procedures
52
+ - **Temp table tracking** - Full lineage through EXEC into temp tables
53
+
54
+ ## 📦 Installation
55
+
56
+ ### From PyPI (Recommended)
57
+ ```bash
58
+ pip install InfoTracker
59
+ ```
60
+
61
+ ### From GitHub
62
+ ```bash
63
+ # Latest stable release
64
+ pip install git+https://github.com/InfoMatePL/InfoTracker.git
65
+
66
+ # Development version
67
+ git clone https://github.com/InfoMatePL/InfoTracker.git
68
+ cd InfoTracker
69
+ pip install -e .
70
+ ```
71
+
72
+ ### Verify Installation
73
+ ```bash
74
+ infotracker --help
75
+ ```
76
+
77
+ ## ⚡ Quick Start
78
+
79
+ ### 1. Extract Lineage
80
+ ```bash
81
+ # Extract lineage from SQL files
82
+ infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage
83
+ ```
84
+
85
+ ### 2. Run Impact Analysis
86
+ ```bash
87
+ # Find what feeds into a column (upstream)
88
+ infotracker impact -s "+STG.dbo.Orders.OrderID"
89
+
90
+ # Find what uses a column (downstream)
91
+ infotracker impact -s "STG.dbo.Orders.OrderID+"
92
+
93
+ # Both directions
94
+ infotracker impact -s "+dbo.fct_sales.Revenue+"
95
+ ```
96
+
97
+ ### 3. Detect Breaking Changes
98
+ ```bash
99
+ # Compare two versions of your schema
100
+ infotracker diff --base build/lineage --head build/lineage_new
101
+ ```
102
+ ## 📖 Selector Syntax
103
+
104
+ InfoTracker supports flexible column selectors for precise impact analysis:
105
+
106
+ | Selector Format | Description | Example |
107
+ |-----------------|-------------|---------|
108
+ | `table.column` | Simple format (adds default `dbo` schema) | `Orders.OrderID` |
109
+ | `schema.table.column` | Schema-qualified format | `dbo.Orders.OrderID` |
110
+ | `database.schema.table.column` | Database-qualified format | `STG.dbo.Orders.OrderID` |
111
+ | `schema.table.*` | Table wildcard (all columns) | `dbo.fct_sales.*` |
112
+ | `..pattern` | Column wildcard (name contains pattern) | `..revenue` |
113
+ | `..pattern*` | Column wildcard with fnmatch | `..customer*` |
114
+
115
+ ### Direction Control
116
+ - `selector` - downstream dependencies (default)
117
+ - `+selector` - upstream sources
118
+ - `selector+` - downstream dependencies (explicit)
119
+ - `+selector+` - both upstream and downstream
120
+
121
+ ## 💡 Examples
122
+
123
+ ### Basic Usage
124
+ ```bash
125
+ # Extract lineage first (always run this before impact analysis)
126
+ infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage
127
+
128
+ # Basic column lineage
129
+ infotracker impact -s "+dbo.fct_sales.Revenue" # What feeds this column?
130
+ infotracker impact -s "STG.dbo.Orders.OrderID+" # What uses this column?
131
+ ```
132
+
133
+ ### Wildcard Selectors
134
+ ```bash
135
+ # All columns from a specific table
136
+ infotracker impact -s "dbo.fct_sales.*"
137
+ infotracker impact -s "STG.dbo.Orders.*"
138
+
139
+ # Find all columns containing "revenue" (case-insensitive)
140
+ infotracker impact -s "..revenue"
141
+
142
+ # Find all columns starting with "customer"
143
+ infotracker impact -s "..customer*"
144
+ ```
145
+
146
+ ### Advanced SQL Objects
147
+ ```bash
148
+ # Table-valued function columns (upstream)
149
+ infotracker impact -s "+dbo.fn_customer_orders_tvf.*"
150
+
151
+ # Procedure dataset columns (upstream)
152
+ infotracker impact -s "+dbo.usp_customer_metrics_dataset.*"
153
+
154
+ # Temp table lineage from EXEC
155
+ infotracker impact -s "+#temp_table.*"
156
+ ```
157
+
158
+ ### Output Formats
159
+ ```bash
160
+ # Text output (default, human-readable)
161
+ infotracker impact -s "+..revenue"
162
+
163
+ # JSON output (machine-readable)
164
+ infotracker --format json impact -s "..customer*" > customer_lineage.json
165
+
166
+ # Control traversal depth
167
+ infotracker impact -s "+dbo.Orders.OrderID" --max-depth 2
168
+ ```
169
+
170
+ ### Breaking Change Detection
171
+ ```bash
172
+ # Extract baseline
173
+ infotracker extract --sql-dir sql_v1 --out-dir build/baseline
174
+
175
+ # Extract new version
176
+ infotracker extract --sql-dir sql_v2 --out-dir build/current
177
+
178
+ # Detect breaking changes
179
+ infotracker diff --base build/baseline --head build/current
180
+
181
+ # Filter by severity
182
+ infotracker diff --base build/baseline --head build/current --threshold BREAKING
183
+ ```
184
+
185
+
186
+ ## Output Format
187
+
188
+ Impact analysis returns these columns:
189
+ - **from** - Source column (fully qualified)
190
+ - **to** - Target column (fully qualified)
191
+ - **direction** - `upstream` or `downstream`
192
+ - **transformation** - Type of transformation (`IDENTITY`, `ARITHMETIC`, `AGGREGATION`, `CASE_AGGREGATION`, `DATE_FUNCTION`, `WINDOW`, etc.)
193
+ - **description** - Human-readable transformation description
194
+
195
+ Results are automatically deduplicated. Use `--format json` for machine-readable output.
196
+
197
+ ### New Transformation Types
198
+
199
+ The enhanced transformation taxonomy includes:
200
+ - `ARITHMETIC_AGGREGATION` - Arithmetic operations combined with aggregation functions
201
+ - `COMPLEX_AGGREGATION` - Multi-step calculations involving multiple aggregations
202
+ - `DATE_FUNCTION` - Date/time calculations like DATEDIFF, DATEADD
203
+ - `DATE_FUNCTION_AGGREGATION` - Date functions applied to aggregated results
204
+ - `CASE_AGGREGATION` - CASE statements applied to aggregated results
205
+
206
+ ### Advanced Object Support
207
+
208
+ InfoTracker now supports advanced SQL Server objects:
209
+
210
+ **Table-Valued Functions (TVF):**
211
+ - Inline TVF (`RETURN AS SELECT`) - Parsed directly from SELECT statement
212
+ - Multi-statement TVF (`RETURN @table TABLE`) - Extracts schema from table variable definition
213
+ - Function parameters are tracked as filter metadata (don't create columns)
214
+
215
+ **Dataset-Returning Procedures:**
216
+ - Procedures ending with SELECT statement are treated as dataset sources
217
+ - Output schema extracted from the final SELECT statement
218
+ - Parameters tracked as filter metadata affecting lineage scope
219
+
220
+ **EXEC into Temp Tables:**
221
+ - `INSERT INTO #temp EXEC procedure` patterns create edges from procedure columns to temp table columns
222
+ - Temp table lineage propagates downstream to final targets
223
+ - Supports complex workflow patterns combining functions, procedures, and temp tables
224
+
225
+ ## Configuration
226
+
227
+ InfoTracker follows this configuration precedence:
228
+ 1. **CLI flags** (highest priority) - override everything
229
+ 2. **infotracker.yml** config file - project defaults
230
+ 3. **Built-in defaults** (lowest priority) - fallback values
231
+
232
+ ## 🔧 Configuration
233
+
234
+ Create an `infotracker.yml` file in your project root:
235
+
236
+ ```yaml
237
+ sql_dirs:
238
+ - "sql/"
239
+ - "models/"
240
+ out_dir: "build/lineage"
241
+ exclude_dirs:
242
+ - "__pycache__"
243
+ - ".git"
244
+ severity_threshold: "POTENTIALLY_BREAKING"
245
+ ```
246
+
247
+ ### Configuration Options
248
+
249
+ | Setting | Description | Default | Examples |
250
+ |---------|-------------|---------|----------|
251
+ | `sql_dirs` | Directories to scan for SQL files | `["."]` | `["sql/", "models/"]` |
252
+ | `out_dir` | Output directory for lineage files | `"lineage"` | `"build/artifacts"` |
253
+ | `exclude_dirs` | Directories to skip | `[]` | `["__pycache__", "node_modules"]` |
254
+ | `severity_threshold` | Breaking change detection level | `"NON_BREAKING"` | `"BREAKING"` |
255
+
256
+ ## 📚 Documentation
257
+
258
+ - **[Architecture](docs/architecture.md)** - Core concepts and design
259
+ - **[Lineage Concepts](docs/lineage_concepts.md)** - Data lineage fundamentals
260
+ - **[CLI Usage](docs/cli_usage.md)** - Complete command reference
261
+ - **[Configuration](docs/configuration.md)** - Advanced configuration options
262
+ - **[DBT Integration](docs/dbt_integration.md)** - Using with DBT projects
263
+ - **[OpenLineage Mapping](docs/openlineage_mapping.md)** - Output format specification
264
+ - **[Breaking Changes](docs/breaking_changes.md)** - Change detection and severity levels
265
+ - **[Advanced Use Cases](docs/advanced_use_cases.md)** - TVFs, stored procedures, and complex scenarios
266
+ - **[Edge Cases](docs/edge_cases.md)** - SELECT *, UNION, temp tables handling
267
+ - **[FAQ](docs/faq.md)** - Common questions and troubleshooting
268
+
269
+ ## 🧪 Testing
270
+
271
+ ```bash
272
+ # Run all tests
273
+ pytest
274
+
275
+ # Run specific test categories
276
+ pytest tests/test_parser.py # Parser functionality
277
+ pytest tests/test_wildcard.py # Wildcard selectors
278
+ pytest tests/test_adapter.py # SQL dialect adapters
279
+
280
+ # Run with coverage
281
+ pytest --cov=infotracker --cov-report=html
282
+ ```
283
+
284
+
285
+
286
+
287
+
288
+ ## 📄 License
289
+
290
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
291
+
292
+ ## 🙏 Acknowledgments
293
+
294
+ - [SQLGlot](https://github.com/tobymao/sqlglot) - SQL parsing library
295
+ - [OpenLineage](https://openlineage.io/) - Data lineage standard
296
+ - [Typer](https://typer.tiangolo.com/) - CLI framework
297
+ - [Rich](https://rich.readthedocs.io/) - Terminal formatting
298
+
299
+ ---
300
+
301
+ **InfoTracker** - Making database schema evolution safer, one column at a time. 🎯
@@ -0,0 +1,270 @@
1
+ # InfoTracker
2
+
3
+ **Column-level SQL lineage extraction and impact analysis for MS SQL Server**
4
+
5
+ InfoTracker is a powerful command-line tool that parses T-SQL files and generates detailed column-level lineage in OpenLineage format. It supports advanced SQL Server features including table-valued functions, stored procedures, temp tables, and EXEC patterns.
6
+
7
+ [![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://python.org)
8
+ [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
9
+ [![PyPI](https://img.shields.io/badge/PyPI-InfoTracker-blue.svg)](https://pypi.org/project/InfoTracker/)
10
+
11
+ ## 🚀 Features
12
+
13
+ - **Column-level lineage** - Track data flow at the column level with precise transformations
14
+ - **Advanced SQL support** - T-SQL dialect with temp tables, variables, CTEs, and window functions
15
+ - **Impact analysis** - Find upstream and downstream dependencies with flexible selectors
16
+ - **Wildcard matching** - Support for table wildcards (`schema.table.*`) and column wildcards (`..pattern`)
17
+ - **Breaking change detection** - Detect schema changes that could break downstream processes
18
+ - **Multiple output formats** - Text tables or JSON for integration with other tools
19
+ - **OpenLineage compatible** - Standard format for data lineage interoperability
20
+ - **Advanced SQL objects** - Table-valued functions (TVF) and dataset-returning procedures
21
+ - **Temp table tracking** - Full lineage through EXEC into temp tables
22
+
23
+ ## 📦 Installation
24
+
25
+ ### From PyPI (Recommended)
26
+ ```bash
27
+ pip install InfoTracker
28
+ ```
29
+
30
+ ### From GitHub
31
+ ```bash
32
+ # Latest stable release
33
+ pip install git+https://github.com/InfoMatePL/InfoTracker.git
34
+
35
+ # Development version
36
+ git clone https://github.com/InfoMatePL/InfoTracker.git
37
+ cd InfoTracker
38
+ pip install -e .
39
+ ```
40
+
41
+ ### Verify Installation
42
+ ```bash
43
+ infotracker --help
44
+ ```
45
+
46
+ ## ⚡ Quick Start
47
+
48
+ ### 1. Extract Lineage
49
+ ```bash
50
+ # Extract lineage from SQL files
51
+ infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage
52
+ ```
53
+
54
+ ### 2. Run Impact Analysis
55
+ ```bash
56
+ # Find what feeds into a column (upstream)
57
+ infotracker impact -s "+STG.dbo.Orders.OrderID"
58
+
59
+ # Find what uses a column (downstream)
60
+ infotracker impact -s "STG.dbo.Orders.OrderID+"
61
+
62
+ # Both directions
63
+ infotracker impact -s "+dbo.fct_sales.Revenue+"
64
+ ```
65
+
66
+ ### 3. Detect Breaking Changes
67
+ ```bash
68
+ # Compare two versions of your schema
69
+ infotracker diff --base build/lineage --head build/lineage_new
70
+ ```
71
+ ## 📖 Selector Syntax
72
+
73
+ InfoTracker supports flexible column selectors for precise impact analysis:
74
+
75
+ | Selector Format | Description | Example |
76
+ |-----------------|-------------|---------|
77
+ | `table.column` | Simple format (adds default `dbo` schema) | `Orders.OrderID` |
78
+ | `schema.table.column` | Schema-qualified format | `dbo.Orders.OrderID` |
79
+ | `database.schema.table.column` | Database-qualified format | `STG.dbo.Orders.OrderID` |
80
+ | `schema.table.*` | Table wildcard (all columns) | `dbo.fct_sales.*` |
81
+ | `..pattern` | Column wildcard (name contains pattern) | `..revenue` |
82
+ | `..pattern*` | Column wildcard with fnmatch | `..customer*` |
83
+
84
+ ### Direction Control
85
+ - `selector` - downstream dependencies (default)
86
+ - `+selector` - upstream sources
87
+ - `selector+` - downstream dependencies (explicit)
88
+ - `+selector+` - both upstream and downstream
89
+
90
+ ## 💡 Examples
91
+
92
+ ### Basic Usage
93
+ ```bash
94
+ # Extract lineage first (always run this before impact analysis)
95
+ infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage
96
+
97
+ # Basic column lineage
98
+ infotracker impact -s "+dbo.fct_sales.Revenue" # What feeds this column?
99
+ infotracker impact -s "STG.dbo.Orders.OrderID+" # What uses this column?
100
+ ```
101
+
102
+ ### Wildcard Selectors
103
+ ```bash
104
+ # All columns from a specific table
105
+ infotracker impact -s "dbo.fct_sales.*"
106
+ infotracker impact -s "STG.dbo.Orders.*"
107
+
108
+ # Find all columns containing "revenue" (case-insensitive)
109
+ infotracker impact -s "..revenue"
110
+
111
+ # Find all columns starting with "customer"
112
+ infotracker impact -s "..customer*"
113
+ ```
114
+
115
+ ### Advanced SQL Objects
116
+ ```bash
117
+ # Table-valued function columns (upstream)
118
+ infotracker impact -s "+dbo.fn_customer_orders_tvf.*"
119
+
120
+ # Procedure dataset columns (upstream)
121
+ infotracker impact -s "+dbo.usp_customer_metrics_dataset.*"
122
+
123
+ # Temp table lineage from EXEC
124
+ infotracker impact -s "+#temp_table.*"
125
+ ```
126
+
127
+ ### Output Formats
128
+ ```bash
129
+ # Text output (default, human-readable)
130
+ infotracker impact -s "+..revenue"
131
+
132
+ # JSON output (machine-readable)
133
+ infotracker --format json impact -s "..customer*" > customer_lineage.json
134
+
135
+ # Control traversal depth
136
+ infotracker impact -s "+dbo.Orders.OrderID" --max-depth 2
137
+ ```
138
+
139
+ ### Breaking Change Detection
140
+ ```bash
141
+ # Extract baseline
142
+ infotracker extract --sql-dir sql_v1 --out-dir build/baseline
143
+
144
+ # Extract new version
145
+ infotracker extract --sql-dir sql_v2 --out-dir build/current
146
+
147
+ # Detect breaking changes
148
+ infotracker diff --base build/baseline --head build/current
149
+
150
+ # Filter by severity
151
+ infotracker diff --base build/baseline --head build/current --threshold BREAKING
152
+ ```
153
+
154
+
155
+ ## Output Format
156
+
157
+ Impact analysis returns these columns:
158
+ - **from** - Source column (fully qualified)
159
+ - **to** - Target column (fully qualified)
160
+ - **direction** - `upstream` or `downstream`
161
+ - **transformation** - Type of transformation (`IDENTITY`, `ARITHMETIC`, `AGGREGATION`, `CASE_AGGREGATION`, `DATE_FUNCTION`, `WINDOW`, etc.)
162
+ - **description** - Human-readable transformation description
163
+
164
+ Results are automatically deduplicated. Use `--format json` for machine-readable output.
165
+
166
+ ### New Transformation Types
167
+
168
+ The enhanced transformation taxonomy includes:
169
+ - `ARITHMETIC_AGGREGATION` - Arithmetic operations combined with aggregation functions
170
+ - `COMPLEX_AGGREGATION` - Multi-step calculations involving multiple aggregations
171
+ - `DATE_FUNCTION` - Date/time calculations like DATEDIFF, DATEADD
172
+ - `DATE_FUNCTION_AGGREGATION` - Date functions applied to aggregated results
173
+ - `CASE_AGGREGATION` - CASE statements applied to aggregated results
174
+
175
+ ### Advanced Object Support
176
+
177
+ InfoTracker now supports advanced SQL Server objects:
178
+
179
+ **Table-Valued Functions (TVF):**
180
+ - Inline TVF (`RETURN AS SELECT`) - Parsed directly from SELECT statement
181
+ - Multi-statement TVF (`RETURN @table TABLE`) - Extracts schema from table variable definition
182
+ - Function parameters are tracked as filter metadata (don't create columns)
183
+
184
+ **Dataset-Returning Procedures:**
185
+ - Procedures ending with SELECT statement are treated as dataset sources
186
+ - Output schema extracted from the final SELECT statement
187
+ - Parameters tracked as filter metadata affecting lineage scope
188
+
189
+ **EXEC into Temp Tables:**
190
+ - `INSERT INTO #temp EXEC procedure` patterns create edges from procedure columns to temp table columns
191
+ - Temp table lineage propagates downstream to final targets
192
+ - Supports complex workflow patterns combining functions, procedures, and temp tables
193
+
194
+ ## Configuration
195
+
196
+ InfoTracker follows this configuration precedence:
197
+ 1. **CLI flags** (highest priority) - override everything
198
+ 2. **infotracker.yml** config file - project defaults
199
+ 3. **Built-in defaults** (lowest priority) - fallback values
200
+
201
+ ## 🔧 Configuration
202
+
203
+ Create an `infotracker.yml` file in your project root:
204
+
205
+ ```yaml
206
+ sql_dirs:
207
+ - "sql/"
208
+ - "models/"
209
+ out_dir: "build/lineage"
210
+ exclude_dirs:
211
+ - "__pycache__"
212
+ - ".git"
213
+ severity_threshold: "POTENTIALLY_BREAKING"
214
+ ```
215
+
216
+ ### Configuration Options
217
+
218
+ | Setting | Description | Default | Examples |
219
+ |---------|-------------|---------|----------|
220
+ | `sql_dirs` | Directories to scan for SQL files | `["."]` | `["sql/", "models/"]` |
221
+ | `out_dir` | Output directory for lineage files | `"lineage"` | `"build/artifacts"` |
222
+ | `exclude_dirs` | Directories to skip | `[]` | `["__pycache__", "node_modules"]` |
223
+ | `severity_threshold` | Breaking change detection level | `"NON_BREAKING"` | `"BREAKING"` |
224
+
225
+ ## 📚 Documentation
226
+
227
+ - **[Architecture](docs/architecture.md)** - Core concepts and design
228
+ - **[Lineage Concepts](docs/lineage_concepts.md)** - Data lineage fundamentals
229
+ - **[CLI Usage](docs/cli_usage.md)** - Complete command reference
230
+ - **[Configuration](docs/configuration.md)** - Advanced configuration options
231
+ - **[DBT Integration](docs/dbt_integration.md)** - Using with DBT projects
232
+ - **[OpenLineage Mapping](docs/openlineage_mapping.md)** - Output format specification
233
+ - **[Breaking Changes](docs/breaking_changes.md)** - Change detection and severity levels
234
+ - **[Advanced Use Cases](docs/advanced_use_cases.md)** - TVFs, stored procedures, and complex scenarios
235
+ - **[Edge Cases](docs/edge_cases.md)** - SELECT *, UNION, temp tables handling
236
+ - **[FAQ](docs/faq.md)** - Common questions and troubleshooting
237
+
238
+ ## 🧪 Testing
239
+
240
+ ```bash
241
+ # Run all tests
242
+ pytest
243
+
244
+ # Run specific test categories
245
+ pytest tests/test_parser.py # Parser functionality
246
+ pytest tests/test_wildcard.py # Wildcard selectors
247
+ pytest tests/test_adapter.py # SQL dialect adapters
248
+
249
+ # Run with coverage
250
+ pytest --cov=infotracker --cov-report=html
251
+ ```
252
+
253
+
254
+
255
+
256
+
257
+ ## 📄 License
258
+
259
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
260
+
261
+ ## 🙏 Acknowledgments
262
+
263
+ - [SQLGlot](https://github.com/tobymao/sqlglot) - SQL parsing library
264
+ - [OpenLineage](https://openlineage.io/) - Data lineage standard
265
+ - [Typer](https://typer.tiangolo.com/) - CLI framework
266
+ - [Rich](https://rich.readthedocs.io/) - Terminal formatting
267
+
268
+ ---
269
+
270
+ **InfoTracker** - Making database schema evolution safer, one column at a time. 🎯
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
4
4
 
5
5
  [project]
6
6
  name = "InfoTracker"
7
- version = "0.3.0"
7
+ version = "0.4.0"
8
8
  description = "Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.10"