InfoTracker 0.3.0__tar.gz → 0.4.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- infotracker-0.4.0/.cursorrules +105 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/.gitignore +3 -4
- infotracker-0.4.0/PKG-INFO +301 -0
- infotracker-0.4.0/README.md +270 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/pyproject.toml +1 -1
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/__init__.py +1 -1
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/cli.py +11 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/engine.py +27 -11
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/infotracker.yml +1 -1
- infotracker-0.4.0/src/infotracker/io_utils.py +312 -0
- infotracker-0.4.0/src/infotracker/lineage.py +255 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/models.py +4 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/openlineage_utils.py +16 -0
- infotracker-0.4.0/src/infotracker/parser.py +3125 -0
- infotracker-0.3.0/PKG-INFO +0 -285
- infotracker-0.3.0/README.md +0 -254
- infotracker-0.3.0/requirements.txt +0 -11
- infotracker-0.3.0/sql_diff_demo/base/01_view.sql +0 -6
- infotracker-0.3.0/sql_diff_demo/head_breaking/01_view.sql +0 -6
- infotracker-0.3.0/sql_diff_demo/head_non/01_view.sql +0 -7
- infotracker-0.3.0/sql_diff_demo/head_potential/01_view.sql +0 -6
- infotracker-0.3.0/src/infotracker/lineage.py +0 -125
- infotracker-0.3.0/src/infotracker/parser.py +0 -1606
- {infotracker-0.3.0 → infotracker-0.4.0}/MANIFEST.in +0 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/__main__.py +0 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/adapters.py +0 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/config.py +0 -0
- {infotracker-0.3.0 → infotracker-0.4.0}/src/infotracker/diff.py +0 -0
@@ -0,0 +1,105 @@
|
|
1
|
+
# InfoTracker – Cursor Rules
|
2
|
+
# Krótkie reguły dla automatycznych zmian w projekcie (parser/engine/CLI/diff/OL).
|
3
|
+
|
4
|
+
# ✱ ZASADY OGÓLNE
|
5
|
+
- Nie zmieniaj publicznych interfejsów CLI (nazwy komend i flag): extract, impact, diff.
|
6
|
+
- Zachowaj kompatybilność plików wyjściowych (OpenLineage JSON, format tabel tekstowych).
|
7
|
+
- Nie usuwaj dokumentacji ani komentarzy TODO; dopisuj „// NOTE(cursor): …” gdzie uzasadniasz decyzję.
|
8
|
+
- Logowanie: use WARNING dla problemów użytkownika (parse/cycle), DEBUG dla szczegółów; nie spamuj INFO.
|
9
|
+
- Deterministyczność: zachowaj stabilne sortowanie (po object_name, ordinal) i stabilne ID runów.
|
10
|
+
- Każda zmiana, która dotyka semantyki, musi mieć test(y) w `tests/` i przechodzić `pytest -q`.
|
11
|
+
- Nie dodawaj twardych zależności zewnętrznych; trzymaj się stdlib + obecnych pakietów.
|
12
|
+
|
13
|
+
# ✱ PARSER (T-SQL)
|
14
|
+
- Zawsze uruchamiaj pre-processing przed sqlglot.parse:
|
15
|
+
- Usuń linie zaczynające się od DECLARE/SET/PRINT.
|
16
|
+
- Usuń GO (puste batche).
|
17
|
+
- Usuń jednowierszowe IF OBJECT_ID('tempdb..#…') … DROP TABLE #… i gołe DROP TABLE #… (temp tables housekeeping).
|
18
|
+
- Sklej dwuliniowe „INSERT INTO #tmp” + następna „EXEC …” do jednej linii: „INSERT INTO #tmp EXEC …”.
|
19
|
+
- Fallback gdy parse się wywala:
|
20
|
+
- Rozpoznaj `INSERT INTO <tabela|#temp> EXEC <proc>` (obsłuż zarówno temp jak i zwykłe tabele).
|
21
|
+
- Wyczyść nazwę procedury (_bez_ końcowego `;` i bez `()`).
|
22
|
+
- Zwróć ObjectInfo:
|
23
|
+
- name = nazwa tabeli (temp → `#…`), object_type = temp_table|table,
|
24
|
+
- schema: 2 placeholderowe kolumny (nullable=True),
|
25
|
+
- dependencies = {proc_name},
|
26
|
+
- lineage: ColumnReference(table_name=proc_name, column_name="*", transformation_type=EXEC).
|
27
|
+
- SELECT/CAST:
|
28
|
+
- Projekcje pobieraj defensywnie: expressions | projections | args["expressions"].
|
29
|
+
- Dla `CAST(... AS TYPE)` ustaw ColumnSchema.data_type na docelowy typ (np. DECIMAL(10,2), NVARCHAR(50)).
|
30
|
+
- Kwalifikacja:
|
31
|
+
- Temp-tabele zawsze pod `namespace="tempdb"`.
|
32
|
+
- Mssql dataset namespace: `mssql://localhost/InfoTrackerDW` (chyba że user ustawi w configu).
|
33
|
+
|
34
|
+
# ✱ ENGINE
|
35
|
+
- Graf zależności buduj z `ObjectInfo.dependencies`. Jeśli puste → fallback na tabele z `lineage.input_fields`.
|
36
|
+
- Topologiczne sortowanie nie może polegać na kolejności plików.
|
37
|
+
- Przy „circular/missing deps” loguj WARNING i przetwarzaj mimo wszystko (best-effort).
|
38
|
+
|
39
|
+
# ✱ OPENLINEAGE
|
40
|
+
- Dla outputs dodawaj facet `schema` dla wszystkich obiektów z kolumnami (tabele, widoki, proc/fn z wynikami).
|
41
|
+
- Dodawaj facet `columnLineage` gdy istnieje lineage.
|
42
|
+
- Namespace i nazwy datasetów muszą być spójne w całym repo.
|
43
|
+
|
44
|
+
# ✱ DIFF
|
45
|
+
- Klasyfikacja zmian:
|
46
|
+
- COLUMN_TYPE_CHANGED → BREAKING.
|
47
|
+
- COLUMN_RENAMED → POTENTIALLY_BREAKING (wykrywanie: typ/nullability/lineage zgodne; heurystyka łącząca pary).
|
48
|
+
- COLUMN_ADDED → NON_BREAKING jeśli nullable, w przeciwnym razie POTENTIALLY_BREAKING.
|
49
|
+
- OBJECT_REMOVED/ADDED → BREAKING.
|
50
|
+
- Honoruj `severity_threshold` z configu w filtrze wyświetlania i exit code:
|
51
|
+
- BREAKING → tylko breaking zmiany.
|
52
|
+
- POTENTIALLY_BREAKING → breaking + potentially.
|
53
|
+
- NON_BREAKING → wszystkie zmiany.
|
54
|
+
|
55
|
+
# ✱ WILDCARDY (ColumnGraph.find_columns_wildcard)
|
56
|
+
- Pusty/pół-pusty wzorzec (`"."`, "`..`", "`ns..`" bez patternu kolumny) → zwracaj pustą listę.
|
57
|
+
- `schema.table.*`:
|
58
|
+
- Dopasowuj case-insensitive po końcówce nazwy tabeli (sufiks 3/2/1 segmentów), niezależnie od obecności bazy.
|
59
|
+
- `..pattern` i `prefix..pattern`:
|
60
|
+
- `pattern` bez metaznaków działa jako „contains” (case-insensitive); z metaznakami → `fnmatch`.
|
61
|
+
- `prefix` z `://` filtruje po namespace, bez `://` po `table_name` (startswith).
|
62
|
+
- Zawsze deduplikuj węzły przed zwrotem.
|
63
|
+
|
64
|
+
# ✱ CLI
|
65
|
+
- Nie dodawaj nowych flag bez uzasadnienia. Dozwolone: `--graph-dir`, `--out`, `--format`, `--config`.
|
66
|
+
- (Opcjonalnie) możesz dodać `--threshold` do `diff`, ale musi honorować wartości z configu (flagą tylko nadpisujesz).
|
67
|
+
|
68
|
+
# ✱ TESTY
|
69
|
+
- Dodawaj test przy każdej zmianie heurystyki parsera/diff/wildcard.
|
70
|
+
- W testach `INSERT … EXEC` nazwę procedury porównuj bez `;`.
|
71
|
+
- W testach wildcardów – sprawdzaj zarówno tryb exact (`schema.table.*`), jak i contains (`..customer*`), z i bez namespace.
|
72
|
+
|
73
|
+
# ✱ KOMITY / PR-Y
|
74
|
+
- Commity: konwencja
|
75
|
+
- feat(parser|engine|cli|diff|models): …
|
76
|
+
- fix(parser|engine|…): …
|
77
|
+
- chore(logging|tests|ci): …
|
78
|
+
- PR ma być mały i atomowy; zawiera opis zmiany, wpływ na diff/impact, link do testów.
|
79
|
+
- Nie łącz refaktoru z funkcjonalną zmianą w jednym PR (chyba że to mechaniczny lint bez ryzyka).
|
80
|
+
|
81
|
+
# ✱ TYPOWE AUTO-NAPRAWY DOZWOLONE
|
82
|
+
- Usuwanie martwego kodu po `return`.
|
83
|
+
- Wstawienie brakujących `GO` między wieloma `CREATE` w jednym pliku przykładowym.
|
84
|
+
- Dodanie brakującego `CREATE` przed `VIEW` gdy wykryto wzorzec `;VIEW … AS` (demo).
|
85
|
+
- Normalizacja nazw procedur (usunięcie `;`/`()`).
|
86
|
+
|
87
|
+
# ✱ CO JEST ZABRONIONE
|
88
|
+
- Zmiana semantyki outputu OpenLineage (np. inny kształt JSON).
|
89
|
+
- Wyciszanie ostrzeżeń bez adresowania przyczyny (np. „pass” w except).
|
90
|
+
- Wymiana parsera SQL lub dodawanie ciężkich zależności.
|
91
|
+
|
92
|
+
# ✱ SZYBKIE POLECENIA (smoke)
|
93
|
+
- extract: `infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage`
|
94
|
+
- impact: `infotracker impact -s "+dbo.orders.total_amount+" --graph-dir build/lineage --max-depth 3`
|
95
|
+
- diff: `infotracker --config tmp.yml diff --base build/ol_base --head build/ol_head --format text`
|
96
|
+
- tests: `pytest -q` lub selektywnie `pytest -k wildcard -q`
|
97
|
+
|
98
|
+
# ✱ KONTEKST PROJEKTU (dla agenta)
|
99
|
+
- Projekt: InfoTracker – lineage/diff dla MSSQL z OL artifacts; moduły: parser.py, engine.py, diff.py, models.py, lineage.py, cli.py.
|
100
|
+
- Najważniejsze kontrakty: ObjectInfo, ColumnSchema, ColumnLineage, ColumnGraph.
|
101
|
+
- Kluczowe inwarianty:
|
102
|
+
- parser: preprocess + fallback dla INSERT INTO … EXEC,
|
103
|
+
- engine: deps z ObjectInfo.dependencies,
|
104
|
+
- diff: filtr po severity_threshold,
|
105
|
+
- models: wildcardy case-insensitive + dopasowanie po sufiksie tabeli.
|
@@ -0,0 +1,301 @@
|
|
1
|
+
Metadata-Version: 2.4
|
2
|
+
Name: InfoTracker
|
3
|
+
Version: 0.4.0
|
4
|
+
Summary: Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)
|
5
|
+
Project-URL: homepage, https://example.com/infotracker
|
6
|
+
Project-URL: documentation, https://example.com/infotracker/docs
|
7
|
+
Author: InfoTracker Authors
|
8
|
+
License: MIT
|
9
|
+
Keywords: data-lineage,impact-analysis,lineage,mssql,openlineage,sql
|
10
|
+
Classifier: Environment :: Console
|
11
|
+
Classifier: License :: OSI Approved :: MIT License
|
12
|
+
Classifier: Operating System :: OS Independent
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
14
|
+
Classifier: Programming Language :: Python :: 3.10
|
15
|
+
Classifier: Topic :: Database
|
16
|
+
Classifier: Topic :: Software Development :: Libraries
|
17
|
+
Requires-Python: >=3.10
|
18
|
+
Requires-Dist: click
|
19
|
+
Requires-Dist: networkx>=3.3
|
20
|
+
Requires-Dist: packaging>=24.0
|
21
|
+
Requires-Dist: pydantic>=2.8.2
|
22
|
+
Requires-Dist: pyyaml>=6.0.1
|
23
|
+
Requires-Dist: rich
|
24
|
+
Requires-Dist: shellingham
|
25
|
+
Requires-Dist: sqlglot>=23.0.0
|
26
|
+
Requires-Dist: typer
|
27
|
+
Provides-Extra: dev
|
28
|
+
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
|
29
|
+
Requires-Dist: pytest>=7.4.0; extra == 'dev'
|
30
|
+
Description-Content-Type: text/markdown
|
31
|
+
|
32
|
+
# InfoTracker
|
33
|
+
|
34
|
+
**Column-level SQL lineage extraction and impact analysis for MS SQL Server**
|
35
|
+
|
36
|
+
InfoTracker is a powerful command-line tool that parses T-SQL files and generates detailed column-level lineage in OpenLineage format. It supports advanced SQL Server features including table-valued functions, stored procedures, temp tables, and EXEC patterns.
|
37
|
+
|
38
|
+
[](https://python.org)
|
39
|
+
[](LICENSE)
|
40
|
+
[](https://pypi.org/project/InfoTracker/)
|
41
|
+
|
42
|
+
## 🚀 Features
|
43
|
+
|
44
|
+
- **Column-level lineage** - Track data flow at the column level with precise transformations
|
45
|
+
- **Advanced SQL support** - T-SQL dialect with temp tables, variables, CTEs, and window functions
|
46
|
+
- **Impact analysis** - Find upstream and downstream dependencies with flexible selectors
|
47
|
+
- **Wildcard matching** - Support for table wildcards (`schema.table.*`) and column wildcards (`..pattern`)
|
48
|
+
- **Breaking change detection** - Detect schema changes that could break downstream processes
|
49
|
+
- **Multiple output formats** - Text tables or JSON for integration with other tools
|
50
|
+
- **OpenLineage compatible** - Standard format for data lineage interoperability
|
51
|
+
- **Advanced SQL objects** - Table-valued functions (TVF) and dataset-returning procedures
|
52
|
+
- **Temp table tracking** - Full lineage through EXEC into temp tables
|
53
|
+
|
54
|
+
## 📦 Installation
|
55
|
+
|
56
|
+
### From PyPI (Recommended)
|
57
|
+
```bash
|
58
|
+
pip install InfoTracker
|
59
|
+
```
|
60
|
+
|
61
|
+
### From GitHub
|
62
|
+
```bash
|
63
|
+
# Latest stable release
|
64
|
+
pip install git+https://github.com/InfoMatePL/InfoTracker.git
|
65
|
+
|
66
|
+
# Development version
|
67
|
+
git clone https://github.com/InfoMatePL/InfoTracker.git
|
68
|
+
cd InfoTracker
|
69
|
+
pip install -e .
|
70
|
+
```
|
71
|
+
|
72
|
+
### Verify Installation
|
73
|
+
```bash
|
74
|
+
infotracker --help
|
75
|
+
```
|
76
|
+
|
77
|
+
## ⚡ Quick Start
|
78
|
+
|
79
|
+
### 1. Extract Lineage
|
80
|
+
```bash
|
81
|
+
# Extract lineage from SQL files
|
82
|
+
infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage
|
83
|
+
```
|
84
|
+
|
85
|
+
### 2. Run Impact Analysis
|
86
|
+
```bash
|
87
|
+
# Find what feeds into a column (upstream)
|
88
|
+
infotracker impact -s "+STG.dbo.Orders.OrderID"
|
89
|
+
|
90
|
+
# Find what uses a column (downstream)
|
91
|
+
infotracker impact -s "STG.dbo.Orders.OrderID+"
|
92
|
+
|
93
|
+
# Both directions
|
94
|
+
infotracker impact -s "+dbo.fct_sales.Revenue+"
|
95
|
+
```
|
96
|
+
|
97
|
+
### 3. Detect Breaking Changes
|
98
|
+
```bash
|
99
|
+
# Compare two versions of your schema
|
100
|
+
infotracker diff --base build/lineage --head build/lineage_new
|
101
|
+
```
|
102
|
+
## 📖 Selector Syntax
|
103
|
+
|
104
|
+
InfoTracker supports flexible column selectors for precise impact analysis:
|
105
|
+
|
106
|
+
| Selector Format | Description | Example |
|
107
|
+
|-----------------|-------------|---------|
|
108
|
+
| `table.column` | Simple format (adds default `dbo` schema) | `Orders.OrderID` |
|
109
|
+
| `schema.table.column` | Schema-qualified format | `dbo.Orders.OrderID` |
|
110
|
+
| `database.schema.table.column` | Database-qualified format | `STG.dbo.Orders.OrderID` |
|
111
|
+
| `schema.table.*` | Table wildcard (all columns) | `dbo.fct_sales.*` |
|
112
|
+
| `..pattern` | Column wildcard (name contains pattern) | `..revenue` |
|
113
|
+
| `..pattern*` | Column wildcard with fnmatch | `..customer*` |
|
114
|
+
|
115
|
+
### Direction Control
|
116
|
+
- `selector` - downstream dependencies (default)
|
117
|
+
- `+selector` - upstream sources
|
118
|
+
- `selector+` - downstream dependencies (explicit)
|
119
|
+
- `+selector+` - both upstream and downstream
|
120
|
+
|
121
|
+
## 💡 Examples
|
122
|
+
|
123
|
+
### Basic Usage
|
124
|
+
```bash
|
125
|
+
# Extract lineage first (always run this before impact analysis)
|
126
|
+
infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage
|
127
|
+
|
128
|
+
# Basic column lineage
|
129
|
+
infotracker impact -s "+dbo.fct_sales.Revenue" # What feeds this column?
|
130
|
+
infotracker impact -s "STG.dbo.Orders.OrderID+" # What uses this column?
|
131
|
+
```
|
132
|
+
|
133
|
+
### Wildcard Selectors
|
134
|
+
```bash
|
135
|
+
# All columns from a specific table
|
136
|
+
infotracker impact -s "dbo.fct_sales.*"
|
137
|
+
infotracker impact -s "STG.dbo.Orders.*"
|
138
|
+
|
139
|
+
# Find all columns containing "revenue" (case-insensitive)
|
140
|
+
infotracker impact -s "..revenue"
|
141
|
+
|
142
|
+
# Find all columns starting with "customer"
|
143
|
+
infotracker impact -s "..customer*"
|
144
|
+
```
|
145
|
+
|
146
|
+
### Advanced SQL Objects
|
147
|
+
```bash
|
148
|
+
# Table-valued function columns (upstream)
|
149
|
+
infotracker impact -s "+dbo.fn_customer_orders_tvf.*"
|
150
|
+
|
151
|
+
# Procedure dataset columns (upstream)
|
152
|
+
infotracker impact -s "+dbo.usp_customer_metrics_dataset.*"
|
153
|
+
|
154
|
+
# Temp table lineage from EXEC
|
155
|
+
infotracker impact -s "+#temp_table.*"
|
156
|
+
```
|
157
|
+
|
158
|
+
### Output Formats
|
159
|
+
```bash
|
160
|
+
# Text output (default, human-readable)
|
161
|
+
infotracker impact -s "+..revenue"
|
162
|
+
|
163
|
+
# JSON output (machine-readable)
|
164
|
+
infotracker --format json impact -s "..customer*" > customer_lineage.json
|
165
|
+
|
166
|
+
# Control traversal depth
|
167
|
+
infotracker impact -s "+dbo.Orders.OrderID" --max-depth 2
|
168
|
+
```
|
169
|
+
|
170
|
+
### Breaking Change Detection
|
171
|
+
```bash
|
172
|
+
# Extract baseline
|
173
|
+
infotracker extract --sql-dir sql_v1 --out-dir build/baseline
|
174
|
+
|
175
|
+
# Extract new version
|
176
|
+
infotracker extract --sql-dir sql_v2 --out-dir build/current
|
177
|
+
|
178
|
+
# Detect breaking changes
|
179
|
+
infotracker diff --base build/baseline --head build/current
|
180
|
+
|
181
|
+
# Filter by severity
|
182
|
+
infotracker diff --base build/baseline --head build/current --threshold BREAKING
|
183
|
+
```
|
184
|
+
|
185
|
+
|
186
|
+
## Output Format
|
187
|
+
|
188
|
+
Impact analysis returns these columns:
|
189
|
+
- **from** - Source column (fully qualified)
|
190
|
+
- **to** - Target column (fully qualified)
|
191
|
+
- **direction** - `upstream` or `downstream`
|
192
|
+
- **transformation** - Type of transformation (`IDENTITY`, `ARITHMETIC`, `AGGREGATION`, `CASE_AGGREGATION`, `DATE_FUNCTION`, `WINDOW`, etc.)
|
193
|
+
- **description** - Human-readable transformation description
|
194
|
+
|
195
|
+
Results are automatically deduplicated. Use `--format json` for machine-readable output.
|
196
|
+
|
197
|
+
### New Transformation Types
|
198
|
+
|
199
|
+
The enhanced transformation taxonomy includes:
|
200
|
+
- `ARITHMETIC_AGGREGATION` - Arithmetic operations combined with aggregation functions
|
201
|
+
- `COMPLEX_AGGREGATION` - Multi-step calculations involving multiple aggregations
|
202
|
+
- `DATE_FUNCTION` - Date/time calculations like DATEDIFF, DATEADD
|
203
|
+
- `DATE_FUNCTION_AGGREGATION` - Date functions applied to aggregated results
|
204
|
+
- `CASE_AGGREGATION` - CASE statements applied to aggregated results
|
205
|
+
|
206
|
+
### Advanced Object Support
|
207
|
+
|
208
|
+
InfoTracker now supports advanced SQL Server objects:
|
209
|
+
|
210
|
+
**Table-Valued Functions (TVF):**
|
211
|
+
- Inline TVF (`RETURN AS SELECT`) - Parsed directly from SELECT statement
|
212
|
+
- Multi-statement TVF (`RETURN @table TABLE`) - Extracts schema from table variable definition
|
213
|
+
- Function parameters are tracked as filter metadata (don't create columns)
|
214
|
+
|
215
|
+
**Dataset-Returning Procedures:**
|
216
|
+
- Procedures ending with SELECT statement are treated as dataset sources
|
217
|
+
- Output schema extracted from the final SELECT statement
|
218
|
+
- Parameters tracked as filter metadata affecting lineage scope
|
219
|
+
|
220
|
+
**EXEC into Temp Tables:**
|
221
|
+
- `INSERT INTO #temp EXEC procedure` patterns create edges from procedure columns to temp table columns
|
222
|
+
- Temp table lineage propagates downstream to final targets
|
223
|
+
- Supports complex workflow patterns combining functions, procedures, and temp tables
|
224
|
+
|
225
|
+
## Configuration
|
226
|
+
|
227
|
+
InfoTracker follows this configuration precedence:
|
228
|
+
1. **CLI flags** (highest priority) - override everything
|
229
|
+
2. **infotracker.yml** config file - project defaults
|
230
|
+
3. **Built-in defaults** (lowest priority) - fallback values
|
231
|
+
|
232
|
+
## 🔧 Configuration
|
233
|
+
|
234
|
+
Create an `infotracker.yml` file in your project root:
|
235
|
+
|
236
|
+
```yaml
|
237
|
+
sql_dirs:
|
238
|
+
- "sql/"
|
239
|
+
- "models/"
|
240
|
+
out_dir: "build/lineage"
|
241
|
+
exclude_dirs:
|
242
|
+
- "__pycache__"
|
243
|
+
- ".git"
|
244
|
+
severity_threshold: "POTENTIALLY_BREAKING"
|
245
|
+
```
|
246
|
+
|
247
|
+
### Configuration Options
|
248
|
+
|
249
|
+
| Setting | Description | Default | Examples |
|
250
|
+
|---------|-------------|---------|----------|
|
251
|
+
| `sql_dirs` | Directories to scan for SQL files | `["."]` | `["sql/", "models/"]` |
|
252
|
+
| `out_dir` | Output directory for lineage files | `"lineage"` | `"build/artifacts"` |
|
253
|
+
| `exclude_dirs` | Directories to skip | `[]` | `["__pycache__", "node_modules"]` |
|
254
|
+
| `severity_threshold` | Breaking change detection level | `"NON_BREAKING"` | `"BREAKING"` |
|
255
|
+
|
256
|
+
## 📚 Documentation
|
257
|
+
|
258
|
+
- **[Architecture](docs/architecture.md)** - Core concepts and design
|
259
|
+
- **[Lineage Concepts](docs/lineage_concepts.md)** - Data lineage fundamentals
|
260
|
+
- **[CLI Usage](docs/cli_usage.md)** - Complete command reference
|
261
|
+
- **[Configuration](docs/configuration.md)** - Advanced configuration options
|
262
|
+
- **[DBT Integration](docs/dbt_integration.md)** - Using with DBT projects
|
263
|
+
- **[OpenLineage Mapping](docs/openlineage_mapping.md)** - Output format specification
|
264
|
+
- **[Breaking Changes](docs/breaking_changes.md)** - Change detection and severity levels
|
265
|
+
- **[Advanced Use Cases](docs/advanced_use_cases.md)** - TVFs, stored procedures, and complex scenarios
|
266
|
+
- **[Edge Cases](docs/edge_cases.md)** - SELECT *, UNION, temp tables handling
|
267
|
+
- **[FAQ](docs/faq.md)** - Common questions and troubleshooting
|
268
|
+
|
269
|
+
## 🧪 Testing
|
270
|
+
|
271
|
+
```bash
|
272
|
+
# Run all tests
|
273
|
+
pytest
|
274
|
+
|
275
|
+
# Run specific test categories
|
276
|
+
pytest tests/test_parser.py # Parser functionality
|
277
|
+
pytest tests/test_wildcard.py # Wildcard selectors
|
278
|
+
pytest tests/test_adapter.py # SQL dialect adapters
|
279
|
+
|
280
|
+
# Run with coverage
|
281
|
+
pytest --cov=infotracker --cov-report=html
|
282
|
+
```
|
283
|
+
|
284
|
+
|
285
|
+
|
286
|
+
|
287
|
+
|
288
|
+
## 📄 License
|
289
|
+
|
290
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
291
|
+
|
292
|
+
## 🙏 Acknowledgments
|
293
|
+
|
294
|
+
- [SQLGlot](https://github.com/tobymao/sqlglot) - SQL parsing library
|
295
|
+
- [OpenLineage](https://openlineage.io/) - Data lineage standard
|
296
|
+
- [Typer](https://typer.tiangolo.com/) - CLI framework
|
297
|
+
- [Rich](https://rich.readthedocs.io/) - Terminal formatting
|
298
|
+
|
299
|
+
---
|
300
|
+
|
301
|
+
**InfoTracker** - Making database schema evolution safer, one column at a time. 🎯
|
@@ -0,0 +1,270 @@
|
|
1
|
+
# InfoTracker
|
2
|
+
|
3
|
+
**Column-level SQL lineage extraction and impact analysis for MS SQL Server**
|
4
|
+
|
5
|
+
InfoTracker is a powerful command-line tool that parses T-SQL files and generates detailed column-level lineage in OpenLineage format. It supports advanced SQL Server features including table-valued functions, stored procedures, temp tables, and EXEC patterns.
|
6
|
+
|
7
|
+
[](https://python.org)
|
8
|
+
[](LICENSE)
|
9
|
+
[](https://pypi.org/project/InfoTracker/)
|
10
|
+
|
11
|
+
## 🚀 Features
|
12
|
+
|
13
|
+
- **Column-level lineage** - Track data flow at the column level with precise transformations
|
14
|
+
- **Advanced SQL support** - T-SQL dialect with temp tables, variables, CTEs, and window functions
|
15
|
+
- **Impact analysis** - Find upstream and downstream dependencies with flexible selectors
|
16
|
+
- **Wildcard matching** - Support for table wildcards (`schema.table.*`) and column wildcards (`..pattern`)
|
17
|
+
- **Breaking change detection** - Detect schema changes that could break downstream processes
|
18
|
+
- **Multiple output formats** - Text tables or JSON for integration with other tools
|
19
|
+
- **OpenLineage compatible** - Standard format for data lineage interoperability
|
20
|
+
- **Advanced SQL objects** - Table-valued functions (TVF) and dataset-returning procedures
|
21
|
+
- **Temp table tracking** - Full lineage through EXEC into temp tables
|
22
|
+
|
23
|
+
## 📦 Installation
|
24
|
+
|
25
|
+
### From PyPI (Recommended)
|
26
|
+
```bash
|
27
|
+
pip install InfoTracker
|
28
|
+
```
|
29
|
+
|
30
|
+
### From GitHub
|
31
|
+
```bash
|
32
|
+
# Latest stable release
|
33
|
+
pip install git+https://github.com/InfoMatePL/InfoTracker.git
|
34
|
+
|
35
|
+
# Development version
|
36
|
+
git clone https://github.com/InfoMatePL/InfoTracker.git
|
37
|
+
cd InfoTracker
|
38
|
+
pip install -e .
|
39
|
+
```
|
40
|
+
|
41
|
+
### Verify Installation
|
42
|
+
```bash
|
43
|
+
infotracker --help
|
44
|
+
```
|
45
|
+
|
46
|
+
## ⚡ Quick Start
|
47
|
+
|
48
|
+
### 1. Extract Lineage
|
49
|
+
```bash
|
50
|
+
# Extract lineage from SQL files
|
51
|
+
infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage
|
52
|
+
```
|
53
|
+
|
54
|
+
### 2. Run Impact Analysis
|
55
|
+
```bash
|
56
|
+
# Find what feeds into a column (upstream)
|
57
|
+
infotracker impact -s "+STG.dbo.Orders.OrderID"
|
58
|
+
|
59
|
+
# Find what uses a column (downstream)
|
60
|
+
infotracker impact -s "STG.dbo.Orders.OrderID+"
|
61
|
+
|
62
|
+
# Both directions
|
63
|
+
infotracker impact -s "+dbo.fct_sales.Revenue+"
|
64
|
+
```
|
65
|
+
|
66
|
+
### 3. Detect Breaking Changes
|
67
|
+
```bash
|
68
|
+
# Compare two versions of your schema
|
69
|
+
infotracker diff --base build/lineage --head build/lineage_new
|
70
|
+
```
|
71
|
+
## 📖 Selector Syntax
|
72
|
+
|
73
|
+
InfoTracker supports flexible column selectors for precise impact analysis:
|
74
|
+
|
75
|
+
| Selector Format | Description | Example |
|
76
|
+
|-----------------|-------------|---------|
|
77
|
+
| `table.column` | Simple format (adds default `dbo` schema) | `Orders.OrderID` |
|
78
|
+
| `schema.table.column` | Schema-qualified format | `dbo.Orders.OrderID` |
|
79
|
+
| `database.schema.table.column` | Database-qualified format | `STG.dbo.Orders.OrderID` |
|
80
|
+
| `schema.table.*` | Table wildcard (all columns) | `dbo.fct_sales.*` |
|
81
|
+
| `..pattern` | Column wildcard (name contains pattern) | `..revenue` |
|
82
|
+
| `..pattern*` | Column wildcard with fnmatch | `..customer*` |
|
83
|
+
|
84
|
+
### Direction Control
|
85
|
+
- `selector` - downstream dependencies (default)
|
86
|
+
- `+selector` - upstream sources
|
87
|
+
- `selector+` - downstream dependencies (explicit)
|
88
|
+
- `+selector+` - both upstream and downstream
|
89
|
+
|
90
|
+
## 💡 Examples
|
91
|
+
|
92
|
+
### Basic Usage
|
93
|
+
```bash
|
94
|
+
# Extract lineage first (always run this before impact analysis)
|
95
|
+
infotracker extract --sql-dir examples/warehouse/sql --out-dir build/lineage
|
96
|
+
|
97
|
+
# Basic column lineage
|
98
|
+
infotracker impact -s "+dbo.fct_sales.Revenue" # What feeds this column?
|
99
|
+
infotracker impact -s "STG.dbo.Orders.OrderID+" # What uses this column?
|
100
|
+
```
|
101
|
+
|
102
|
+
### Wildcard Selectors
|
103
|
+
```bash
|
104
|
+
# All columns from a specific table
|
105
|
+
infotracker impact -s "dbo.fct_sales.*"
|
106
|
+
infotracker impact -s "STG.dbo.Orders.*"
|
107
|
+
|
108
|
+
# Find all columns containing "revenue" (case-insensitive)
|
109
|
+
infotracker impact -s "..revenue"
|
110
|
+
|
111
|
+
# Find all columns starting with "customer"
|
112
|
+
infotracker impact -s "..customer*"
|
113
|
+
```
|
114
|
+
|
115
|
+
### Advanced SQL Objects
|
116
|
+
```bash
|
117
|
+
# Table-valued function columns (upstream)
|
118
|
+
infotracker impact -s "+dbo.fn_customer_orders_tvf.*"
|
119
|
+
|
120
|
+
# Procedure dataset columns (upstream)
|
121
|
+
infotracker impact -s "+dbo.usp_customer_metrics_dataset.*"
|
122
|
+
|
123
|
+
# Temp table lineage from EXEC
|
124
|
+
infotracker impact -s "+#temp_table.*"
|
125
|
+
```
|
126
|
+
|
127
|
+
### Output Formats
|
128
|
+
```bash
|
129
|
+
# Text output (default, human-readable)
|
130
|
+
infotracker impact -s "+..revenue"
|
131
|
+
|
132
|
+
# JSON output (machine-readable)
|
133
|
+
infotracker --format json impact -s "..customer*" > customer_lineage.json
|
134
|
+
|
135
|
+
# Control traversal depth
|
136
|
+
infotracker impact -s "+dbo.Orders.OrderID" --max-depth 2
|
137
|
+
```
|
138
|
+
|
139
|
+
### Breaking Change Detection
|
140
|
+
```bash
|
141
|
+
# Extract baseline
|
142
|
+
infotracker extract --sql-dir sql_v1 --out-dir build/baseline
|
143
|
+
|
144
|
+
# Extract new version
|
145
|
+
infotracker extract --sql-dir sql_v2 --out-dir build/current
|
146
|
+
|
147
|
+
# Detect breaking changes
|
148
|
+
infotracker diff --base build/baseline --head build/current
|
149
|
+
|
150
|
+
# Filter by severity
|
151
|
+
infotracker diff --base build/baseline --head build/current --threshold BREAKING
|
152
|
+
```
|
153
|
+
|
154
|
+
|
155
|
+
## Output Format
|
156
|
+
|
157
|
+
Impact analysis returns these columns:
|
158
|
+
- **from** - Source column (fully qualified)
|
159
|
+
- **to** - Target column (fully qualified)
|
160
|
+
- **direction** - `upstream` or `downstream`
|
161
|
+
- **transformation** - Type of transformation (`IDENTITY`, `ARITHMETIC`, `AGGREGATION`, `CASE_AGGREGATION`, `DATE_FUNCTION`, `WINDOW`, etc.)
|
162
|
+
- **description** - Human-readable transformation description
|
163
|
+
|
164
|
+
Results are automatically deduplicated. Use `--format json` for machine-readable output.
|
165
|
+
|
166
|
+
### New Transformation Types
|
167
|
+
|
168
|
+
The enhanced transformation taxonomy includes:
|
169
|
+
- `ARITHMETIC_AGGREGATION` - Arithmetic operations combined with aggregation functions
|
170
|
+
- `COMPLEX_AGGREGATION` - Multi-step calculations involving multiple aggregations
|
171
|
+
- `DATE_FUNCTION` - Date/time calculations like DATEDIFF, DATEADD
|
172
|
+
- `DATE_FUNCTION_AGGREGATION` - Date functions applied to aggregated results
|
173
|
+
- `CASE_AGGREGATION` - CASE statements applied to aggregated results
|
174
|
+
|
175
|
+
### Advanced Object Support
|
176
|
+
|
177
|
+
InfoTracker now supports advanced SQL Server objects:
|
178
|
+
|
179
|
+
**Table-Valued Functions (TVF):**
|
180
|
+
- Inline TVF (`RETURN AS SELECT`) - Parsed directly from SELECT statement
|
181
|
+
- Multi-statement TVF (`RETURN @table TABLE`) - Extracts schema from table variable definition
|
182
|
+
- Function parameters are tracked as filter metadata (don't create columns)
|
183
|
+
|
184
|
+
**Dataset-Returning Procedures:**
|
185
|
+
- Procedures ending with SELECT statement are treated as dataset sources
|
186
|
+
- Output schema extracted from the final SELECT statement
|
187
|
+
- Parameters tracked as filter metadata affecting lineage scope
|
188
|
+
|
189
|
+
**EXEC into Temp Tables:**
|
190
|
+
- `INSERT INTO #temp EXEC procedure` patterns create edges from procedure columns to temp table columns
|
191
|
+
- Temp table lineage propagates downstream to final targets
|
192
|
+
- Supports complex workflow patterns combining functions, procedures, and temp tables
|
193
|
+
|
194
|
+
## Configuration
|
195
|
+
|
196
|
+
InfoTracker follows this configuration precedence:
|
197
|
+
1. **CLI flags** (highest priority) - override everything
|
198
|
+
2. **infotracker.yml** config file - project defaults
|
199
|
+
3. **Built-in defaults** (lowest priority) - fallback values
|
200
|
+
|
201
|
+
## 🔧 Configuration
|
202
|
+
|
203
|
+
Create an `infotracker.yml` file in your project root:
|
204
|
+
|
205
|
+
```yaml
|
206
|
+
sql_dirs:
|
207
|
+
- "sql/"
|
208
|
+
- "models/"
|
209
|
+
out_dir: "build/lineage"
|
210
|
+
exclude_dirs:
|
211
|
+
- "__pycache__"
|
212
|
+
- ".git"
|
213
|
+
severity_threshold: "POTENTIALLY_BREAKING"
|
214
|
+
```
|
215
|
+
|
216
|
+
### Configuration Options
|
217
|
+
|
218
|
+
| Setting | Description | Default | Examples |
|
219
|
+
|---------|-------------|---------|----------|
|
220
|
+
| `sql_dirs` | Directories to scan for SQL files | `["."]` | `["sql/", "models/"]` |
|
221
|
+
| `out_dir` | Output directory for lineage files | `"lineage"` | `"build/artifacts"` |
|
222
|
+
| `exclude_dirs` | Directories to skip | `[]` | `["__pycache__", "node_modules"]` |
|
223
|
+
| `severity_threshold` | Breaking change detection level | `"NON_BREAKING"` | `"BREAKING"` |
|
224
|
+
|
225
|
+
## 📚 Documentation
|
226
|
+
|
227
|
+
- **[Architecture](docs/architecture.md)** - Core concepts and design
|
228
|
+
- **[Lineage Concepts](docs/lineage_concepts.md)** - Data lineage fundamentals
|
229
|
+
- **[CLI Usage](docs/cli_usage.md)** - Complete command reference
|
230
|
+
- **[Configuration](docs/configuration.md)** - Advanced configuration options
|
231
|
+
- **[DBT Integration](docs/dbt_integration.md)** - Using with DBT projects
|
232
|
+
- **[OpenLineage Mapping](docs/openlineage_mapping.md)** - Output format specification
|
233
|
+
- **[Breaking Changes](docs/breaking_changes.md)** - Change detection and severity levels
|
234
|
+
- **[Advanced Use Cases](docs/advanced_use_cases.md)** - TVFs, stored procedures, and complex scenarios
|
235
|
+
- **[Edge Cases](docs/edge_cases.md)** - SELECT *, UNION, temp tables handling
|
236
|
+
- **[FAQ](docs/faq.md)** - Common questions and troubleshooting
|
237
|
+
|
238
|
+
## 🧪 Testing
|
239
|
+
|
240
|
+
```bash
|
241
|
+
# Run all tests
|
242
|
+
pytest
|
243
|
+
|
244
|
+
# Run specific test categories
|
245
|
+
pytest tests/test_parser.py # Parser functionality
|
246
|
+
pytest tests/test_wildcard.py # Wildcard selectors
|
247
|
+
pytest tests/test_adapter.py # SQL dialect adapters
|
248
|
+
|
249
|
+
# Run with coverage
|
250
|
+
pytest --cov=infotracker --cov-report=html
|
251
|
+
```
|
252
|
+
|
253
|
+
|
254
|
+
|
255
|
+
|
256
|
+
|
257
|
+
## 📄 License
|
258
|
+
|
259
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
260
|
+
|
261
|
+
## 🙏 Acknowledgments
|
262
|
+
|
263
|
+
- [SQLGlot](https://github.com/tobymao/sqlglot) - SQL parsing library
|
264
|
+
- [OpenLineage](https://openlineage.io/) - Data lineage standard
|
265
|
+
- [Typer](https://typer.tiangolo.com/) - CLI framework
|
266
|
+
- [Rich](https://rich.readthedocs.io/) - Terminal formatting
|
267
|
+
|
268
|
+
---
|
269
|
+
|
270
|
+
**InfoTracker** - Making database schema evolution safer, one column at a time. 🎯
|
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
|
|
4
4
|
|
5
5
|
[project]
|
6
6
|
name = "InfoTracker"
|
7
|
-
version = "0.
|
7
|
+
version = "0.4.0"
|
8
8
|
description = "Column-level SQL lineage, impact analysis, and breaking-change detection (MS SQL first)"
|
9
9
|
readme = "README.md"
|
10
10
|
requires-python = ">=3.10"
|