informatica-python 1.8.0__tar.gz → 1.8.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. {informatica_python-1.8.0 → informatica_python-1.8.2}/PKG-INFO +158 -18
  2. informatica_python-1.8.2/README.md +341 -0
  3. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/__init__.py +1 -1
  4. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/generators/mapping_gen.py +16 -4
  5. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python.egg-info/PKG-INFO +158 -18
  6. {informatica_python-1.8.0 → informatica_python-1.8.2}/pyproject.toml +1 -1
  7. {informatica_python-1.8.0 → informatica_python-1.8.2}/tests/test_integration.py +1 -1
  8. informatica_python-1.8.0/README.md +0 -201
  9. {informatica_python-1.8.0 → informatica_python-1.8.2}/LICENSE +0 -0
  10. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/cli.py +0 -0
  11. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/converter.py +0 -0
  12. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/generators/__init__.py +0 -0
  13. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/generators/config_gen.py +0 -0
  14. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/generators/error_log_gen.py +0 -0
  15. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/generators/helper_gen.py +0 -0
  16. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/generators/sql_gen.py +0 -0
  17. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/generators/workflow_gen.py +0 -0
  18. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/models.py +0 -0
  19. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/parser.py +0 -0
  20. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/utils/__init__.py +0 -0
  21. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/utils/datatype_map.py +0 -0
  22. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/utils/expression_converter.py +0 -0
  23. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/utils/lib_adapters.py +0 -0
  24. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python/utils/sql_dialect.py +0 -0
  25. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python.egg-info/SOURCES.txt +0 -0
  26. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python.egg-info/dependency_links.txt +0 -0
  27. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python.egg-info/entry_points.txt +0 -0
  28. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python.egg-info/requires.txt +0 -0
  29. {informatica_python-1.8.0 → informatica_python-1.8.2}/informatica_python.egg-info/top_level.txt +0 -0
  30. {informatica_python-1.8.0 → informatica_python-1.8.2}/setup.cfg +0 -0
  31. {informatica_python-1.8.0 → informatica_python-1.8.2}/tests/test_converter.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: informatica-python
3
- Version: 1.8.0
3
+ Version: 1.8.2
4
4
  Summary: Convert Informatica PowerCenter workflow XML to Python/PySpark code
5
5
  Author: Nick
6
6
  License: MIT
@@ -59,6 +59,12 @@ informatica-python workflow_export.xml -z output.zip
59
59
  # Use a different data library
60
60
  informatica-python workflow_export.xml -o output_dir --data-lib polars
61
61
 
62
+ # Include a parameter file
63
+ informatica-python workflow_export.xml -o output_dir --param-file workflow.param
64
+
65
+ # Enable data quality validation on type casts
66
+ informatica-python workflow_export.xml -o output_dir --validate-casts
67
+
62
68
  # Parse to JSON only (no code generation)
63
69
  informatica-python workflow_export.xml --json
64
70
 
@@ -90,12 +96,12 @@ converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars
90
96
 
91
97
  | File | Description |
92
98
  |------|-------------|
93
- | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions) |
94
- | `mapping_N.py` | One per mapping — transformation logic, source reads, target writes |
95
- | `workflow.py` | Task orchestration with topological ordering and error handling |
99
+ | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions), window/analytic functions, stored procedure execution, state persistence |
100
+ | `mapping_N.py` | One per mapping — transformation logic with row-count logging, source reads, target writes, inline documentation |
101
+ | `workflow.py` | Task orchestration with topological ordering, decision branching, worklet calls, and error handling |
96
102
  | `config.yml` | Connection configs, source/target metadata, runtime parameters |
97
- | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms |
98
- | `error_log.txt` | Conversion summary, warnings, and unsupported feature notes |
103
+ | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms (with ANSI-translated variants) |
104
+ | `error_log.txt` | Conversion summary with unsupported transform analysis, unmapped port detection, and unknown expression function tracing |
99
105
 
100
106
  ## Supported Data Libraries
101
107
 
@@ -113,21 +119,21 @@ Select via `--data-lib` CLI flag or `data_lib` parameter:
113
119
 
114
120
  The code generator produces real, runnable Python for these transformation types:
115
121
 
116
- - **Source Qualifier** — SQL override, pre/post SQL, column selection
117
- - **Expression** — Field-level expressions converted to pandas operations
118
- - **Filter** — Row filtering with converted conditions
119
- - **Joiner** — `pd.merge()` with join type and condition parsing
120
- - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads
121
- - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST
122
+ - **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides
123
+ - **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style)
124
+ - **Filter** — Row filtering with vectorized converted conditions
125
+ - **Joiner** — `pd.merge()` with join type and condition parsing (inner/left/right/outer)
126
+ - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads, multiple match policies, default values
127
+ - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST, computed aggregates
122
128
  - **Sorter** — `sort_values()` with multi-key ascending/descending
123
- - **Router** — Multi-group conditional routing with if/elif/else
129
+ - **Router** — Multi-group conditional routing with named groups
124
130
  - **Union** — `pd.concat()` across multiple input groups
125
- - **Update Strategy** — Insert/Update/Delete/Reject flag generation
131
+ - **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys
126
132
  - **Sequence Generator** — Auto-incrementing ID columns
127
133
  - **Normalizer** — `pd.melt()` with auto-detected id/value vars
128
134
  - **Rank** — `groupby().rank()` with Top-N filtering
129
- - **Stored Procedure** — Stub generation with SP name and parameters
130
- - **Transaction Control** — Commit/rollback logic stubs
135
+ - **Stored Procedure** — Full code generation with Oracle/MSSQL/generic support, input/output parameter mapping
136
+ - **Transaction Control** — Commit/rollback logic
131
137
  - **Custom / Java** — Placeholder stubs with TODO markers
132
138
  - **SQL Transform** — Direct SQL execution pass-through
133
139
 
@@ -147,13 +153,118 @@ The code generator produces real, runnable Python for these transformation types
147
153
 
148
154
  ## Key Features
149
155
 
156
+ ### Row-Count Logging (v1.8+)
157
+
158
+ Generated code automatically logs row counts at every step of the data pipeline:
159
+
160
+ ```
161
+ Source SQ_CUSTOMERS: 10000 rows read
162
+ EXP_CALC (Expression): 10000 input rows -> 10000 output rows
163
+ FIL_ACTIVE (Filter): 10000 input rows -> 8542 output rows
164
+ AGG_TOTALS (Aggregator): 8542 input rows -> 150 output rows
165
+ Target TGT_SUMMARY: 150 rows written
166
+ ```
167
+
168
+ All row-count operations are backend-safe (wrapped in try/except), so Dask and other lazy-evaluation backends won't fail.
169
+
170
+ ### Generated Code Documentation (v1.8+)
171
+
172
+ Every generated mapping function includes a rich docstring describing:
173
+ - Mapping name and original Informatica description
174
+ - Source and target tables/files
175
+ - Transformation pipeline with field counts per step
176
+
177
+ Each transformation block is annotated with:
178
+ - Separator headers for visual scanning
179
+ - Transform type and description (from Informatica XML)
180
+ - Input and output field lists (truncated at 10 for readability)
181
+
182
+ ### Window / Analytic Functions (v1.7+)
183
+
184
+ DataFrame-level analytic functions for aggregation transforms:
185
+ - `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
186
+ - `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
187
+ - `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
188
+ - `percentile_df(df, col, pct)` — quantile via `.quantile()`
189
+
190
+ ### Update Strategy with Target Operations (v1.7+)
191
+
192
+ Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
193
+ - Static strategies (0/1/2/3) map to INSERT/UPDATE/DELETE/REJECT
194
+ - DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT expressions parsed from conditions
195
+ - Target writer splits rows and routes to appropriate SQL operations
196
+ - Dialect-aware SQL placeholders (`?` for MSSQL, `%s` for PostgreSQL/Oracle)
197
+ - Primary key columns auto-detected from target field definitions
198
+
199
+ ### Stored Procedure Execution (v1.7+)
200
+
201
+ Full stored procedure code generation (not just stubs):
202
+ - Oracle: `cursor.callproc()` with output parameter registration
203
+ - MSSQL: `EXEC` with output parameter capture
204
+ - Generic: `CALL` syntax for other databases
205
+ - Input/output parameter mapping from transformation fields
206
+ - Empty-input guard prevents errors on empty upstream DataFrames
207
+
208
+ ### State Persistence (v1.7+)
209
+
210
+ JSON-based variable persistence between workflow runs:
211
+ - `load_persistent_state()` / `save_persistent_state()` bracketing workflow execution
212
+ - `get_persistent_variable()` / `set_persistent_variable()` scoped by workflow/mapping name
213
+ - Mapping variables marked `is_persistent="YES"` automatically load from and save to state file
214
+ - Non-persistent variables remain unaffected
215
+
216
+ ### SQL Dialect Translation (v1.6+)
217
+
218
+ Automatically translates vendor-specific SQL to ANSI equivalents:
219
+ - **Oracle:** NVL→COALESCE, SYSDATE→CURRENT_TIMESTAMP, DECODE→CASE, NVL2→CASE, (+)→ANSI JOIN, ROWNUM→LIMIT
220
+ - **MSSQL:** GETDATE→CURRENT_TIMESTAMP, ISNULL→COALESCE, TOP N→LIMIT, LEN→LENGTH, CHARINDEX→POSITION
221
+ - Auto-detects source dialect; outputs both original and translated SQL
222
+
223
+ ### Enhanced Error Reporting (v1.6+)
224
+
225
+ Structured error log with three analysis sections:
226
+ - **Unsupported Transforms:** Lists each skipped transform with type, field count, and attributes
227
+ - **Unmapped Ports:** OUTPUT fields not connected to any downstream transform
228
+ - **Unsupported Expression Functions:** Unknown functions with location traces
229
+
230
+ ### Nested Mapplet Support (v1.6+)
231
+
232
+ Recursively expands mapplet-within-mapplet instances:
233
+ - Double-underscore namespacing for nested transforms
234
+ - Depth limit of 10 with circular reference protection
235
+ - Connector rewiring through the full expansion tree
236
+
237
+ ### Data Quality Validation (v1.6+)
238
+
239
+ Optional `--validate-casts` flag generates null-count checks before/after type casting:
240
+ - Counts null values pre- and post-coercion per column
241
+ - Logs warnings when coercion introduces new nulls
242
+ - Helps identify data quality issues during test runs
243
+
244
+ ### Vectorized Expression Generation (v1.5+)
245
+
246
+ Column-level pandas operations instead of row-level iteration:
247
+ - IIF → `np.where()`, NVL → `.fillna()`, UPPER/LOWER → `.str.upper()/.str.lower()`
248
+ - SUBSTR → `.str[start:end]`, TO_INTEGER → `pd.to_numeric()`, TO_DATE → `pd.to_datetime()`
249
+ - IS NULL/IS NOT NULL → `.isna()`/`.notna()`
250
+
251
+ ### Parameter File Support (v1.5+)
252
+
253
+ Standard Informatica `.param` file parsing:
254
+ - `[Global]` and `[folder.WF:workflow.ST:session]` section support
255
+ - `get_param(config, var_name)` resolution chain: config → env vars → defaults
256
+ - CLI `--param-file` flag for specifying parameter files
257
+
150
258
  ### Session Connection Overrides (v1.4+)
259
+
151
260
  When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.
152
261
 
153
262
  ### Worklet Support (v1.4+)
263
+
154
264
  Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.
155
265
 
156
266
  ### Type Casting at Target Writes (v1.4+)
267
+
157
268
  Target field datatypes are mapped to pandas types and generate proper casting code:
158
269
  - Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
159
270
  - Dates: `pd.to_datetime(errors='coerce')`
@@ -161,12 +272,15 @@ Target field datatypes are mapped to pandas types and generate proper casting co
161
272
  - Booleans: `.astype('boolean')`
162
273
 
163
274
  ### Flat File Handling (v1.3+)
275
+
164
276
  Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.
165
277
 
166
278
  ### Mapplet Inlining (v1.3+)
279
+
167
280
  Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.
168
281
 
169
282
  ### Decision Tasks (v1.3+)
283
+
170
284
  Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
171
285
 
172
286
  ### Expression Converter (80+ Functions)
@@ -190,6 +304,32 @@ Converts Informatica expressions to Python equivalents:
190
304
 
191
305
  ## Changelog
192
306
 
307
+ ### v1.8.x (Phase 7)
308
+ - Row-count logging at every pipeline step (source reads, transforms, target writes)
309
+ - Backend-safe logging (try/except wrapped for Dask/lazy backends)
310
+ - Rich mapping function docstrings with sources, targets, and transform pipeline summary
311
+ - Per-transform documentation headers with description, input/output field lists
312
+
313
+ ### v1.7.x (Phase 6)
314
+ - Window/analytic functions (rolling avg/sum, cumulative sum, percentile)
315
+ - Update Strategy routing with actual INSERT/UPDATE/DELETE target operations
316
+ - Dialect-aware SQL placeholders for MSSQL/PostgreSQL/Oracle
317
+ - Full stored procedure code generation (Oracle/MSSQL/generic)
318
+ - JSON-based state persistence for mapping and workflow variables
319
+ - Primary key auto-detection for update strategy targets
320
+
321
+ ### v1.6.x (Phase 5)
322
+ - SQL dialect translation (Oracle/MSSQL → ANSI)
323
+ - Enhanced error reporting (unsupported transforms, unmapped ports, unknown functions)
324
+ - Nested mapplet expansion with circular reference protection
325
+ - Data quality validation warnings on type casting (`--validate-casts`)
326
+
327
+ ### v1.5.x (Phase 4)
328
+ - Parameter file support (`.param` files with section parsing)
329
+ - Vectorized expression generation (column-level pandas operations)
330
+ - Library-specific code adapters (polars/dask/modin/vaex syntax generation)
331
+ - 72+ integration tests
332
+
193
333
  ### v1.4.x (Phase 3)
194
334
  - Session connection overrides for sources and targets
195
335
  - Worklet function generation with safe invocation
@@ -217,8 +357,8 @@ Converts Informatica expressions to Python equivalents:
217
357
  cd informatica_python
218
358
  pip install -e ".[dev]"
219
359
 
220
- # Run tests (25 tests)
221
- pytest tests/test_converter.py -v
360
+ # Run tests (136 tests)
361
+ pytest tests/ -v
222
362
  ```
223
363
 
224
364
  ## License
@@ -0,0 +1,341 @@
1
+ # informatica-python
2
+
3
+ Convert Informatica PowerCenter workflow XML exports into clean, runnable Python/PySpark code.
4
+
5
+ **Author:** Nick
6
+ **License:** MIT
7
+ **PyPI:** [informatica-python](https://pypi.org/project/informatica-python/)
8
+
9
+ ---
10
+
11
+ ## Overview
12
+
13
+ `informatica-python` parses Informatica PowerCenter XML export files and generates equivalent Python code using your choice of data library. It handles all 72 DTD tags from the PowerCenter XML schema and produces a complete, ready-to-run Python project.
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ pip install informatica-python
19
+ ```
20
+
21
+ ## Quick Start
22
+
23
+ ### Command Line
24
+
25
+ ```bash
26
+ # Generate Python files to a directory
27
+ informatica-python workflow_export.xml -o output_dir
28
+
29
+ # Generate as a zip archive
30
+ informatica-python workflow_export.xml -z output.zip
31
+
32
+ # Use a different data library
33
+ informatica-python workflow_export.xml -o output_dir --data-lib polars
34
+
35
+ # Include a parameter file
36
+ informatica-python workflow_export.xml -o output_dir --param-file workflow.param
37
+
38
+ # Enable data quality validation on type casts
39
+ informatica-python workflow_export.xml -o output_dir --validate-casts
40
+
41
+ # Parse to JSON only (no code generation)
42
+ informatica-python workflow_export.xml --json
43
+
44
+ # Save parsed JSON to file
45
+ informatica-python workflow_export.xml --json-file parsed.json
46
+ ```
47
+
48
+ ### Python API
49
+
50
+ ```python
51
+ from informatica_python import InformaticaConverter
52
+
53
+ converter = InformaticaConverter()
54
+
55
+ # Parse and generate files
56
+ converter.convert_to_files("workflow_export.xml", "output_dir")
57
+
58
+ # Parse and generate zip
59
+ converter.convert_to_zip("workflow_export.xml", "output.zip")
60
+
61
+ # Parse to structured dict
62
+ result = converter.parse_file("workflow_export.xml")
63
+
64
+ # Use a different data library
65
+ converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars")
66
+ ```
67
+
68
+ ## Generated Output Files
69
+
70
+ | File | Description |
71
+ |------|-------------|
72
+ | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions), window/analytic functions, stored procedure execution, state persistence |
73
+ | `mapping_N.py` | One per mapping — transformation logic with row-count logging, source reads, target writes, inline documentation |
74
+ | `workflow.py` | Task orchestration with topological ordering, decision branching, worklet calls, and error handling |
75
+ | `config.yml` | Connection configs, source/target metadata, runtime parameters |
76
+ | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms (with ANSI-translated variants) |
77
+ | `error_log.txt` | Conversion summary with unsupported transform analysis, unmapped port detection, and unknown expression function tracing |
78
+
79
+ ## Supported Data Libraries
80
+
81
+ Select via `--data-lib` CLI flag or `data_lib` parameter:
82
+
83
+ | Library | Flag | Best For |
84
+ |---------|------|----------|
85
+ | **pandas** | `pandas` (default) | General-purpose, most compatible |
86
+ | **dask** | `dask` | Large datasets, parallel processing |
87
+ | **polars** | `polars` | High performance, Rust-backed |
88
+ | **vaex** | `vaex` | Out-of-core, billion-row datasets |
89
+ | **modin** | `modin` | Drop-in pandas replacement, multi-core |
90
+
91
+ ## Supported Transformations
92
+
93
+ The code generator produces real, runnable Python for these transformation types:
94
+
95
+ - **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides
96
+ - **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style)
97
+ - **Filter** — Row filtering with vectorized converted conditions
98
+ - **Joiner** — `pd.merge()` with join type and condition parsing (inner/left/right/outer)
99
+ - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads, multiple match policies, default values
100
+ - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST, computed aggregates
101
+ - **Sorter** — `sort_values()` with multi-key ascending/descending
102
+ - **Router** — Multi-group conditional routing with named groups
103
+ - **Union** — `pd.concat()` across multiple input groups
104
+ - **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys
105
+ - **Sequence Generator** — Auto-incrementing ID columns
106
+ - **Normalizer** — `pd.melt()` with auto-detected id/value vars
107
+ - **Rank** — `groupby().rank()` with Top-N filtering
108
+ - **Stored Procedure** — Full code generation with Oracle/MSSQL/generic support, input/output parameter mapping
109
+ - **Transaction Control** — Commit/rollback logic
110
+ - **Custom / Java** — Placeholder stubs with TODO markers
111
+ - **SQL Transform** — Direct SQL execution pass-through
112
+
113
+ ## Supported XML Tags (72 Tags)
114
+
115
+ **Top-level:** POWERMART, REPOSITORY, FOLDER, FOLDERVERSION
116
+
117
+ **Source/Target:** SOURCE, SOURCEFIELD, TARGET, TARGETFIELD, TARGETINDEX, TARGETINDEXFIELD, FLATFILE, XMLINFO, XMLTEXT, GROUP, TABLEATTRIBUTE, FIELDATTRIBUTE, METADATAEXTENSION, KEYWORD, ERPSRCINFO
118
+
119
+ **Mapping/Mapplet:** MAPPING, MAPPLET, TRANSFORMATION, TRANSFORMFIELD, TRANSFORMFIELDATTR, TRANSFORMFIELDATTRDEF, INSTANCE, ASSOCIATED_SOURCE_INSTANCE, CONNECTOR, MAPDEPENDENCY, TARGETLOADORDER, MAPPINGVARIABLE, FIELDDEPENDENCY, INITPROP, ERPINFO
120
+
121
+ **Task/Session/Workflow:** TASK, TIMER, VALUEPAIR, SCHEDULER, SCHEDULEINFO, STARTOPTIONS, ENDOPTIONS, SCHEDULEOPTIONS, RECURRING, CUSTOM, DAILYFREQUENCY, REPEAT, FILTER, SESSION, CONFIGREFERENCE, SESSTRANSFORMATIONINST, SESSTRANSFORMATIONGROUP, PARTITION, HASHKEY, KEYRANGE, CONFIG, SESSIONCOMPONENT, CONNECTIONREFERENCE, TASKINSTANCE, WORKFLOWLINK, WORKFLOWVARIABLE, WORKFLOWEVENT, WORKLET, WORKFLOW, ATTRIBUTE
122
+
123
+ **Shortcut:** SHORTCUT
124
+
125
+ **SAP:** SAPFUNCTION, SAPSTRUCTURE, SAPPROGRAM, SAPOUTPUTPORT, SAPVARIABLE, SAPPROGRAMFLOWOBJECT, SAPTABLEPARAM
126
+
127
+ ## Key Features
128
+
129
+ ### Row-Count Logging (v1.8+)
130
+
131
+ Generated code automatically logs row counts at every step of the data pipeline:
132
+
133
+ ```
134
+ Source SQ_CUSTOMERS: 10000 rows read
135
+ EXP_CALC (Expression): 10000 input rows -> 10000 output rows
136
+ FIL_ACTIVE (Filter): 10000 input rows -> 8542 output rows
137
+ AGG_TOTALS (Aggregator): 8542 input rows -> 150 output rows
138
+ Target TGT_SUMMARY: 150 rows written
139
+ ```
140
+
141
+ All row-count operations are backend-safe (wrapped in try/except), so Dask and other lazy-evaluation backends won't fail.
142
+
143
+ ### Generated Code Documentation (v1.8+)
144
+
145
+ Every generated mapping function includes a rich docstring describing:
146
+ - Mapping name and original Informatica description
147
+ - Source and target tables/files
148
+ - Transformation pipeline with field counts per step
149
+
150
+ Each transformation block is annotated with:
151
+ - Separator headers for visual scanning
152
+ - Transform type and description (from Informatica XML)
153
+ - Input and output field lists (truncated at 10 for readability)
154
+
155
+ ### Window / Analytic Functions (v1.7+)
156
+
157
+ DataFrame-level analytic functions for aggregation transforms:
158
+ - `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
159
+ - `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
160
+ - `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
161
+ - `percentile_df(df, col, pct)` — quantile via `.quantile()`
162
+
163
+ ### Update Strategy with Target Operations (v1.7+)
164
+
165
+ Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
166
+ - Static strategies (0/1/2/3) map to INSERT/UPDATE/DELETE/REJECT
167
+ - DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT expressions parsed from conditions
168
+ - Target writer splits rows and routes to appropriate SQL operations
169
+ - Dialect-aware SQL placeholders (`?` for MSSQL, `%s` for PostgreSQL/Oracle)
170
+ - Primary key columns auto-detected from target field definitions
171
+
172
+ ### Stored Procedure Execution (v1.7+)
173
+
174
+ Full stored procedure code generation (not just stubs):
175
+ - Oracle: `cursor.callproc()` with output parameter registration
176
+ - MSSQL: `EXEC` with output parameter capture
177
+ - Generic: `CALL` syntax for other databases
178
+ - Input/output parameter mapping from transformation fields
179
+ - Empty-input guard prevents errors on empty upstream DataFrames
180
+
181
+ ### State Persistence (v1.7+)
182
+
183
+ JSON-based variable persistence between workflow runs:
184
+ - `load_persistent_state()` / `save_persistent_state()` bracketing workflow execution
185
+ - `get_persistent_variable()` / `set_persistent_variable()` scoped by workflow/mapping name
186
+ - Mapping variables marked `is_persistent="YES"` automatically load from and save to state file
187
+ - Non-persistent variables remain unaffected
188
+
189
+ ### SQL Dialect Translation (v1.6+)
190
+
191
+ Automatically translates vendor-specific SQL to ANSI equivalents:
192
+ - **Oracle:** NVL→COALESCE, SYSDATE→CURRENT_TIMESTAMP, DECODE→CASE, NVL2→CASE, (+)→ANSI JOIN, ROWNUM→LIMIT
193
+ - **MSSQL:** GETDATE→CURRENT_TIMESTAMP, ISNULL→COALESCE, TOP N→LIMIT, LEN→LENGTH, CHARINDEX→POSITION
194
+ - Auto-detects source dialect; outputs both original and translated SQL
195
+
196
+ ### Enhanced Error Reporting (v1.6+)
197
+
198
+ Structured error log with three analysis sections:
199
+ - **Unsupported Transforms:** Lists each skipped transform with type, field count, and attributes
200
+ - **Unmapped Ports:** OUTPUT fields not connected to any downstream transform
201
+ - **Unsupported Expression Functions:** Unknown functions with location traces
202
+
203
+ ### Nested Mapplet Support (v1.6+)
204
+
205
+ Recursively expands mapplet-within-mapplet instances:
206
+ - Double-underscore namespacing for nested transforms
207
+ - Depth limit of 10 with circular reference protection
208
+ - Connector rewiring through the full expansion tree
209
+
210
+ ### Data Quality Validation (v1.6+)
211
+
212
+ Optional `--validate-casts` flag generates null-count checks before/after type casting:
213
+ - Counts null values pre- and post-coercion per column
214
+ - Logs warnings when coercion introduces new nulls
215
+ - Helps identify data quality issues during test runs
216
+
217
+ ### Vectorized Expression Generation (v1.5+)
218
+
219
+ Column-level pandas operations instead of row-level iteration:
220
+ - IIF → `np.where()`, NVL → `.fillna()`, UPPER/LOWER → `.str.upper()/.str.lower()`
221
+ - SUBSTR → `.str[start:end]`, TO_INTEGER → `pd.to_numeric()`, TO_DATE → `pd.to_datetime()`
222
+ - IS NULL/IS NOT NULL → `.isna()`/`.notna()`
223
+
224
+ ### Parameter File Support (v1.5+)
225
+
226
+ Standard Informatica `.param` file parsing:
227
+ - `[Global]` and `[folder.WF:workflow.ST:session]` section support
228
+ - `get_param(config, var_name)` resolution chain: config → env vars → defaults
229
+ - CLI `--param-file` flag for specifying parameter files
230
+
231
+ ### Session Connection Overrides (v1.4+)
232
+
233
+ When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.
234
+
235
+ ### Worklet Support (v1.4+)
236
+
237
+ Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.
238
+
239
+ ### Type Casting at Target Writes (v1.4+)
240
+
241
+ Target field datatypes are mapped to pandas types and generate proper casting code:
242
+ - Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
243
+ - Dates: `pd.to_datetime(errors='coerce')`
244
+ - Decimals/Floats: `pd.to_numeric(errors='coerce')`
245
+ - Booleans: `.astype('boolean')`
246
+
247
+ ### Flat File Handling (v1.3+)
248
+
249
+ Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.
250
+
251
+ ### Mapplet Inlining (v1.3+)
252
+
253
+ Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.
254
+
255
+ ### Decision Tasks (v1.3+)
256
+
257
+ Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
258
+
259
+ ### Expression Converter (80+ Functions)
260
+
261
+ Converts Informatica expressions to Python equivalents:
262
+
263
+ - **String:** SUBSTR, LTRIM, RTRIM, UPPER, LOWER, LPAD, RPAD, INSTR, LENGTH, CONCAT, REPLACE, REG_EXTRACT, REG_REPLACE, REVERSE, INITCAP, CHR, ASCII
264
+ - **Date:** ADD_TO_DATE, DATE_DIFF, GET_DATE_PART, SYSDATE, SYSTIMESTAMP, TO_DATE, TO_CHAR, TRUNC (date)
265
+ - **Numeric:** ROUND, TRUNC, MOD, ABS, CEIL, FLOOR, POWER, SQRT, LOG, EXP, SIGN
266
+ - **Conversion:** TO_INTEGER, TO_BIGINT, TO_FLOAT, TO_DECIMAL, TO_CHAR, TO_DATE
267
+ - **Null handling:** IIF, DECODE, NVL, NVL2, ISNULL, IS_SPACES, IS_NUMBER
268
+ - **Aggregate:** SUM, AVG, COUNT, MIN, MAX, FIRST, LAST, MEDIAN, STDDEV, VARIANCE
269
+ - **Lookup:** :LKP expressions with dynamic lookup references
270
+ - **Variable:** SETVARIABLE / mapping variable assignment
271
+
272
+ ## Requirements
273
+
274
+ - Python >= 3.8
275
+ - lxml >= 4.9.0
276
+ - PyYAML >= 6.0
277
+
278
+ ## Changelog
279
+
280
+ ### v1.8.x (Phase 7)
281
+ - Row-count logging at every pipeline step (source reads, transforms, target writes)
282
+ - Backend-safe logging (try/except wrapped for Dask/lazy backends)
283
+ - Rich mapping function docstrings with sources, targets, and transform pipeline summary
284
+ - Per-transform documentation headers with description, input/output field lists
285
+
286
+ ### v1.7.x (Phase 6)
287
+ - Window/analytic functions (rolling avg/sum, cumulative sum, percentile)
288
+ - Update Strategy routing with actual INSERT/UPDATE/DELETE target operations
289
+ - Dialect-aware SQL placeholders for MSSQL/PostgreSQL/Oracle
290
+ - Full stored procedure code generation (Oracle/MSSQL/generic)
291
+ - JSON-based state persistence for mapping and workflow variables
292
+ - Primary key auto-detection for update strategy targets
293
+
294
+ ### v1.6.x (Phase 5)
295
+ - SQL dialect translation (Oracle/MSSQL → ANSI)
296
+ - Enhanced error reporting (unsupported transforms, unmapped ports, unknown functions)
297
+ - Nested mapplet expansion with circular reference protection
298
+ - Data quality validation warnings on type casting (`--validate-casts`)
299
+
300
+ ### v1.5.x (Phase 4)
301
+ - Parameter file support (`.param` files with section parsing)
302
+ - Vectorized expression generation (column-level pandas operations)
303
+ - Library-specific code adapters (polars/dask/modin/vaex syntax generation)
304
+ - 72+ integration tests
305
+
306
+ ### v1.4.x (Phase 3)
307
+ - Session connection overrides for sources and targets
308
+ - Worklet function generation with safe invocation
309
+ - Type casting at target writes based on TARGETFIELD datatypes
310
+ - Flat-file session path overrides properly wired
311
+
312
+ ### v1.3.x (Phase 2)
313
+ - FLATFILE metadata in source reads and target writes
314
+ - Normalizer with `pd.melt()`
315
+ - Rank with group-by and Top-N filtering
316
+ - Decision tasks with real if/else branches
317
+ - Mapplet instance inlining
318
+
319
+ ### v1.2.x (Phase 1)
320
+ - Core parser for all 72 XML tags
321
+ - Expression converter with 80+ functions
322
+ - Aggregator, Joiner, Lookup code generation
323
+ - Workflow orchestration with topological task ordering
324
+ - Multi-library support (pandas, dask, polars, vaex, modin)
325
+
326
+ ## Development
327
+
328
+ ```bash
329
+ # Clone and install in development mode
330
+ cd informatica_python
331
+ pip install -e ".[dev]"
332
+
333
+ # Run tests (136 tests)
334
+ pytest tests/ -v
335
+ ```
336
+
337
+ ## License
338
+
339
+ MIT License - Copyright (c) 2025 Nick
340
+
341
+ See [LICENSE](LICENSE) for details.
@@ -7,7 +7,7 @@ Licensed under the MIT License.
7
7
 
8
8
  from informatica_python.converter import InformaticaConverter
9
9
 
10
- __version__ = "1.8.0"
10
+ __version__ = "1.8.2"
11
11
  __author__ = "Nick"
12
12
  __license__ = "MIT"
13
13
  __all__ = ["InformaticaConverter"]
@@ -644,7 +644,10 @@ def _generate_source_qualifier(lines, sq, source_map, source_dfs, connector_grap
644
644
  lines.append(f" df_{sq_safe} = df_{_safe_name(next(iter(connected_sources)))}")
645
645
 
646
646
  source_dfs[sq.name] = f"df_{sq_safe}"
647
- lines.append(f" logger.info(f'Source {sq.name}: {{len(df_{sq_safe})}} rows read')")
647
+ lines.append(f" try:")
648
+ lines.append(f" logger.info(f'Source {sq.name}: {{len(df_{sq_safe})}} rows read')")
649
+ lines.append(f" except Exception:")
650
+ lines.append(f" logger.info('Source {sq.name}: rows read (count unavailable)')")
648
651
 
649
652
  if post_sql:
650
653
  lines.append(f" # Post-SQL")
@@ -684,7 +687,10 @@ def _generate_transformation(lines, tx, connector_graph, source_dfs, transform_m
684
687
  lines.append(f" # Input fields: {', '.join(in_fields[:10])}{' ...' if len(in_fields) > 10 else ''}")
685
688
  lines.append(f" # Output fields: {', '.join(out_fields[:10])}{' ...' if len(out_fields) > 10 else ''}")
686
689
  lines.append(f" # -------------------------------------------------------------------")
687
- lines.append(f" _input_rows_{tx_safe} = len({input_df}) if hasattr({input_df}, '__len__') else 0")
690
+ lines.append(f" try:")
691
+ lines.append(f" _input_rows_{tx_safe} = len({input_df})")
692
+ lines.append(f" except Exception:")
693
+ lines.append(f" _input_rows_{tx_safe} = -1")
688
694
 
689
695
  if tx_type == "expression":
690
696
  _gen_expression_transform(lines, tx, tx_safe, input_df, source_dfs, data_lib)
@@ -724,7 +730,10 @@ def _generate_transformation(lines, tx, connector_graph, source_dfs, transform_m
724
730
  lines.append(f" df_{tx_safe} = {copy_expr}")
725
731
  source_dfs[tx.name] = f"df_{tx_safe}"
726
732
 
727
- lines.append(f" _output_rows_{tx_safe} = len(df_{tx_safe}) if hasattr(df_{tx_safe}, '__len__') else 0")
733
+ lines.append(f" try:")
734
+ lines.append(f" _output_rows_{tx_safe} = len(df_{tx_safe})")
735
+ lines.append(f" except Exception:")
736
+ lines.append(f" _output_rows_{tx_safe} = -1")
728
737
  lines.append(f" logger.info(f'{tx.name} ({tx.type}): {{_input_rows_{tx_safe}}} input rows -> {{_output_rows_{tx_safe}}} output rows')")
729
738
  lines.append("")
730
739
 
@@ -1392,7 +1401,10 @@ def _generate_target_write(lines, tgt_name, tgt_def, connector_graph, source_dfs
1392
1401
  else:
1393
1402
  lines.append(f" write_file(df_target_{tgt_safe}, config.get('targets', {{}}).get('{tgt_def.name}', {{}}).get('file_path', '{tgt_def.name}'),")
1394
1403
  lines.append(f" config.get('targets', {{}}).get('{tgt_def.name}', {{}}))")
1395
- lines.append(f" logger.info(f'Target {tgt_def.name}: {{len(df_target_{tgt_safe})}} rows written')")
1404
+ lines.append(f" try:")
1405
+ lines.append(f" logger.info(f'Target {tgt_def.name}: {{len(df_target_{tgt_safe})}} rows written')")
1406
+ lines.append(f" except Exception:")
1407
+ lines.append(f" logger.info('Target {tgt_def.name}: rows written (count unavailable)')")
1396
1408
 
1397
1409
 
1398
1410
  CAST_MAP = {
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: informatica-python
3
- Version: 1.8.0
3
+ Version: 1.8.2
4
4
  Summary: Convert Informatica PowerCenter workflow XML to Python/PySpark code
5
5
  Author: Nick
6
6
  License: MIT
@@ -59,6 +59,12 @@ informatica-python workflow_export.xml -z output.zip
59
59
  # Use a different data library
60
60
  informatica-python workflow_export.xml -o output_dir --data-lib polars
61
61
 
62
+ # Include a parameter file
63
+ informatica-python workflow_export.xml -o output_dir --param-file workflow.param
64
+
65
+ # Enable data quality validation on type casts
66
+ informatica-python workflow_export.xml -o output_dir --validate-casts
67
+
62
68
  # Parse to JSON only (no code generation)
63
69
  informatica-python workflow_export.xml --json
64
70
 
@@ -90,12 +96,12 @@ converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars
90
96
 
91
97
  | File | Description |
92
98
  |------|-------------|
93
- | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions) |
94
- | `mapping_N.py` | One per mapping — transformation logic, source reads, target writes |
95
- | `workflow.py` | Task orchestration with topological ordering and error handling |
99
+ | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions), window/analytic functions, stored procedure execution, state persistence |
100
+ | `mapping_N.py` | One per mapping — transformation logic with row-count logging, source reads, target writes, inline documentation |
101
+ | `workflow.py` | Task orchestration with topological ordering, decision branching, worklet calls, and error handling |
96
102
  | `config.yml` | Connection configs, source/target metadata, runtime parameters |
97
- | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms |
98
- | `error_log.txt` | Conversion summary, warnings, and unsupported feature notes |
103
+ | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms (with ANSI-translated variants) |
104
+ | `error_log.txt` | Conversion summary with unsupported transform analysis, unmapped port detection, and unknown expression function tracing |
99
105
 
100
106
  ## Supported Data Libraries
101
107
 
@@ -113,21 +119,21 @@ Select via `--data-lib` CLI flag or `data_lib` parameter:
113
119
 
114
120
  The code generator produces real, runnable Python for these transformation types:
115
121
 
116
- - **Source Qualifier** — SQL override, pre/post SQL, column selection
117
- - **Expression** — Field-level expressions converted to pandas operations
118
- - **Filter** — Row filtering with converted conditions
119
- - **Joiner** — `pd.merge()` with join type and condition parsing
120
- - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads
121
- - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST
122
+ - **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides
123
+ - **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style)
124
+ - **Filter** — Row filtering with vectorized converted conditions
125
+ - **Joiner** — `pd.merge()` with join type and condition parsing (inner/left/right/outer)
126
+ - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads, multiple match policies, default values
127
+ - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST, computed aggregates
122
128
  - **Sorter** — `sort_values()` with multi-key ascending/descending
123
- - **Router** — Multi-group conditional routing with if/elif/else
129
+ - **Router** — Multi-group conditional routing with named groups
124
130
  - **Union** — `pd.concat()` across multiple input groups
125
- - **Update Strategy** — Insert/Update/Delete/Reject flag generation
131
+ - **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys
126
132
  - **Sequence Generator** — Auto-incrementing ID columns
127
133
  - **Normalizer** — `pd.melt()` with auto-detected id/value vars
128
134
  - **Rank** — `groupby().rank()` with Top-N filtering
129
- - **Stored Procedure** — Stub generation with SP name and parameters
130
- - **Transaction Control** — Commit/rollback logic stubs
135
+ - **Stored Procedure** — Full code generation with Oracle/MSSQL/generic support, input/output parameter mapping
136
+ - **Transaction Control** — Commit/rollback logic
131
137
  - **Custom / Java** — Placeholder stubs with TODO markers
132
138
  - **SQL Transform** — Direct SQL execution pass-through
133
139
 
@@ -147,13 +153,118 @@ The code generator produces real, runnable Python for these transformation types
147
153
 
148
154
  ## Key Features
149
155
 
156
+ ### Row-Count Logging (v1.8+)
157
+
158
+ Generated code automatically logs row counts at every step of the data pipeline:
159
+
160
+ ```
161
+ Source SQ_CUSTOMERS: 10000 rows read
162
+ EXP_CALC (Expression): 10000 input rows -> 10000 output rows
163
+ FIL_ACTIVE (Filter): 10000 input rows -> 8542 output rows
164
+ AGG_TOTALS (Aggregator): 8542 input rows -> 150 output rows
165
+ Target TGT_SUMMARY: 150 rows written
166
+ ```
167
+
168
+ All row-count operations are backend-safe (wrapped in try/except), so Dask and other lazy-evaluation backends won't fail.
169
+
170
+ ### Generated Code Documentation (v1.8+)
171
+
172
+ Every generated mapping function includes a rich docstring describing:
173
+ - Mapping name and original Informatica description
174
+ - Source and target tables/files
175
+ - Transformation pipeline with field counts per step
176
+
177
+ Each transformation block is annotated with:
178
+ - Separator headers for visual scanning
179
+ - Transform type and description (from Informatica XML)
180
+ - Input and output field lists (truncated at 10 for readability)
181
+
182
+ ### Window / Analytic Functions (v1.7+)
183
+
184
+ DataFrame-level analytic functions for aggregation transforms:
185
+ - `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
186
+ - `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
187
+ - `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
188
+ - `percentile_df(df, col, pct)` — quantile via `.quantile()`
189
+
190
+ ### Update Strategy with Target Operations (v1.7+)
191
+
192
+ Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
193
+ - Static strategies (0/1/2/3) map to INSERT/UPDATE/DELETE/REJECT
194
+ - DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT expressions parsed from conditions
195
+ - Target writer splits rows and routes to appropriate SQL operations
196
+ - Dialect-aware SQL placeholders (`?` for MSSQL, `%s` for PostgreSQL/Oracle)
197
+ - Primary key columns auto-detected from target field definitions
198
+
199
+ ### Stored Procedure Execution (v1.7+)
200
+
201
+ Full stored procedure code generation (not just stubs):
202
+ - Oracle: `cursor.callproc()` with output parameter registration
203
+ - MSSQL: `EXEC` with output parameter capture
204
+ - Generic: `CALL` syntax for other databases
205
+ - Input/output parameter mapping from transformation fields
206
+ - Empty-input guard prevents errors on empty upstream DataFrames
207
+
208
+ ### State Persistence (v1.7+)
209
+
210
+ JSON-based variable persistence between workflow runs:
211
+ - `load_persistent_state()` / `save_persistent_state()` bracketing workflow execution
212
+ - `get_persistent_variable()` / `set_persistent_variable()` scoped by workflow/mapping name
213
+ - Mapping variables marked `is_persistent="YES"` automatically load from and save to state file
214
+ - Non-persistent variables remain unaffected
215
+
216
+ ### SQL Dialect Translation (v1.6+)
217
+
218
+ Automatically translates vendor-specific SQL to ANSI equivalents:
219
+ - **Oracle:** NVL→COALESCE, SYSDATE→CURRENT_TIMESTAMP, DECODE→CASE, NVL2→CASE, (+)→ANSI JOIN, ROWNUM→LIMIT
220
+ - **MSSQL:** GETDATE→CURRENT_TIMESTAMP, ISNULL→COALESCE, TOP N→LIMIT, LEN→LENGTH, CHARINDEX→POSITION
221
+ - Auto-detects source dialect; outputs both original and translated SQL
222
+
223
+ ### Enhanced Error Reporting (v1.6+)
224
+
225
+ Structured error log with three analysis sections:
226
+ - **Unsupported Transforms:** Lists each skipped transform with type, field count, and attributes
227
+ - **Unmapped Ports:** OUTPUT fields not connected to any downstream transform
228
+ - **Unsupported Expression Functions:** Unknown functions with location traces
229
+
230
+ ### Nested Mapplet Support (v1.6+)
231
+
232
+ Recursively expands mapplet-within-mapplet instances:
233
+ - Double-underscore namespacing for nested transforms
234
+ - Depth limit of 10 with circular reference protection
235
+ - Connector rewiring through the full expansion tree
236
+
237
+ ### Data Quality Validation (v1.6+)
238
+
239
+ Optional `--validate-casts` flag generates null-count checks before/after type casting:
240
+ - Counts null values pre- and post-coercion per column
241
+ - Logs warnings when coercion introduces new nulls
242
+ - Helps identify data quality issues during test runs
243
+
244
+ ### Vectorized Expression Generation (v1.5+)
245
+
246
+ Column-level pandas operations instead of row-level iteration:
247
+ - IIF → `np.where()`, NVL → `.fillna()`, UPPER/LOWER → `.str.upper()/.str.lower()`
248
+ - SUBSTR → `.str[start:end]`, TO_INTEGER → `pd.to_numeric()`, TO_DATE → `pd.to_datetime()`
249
+ - IS NULL/IS NOT NULL → `.isna()`/`.notna()`
250
+
251
+ ### Parameter File Support (v1.5+)
252
+
253
+ Standard Informatica `.param` file parsing:
254
+ - `[Global]` and `[folder.WF:workflow.ST:session]` section support
255
+ - `get_param(config, var_name)` resolution chain: config → env vars → defaults
256
+ - CLI `--param-file` flag for specifying parameter files
257
+
150
258
  ### Session Connection Overrides (v1.4+)
259
+
151
260
  When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.
152
261
 
153
262
  ### Worklet Support (v1.4+)
263
+
154
264
  Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.
155
265
 
156
266
  ### Type Casting at Target Writes (v1.4+)
267
+
157
268
  Target field datatypes are mapped to pandas types and generate proper casting code:
158
269
  - Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
159
270
  - Dates: `pd.to_datetime(errors='coerce')`
@@ -161,12 +272,15 @@ Target field datatypes are mapped to pandas types and generate proper casting co
161
272
  - Booleans: `.astype('boolean')`
162
273
 
163
274
  ### Flat File Handling (v1.3+)
275
+
164
276
  Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.
165
277
 
166
278
  ### Mapplet Inlining (v1.3+)
279
+
167
280
  Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.
168
281
 
169
282
  ### Decision Tasks (v1.3+)
283
+
170
284
  Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
171
285
 
172
286
  ### Expression Converter (80+ Functions)
@@ -190,6 +304,32 @@ Converts Informatica expressions to Python equivalents:
190
304
 
191
305
  ## Changelog
192
306
 
307
+ ### v1.8.x (Phase 7)
308
+ - Row-count logging at every pipeline step (source reads, transforms, target writes)
309
+ - Backend-safe logging (try/except wrapped for Dask/lazy backends)
310
+ - Rich mapping function docstrings with sources, targets, and transform pipeline summary
311
+ - Per-transform documentation headers with description, input/output field lists
312
+
313
+ ### v1.7.x (Phase 6)
314
+ - Window/analytic functions (rolling avg/sum, cumulative sum, percentile)
315
+ - Update Strategy routing with actual INSERT/UPDATE/DELETE target operations
316
+ - Dialect-aware SQL placeholders for MSSQL/PostgreSQL/Oracle
317
+ - Full stored procedure code generation (Oracle/MSSQL/generic)
318
+ - JSON-based state persistence for mapping and workflow variables
319
+ - Primary key auto-detection for update strategy targets
320
+
321
+ ### v1.6.x (Phase 5)
322
+ - SQL dialect translation (Oracle/MSSQL → ANSI)
323
+ - Enhanced error reporting (unsupported transforms, unmapped ports, unknown functions)
324
+ - Nested mapplet expansion with circular reference protection
325
+ - Data quality validation warnings on type casting (`--validate-casts`)
326
+
327
+ ### v1.5.x (Phase 4)
328
+ - Parameter file support (`.param` files with section parsing)
329
+ - Vectorized expression generation (column-level pandas operations)
330
+ - Library-specific code adapters (polars/dask/modin/vaex syntax generation)
331
+ - 72+ integration tests
332
+
193
333
  ### v1.4.x (Phase 3)
194
334
  - Session connection overrides for sources and targets
195
335
  - Worklet function generation with safe invocation
@@ -217,8 +357,8 @@ Converts Informatica expressions to Python equivalents:
217
357
  cd informatica_python
218
358
  pip install -e ".[dev]"
219
359
 
220
- # Run tests (25 tests)
221
- pytest tests/test_converter.py -v
360
+ # Run tests (136 tests)
361
+ pytest tests/ -v
222
362
  ```
223
363
 
224
364
  ## License
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "informatica-python"
7
- version = "1.8.0"
7
+ version = "1.8.2"
8
8
  description = "Convert Informatica PowerCenter workflow XML to Python/PySpark code"
9
9
  readme = "README.md"
10
10
  license = {text = "MIT"}
@@ -1236,7 +1236,7 @@ class TestLoggingEnrichment:
1236
1236
  folder = FolderDef(name="TestFolder", mappings=[mapping])
1237
1237
  code = generate_mapping_code(mapping, folder, "pandas", 1)
1238
1238
  assert "_input_rows_fil_active = len(" in code
1239
- assert "_output_rows_fil_active = len(df_fil_active)" in code
1239
+ assert "_output_rows_fil_active = len(df_fil_active" in code
1240
1240
  assert "FIL_ACTIVE (Filter):" in code
1241
1241
  assert "input rows ->" in code
1242
1242
  assert "output rows" in code
@@ -1,201 +0,0 @@
1
- # informatica-python
2
-
3
- Convert Informatica PowerCenter workflow XML exports into clean, runnable Python/PySpark code.
4
-
5
- **Author:** Nick
6
- **License:** MIT
7
- **PyPI:** [informatica-python](https://pypi.org/project/informatica-python/)
8
-
9
- ---
10
-
11
- ## Overview
12
-
13
- `informatica-python` parses Informatica PowerCenter XML export files and generates equivalent Python code using your choice of data library. It handles all 72 DTD tags from the PowerCenter XML schema and produces a complete, ready-to-run Python project.
14
-
15
- ## Installation
16
-
17
- ```bash
18
- pip install informatica-python
19
- ```
20
-
21
- ## Quick Start
22
-
23
- ### Command Line
24
-
25
- ```bash
26
- # Generate Python files to a directory
27
- informatica-python workflow_export.xml -o output_dir
28
-
29
- # Generate as a zip archive
30
- informatica-python workflow_export.xml -z output.zip
31
-
32
- # Use a different data library
33
- informatica-python workflow_export.xml -o output_dir --data-lib polars
34
-
35
- # Parse to JSON only (no code generation)
36
- informatica-python workflow_export.xml --json
37
-
38
- # Save parsed JSON to file
39
- informatica-python workflow_export.xml --json-file parsed.json
40
- ```
41
-
42
- ### Python API
43
-
44
- ```python
45
- from informatica_python import InformaticaConverter
46
-
47
- converter = InformaticaConverter()
48
-
49
- # Parse and generate files
50
- converter.convert_to_files("workflow_export.xml", "output_dir")
51
-
52
- # Parse and generate zip
53
- converter.convert_to_zip("workflow_export.xml", "output.zip")
54
-
55
- # Parse to structured dict
56
- result = converter.parse_file("workflow_export.xml")
57
-
58
- # Use a different data library
59
- converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars")
60
- ```
61
-
62
- ## Generated Output Files
63
-
64
- | File | Description |
65
- |------|-------------|
66
- | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions) |
67
- | `mapping_N.py` | One per mapping — transformation logic, source reads, target writes |
68
- | `workflow.py` | Task orchestration with topological ordering and error handling |
69
- | `config.yml` | Connection configs, source/target metadata, runtime parameters |
70
- | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms |
71
- | `error_log.txt` | Conversion summary, warnings, and unsupported feature notes |
72
-
73
- ## Supported Data Libraries
74
-
75
- Select via `--data-lib` CLI flag or `data_lib` parameter:
76
-
77
- | Library | Flag | Best For |
78
- |---------|------|----------|
79
- | **pandas** | `pandas` (default) | General-purpose, most compatible |
80
- | **dask** | `dask` | Large datasets, parallel processing |
81
- | **polars** | `polars` | High performance, Rust-backed |
82
- | **vaex** | `vaex` | Out-of-core, billion-row datasets |
83
- | **modin** | `modin` | Drop-in pandas replacement, multi-core |
84
-
85
- ## Supported Transformations
86
-
87
- The code generator produces real, runnable Python for these transformation types:
88
-
89
- - **Source Qualifier** — SQL override, pre/post SQL, column selection
90
- - **Expression** — Field-level expressions converted to pandas operations
91
- - **Filter** — Row filtering with converted conditions
92
- - **Joiner** — `pd.merge()` with join type and condition parsing
93
- - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads
94
- - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST
95
- - **Sorter** — `sort_values()` with multi-key ascending/descending
96
- - **Router** — Multi-group conditional routing with if/elif/else
97
- - **Union** — `pd.concat()` across multiple input groups
98
- - **Update Strategy** — Insert/Update/Delete/Reject flag generation
99
- - **Sequence Generator** — Auto-incrementing ID columns
100
- - **Normalizer** — `pd.melt()` with auto-detected id/value vars
101
- - **Rank** — `groupby().rank()` with Top-N filtering
102
- - **Stored Procedure** — Stub generation with SP name and parameters
103
- - **Transaction Control** — Commit/rollback logic stubs
104
- - **Custom / Java** — Placeholder stubs with TODO markers
105
- - **SQL Transform** — Direct SQL execution pass-through
106
-
107
- ## Supported XML Tags (72 Tags)
108
-
109
- **Top-level:** POWERMART, REPOSITORY, FOLDER, FOLDERVERSION
110
-
111
- **Source/Target:** SOURCE, SOURCEFIELD, TARGET, TARGETFIELD, TARGETINDEX, TARGETINDEXFIELD, FLATFILE, XMLINFO, XMLTEXT, GROUP, TABLEATTRIBUTE, FIELDATTRIBUTE, METADATAEXTENSION, KEYWORD, ERPSRCINFO
112
-
113
- **Mapping/Mapplet:** MAPPING, MAPPLET, TRANSFORMATION, TRANSFORMFIELD, TRANSFORMFIELDATTR, TRANSFORMFIELDATTRDEF, INSTANCE, ASSOCIATED_SOURCE_INSTANCE, CONNECTOR, MAPDEPENDENCY, TARGETLOADORDER, MAPPINGVARIABLE, FIELDDEPENDENCY, INITPROP, ERPINFO
114
-
115
- **Task/Session/Workflow:** TASK, TIMER, VALUEPAIR, SCHEDULER, SCHEDULEINFO, STARTOPTIONS, ENDOPTIONS, SCHEDULEOPTIONS, RECURRING, CUSTOM, DAILYFREQUENCY, REPEAT, FILTER, SESSION, CONFIGREFERENCE, SESSTRANSFORMATIONINST, SESSTRANSFORMATIONGROUP, PARTITION, HASHKEY, KEYRANGE, CONFIG, SESSIONCOMPONENT, CONNECTIONREFERENCE, TASKINSTANCE, WORKFLOWLINK, WORKFLOWVARIABLE, WORKFLOWEVENT, WORKLET, WORKFLOW, ATTRIBUTE
116
-
117
- **Shortcut:** SHORTCUT
118
-
119
- **SAP:** SAPFUNCTION, SAPSTRUCTURE, SAPPROGRAM, SAPOUTPUTPORT, SAPVARIABLE, SAPPROGRAMFLOWOBJECT, SAPTABLEPARAM
120
-
121
- ## Key Features
122
-
123
- ### Session Connection Overrides (v1.4+)
124
- When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.
125
-
126
- ### Worklet Support (v1.4+)
127
- Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.
128
-
129
- ### Type Casting at Target Writes (v1.4+)
130
- Target field datatypes are mapped to pandas types and generate proper casting code:
131
- - Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
132
- - Dates: `pd.to_datetime(errors='coerce')`
133
- - Decimals/Floats: `pd.to_numeric(errors='coerce')`
134
- - Booleans: `.astype('boolean')`
135
-
136
- ### Flat File Handling (v1.3+)
137
- Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.
138
-
139
- ### Mapplet Inlining (v1.3+)
140
- Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.
141
-
142
- ### Decision Tasks (v1.3+)
143
- Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
144
-
145
- ### Expression Converter (80+ Functions)
146
-
147
- Converts Informatica expressions to Python equivalents:
148
-
149
- - **String:** SUBSTR, LTRIM, RTRIM, UPPER, LOWER, LPAD, RPAD, INSTR, LENGTH, CONCAT, REPLACE, REG_EXTRACT, REG_REPLACE, REVERSE, INITCAP, CHR, ASCII
150
- - **Date:** ADD_TO_DATE, DATE_DIFF, GET_DATE_PART, SYSDATE, SYSTIMESTAMP, TO_DATE, TO_CHAR, TRUNC (date)
151
- - **Numeric:** ROUND, TRUNC, MOD, ABS, CEIL, FLOOR, POWER, SQRT, LOG, EXP, SIGN
152
- - **Conversion:** TO_INTEGER, TO_BIGINT, TO_FLOAT, TO_DECIMAL, TO_CHAR, TO_DATE
153
- - **Null handling:** IIF, DECODE, NVL, NVL2, ISNULL, IS_SPACES, IS_NUMBER
154
- - **Aggregate:** SUM, AVG, COUNT, MIN, MAX, FIRST, LAST, MEDIAN, STDDEV, VARIANCE
155
- - **Lookup:** :LKP expressions with dynamic lookup references
156
- - **Variable:** SETVARIABLE / mapping variable assignment
157
-
158
- ## Requirements
159
-
160
- - Python >= 3.8
161
- - lxml >= 4.9.0
162
- - PyYAML >= 6.0
163
-
164
- ## Changelog
165
-
166
- ### v1.4.x (Phase 3)
167
- - Session connection overrides for sources and targets
168
- - Worklet function generation with safe invocation
169
- - Type casting at target writes based on TARGETFIELD datatypes
170
- - Flat-file session path overrides properly wired
171
-
172
- ### v1.3.x (Phase 2)
173
- - FLATFILE metadata in source reads and target writes
174
- - Normalizer with `pd.melt()`
175
- - Rank with group-by and Top-N filtering
176
- - Decision tasks with real if/else branches
177
- - Mapplet instance inlining
178
-
179
- ### v1.2.x (Phase 1)
180
- - Core parser for all 72 XML tags
181
- - Expression converter with 80+ functions
182
- - Aggregator, Joiner, Lookup code generation
183
- - Workflow orchestration with topological task ordering
184
- - Multi-library support (pandas, dask, polars, vaex, modin)
185
-
186
- ## Development
187
-
188
- ```bash
189
- # Clone and install in development mode
190
- cd informatica_python
191
- pip install -e ".[dev]"
192
-
193
- # Run tests (25 tests)
194
- pytest tests/test_converter.py -v
195
- ```
196
-
197
- ## License
198
-
199
- MIT License - Copyright (c) 2025 Nick
200
-
201
- See [LICENSE](LICENSE) for details.