informatica-python 1.8.1__tar.gz → 1.9.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. {informatica_python-1.8.1 → informatica_python-1.9.0}/PKG-INFO +162 -18
  2. informatica_python-1.9.0/README.md +345 -0
  3. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/__init__.py +1 -1
  4. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/converter.py +10 -1
  5. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/generators/workflow_gen.py +2 -2
  6. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python.egg-info/PKG-INFO +162 -18
  7. {informatica_python-1.8.1 → informatica_python-1.9.0}/pyproject.toml +1 -1
  8. {informatica_python-1.8.1 → informatica_python-1.9.0}/tests/test_converter.py +6 -1
  9. {informatica_python-1.8.1 → informatica_python-1.9.0}/tests/test_integration.py +4 -4
  10. informatica_python-1.8.1/README.md +0 -201
  11. {informatica_python-1.8.1 → informatica_python-1.9.0}/LICENSE +0 -0
  12. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/cli.py +0 -0
  13. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/generators/__init__.py +0 -0
  14. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/generators/config_gen.py +0 -0
  15. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/generators/error_log_gen.py +0 -0
  16. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/generators/helper_gen.py +0 -0
  17. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/generators/mapping_gen.py +0 -0
  18. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/generators/sql_gen.py +0 -0
  19. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/models.py +0 -0
  20. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/parser.py +0 -0
  21. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/utils/__init__.py +0 -0
  22. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/utils/datatype_map.py +0 -0
  23. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/utils/expression_converter.py +0 -0
  24. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/utils/lib_adapters.py +0 -0
  25. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python/utils/sql_dialect.py +0 -0
  26. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python.egg-info/SOURCES.txt +0 -0
  27. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python.egg-info/dependency_links.txt +0 -0
  28. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python.egg-info/entry_points.txt +0 -0
  29. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python.egg-info/requires.txt +0 -0
  30. {informatica_python-1.8.1 → informatica_python-1.9.0}/informatica_python.egg-info/top_level.txt +0 -0
  31. {informatica_python-1.8.1 → informatica_python-1.9.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: informatica-python
3
- Version: 1.8.1
3
+ Version: 1.9.0
4
4
  Summary: Convert Informatica PowerCenter workflow XML to Python/PySpark code
5
5
  Author: Nick
6
6
  License: MIT
@@ -59,6 +59,12 @@ informatica-python workflow_export.xml -z output.zip
59
59
  # Use a different data library
60
60
  informatica-python workflow_export.xml -o output_dir --data-lib polars
61
61
 
62
+ # Include a parameter file
63
+ informatica-python workflow_export.xml -o output_dir --param-file workflow.param
64
+
65
+ # Enable data quality validation on type casts
66
+ informatica-python workflow_export.xml -o output_dir --validate-casts
67
+
62
68
  # Parse to JSON only (no code generation)
63
69
  informatica-python workflow_export.xml --json
64
70
 
@@ -90,12 +96,12 @@ converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars
90
96
 
91
97
  | File | Description |
92
98
  |------|-------------|
93
- | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions) |
94
- | `mapping_N.py` | One per mapping — transformation logic, source reads, target writes |
95
- | `workflow.py` | Task orchestration with topological ordering and error handling |
99
+ | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions), window/analytic functions, stored procedure execution, state persistence |
100
+ | `mapping_{name}.py` | One per mapping, named after the real Informatica mapping name — transformation logic with row-count logging, source reads, target writes, inline documentation |
101
+ | `workflow.py` | Task orchestration with topological ordering, decision branching, worklet calls, and error handling |
96
102
  | `config.yml` | Connection configs, source/target metadata, runtime parameters |
97
- | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms |
98
- | `error_log.txt` | Conversion summary, warnings, and unsupported feature notes |
103
+ | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms (with ANSI-translated variants) |
104
+ | `error_log.txt` | Conversion summary with unsupported transform analysis, unmapped port detection, and unknown expression function tracing |
99
105
 
100
106
  ## Supported Data Libraries
101
107
 
@@ -113,21 +119,21 @@ Select via `--data-lib` CLI flag or `data_lib` parameter:
113
119
 
114
120
  The code generator produces real, runnable Python for these transformation types:
115
121
 
116
- - **Source Qualifier** — SQL override, pre/post SQL, column selection
117
- - **Expression** — Field-level expressions converted to pandas operations
118
- - **Filter** — Row filtering with converted conditions
119
- - **Joiner** — `pd.merge()` with join type and condition parsing
120
- - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads
121
- - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST
122
+ - **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides
123
+ - **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style)
124
+ - **Filter** — Row filtering with vectorized converted conditions
125
+ - **Joiner** — `pd.merge()` with join type and condition parsing (inner/left/right/outer)
126
+ - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads, multiple match policies, default values
127
+ - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST, computed aggregates
122
128
  - **Sorter** — `sort_values()` with multi-key ascending/descending
123
- - **Router** — Multi-group conditional routing with if/elif/else
129
+ - **Router** — Multi-group conditional routing with named groups
124
130
  - **Union** — `pd.concat()` across multiple input groups
125
- - **Update Strategy** — Insert/Update/Delete/Reject flag generation
131
+ - **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys
126
132
  - **Sequence Generator** — Auto-incrementing ID columns
127
133
  - **Normalizer** — `pd.melt()` with auto-detected id/value vars
128
134
  - **Rank** — `groupby().rank()` with Top-N filtering
129
- - **Stored Procedure** — Stub generation with SP name and parameters
130
- - **Transaction Control** — Commit/rollback logic stubs
135
+ - **Stored Procedure** — Full code generation with Oracle/MSSQL/generic support, input/output parameter mapping
136
+ - **Transaction Control** — Commit/rollback logic
131
137
  - **Custom / Java** — Placeholder stubs with TODO markers
132
138
  - **SQL Transform** — Direct SQL execution pass-through
133
139
 
@@ -147,13 +153,118 @@ The code generator produces real, runnable Python for these transformation types
147
153
 
148
154
  ## Key Features
149
155
 
156
+ ### Row-Count Logging (v1.8+)
157
+
158
+ Generated code automatically logs row counts at every step of the data pipeline:
159
+
160
+ ```
161
+ Source SQ_CUSTOMERS: 10000 rows read
162
+ EXP_CALC (Expression): 10000 input rows -> 10000 output rows
163
+ FIL_ACTIVE (Filter): 10000 input rows -> 8542 output rows
164
+ AGG_TOTALS (Aggregator): 8542 input rows -> 150 output rows
165
+ Target TGT_SUMMARY: 150 rows written
166
+ ```
167
+
168
+ All row-count operations are backend-safe (wrapped in try/except), so Dask and other lazy-evaluation backends won't fail.
169
+
170
+ ### Generated Code Documentation (v1.8+)
171
+
172
+ Every generated mapping function includes a rich docstring describing:
173
+ - Mapping name and original Informatica description
174
+ - Source and target tables/files
175
+ - Transformation pipeline with field counts per step
176
+
177
+ Each transformation block is annotated with:
178
+ - Separator headers for visual scanning
179
+ - Transform type and description (from Informatica XML)
180
+ - Input and output field lists (truncated at 10 for readability)
181
+
182
+ ### Window / Analytic Functions (v1.7+)
183
+
184
+ DataFrame-level analytic functions for aggregation transforms:
185
+ - `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
186
+ - `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
187
+ - `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
188
+ - `percentile_df(df, col, pct)` — quantile via `.quantile()`
189
+
190
+ ### Update Strategy with Target Operations (v1.7+)
191
+
192
+ Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
193
+ - Static strategies (0/1/2/3) map to INSERT/UPDATE/DELETE/REJECT
194
+ - DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT expressions parsed from conditions
195
+ - Target writer splits rows and routes to appropriate SQL operations
196
+ - Dialect-aware SQL placeholders (`?` for MSSQL, `%s` for PostgreSQL/Oracle)
197
+ - Primary key columns auto-detected from target field definitions
198
+
199
+ ### Stored Procedure Execution (v1.7+)
200
+
201
+ Full stored procedure code generation (not just stubs):
202
+ - Oracle: `cursor.callproc()` with output parameter registration
203
+ - MSSQL: `EXEC` with output parameter capture
204
+ - Generic: `CALL` syntax for other databases
205
+ - Input/output parameter mapping from transformation fields
206
+ - Empty-input guard prevents errors on empty upstream DataFrames
207
+
208
+ ### State Persistence (v1.7+)
209
+
210
+ JSON-based variable persistence between workflow runs:
211
+ - `load_persistent_state()` / `save_persistent_state()` bracketing workflow execution
212
+ - `get_persistent_variable()` / `set_persistent_variable()` scoped by workflow/mapping name
213
+ - Mapping variables marked `is_persistent="YES"` automatically load from and save to state file
214
+ - Non-persistent variables remain unaffected
215
+
216
+ ### SQL Dialect Translation (v1.6+)
217
+
218
+ Automatically translates vendor-specific SQL to ANSI equivalents:
219
+ - **Oracle:** NVL→COALESCE, SYSDATE→CURRENT_TIMESTAMP, DECODE→CASE, NVL2→CASE, (+)→ANSI JOIN, ROWNUM→LIMIT
220
+ - **MSSQL:** GETDATE→CURRENT_TIMESTAMP, ISNULL→COALESCE, TOP N→LIMIT, LEN→LENGTH, CHARINDEX→POSITION
221
+ - Auto-detects source dialect; outputs both original and translated SQL
222
+
223
+ ### Enhanced Error Reporting (v1.6+)
224
+
225
+ Structured error log with three analysis sections:
226
+ - **Unsupported Transforms:** Lists each skipped transform with type, field count, and attributes
227
+ - **Unmapped Ports:** OUTPUT fields not connected to any downstream transform
228
+ - **Unsupported Expression Functions:** Unknown functions with location traces
229
+
230
+ ### Nested Mapplet Support (v1.6+)
231
+
232
+ Recursively expands mapplet-within-mapplet instances:
233
+ - Double-underscore namespacing for nested transforms
234
+ - Depth limit of 10 with circular reference protection
235
+ - Connector rewiring through the full expansion tree
236
+
237
+ ### Data Quality Validation (v1.6+)
238
+
239
+ Optional `--validate-casts` flag generates null-count checks before/after type casting:
240
+ - Counts null values pre- and post-coercion per column
241
+ - Logs warnings when coercion introduces new nulls
242
+ - Helps identify data quality issues during test runs
243
+
244
+ ### Vectorized Expression Generation (v1.5+)
245
+
246
+ Column-level pandas operations instead of row-level iteration:
247
+ - IIF → `np.where()`, NVL → `.fillna()`, UPPER/LOWER → `.str.upper()/.str.lower()`
248
+ - SUBSTR → `.str[start:end]`, TO_INTEGER → `pd.to_numeric()`, TO_DATE → `pd.to_datetime()`
249
+ - IS NULL/IS NOT NULL → `.isna()`/`.notna()`
250
+
251
+ ### Parameter File Support (v1.5+)
252
+
253
+ Standard Informatica `.param` file parsing:
254
+ - `[Global]` and `[folder.WF:workflow.ST:session]` section support
255
+ - `get_param(config, var_name)` resolution chain: config → env vars → defaults
256
+ - CLI `--param-file` flag for specifying parameter files
257
+
150
258
  ### Session Connection Overrides (v1.4+)
259
+
151
260
  When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.
152
261
 
153
262
  ### Worklet Support (v1.4+)
263
+
154
264
  Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.
155
265
 
156
266
  ### Type Casting at Target Writes (v1.4+)
267
+
157
268
  Target field datatypes are mapped to pandas types and generate proper casting code:
158
269
  - Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
159
270
  - Dates: `pd.to_datetime(errors='coerce')`
@@ -161,12 +272,15 @@ Target field datatypes are mapped to pandas types and generate proper casting co
161
272
  - Booleans: `.astype('boolean')`
162
273
 
163
274
  ### Flat File Handling (v1.3+)
275
+
164
276
  Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.
165
277
 
166
278
  ### Mapplet Inlining (v1.3+)
279
+
167
280
  Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.
168
281
 
169
282
  ### Decision Tasks (v1.3+)
283
+
170
284
  Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
171
285
 
172
286
  ### Expression Converter (80+ Functions)
@@ -190,6 +304,36 @@ Converts Informatica expressions to Python equivalents:
190
304
 
191
305
  ## Changelog
192
306
 
307
+ ### v1.9.x (Phase 8)
308
+ - Mapping output files now use real mapping names (e.g., `mapping_m_customer_load.py`) instead of generic numeric indices (`mapping_1.py`)
309
+ - Workflow imports automatically match the named mapping files
310
+
311
+ ### v1.8.x (Phase 7)
312
+ - Row-count logging at every pipeline step (source reads, transforms, target writes)
313
+ - Backend-safe logging (try/except wrapped for Dask/lazy backends)
314
+ - Rich mapping function docstrings with sources, targets, and transform pipeline summary
315
+ - Per-transform documentation headers with description, input/output field lists
316
+
317
+ ### v1.7.x (Phase 6)
318
+ - Window/analytic functions (rolling avg/sum, cumulative sum, percentile)
319
+ - Update Strategy routing with actual INSERT/UPDATE/DELETE target operations
320
+ - Dialect-aware SQL placeholders for MSSQL/PostgreSQL/Oracle
321
+ - Full stored procedure code generation (Oracle/MSSQL/generic)
322
+ - JSON-based state persistence for mapping and workflow variables
323
+ - Primary key auto-detection for update strategy targets
324
+
325
+ ### v1.6.x (Phase 5)
326
+ - SQL dialect translation (Oracle/MSSQL → ANSI)
327
+ - Enhanced error reporting (unsupported transforms, unmapped ports, unknown functions)
328
+ - Nested mapplet expansion with circular reference protection
329
+ - Data quality validation warnings on type casting (`--validate-casts`)
330
+
331
+ ### v1.5.x (Phase 4)
332
+ - Parameter file support (`.param` files with section parsing)
333
+ - Vectorized expression generation (column-level pandas operations)
334
+ - Library-specific code adapters (polars/dask/modin/vaex syntax generation)
335
+ - 72+ integration tests
336
+
193
337
  ### v1.4.x (Phase 3)
194
338
  - Session connection overrides for sources and targets
195
339
  - Worklet function generation with safe invocation
@@ -217,8 +361,8 @@ Converts Informatica expressions to Python equivalents:
217
361
  cd informatica_python
218
362
  pip install -e ".[dev]"
219
363
 
220
- # Run tests (25 tests)
221
- pytest tests/test_converter.py -v
364
+ # Run tests (136 tests)
365
+ pytest tests/ -v
222
366
  ```
223
367
 
224
368
  ## License
@@ -0,0 +1,345 @@
1
+ # informatica-python
2
+
3
+ Convert Informatica PowerCenter workflow XML exports into clean, runnable Python/PySpark code.
4
+
5
+ **Author:** Nick
6
+ **License:** MIT
7
+ **PyPI:** [informatica-python](https://pypi.org/project/informatica-python/)
8
+
9
+ ---
10
+
11
+ ## Overview
12
+
13
+ `informatica-python` parses Informatica PowerCenter XML export files and generates equivalent Python code using your choice of data library. It handles all 72 DTD tags from the PowerCenter XML schema and produces a complete, ready-to-run Python project.
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ pip install informatica-python
19
+ ```
20
+
21
+ ## Quick Start
22
+
23
+ ### Command Line
24
+
25
+ ```bash
26
+ # Generate Python files to a directory
27
+ informatica-python workflow_export.xml -o output_dir
28
+
29
+ # Generate as a zip archive
30
+ informatica-python workflow_export.xml -z output.zip
31
+
32
+ # Use a different data library
33
+ informatica-python workflow_export.xml -o output_dir --data-lib polars
34
+
35
+ # Include a parameter file
36
+ informatica-python workflow_export.xml -o output_dir --param-file workflow.param
37
+
38
+ # Enable data quality validation on type casts
39
+ informatica-python workflow_export.xml -o output_dir --validate-casts
40
+
41
+ # Parse to JSON only (no code generation)
42
+ informatica-python workflow_export.xml --json
43
+
44
+ # Save parsed JSON to file
45
+ informatica-python workflow_export.xml --json-file parsed.json
46
+ ```
47
+
48
+ ### Python API
49
+
50
+ ```python
51
+ from informatica_python import InformaticaConverter
52
+
53
+ converter = InformaticaConverter()
54
+
55
+ # Parse and generate files
56
+ converter.convert_to_files("workflow_export.xml", "output_dir")
57
+
58
+ # Parse and generate zip
59
+ converter.convert_to_zip("workflow_export.xml", "output.zip")
60
+
61
+ # Parse to structured dict
62
+ result = converter.parse_file("workflow_export.xml")
63
+
64
+ # Use a different data library
65
+ converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars")
66
+ ```
67
+
68
+ ## Generated Output Files
69
+
70
+ | File | Description |
71
+ |------|-------------|
72
+ | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions), window/analytic functions, stored procedure execution, state persistence |
73
+ | `mapping_{name}.py` | One per mapping, named after the real Informatica mapping name — transformation logic with row-count logging, source reads, target writes, inline documentation |
74
+ | `workflow.py` | Task orchestration with topological ordering, decision branching, worklet calls, and error handling |
75
+ | `config.yml` | Connection configs, source/target metadata, runtime parameters |
76
+ | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms (with ANSI-translated variants) |
77
+ | `error_log.txt` | Conversion summary with unsupported transform analysis, unmapped port detection, and unknown expression function tracing |
78
+
79
+ ## Supported Data Libraries
80
+
81
+ Select via `--data-lib` CLI flag or `data_lib` parameter:
82
+
83
+ | Library | Flag | Best For |
84
+ |---------|------|----------|
85
+ | **pandas** | `pandas` (default) | General-purpose, most compatible |
86
+ | **dask** | `dask` | Large datasets, parallel processing |
87
+ | **polars** | `polars` | High performance, Rust-backed |
88
+ | **vaex** | `vaex` | Out-of-core, billion-row datasets |
89
+ | **modin** | `modin` | Drop-in pandas replacement, multi-core |
90
+
91
+ ## Supported Transformations
92
+
93
+ The code generator produces real, runnable Python for these transformation types:
94
+
95
+ - **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides
96
+ - **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style)
97
+ - **Filter** — Row filtering with vectorized converted conditions
98
+ - **Joiner** — `pd.merge()` with join type and condition parsing (inner/left/right/outer)
99
+ - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads, multiple match policies, default values
100
+ - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST, computed aggregates
101
+ - **Sorter** — `sort_values()` with multi-key ascending/descending
102
+ - **Router** — Multi-group conditional routing with named groups
103
+ - **Union** — `pd.concat()` across multiple input groups
104
+ - **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys
105
+ - **Sequence Generator** — Auto-incrementing ID columns
106
+ - **Normalizer** — `pd.melt()` with auto-detected id/value vars
107
+ - **Rank** — `groupby().rank()` with Top-N filtering
108
+ - **Stored Procedure** — Full code generation with Oracle/MSSQL/generic support, input/output parameter mapping
109
+ - **Transaction Control** — Commit/rollback logic
110
+ - **Custom / Java** — Placeholder stubs with TODO markers
111
+ - **SQL Transform** — Direct SQL execution pass-through
112
+
113
+ ## Supported XML Tags (72 Tags)
114
+
115
+ **Top-level:** POWERMART, REPOSITORY, FOLDER, FOLDERVERSION
116
+
117
+ **Source/Target:** SOURCE, SOURCEFIELD, TARGET, TARGETFIELD, TARGETINDEX, TARGETINDEXFIELD, FLATFILE, XMLINFO, XMLTEXT, GROUP, TABLEATTRIBUTE, FIELDATTRIBUTE, METADATAEXTENSION, KEYWORD, ERPSRCINFO
118
+
119
+ **Mapping/Mapplet:** MAPPING, MAPPLET, TRANSFORMATION, TRANSFORMFIELD, TRANSFORMFIELDATTR, TRANSFORMFIELDATTRDEF, INSTANCE, ASSOCIATED_SOURCE_INSTANCE, CONNECTOR, MAPDEPENDENCY, TARGETLOADORDER, MAPPINGVARIABLE, FIELDDEPENDENCY, INITPROP, ERPINFO
120
+
121
+ **Task/Session/Workflow:** TASK, TIMER, VALUEPAIR, SCHEDULER, SCHEDULEINFO, STARTOPTIONS, ENDOPTIONS, SCHEDULEOPTIONS, RECURRING, CUSTOM, DAILYFREQUENCY, REPEAT, FILTER, SESSION, CONFIGREFERENCE, SESSTRANSFORMATIONINST, SESSTRANSFORMATIONGROUP, PARTITION, HASHKEY, KEYRANGE, CONFIG, SESSIONCOMPONENT, CONNECTIONREFERENCE, TASKINSTANCE, WORKFLOWLINK, WORKFLOWVARIABLE, WORKFLOWEVENT, WORKLET, WORKFLOW, ATTRIBUTE
122
+
123
+ **Shortcut:** SHORTCUT
124
+
125
+ **SAP:** SAPFUNCTION, SAPSTRUCTURE, SAPPROGRAM, SAPOUTPUTPORT, SAPVARIABLE, SAPPROGRAMFLOWOBJECT, SAPTABLEPARAM
126
+
127
+ ## Key Features
128
+
129
+ ### Row-Count Logging (v1.8+)
130
+
131
+ Generated code automatically logs row counts at every step of the data pipeline:
132
+
133
+ ```
134
+ Source SQ_CUSTOMERS: 10000 rows read
135
+ EXP_CALC (Expression): 10000 input rows -> 10000 output rows
136
+ FIL_ACTIVE (Filter): 10000 input rows -> 8542 output rows
137
+ AGG_TOTALS (Aggregator): 8542 input rows -> 150 output rows
138
+ Target TGT_SUMMARY: 150 rows written
139
+ ```
140
+
141
+ All row-count operations are backend-safe (wrapped in try/except), so Dask and other lazy-evaluation backends won't fail.
142
+
143
+ ### Generated Code Documentation (v1.8+)
144
+
145
+ Every generated mapping function includes a rich docstring describing:
146
+ - Mapping name and original Informatica description
147
+ - Source and target tables/files
148
+ - Transformation pipeline with field counts per step
149
+
150
+ Each transformation block is annotated with:
151
+ - Separator headers for visual scanning
152
+ - Transform type and description (from Informatica XML)
153
+ - Input and output field lists (truncated at 10 for readability)
154
+
155
+ ### Window / Analytic Functions (v1.7+)
156
+
157
+ DataFrame-level analytic functions for aggregation transforms:
158
+ - `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
159
+ - `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
160
+ - `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
161
+ - `percentile_df(df, col, pct)` — quantile via `.quantile()`
162
+
163
+ ### Update Strategy with Target Operations (v1.7+)
164
+
165
+ Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
166
+ - Static strategies (0/1/2/3) map to INSERT/UPDATE/DELETE/REJECT
167
+ - DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT expressions parsed from conditions
168
+ - Target writer splits rows and routes to appropriate SQL operations
169
+ - Dialect-aware SQL placeholders (`?` for MSSQL, `%s` for PostgreSQL/Oracle)
170
+ - Primary key columns auto-detected from target field definitions
171
+
172
+ ### Stored Procedure Execution (v1.7+)
173
+
174
+ Full stored procedure code generation (not just stubs):
175
+ - Oracle: `cursor.callproc()` with output parameter registration
176
+ - MSSQL: `EXEC` with output parameter capture
177
+ - Generic: `CALL` syntax for other databases
178
+ - Input/output parameter mapping from transformation fields
179
+ - Empty-input guard prevents errors on empty upstream DataFrames
180
+
181
+ ### State Persistence (v1.7+)
182
+
183
+ JSON-based variable persistence between workflow runs:
184
+ - `load_persistent_state()` / `save_persistent_state()` bracketing workflow execution
185
+ - `get_persistent_variable()` / `set_persistent_variable()` scoped by workflow/mapping name
186
+ - Mapping variables marked `is_persistent="YES"` automatically load from and save to state file
187
+ - Non-persistent variables remain unaffected
188
+
189
+ ### SQL Dialect Translation (v1.6+)
190
+
191
+ Automatically translates vendor-specific SQL to ANSI equivalents:
192
+ - **Oracle:** NVL→COALESCE, SYSDATE→CURRENT_TIMESTAMP, DECODE→CASE, NVL2→CASE, (+)→ANSI JOIN, ROWNUM→LIMIT
193
+ - **MSSQL:** GETDATE→CURRENT_TIMESTAMP, ISNULL→COALESCE, TOP N→LIMIT, LEN→LENGTH, CHARINDEX→POSITION
194
+ - Auto-detects source dialect; outputs both original and translated SQL
195
+
196
+ ### Enhanced Error Reporting (v1.6+)
197
+
198
+ Structured error log with three analysis sections:
199
+ - **Unsupported Transforms:** Lists each skipped transform with type, field count, and attributes
200
+ - **Unmapped Ports:** OUTPUT fields not connected to any downstream transform
201
+ - **Unsupported Expression Functions:** Unknown functions with location traces
202
+
203
+ ### Nested Mapplet Support (v1.6+)
204
+
205
+ Recursively expands mapplet-within-mapplet instances:
206
+ - Double-underscore namespacing for nested transforms
207
+ - Depth limit of 10 with circular reference protection
208
+ - Connector rewiring through the full expansion tree
209
+
210
+ ### Data Quality Validation (v1.6+)
211
+
212
+ Optional `--validate-casts` flag generates null-count checks before/after type casting:
213
+ - Counts null values pre- and post-coercion per column
214
+ - Logs warnings when coercion introduces new nulls
215
+ - Helps identify data quality issues during test runs
216
+
217
+ ### Vectorized Expression Generation (v1.5+)
218
+
219
+ Column-level pandas operations instead of row-level iteration:
220
+ - IIF → `np.where()`, NVL → `.fillna()`, UPPER/LOWER → `.str.upper()/.str.lower()`
221
+ - SUBSTR → `.str[start:end]`, TO_INTEGER → `pd.to_numeric()`, TO_DATE → `pd.to_datetime()`
222
+ - IS NULL/IS NOT NULL → `.isna()`/`.notna()`
223
+
224
+ ### Parameter File Support (v1.5+)
225
+
226
+ Standard Informatica `.param` file parsing:
227
+ - `[Global]` and `[folder.WF:workflow.ST:session]` section support
228
+ - `get_param(config, var_name)` resolution chain: config → env vars → defaults
229
+ - CLI `--param-file` flag for specifying parameter files
230
+
231
+ ### Session Connection Overrides (v1.4+)
232
+
233
+ When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.
234
+
235
+ ### Worklet Support (v1.4+)
236
+
237
+ Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.
238
+
239
+ ### Type Casting at Target Writes (v1.4+)
240
+
241
+ Target field datatypes are mapped to pandas types and generate proper casting code:
242
+ - Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
243
+ - Dates: `pd.to_datetime(errors='coerce')`
244
+ - Decimals/Floats: `pd.to_numeric(errors='coerce')`
245
+ - Booleans: `.astype('boolean')`
246
+
247
+ ### Flat File Handling (v1.3+)
248
+
249
+ Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.
250
+
251
+ ### Mapplet Inlining (v1.3+)
252
+
253
+ Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.
254
+
255
+ ### Decision Tasks (v1.3+)
256
+
257
+ Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
258
+
259
+ ### Expression Converter (80+ Functions)
260
+
261
+ Converts Informatica expressions to Python equivalents:
262
+
263
+ - **String:** SUBSTR, LTRIM, RTRIM, UPPER, LOWER, LPAD, RPAD, INSTR, LENGTH, CONCAT, REPLACE, REG_EXTRACT, REG_REPLACE, REVERSE, INITCAP, CHR, ASCII
264
+ - **Date:** ADD_TO_DATE, DATE_DIFF, GET_DATE_PART, SYSDATE, SYSTIMESTAMP, TO_DATE, TO_CHAR, TRUNC (date)
265
+ - **Numeric:** ROUND, TRUNC, MOD, ABS, CEIL, FLOOR, POWER, SQRT, LOG, EXP, SIGN
266
+ - **Conversion:** TO_INTEGER, TO_BIGINT, TO_FLOAT, TO_DECIMAL, TO_CHAR, TO_DATE
267
+ - **Null handling:** IIF, DECODE, NVL, NVL2, ISNULL, IS_SPACES, IS_NUMBER
268
+ - **Aggregate:** SUM, AVG, COUNT, MIN, MAX, FIRST, LAST, MEDIAN, STDDEV, VARIANCE
269
+ - **Lookup:** :LKP expressions with dynamic lookup references
270
+ - **Variable:** SETVARIABLE / mapping variable assignment
271
+
272
+ ## Requirements
273
+
274
+ - Python >= 3.8
275
+ - lxml >= 4.9.0
276
+ - PyYAML >= 6.0
277
+
278
+ ## Changelog
279
+
280
+ ### v1.9.x (Phase 8)
281
+ - Mapping output files now use real mapping names (e.g., `mapping_m_customer_load.py`) instead of generic numeric indices (`mapping_1.py`)
282
+ - Workflow imports automatically match the named mapping files
283
+
284
+ ### v1.8.x (Phase 7)
285
+ - Row-count logging at every pipeline step (source reads, transforms, target writes)
286
+ - Backend-safe logging (try/except wrapped for Dask/lazy backends)
287
+ - Rich mapping function docstrings with sources, targets, and transform pipeline summary
288
+ - Per-transform documentation headers with description, input/output field lists
289
+
290
+ ### v1.7.x (Phase 6)
291
+ - Window/analytic functions (rolling avg/sum, cumulative sum, percentile)
292
+ - Update Strategy routing with actual INSERT/UPDATE/DELETE target operations
293
+ - Dialect-aware SQL placeholders for MSSQL/PostgreSQL/Oracle
294
+ - Full stored procedure code generation (Oracle/MSSQL/generic)
295
+ - JSON-based state persistence for mapping and workflow variables
296
+ - Primary key auto-detection for update strategy targets
297
+
298
+ ### v1.6.x (Phase 5)
299
+ - SQL dialect translation (Oracle/MSSQL → ANSI)
300
+ - Enhanced error reporting (unsupported transforms, unmapped ports, unknown functions)
301
+ - Nested mapplet expansion with circular reference protection
302
+ - Data quality validation warnings on type casting (`--validate-casts`)
303
+
304
+ ### v1.5.x (Phase 4)
305
+ - Parameter file support (`.param` files with section parsing)
306
+ - Vectorized expression generation (column-level pandas operations)
307
+ - Library-specific code adapters (polars/dask/modin/vaex syntax generation)
308
+ - 72+ integration tests
309
+
310
+ ### v1.4.x (Phase 3)
311
+ - Session connection overrides for sources and targets
312
+ - Worklet function generation with safe invocation
313
+ - Type casting at target writes based on TARGETFIELD datatypes
314
+ - Flat-file session path overrides properly wired
315
+
316
+ ### v1.3.x (Phase 2)
317
+ - FLATFILE metadata in source reads and target writes
318
+ - Normalizer with `pd.melt()`
319
+ - Rank with group-by and Top-N filtering
320
+ - Decision tasks with real if/else branches
321
+ - Mapplet instance inlining
322
+
323
+ ### v1.2.x (Phase 1)
324
+ - Core parser for all 72 XML tags
325
+ - Expression converter with 80+ functions
326
+ - Aggregator, Joiner, Lookup code generation
327
+ - Workflow orchestration with topological task ordering
328
+ - Multi-library support (pandas, dask, polars, vaex, modin)
329
+
330
+ ## Development
331
+
332
+ ```bash
333
+ # Clone and install in development mode
334
+ cd informatica_python
335
+ pip install -e ".[dev]"
336
+
337
+ # Run tests (136 tests)
338
+ pytest tests/ -v
339
+ ```
340
+
341
+ ## License
342
+
343
+ MIT License - Copyright (c) 2025 Nick
344
+
345
+ See [LICENSE](LICENSE) for details.
@@ -7,7 +7,7 @@ Licensed under the MIT License.
7
7
 
8
8
  from informatica_python.converter import InformaticaConverter
9
9
 
10
- __version__ = "1.8.1"
10
+ __version__ = "1.9.0"
11
11
  __author__ = "Nick"
12
12
  __license__ = "MIT"
13
13
  __all__ = ["InformaticaConverter"]
@@ -19,6 +19,14 @@ class InformaticaConverter:
19
19
  self.parser = InformaticaParser()
20
20
  self.powermart = None
21
21
 
22
+ @staticmethod
23
+ def _safe_name(name):
24
+ import re
25
+ safe = re.sub(r'[^a-zA-Z0-9_]', '_', name)
26
+ if safe and safe[0].isdigit():
27
+ safe = '_' + safe
28
+ return safe.lower()
29
+
22
30
  def parse_file(self, file_path: str) -> dict:
23
31
  self.powermart = self.parser.parse_file(file_path)
24
32
  return self.to_json()
@@ -102,7 +110,8 @@ class InformaticaConverter:
102
110
 
103
111
  for i, mapping in enumerate(folder.mappings, 1):
104
112
  code = generate_mapping_code(mapping, folder, self.data_lib, i, validate_casts=validate_casts)
105
- files[f"mapping_{i}.py"] = code
113
+ safe_name = self._safe_name(mapping.name)
114
+ files[f"mapping_{safe_name}.py"] = code
106
115
 
107
116
  files["workflow.py"] = generate_workflow_code(folder)
108
117
 
@@ -25,9 +25,9 @@ def generate_workflow_code(folder: FolderDef) -> str:
25
25
  lines.append("from helper_functions import load_config, logger, load_persistent_state, save_persistent_state, get_persistent_variable, set_persistent_variable")
26
26
  lines.append("")
27
27
 
28
- for i, mapping in enumerate(folder.mappings, 1):
28
+ for mapping in folder.mappings:
29
29
  safe_name = _safe_name(mapping.name)
30
- lines.append(f"from mapping_{i} import run_{safe_name}")
30
+ lines.append(f"from mapping_{safe_name} import run_{safe_name}")
31
31
  lines.append("")
32
32
  lines.append("")
33
33
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: informatica-python
3
- Version: 1.8.1
3
+ Version: 1.9.0
4
4
  Summary: Convert Informatica PowerCenter workflow XML to Python/PySpark code
5
5
  Author: Nick
6
6
  License: MIT
@@ -59,6 +59,12 @@ informatica-python workflow_export.xml -z output.zip
59
59
  # Use a different data library
60
60
  informatica-python workflow_export.xml -o output_dir --data-lib polars
61
61
 
62
+ # Include a parameter file
63
+ informatica-python workflow_export.xml -o output_dir --param-file workflow.param
64
+
65
+ # Enable data quality validation on type casts
66
+ informatica-python workflow_export.xml -o output_dir --validate-casts
67
+
62
68
  # Parse to JSON only (no code generation)
63
69
  informatica-python workflow_export.xml --json
64
70
 
@@ -90,12 +96,12 @@ converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars
90
96
 
91
97
  | File | Description |
92
98
  |------|-------------|
93
- | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions) |
94
- | `mapping_N.py` | One per mapping — transformation logic, source reads, target writes |
95
- | `workflow.py` | Task orchestration with topological ordering and error handling |
99
+ | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions), window/analytic functions, stored procedure execution, state persistence |
100
+ | `mapping_{name}.py` | One per mapping, named after the real Informatica mapping name — transformation logic with row-count logging, source reads, target writes, inline documentation |
101
+ | `workflow.py` | Task orchestration with topological ordering, decision branching, worklet calls, and error handling |
96
102
  | `config.yml` | Connection configs, source/target metadata, runtime parameters |
97
- | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms |
98
- | `error_log.txt` | Conversion summary, warnings, and unsupported feature notes |
103
+ | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms (with ANSI-translated variants) |
104
+ | `error_log.txt` | Conversion summary with unsupported transform analysis, unmapped port detection, and unknown expression function tracing |
99
105
 
100
106
  ## Supported Data Libraries
101
107
 
@@ -113,21 +119,21 @@ Select via `--data-lib` CLI flag or `data_lib` parameter:
113
119
 
114
120
  The code generator produces real, runnable Python for these transformation types:
115
121
 
116
- - **Source Qualifier** — SQL override, pre/post SQL, column selection
117
- - **Expression** — Field-level expressions converted to pandas operations
118
- - **Filter** — Row filtering with converted conditions
119
- - **Joiner** — `pd.merge()` with join type and condition parsing
120
- - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads
121
- - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST
122
+ - **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides
123
+ - **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style)
124
+ - **Filter** — Row filtering with vectorized converted conditions
125
+ - **Joiner** — `pd.merge()` with join type and condition parsing (inner/left/right/outer)
126
+ - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads, multiple match policies, default values
127
+ - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST, computed aggregates
122
128
  - **Sorter** — `sort_values()` with multi-key ascending/descending
123
- - **Router** — Multi-group conditional routing with if/elif/else
129
+ - **Router** — Multi-group conditional routing with named groups
124
130
  - **Union** — `pd.concat()` across multiple input groups
125
- - **Update Strategy** — Insert/Update/Delete/Reject flag generation
131
+ - **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys
126
132
  - **Sequence Generator** — Auto-incrementing ID columns
127
133
  - **Normalizer** — `pd.melt()` with auto-detected id/value vars
128
134
  - **Rank** — `groupby().rank()` with Top-N filtering
129
- - **Stored Procedure** — Stub generation with SP name and parameters
130
- - **Transaction Control** — Commit/rollback logic stubs
135
+ - **Stored Procedure** — Full code generation with Oracle/MSSQL/generic support, input/output parameter mapping
136
+ - **Transaction Control** — Commit/rollback logic
131
137
  - **Custom / Java** — Placeholder stubs with TODO markers
132
138
  - **SQL Transform** — Direct SQL execution pass-through
133
139
 
@@ -147,13 +153,118 @@ The code generator produces real, runnable Python for these transformation types
147
153
 
148
154
  ## Key Features
149
155
 
156
+ ### Row-Count Logging (v1.8+)
157
+
158
+ Generated code automatically logs row counts at every step of the data pipeline:
159
+
160
+ ```
161
+ Source SQ_CUSTOMERS: 10000 rows read
162
+ EXP_CALC (Expression): 10000 input rows -> 10000 output rows
163
+ FIL_ACTIVE (Filter): 10000 input rows -> 8542 output rows
164
+ AGG_TOTALS (Aggregator): 8542 input rows -> 150 output rows
165
+ Target TGT_SUMMARY: 150 rows written
166
+ ```
167
+
168
+ All row-count operations are backend-safe (wrapped in try/except), so Dask and other lazy-evaluation backends won't fail.
169
+
170
+ ### Generated Code Documentation (v1.8+)
171
+
172
+ Every generated mapping function includes a rich docstring describing:
173
+ - Mapping name and original Informatica description
174
+ - Source and target tables/files
175
+ - Transformation pipeline with field counts per step
176
+
177
+ Each transformation block is annotated with:
178
+ - Separator headers for visual scanning
179
+ - Transform type and description (from Informatica XML)
180
+ - Input and output field lists (truncated at 10 for readability)
181
+
182
+ ### Window / Analytic Functions (v1.7+)
183
+
184
+ DataFrame-level analytic functions for aggregation transforms:
185
+ - `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
186
+ - `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
187
+ - `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
188
+ - `percentile_df(df, col, pct)` — quantile via `.quantile()`
189
+
190
+ ### Update Strategy with Target Operations (v1.7+)
191
+
192
+ Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
193
+ - Static strategies (0/1/2/3) map to INSERT/UPDATE/DELETE/REJECT
194
+ - DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT expressions parsed from conditions
195
+ - Target writer splits rows and routes to appropriate SQL operations
196
+ - Dialect-aware SQL placeholders (`?` for MSSQL, `%s` for PostgreSQL/Oracle)
197
+ - Primary key columns auto-detected from target field definitions
198
+
199
+ ### Stored Procedure Execution (v1.7+)
200
+
201
+ Full stored procedure code generation (not just stubs):
202
+ - Oracle: `cursor.callproc()` with output parameter registration
203
+ - MSSQL: `EXEC` with output parameter capture
204
+ - Generic: `CALL` syntax for other databases
205
+ - Input/output parameter mapping from transformation fields
206
+ - Empty-input guard prevents errors on empty upstream DataFrames
207
+
208
+ ### State Persistence (v1.7+)
209
+
210
+ JSON-based variable persistence between workflow runs:
211
+ - `load_persistent_state()` / `save_persistent_state()` bracketing workflow execution
212
+ - `get_persistent_variable()` / `set_persistent_variable()` scoped by workflow/mapping name
213
+ - Mapping variables marked `is_persistent="YES"` automatically load from and save to state file
214
+ - Non-persistent variables remain unaffected
215
+
216
+ ### SQL Dialect Translation (v1.6+)
217
+
218
+ Automatically translates vendor-specific SQL to ANSI equivalents:
219
+ - **Oracle:** NVL→COALESCE, SYSDATE→CURRENT_TIMESTAMP, DECODE→CASE, NVL2→CASE, (+)→ANSI JOIN, ROWNUM→LIMIT
220
+ - **MSSQL:** GETDATE→CURRENT_TIMESTAMP, ISNULL→COALESCE, TOP N→LIMIT, LEN→LENGTH, CHARINDEX→POSITION
221
+ - Auto-detects source dialect; outputs both original and translated SQL
222
+
223
+ ### Enhanced Error Reporting (v1.6+)
224
+
225
+ Structured error log with three analysis sections:
226
+ - **Unsupported Transforms:** Lists each skipped transform with type, field count, and attributes
227
+ - **Unmapped Ports:** OUTPUT fields not connected to any downstream transform
228
+ - **Unsupported Expression Functions:** Unknown functions with location traces
229
+
230
+ ### Nested Mapplet Support (v1.6+)
231
+
232
+ Recursively expands mapplet-within-mapplet instances:
233
+ - Double-underscore namespacing for nested transforms
234
+ - Depth limit of 10 with circular reference protection
235
+ - Connector rewiring through the full expansion tree
236
+
237
+ ### Data Quality Validation (v1.6+)
238
+
239
+ Optional `--validate-casts` flag generates null-count checks before/after type casting:
240
+ - Counts null values pre- and post-coercion per column
241
+ - Logs warnings when coercion introduces new nulls
242
+ - Helps identify data quality issues during test runs
243
+
244
+ ### Vectorized Expression Generation (v1.5+)
245
+
246
+ Column-level pandas operations instead of row-level iteration:
247
+ - IIF → `np.where()`, NVL → `.fillna()`, UPPER/LOWER → `.str.upper()/.str.lower()`
248
+ - SUBSTR → `.str[start:end]`, TO_INTEGER → `pd.to_numeric()`, TO_DATE → `pd.to_datetime()`
249
+ - IS NULL/IS NOT NULL → `.isna()`/`.notna()`
250
+
251
+ ### Parameter File Support (v1.5+)
252
+
253
+ Standard Informatica `.param` file parsing:
254
+ - `[Global]` and `[folder.WF:workflow.ST:session]` section support
255
+ - `get_param(config, var_name)` resolution chain: config → env vars → defaults
256
+ - CLI `--param-file` flag for specifying parameter files
257
+
150
258
  ### Session Connection Overrides (v1.4+)
259
+
151
260
  When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.
152
261
 
153
262
  ### Worklet Support (v1.4+)
263
+
154
264
  Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.
155
265
 
156
266
  ### Type Casting at Target Writes (v1.4+)
267
+
157
268
  Target field datatypes are mapped to pandas types and generate proper casting code:
158
269
  - Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
159
270
  - Dates: `pd.to_datetime(errors='coerce')`
@@ -161,12 +272,15 @@ Target field datatypes are mapped to pandas types and generate proper casting co
161
272
  - Booleans: `.astype('boolean')`
162
273
 
163
274
  ### Flat File Handling (v1.3+)
275
+
164
276
  Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.
165
277
 
166
278
  ### Mapplet Inlining (v1.3+)
279
+
167
280
  Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.
168
281
 
169
282
  ### Decision Tasks (v1.3+)
283
+
170
284
  Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
171
285
 
172
286
  ### Expression Converter (80+ Functions)
@@ -190,6 +304,36 @@ Converts Informatica expressions to Python equivalents:
190
304
 
191
305
  ## Changelog
192
306
 
307
+ ### v1.9.x (Phase 8)
308
+ - Mapping output files now use real mapping names (e.g., `mapping_m_customer_load.py`) instead of generic numeric indices (`mapping_1.py`)
309
+ - Workflow imports automatically match the named mapping files
310
+
311
+ ### v1.8.x (Phase 7)
312
+ - Row-count logging at every pipeline step (source reads, transforms, target writes)
313
+ - Backend-safe logging (try/except wrapped for Dask/lazy backends)
314
+ - Rich mapping function docstrings with sources, targets, and transform pipeline summary
315
+ - Per-transform documentation headers with description, input/output field lists
316
+
317
+ ### v1.7.x (Phase 6)
318
+ - Window/analytic functions (rolling avg/sum, cumulative sum, percentile)
319
+ - Update Strategy routing with actual INSERT/UPDATE/DELETE target operations
320
+ - Dialect-aware SQL placeholders for MSSQL/PostgreSQL/Oracle
321
+ - Full stored procedure code generation (Oracle/MSSQL/generic)
322
+ - JSON-based state persistence for mapping and workflow variables
323
+ - Primary key auto-detection for update strategy targets
324
+
325
+ ### v1.6.x (Phase 5)
326
+ - SQL dialect translation (Oracle/MSSQL → ANSI)
327
+ - Enhanced error reporting (unsupported transforms, unmapped ports, unknown functions)
328
+ - Nested mapplet expansion with circular reference protection
329
+ - Data quality validation warnings on type casting (`--validate-casts`)
330
+
331
+ ### v1.5.x (Phase 4)
332
+ - Parameter file support (`.param` files with section parsing)
333
+ - Vectorized expression generation (column-level pandas operations)
334
+ - Library-specific code adapters (polars/dask/modin/vaex syntax generation)
335
+ - 72+ integration tests
336
+
193
337
  ### v1.4.x (Phase 3)
194
338
  - Session connection overrides for sources and targets
195
339
  - Worklet function generation with safe invocation
@@ -217,8 +361,8 @@ Converts Informatica expressions to Python equivalents:
217
361
  cd informatica_python
218
362
  pip install -e ".[dev]"
219
363
 
220
- # Run tests (25 tests)
221
- pytest tests/test_converter.py -v
364
+ # Run tests (136 tests)
365
+ pytest tests/ -v
222
366
  ```
223
367
 
224
368
  ## License
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "informatica-python"
7
- version = "1.8.1"
7
+ version = "1.9.0"
8
8
  description = "Convert Informatica PowerCenter workflow XML to Python/PySpark code"
9
9
  readme = "README.md"
10
10
  license = {text = "MIT"}
@@ -104,7 +104,6 @@ def test_convert_to_files():
104
104
 
105
105
  expected_files = [
106
106
  "helper_functions.py",
107
- "mapping_1.py",
108
107
  "workflow.py",
109
108
  "config.yml",
110
109
  "all_sql_queries.sql",
@@ -118,6 +117,12 @@ def test_convert_to_files():
118
117
  assert size > 0, f"{f} should not be empty"
119
118
  print(f" {f}: {size} bytes")
120
119
 
120
+ mapping_files = [f for f in os.listdir(output_dir) if f.startswith("mapping_") and f.endswith(".py")]
121
+ assert len(mapping_files) > 0, "At least one mapping file should exist"
122
+ for mf in mapping_files:
123
+ assert mf != "mapping_1.py", "Mapping files should use real mapping names, not numeric indices"
124
+ print(f" {mf}: {os.path.getsize(os.path.join(output_dir, mf))} bytes")
125
+
121
126
  print(f"PASS: test_convert_to_files")
122
127
 
123
128
 
@@ -337,7 +337,7 @@ class TestCodeGeneration:
337
337
  converter = InformaticaConverter(data_lib="pandas")
338
338
  output = converter.convert_string(MINIMAL_XML, output_dir=self.tmpdir)
339
339
 
340
- mapping_path = os.path.join(output, "mapping_1.py")
340
+ mapping_path = os.path.join(output, "mapping_m_test_expr.py")
341
341
  assert os.path.exists(mapping_path)
342
342
 
343
343
  with open(mapping_path) as f:
@@ -350,7 +350,7 @@ class TestCodeGeneration:
350
350
  converter = InformaticaConverter(data_lib="pandas")
351
351
  output = converter.convert_string(FILTER_XML, output_dir=self.tmpdir)
352
352
 
353
- mapping_path = os.path.join(output, "mapping_1.py")
353
+ mapping_path = os.path.join(output, "mapping_m_test_filter.py")
354
354
  with open(mapping_path) as f:
355
355
  code = f.read()
356
356
  assert 'Filter' in code
@@ -378,7 +378,7 @@ class TestCodeGeneration:
378
378
  output = converter.convert_string(MINIMAL_XML, output_dir=self.tmpdir)
379
379
 
380
380
  expected_files = [
381
- "helper_functions.py", "mapping_1.py", "workflow.py",
381
+ "helper_functions.py", "mapping_m_test_expr.py", "workflow.py",
382
382
  "config.yml", "all_sql_queries.sql", "error_log.txt",
383
383
  ]
384
384
  for fname in expected_files:
@@ -398,7 +398,7 @@ class TestCodeGeneration:
398
398
  converter = InformaticaConverter(data_lib="pandas")
399
399
  output = converter.convert_string(MINIMAL_XML, output_dir=self.tmpdir)
400
400
 
401
- mapping_path = os.path.join(output, "mapping_1.py")
401
+ mapping_path = os.path.join(output, "mapping_m_test_expr.py")
402
402
  with open(mapping_path) as f:
403
403
  code = f.read()
404
404
  compile(code, mapping_path, "exec")
@@ -1,201 +0,0 @@
1
- # informatica-python
2
-
3
- Convert Informatica PowerCenter workflow XML exports into clean, runnable Python/PySpark code.
4
-
5
- **Author:** Nick
6
- **License:** MIT
7
- **PyPI:** [informatica-python](https://pypi.org/project/informatica-python/)
8
-
9
- ---
10
-
11
- ## Overview
12
-
13
- `informatica-python` parses Informatica PowerCenter XML export files and generates equivalent Python code using your choice of data library. It handles all 72 DTD tags from the PowerCenter XML schema and produces a complete, ready-to-run Python project.
14
-
15
- ## Installation
16
-
17
- ```bash
18
- pip install informatica-python
19
- ```
20
-
21
- ## Quick Start
22
-
23
- ### Command Line
24
-
25
- ```bash
26
- # Generate Python files to a directory
27
- informatica-python workflow_export.xml -o output_dir
28
-
29
- # Generate as a zip archive
30
- informatica-python workflow_export.xml -z output.zip
31
-
32
- # Use a different data library
33
- informatica-python workflow_export.xml -o output_dir --data-lib polars
34
-
35
- # Parse to JSON only (no code generation)
36
- informatica-python workflow_export.xml --json
37
-
38
- # Save parsed JSON to file
39
- informatica-python workflow_export.xml --json-file parsed.json
40
- ```
41
-
42
- ### Python API
43
-
44
- ```python
45
- from informatica_python import InformaticaConverter
46
-
47
- converter = InformaticaConverter()
48
-
49
- # Parse and generate files
50
- converter.convert_to_files("workflow_export.xml", "output_dir")
51
-
52
- # Parse and generate zip
53
- converter.convert_to_zip("workflow_export.xml", "output.zip")
54
-
55
- # Parse to structured dict
56
- result = converter.parse_file("workflow_export.xml")
57
-
58
- # Use a different data library
59
- converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars")
60
- ```
61
-
62
- ## Generated Output Files
63
-
64
- | File | Description |
65
- |------|-------------|
66
- | `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions) |
67
- | `mapping_N.py` | One per mapping — transformation logic, source reads, target writes |
68
- | `workflow.py` | Task orchestration with topological ordering and error handling |
69
- | `config.yml` | Connection configs, source/target metadata, runtime parameters |
70
- | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms |
71
- | `error_log.txt` | Conversion summary, warnings, and unsupported feature notes |
72
-
73
- ## Supported Data Libraries
74
-
75
- Select via `--data-lib` CLI flag or `data_lib` parameter:
76
-
77
- | Library | Flag | Best For |
78
- |---------|------|----------|
79
- | **pandas** | `pandas` (default) | General-purpose, most compatible |
80
- | **dask** | `dask` | Large datasets, parallel processing |
81
- | **polars** | `polars` | High performance, Rust-backed |
82
- | **vaex** | `vaex` | Out-of-core, billion-row datasets |
83
- | **modin** | `modin` | Drop-in pandas replacement, multi-core |
84
-
85
- ## Supported Transformations
86
-
87
- The code generator produces real, runnable Python for these transformation types:
88
-
89
- - **Source Qualifier** — SQL override, pre/post SQL, column selection
90
- - **Expression** — Field-level expressions converted to pandas operations
91
- - **Filter** — Row filtering with converted conditions
92
- - **Joiner** — `pd.merge()` with join type and condition parsing
93
- - **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads
94
- - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST
95
- - **Sorter** — `sort_values()` with multi-key ascending/descending
96
- - **Router** — Multi-group conditional routing with if/elif/else
97
- - **Union** — `pd.concat()` across multiple input groups
98
- - **Update Strategy** — Insert/Update/Delete/Reject flag generation
99
- - **Sequence Generator** — Auto-incrementing ID columns
100
- - **Normalizer** — `pd.melt()` with auto-detected id/value vars
101
- - **Rank** — `groupby().rank()` with Top-N filtering
102
- - **Stored Procedure** — Stub generation with SP name and parameters
103
- - **Transaction Control** — Commit/rollback logic stubs
104
- - **Custom / Java** — Placeholder stubs with TODO markers
105
- - **SQL Transform** — Direct SQL execution pass-through
106
-
107
- ## Supported XML Tags (72 Tags)
108
-
109
- **Top-level:** POWERMART, REPOSITORY, FOLDER, FOLDERVERSION
110
-
111
- **Source/Target:** SOURCE, SOURCEFIELD, TARGET, TARGETFIELD, TARGETINDEX, TARGETINDEXFIELD, FLATFILE, XMLINFO, XMLTEXT, GROUP, TABLEATTRIBUTE, FIELDATTRIBUTE, METADATAEXTENSION, KEYWORD, ERPSRCINFO
112
-
113
- **Mapping/Mapplet:** MAPPING, MAPPLET, TRANSFORMATION, TRANSFORMFIELD, TRANSFORMFIELDATTR, TRANSFORMFIELDATTRDEF, INSTANCE, ASSOCIATED_SOURCE_INSTANCE, CONNECTOR, MAPDEPENDENCY, TARGETLOADORDER, MAPPINGVARIABLE, FIELDDEPENDENCY, INITPROP, ERPINFO
114
-
115
- **Task/Session/Workflow:** TASK, TIMER, VALUEPAIR, SCHEDULER, SCHEDULEINFO, STARTOPTIONS, ENDOPTIONS, SCHEDULEOPTIONS, RECURRING, CUSTOM, DAILYFREQUENCY, REPEAT, FILTER, SESSION, CONFIGREFERENCE, SESSTRANSFORMATIONINST, SESSTRANSFORMATIONGROUP, PARTITION, HASHKEY, KEYRANGE, CONFIG, SESSIONCOMPONENT, CONNECTIONREFERENCE, TASKINSTANCE, WORKFLOWLINK, WORKFLOWVARIABLE, WORKFLOWEVENT, WORKLET, WORKFLOW, ATTRIBUTE
116
-
117
- **Shortcut:** SHORTCUT
118
-
119
- **SAP:** SAPFUNCTION, SAPSTRUCTURE, SAPPROGRAM, SAPOUTPUTPORT, SAPVARIABLE, SAPPROGRAMFLOWOBJECT, SAPTABLEPARAM
120
-
121
- ## Key Features
122
-
123
- ### Session Connection Overrides (v1.4+)
124
- When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.
125
-
126
- ### Worklet Support (v1.4+)
127
- Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.
128
-
129
- ### Type Casting at Target Writes (v1.4+)
130
- Target field datatypes are mapped to pandas types and generate proper casting code:
131
- - Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
132
- - Dates: `pd.to_datetime(errors='coerce')`
133
- - Decimals/Floats: `pd.to_numeric(errors='coerce')`
134
- - Booleans: `.astype('boolean')`
135
-
136
- ### Flat File Handling (v1.3+)
137
- Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.
138
-
139
- ### Mapplet Inlining (v1.3+)
140
- Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.
141
-
142
- ### Decision Tasks (v1.3+)
143
- Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
144
-
145
- ### Expression Converter (80+ Functions)
146
-
147
- Converts Informatica expressions to Python equivalents:
148
-
149
- - **String:** SUBSTR, LTRIM, RTRIM, UPPER, LOWER, LPAD, RPAD, INSTR, LENGTH, CONCAT, REPLACE, REG_EXTRACT, REG_REPLACE, REVERSE, INITCAP, CHR, ASCII
150
- - **Date:** ADD_TO_DATE, DATE_DIFF, GET_DATE_PART, SYSDATE, SYSTIMESTAMP, TO_DATE, TO_CHAR, TRUNC (date)
151
- - **Numeric:** ROUND, TRUNC, MOD, ABS, CEIL, FLOOR, POWER, SQRT, LOG, EXP, SIGN
152
- - **Conversion:** TO_INTEGER, TO_BIGINT, TO_FLOAT, TO_DECIMAL, TO_CHAR, TO_DATE
153
- - **Null handling:** IIF, DECODE, NVL, NVL2, ISNULL, IS_SPACES, IS_NUMBER
154
- - **Aggregate:** SUM, AVG, COUNT, MIN, MAX, FIRST, LAST, MEDIAN, STDDEV, VARIANCE
155
- - **Lookup:** :LKP expressions with dynamic lookup references
156
- - **Variable:** SETVARIABLE / mapping variable assignment
157
-
158
- ## Requirements
159
-
160
- - Python >= 3.8
161
- - lxml >= 4.9.0
162
- - PyYAML >= 6.0
163
-
164
- ## Changelog
165
-
166
- ### v1.4.x (Phase 3)
167
- - Session connection overrides for sources and targets
168
- - Worklet function generation with safe invocation
169
- - Type casting at target writes based on TARGETFIELD datatypes
170
- - Flat-file session path overrides properly wired
171
-
172
- ### v1.3.x (Phase 2)
173
- - FLATFILE metadata in source reads and target writes
174
- - Normalizer with `pd.melt()`
175
- - Rank with group-by and Top-N filtering
176
- - Decision tasks with real if/else branches
177
- - Mapplet instance inlining
178
-
179
- ### v1.2.x (Phase 1)
180
- - Core parser for all 72 XML tags
181
- - Expression converter with 80+ functions
182
- - Aggregator, Joiner, Lookup code generation
183
- - Workflow orchestration with topological task ordering
184
- - Multi-library support (pandas, dask, polars, vaex, modin)
185
-
186
- ## Development
187
-
188
- ```bash
189
- # Clone and install in development mode
190
- cd informatica_python
191
- pip install -e ".[dev]"
192
-
193
- # Run tests (25 tests)
194
- pytest tests/test_converter.py -v
195
- ```
196
-
197
- ## License
198
-
199
- MIT License - Copyright (c) 2025 Nick
200
-
201
- See [LICENSE](LICENSE) for details.