PyPI - informatica-python - Versions diffs - 1.9.1__tar.gz → 1.9.3__tar.gz - Mend

informatica-python 1.9.1tar.gz → 1.9.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

{informatica_python-1.9.1 → informatica_python-1.9.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: informatica-python
-Version: 1.9.1
+Version: 1.9.3
 Summary: Convert Informatica PowerCenter workflow XML to Python/PySpark code
 Author: Nick
 License: MIT
@@ -79,25 +79,26 @@ from informatica_python import InformaticaConverter
 converter = InformaticaConverter()
-# Parse and generate files
-converter.convert_to_files("workflow_export.xml", "output_dir")
+# Parse and generate files to a directory
+converter.convert("workflow_export.xml", output_dir="output_dir")
-# Parse and generate zip
-converter.convert_to_zip("workflow_export.xml", "output.zip")
+# Parse and generate zip archive
+converter.convert("workflow_export.xml", output_zip="output.zip")
-# Parse to structured dict
+# Parse to structured dict (no code generation)
 result = converter.parse_file("workflow_export.xml")
 # Use a different data library
-converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars")
+converter = InformaticaConverter(data_lib="polars")
+converter.convert("workflow_export.xml", output_dir="output_dir")
 ```
 ## Generated Output Files
 | File | Description |
 |------|-------------|
-| `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions), window/analytic functions, stored procedure execution, state persistence |
-| `mapping_{name}.py` | One per mapping, named after the real Informatica mapping name — transformation logic with row-count logging, source reads, target writes, inline documentation |
+| `helper_functions.py` | Database/file I/O helpers, 90+ Informatica expression equivalents, window/analytic functions, stored procedure execution, state persistence |
+| `mapping_{name}.py` | One per mapping, named after the real Informatica mapping name — transformation logic with vectorized expressions, row-count logging, type casting, inline documentation |
 | `workflow.py` | Task orchestration with topological ordering, decision branching, worklet calls, and error handling |
 | `config.yml` | Connection configs, source/target metadata, runtime parameters |
 | `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms (with ANSI-translated variants) |
@@ -119,23 +120,22 @@ Select via `--data-lib` CLI flag or `data_lib` parameter:
 The code generator produces real, runnable Python for these transformation types:
-- **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides
-- **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style)
+- **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides, `$$PARAM` substitution in SQL
+- **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style) with 40+ vectorized function handlers
 - **Filter** — Row filtering with vectorized converted conditions
 - **Joiner** — `pd.merge()` with join type and condition parsing (inner/left/right/outer)
-- **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads, multiple match policies, default values
+- **Lookup** — `pd.merge()` lookups with connection-aware DB reads, multiple match policies, default values, `$$PARAM` substitution
 - **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST, computed aggregates
-- **Sorter** — `sort_values()` with multi-key ascending/descending
+- **Sorter** — `sort_values()` with multi-key ascending/descending per-field direction from SORTDIRECTION attribute
 - **Router** — Multi-group conditional routing with named groups
 - **Union** — `pd.concat()` across multiple input groups
-- **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys
+- **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys; vectorized expression parsing with row-level fallback
 - **Sequence Generator** — Auto-incrementing ID columns
 - **Normalizer** — `pd.melt()` with auto-detected id/value vars
 - **Rank** — `groupby().rank()` with Top-N filtering
 - **Stored Procedure** — Full code generation with Oracle/MSSQL/generic support, input/output parameter mapping
-- **Transaction Control** — Commit/rollback logic
 - **Custom / Java** — Placeholder stubs with TODO markers
-- **SQL Transform** — Direct SQL execution pass-through
+- **SQL Transform** — Direct SQL execution pass-through with `$$PARAM` substitution
 ## Supported XML Tags (72 Tags)
@@ -153,6 +153,86 @@ The code generator produces real, runnable Python for these transformation types
 ## Key Features
+### Generated Code Quality (v1.9.3+)
+Generated code follows clean formatting and commenting standards:
+- Consistent section headers (`# ---`) for Source Qualifiers, Transformations, and Target Writes
+- Each section includes metadata: database type, field lists, descriptions
+- Column mapping comments (`# Column mapping: source -> target`) and write operation type comments (`# Write to database table` / `# Write to file`)
+- Expression inline comments showing original Informatica expression (e.g., `# FULL_NAME = UPPER(FIRST_NAME) || ' ' || UPPER(LAST_NAME)`)
+- Clean indentation: no blank line after `try:`, no consecutive blank lines inside function body
+- Mapping-level `try:/except` wrapper with `logger.error()` for runtime visibility
+### Smart Target Write Detection (v1.9.3+)
+Targets are automatically classified as database or file writes:
+- Targets with `database_type` set (Oracle, SQL Server, etc.) generate `write_to_db()` calls
+- Targets with flatfile metadata or file extensions (`.csv`, `.dat`, `.txt`, `.xml`, `.json`, `.parquet`, `.xlsx`, `.xls`, `.tsv`, `.avro`) generate `write_file()` calls
+- Bare targets (no metadata) default to `write_to_db()` since Informatica targets are typically database tables
+- Schema-qualified names (e.g., `dbo.MY_TABLE`) correctly route to database writes
+- Session file path overrides take priority when present
+### Vectorized Expression Engine (v1.9.2+)
+Column-level pandas operations instead of row-level iteration. The expression converter uses a recursive parenthesis-aware parser that handles:
+**Conditional / Null:**
+- `IIF(cond, val, else_val)` → `np.where()` — supports 2-arg form (missing else defaults to `None`)
+- `DECODE(TRUE, cond1, val1, ..., default)` → nested `np.where()` chains
+- `DECODE(field, val1, res1, ..., default)` → value-matching `np.where()`
+- `NVL(val, default)` → `.fillna()`
+- `IS_SPACES(field)` → `field.str.strip().eq("")`
+- `IS_NUMBER(field)` → `pd.to_numeric(field, errors="coerce").notna()`
+- `IN(field, val1, val2, ...)` → `field.isin([...])`
+**String:**
+- `UPPER/LOWER` → `.str.upper()/.str.lower()`
+- `LTRIM/RTRIM/TRIM` → `.str.lstrip()/.str.rstrip()/.str.strip()` with custom char support
+- `SUBSTR(val, start, len)` → `.str[start:end]`
+- `INSTR(val, search)` → `.str.find()`
+- `LPAD/RPAD` → `.str.pad()`
+- `REVERSE(val)` → `.str[::-1]`
+- `INITCAP(val)` → `.str.title()`
+- `REPLACECHR/REPLACESTR` → `.str.replace()`
+- `REG_EXTRACT/REG_REPLACE` → `.str.extract()/.str.replace(regex=True)`
+- `CHR(code)` → `chr(int(code))`
+- `||` concatenation → `+` with `.astype(str)` on non-literals
+**Date/Time:**
+- `TO_DATE(val, fmt)` → `pd.to_datetime()` with Informatica→Python format conversion
+- `TO_CHAR(val, fmt)` → `.dt.strftime()`
+- `ADD_TO_DATE(date, part, amount)` → `date + pd.to_timedelta()` with full unit mapping (YY/MM/DD/HH/MI/SS)
+- `DATE_DIFF(date1, date2, part)` → `(date1 - date2).dt.days` / `.dt.total_seconds() / 3600` etc.
+- `SYSDATE/SYSTIMESTAMP` → `pd.Timestamp.now()`
+- `TRUNC(date, 'DD')` → date truncation via `.dt.floor()/.dt.to_period()`
+- `MAKE_DATE_TIME(y, m, d, h, mi, s)` → `pd.Timestamp()`
+**Numeric:**
+- `TO_INTEGER/TO_BIGINT/TO_FLOAT/TO_DECIMAL` → `pd.to_numeric()`
+- `TRUNC(val)` → `np.trunc()` for numeric truncation
+- `ROUND/ABS/CEIL/FLOOR/POWER/SQRT/MOD/LOG/SIGN` → `np.*` equivalents
+**Special:**
+- `:LKP.TABLE(args)` — Connected lookup references → `df_lkp_table` merge
+- `:PORT.FUNC(args)` — Unconnected lookups → `lookup_func("FUNC", args)` calls
+- Inline `--` comment stripping (respects string literals)
+- String-literal-aware field substitution
+### Expression Converter (90+ Row-Level Functions)
+All Informatica expression functions are available as row-level Python equivalents in `helper_functions.py`:
+- **String:** `substr`, `ltrim`, `rtrim`, `upper`, `lower`, `lpad`, `rpad`, `instr`, `length`, `concat`, `replacechr`, `replacestr`, `reg_extract`, `reg_replace`, `reg_match`, `reverse_str`, `initcap`, `chr_func`, `ascii_func`, `left_str`, `right_str`, `trim_func`, `indexof`, `metaphone_func`, `soundex_func`, `compress_func`, `decompress_func`
+- **Date:** `add_to_date`, `date_diff`, `date_compare`, `get_date_part`, `set_date_part`, `last_day`, `make_date_time`, `to_date`, `to_char`, `to_timestamp_func`, `current_timestamp`, `session_start_time`
+- **Numeric:** `round_val`, `trunc`, `mod_val`, `abs_val`, `ceil_val`, `floor_val`, `power_val`, `sqrt_val`, `log_val`, `ln_val`, `exp_val`, `sign_val`, `rand_val`, `greatest_val`, `least_val`
+- **Conversion:** `to_integer`, `to_bigint`, `to_float`, `to_decimal`, `cast_func`
+- **Null/Conditional:** `iif_expr`, `decode_expr`, `nvl`, `nvl2`, `isnull`, `is_spaces`, `is_number`, `is_date`, `in_expr`, `choose_expr`
+- **Aggregate:** `sum_val`, `avg_val`, `count_val`, `min_val`, `max_val`, `first_val`, `last_val`, `median_val`, `stddev_val`, `variance_val`, `percentile_val`
+- **Window/Analytic:** `moving_avg`, `moving_avg_df`, `moving_sum`, `moving_sum_df`, `cume`, `cume_df`, `percentile_df`
+- **Lookup:** `lookup_func` — Placeholder for runtime lookup resolution
+- **Variable:** `get_variable`, `set_variable`, `set_count_variable`
+- **Control:** `raise_error`, `abort_func`
 ### Row-Count Logging (v1.8+)
 Generated code automatically logs row counts at every step of the data pipeline:
@@ -165,8 +245,6 @@ AGG_TOTALS (Aggregator): 8542 input rows -> 150 output rows
 Target TGT_SUMMARY: 150 rows written
 ```
-All row-count operations are backend-safe (wrapped in try/except), so Dask and other lazy-evaluation backends won't fail.
 ### Generated Code Documentation (v1.8+)
 Every generated mapping function includes a rich docstring describing:
@@ -179,14 +257,6 @@ Each transformation block is annotated with:
 - Transform type and description (from Informatica XML)
 - Input and output field lists (truncated at 10 for readability)
-### Window / Analytic Functions (v1.7+)
-DataFrame-level analytic functions for aggregation transforms:
-- `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
-- `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
-- `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
-- `percentile_df(df, col, pct)` — quantile via `.quantile()`
 ### Update Strategy with Target Operations (v1.7+)
 Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
@@ -196,6 +266,14 @@ Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
 - Dialect-aware SQL placeholders (`?` for MSSQL, `%s` for PostgreSQL/Oracle)
 - Primary key columns auto-detected from target field definitions
+### Window / Analytic Functions (v1.7+)
+DataFrame-level analytic functions for aggregation transforms:
+- `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
+- `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
+- `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
+- `percentile_df(df, col, pct)` — quantile via `.quantile()`
 ### Stored Procedure Execution (v1.7+)
 Full stored procedure code generation (not just stubs):
@@ -241,19 +319,13 @@ Optional `--validate-casts` flag generates null-count checks before/after type c
 - Logs warnings when coercion introduces new nulls
 - Helps identify data quality issues during test runs
-### Vectorized Expression Generation (v1.5+)
-Column-level pandas operations instead of row-level iteration:
-- IIF → `np.where()`, NVL → `.fillna()`, UPPER/LOWER → `.str.upper()/.str.lower()`
-- SUBSTR → `.str[start:end]`, TO_INTEGER → `pd.to_numeric()`, TO_DATE → `pd.to_datetime()`
-- IS NULL/IS NOT NULL → `.isna()`/`.notna()`
 ### Parameter File Support (v1.5+)
 Standard Informatica `.param` file parsing:
 - `[Global]` and `[folder.WF:workflow.ST:session]` section support
 - `get_param(config, var_name)` resolution chain: config → env vars → defaults
 - CLI `--param-file` flag for specifying parameter files
+- `$$PARAM` variables in SQL automatically substituted with `.replace()` calls
 ### Session Connection Overrides (v1.4+)
@@ -283,18 +355,49 @@ Expands Mapplet instances into prefixed transforms, rewires connectors, and elim
 Converts Informatica decision conditions to Python if/else branches with proper variable substitution.
-### Expression Converter (80+ Functions)
-Converts Informatica expressions to Python equivalents:
-- **String:** SUBSTR, LTRIM, RTRIM, UPPER, LOWER, LPAD, RPAD, INSTR, LENGTH, CONCAT, REPLACE, REG_EXTRACT, REG_REPLACE, REVERSE, INITCAP, CHR, ASCII
-- **Date:** ADD_TO_DATE, DATE_DIFF, GET_DATE_PART, SYSDATE, SYSTIMESTAMP, TO_DATE, TO_CHAR, TRUNC (date)
-- **Numeric:** ROUND, TRUNC, MOD, ABS, CEIL, FLOOR, POWER, SQRT, LOG, EXP, SIGN
-- **Conversion:** TO_INTEGER, TO_BIGINT, TO_FLOAT, TO_DECIMAL, TO_CHAR, TO_DATE
-- **Null handling:** IIF, DECODE, NVL, NVL2, ISNULL, IS_SPACES, IS_NUMBER
-- **Aggregate:** SUM, AVG, COUNT, MIN, MAX, FIRST, LAST, MEDIAN, STDDEV, VARIANCE
-- **Lookup:** :LKP expressions with dynamic lookup references
-- **Variable:** SETVARIABLE / mapping variable assignment
+## Helper Functions Library
+The generated `helper_functions.py` provides a complete runtime library:
+### Configuration & Parameters
+| Function | Description |
+|----------|-------------|
+| `load_config(path, param_file)` | Load YAML config with optional `.param` file merge |
+| `parse_param_file(path)` | Parse Informatica `.param` files (`[Global]`, `[folder.WF:...]` sections) |
+| `get_param(config, var_name, default)` | Resolve parameter: config → env vars → default |
+| `get_variable(var_name, config)` | Get workflow/mapping variable from params, env vars, or param store |
+| `set_variable(var_name, value)` | Set workflow/mapping variable in param store and env |
+### Database Operations
+| Function | Description |
+|----------|-------------|
+| `get_db_connection(config, conn_name)` | Create DB connection (pyodbc/pymssql/sqlalchemy fallback for MSSQL) |
+| `read_from_db(config, query, conn_name)` | Execute SQL query and return DataFrame |
+| `write_to_db(config, df, table, conn_name)` | Write DataFrame to database table via `.to_sql()` |
+| `execute_sql(config, sql, conn_name)` | Execute DDL/DML statement (INSERT, UPDATE, DELETE) |
+| `write_with_update_strategy(config, df, table, ...)` | Split rows by `_update_strategy` column into INSERT/UPDATE/DELETE/REJECT operations |
+| `call_stored_procedure(config, proc, params, ...)` | Execute stored procedure with input/output parameter mapping (Oracle/MSSQL/generic) |
+### File Operations
+| Function | Description |
+|----------|-------------|
+| `read_file(path, file_config)` | Read CSV/DAT/TXT/XML/XLSX/JSON/Parquet with auto-detection |
+| `write_file(df, path, file_config)` | Write DataFrame to file with format auto-detection |
+### State Persistence
+| Function | Description |
+|----------|-------------|
+| `load_persistent_state(file)` | Load JSON state file for persistent variables |
+| `save_persistent_state(file)` | Save persistent variables to JSON state file |
+| `get_persistent_variable(scope, var, default)` | Get scoped persistent variable |
+| `set_persistent_variable(scope, var, value)` | Set scoped persistent variable |
+### Logging & Monitoring
+| Function | Description |
+|----------|-------------|
+| `log_mapping_start(name)` | Log mapping start with timestamp |
+| `log_mapping_end(name, start_time, row_count)` | Log mapping completion with elapsed time |
+| `validate_row_count(df, name, min_rows)` | Validate minimum row count threshold |
 ## Requirements
@@ -304,9 +407,40 @@ Converts Informatica expressions to Python equivalents:
 ## Changelog
-### v1.9.x (Phase 8)
+### v1.9.3 (Current)
+- **Smart target write detection**: Bare targets default to `write_to_db()` instead of `write_file()`; file extension allowlist (`.csv`, `.dat`, `.txt`, `.xml`, `.json`, `.parquet`, `.xlsx`, `.xls`, `.tsv`, `.avro`) for file targets; schema-qualified names (`dbo.TABLE`) correctly route to database
+- **DECODE vectorization**: `DECODE(TRUE, cond1, val1, ..., default)` → nested `np.where()` chains; value-matching DECODE; handles IN() conditions and complex boolean nesting
+- **IS_SPACES vectorization**: `IS_SPACES(field)` → `field.str.strip().eq("")`
+- **2-arg IIF**: `IIF(cond, val)` without else clause defaults to `None`
+- **REVERSE vectorization**: `REVERSE(field)` → `field.str[::-1]`
+- **IN() vectorization**: `IN(field, val1, val2, ...)` → `field.isin([...])`
+- **IS_NUMBER vectorization**: `IS_NUMBER(field)` → `pd.to_numeric(field, errors="coerce").notna()`
+- **SYSDATE/SYSTIMESTAMP**: Bare `SYSDATE`/`SYSTIMESTAMP` → `pd.Timestamp.now()` in vectorized mode
+- **TRUNC vectorization**: Numeric `TRUNC(field)` → `np.trunc()`; date `TRUNC(field, 'DD')` → `.dt.floor()`
+- **ADD_TO_DATE vectorization**: `ADD_TO_DATE(date, part, amount)` → `pd.to_timedelta()` with YY/MM/DD/HH/MI/SS units
+- **DATE_DIFF vectorization**: `DATE_DIFF(date1, date2, part)` → arithmetic on timedelta components
+- **Unconnected lookup support**: `:PORT.FUNC_NAME(args)` → `lookup_func("FUNC_NAME", args)`
+- **Inline comment stripping**: `--` comments removed from expressions (respects string literals)
+- **`$$PARAM` SQL substitution**: Source Qualifier, Lookup, and SQL Transform SQL strings auto-substitute `$$VAR` with `get_param(config, 'VAR')` calls
+- **Sorter direction**: Reads `SORTDIRECTION` from field attributes, generates per-field `ascending=[True, False, ...]`
+- **Pass-through optimization**: Identity expressions skip `.copy()` and use direct reference
+- **Duplicate lookup deduplication**: `_gen_lookup_transform` uses `seen_output_cols` set to avoid duplicate column checks
+- **Mapping-level error handling**: Generated function body wrapped in `try:/except` with `logger.error()`
+- **Update strategy vectorized**: Tries vectorized expression first, falls back to row-level `apply()`
+- **Generated code formatting**: Consistent `# ---` section headers for Source Qualifiers, Transforms, and Target Writes; metadata comments (database type, field lists); column mapping and write operation comments; clean blank line handling
+- **Source/target detection**: Case-insensitive instance type matching
+- **Session→mapping inference**: Longest-suffix-match strategy for ambiguous mapping names
+- **646 tests** across unit, integration, expression, and formatting test suites
+### v1.9.2 (Phase 8)
 - Mapping output files now use real mapping names (e.g., `mapping_m_customer_load.py`) instead of generic numeric indices (`mapping_1.py`)
 - Workflow imports automatically match the named mapping files
+- **Expression converter rewrite**: Recursive parenthesis-aware parser replacing simple regex; fixes nested IIF/INSTR/LTRIM/RTRIM/REPLACECHR/REPLACESTR/SUBSTR/TO_CHAR/CHR/MAKE_DATE_TIME
+- **`:LKP.` references** now properly converted to `lookup_func()` calls in vectorized mode
+- **String literal safety**: `||` concatenation no longer applies `.astype(str)` to string literals
+- **NULL/TRUE/FALSE**: Correctly resolved as `None`/`True`/`False` before field-name substitution
+- **`import pandas as pd`** and `from datetime import datetime` now included in generated mapping files
+- **MSSQL connection fallbacks**: `pymssql` and `sqlalchemy` tried when `pyodbc` unavailable
 ### v1.8.x (Phase 7)
 - Row-count logging at every pipeline step (source reads, transforms, target writes)
@@ -361,7 +495,7 @@ Converts Informatica expressions to Python equivalents:
 cd informatica_python
 pip install -e ".[dev]"
-# Run tests (136 tests)
+# Run tests (646 tests)
 pytest tests/ -v
 ```

informatica-python 1.9.1__tar.gz → 1.9.3__tar.gz

informatica-python 1.9.1tar.gz → 1.9.3tar.gz