pyteryx 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,45 @@
1
+ # Environments
2
+ .env
3
+ .venv
4
+ env/
5
+ venv/
6
+ ENV/
7
+ env.bak/
8
+ venv.bak/
9
+
10
+ # Build and Distribution
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ *.egg-info/
24
+ .installed.cfg
25
+ *.egg
26
+
27
+ # Python and Cache
28
+ __pycache__/
29
+ *.py[cod]
30
+ *$py.class
31
+ .pytest_cache/
32
+ .coverage
33
+ htmlcov/
34
+
35
+ # Editor and IDEs
36
+ .vscode/
37
+ .idea/
38
+ *.swp
39
+ *.swo
40
+
41
+ # Mock data generated during E2E test
42
+ sales.csv
43
+ regions.csv
44
+ cli_test_output.csv
45
+ e2e_demo.py
@@ -0,0 +1,19 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [0.1.0] — 2025-06-28
9
+
10
+ ### Added
11
+
12
+ - **InOut** class — `input_data`, `output_data`, `text_input`, `browse`, `directory`, `date_time_now`
13
+ - **Preparation** class — `filter`, `formula`, `select`, `data_cleansing`, `sort`, `unique`, `sample`, `record_id`, `generate_rows`, `auto_field`, `multi_field_formula`, `multi_row_formula`, `tile`, `imputation`
14
+ - **Join** class — `join`, `join_multiple`, `union`, `find_replace`, `append_fields`, `fuzzy_match`
15
+ - **Transform** class — `summarize`, `transpose`, `cross_tab`, `running_total`, `count_records`
16
+ - **Parse** class — `date_time`, `regex_match`, `regex_parse`, `regex_replace`, `regex_tokenize`, `text_to_columns`, `xml_parse`
17
+ - **Developer** class — `base64_encode`, `base64_decode`, `download`, `column_info`, `dynamic_rename`
18
+ - Full test suite with pytest
19
+ - Type hints and Google-style docstrings on all public methods
pyteryx-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 PyTeryx Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
pyteryx-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,287 @@
1
+ Metadata-Version: 2.4
2
+ Name: pyteryx
3
+ Version: 0.1.0
4
+ Summary: Replicate Alteryx Designer tools as independent Python functions — migrate from Alteryx to pandas with a familiar API.
5
+ Project-URL: Homepage, https://github.com/pyteryx/pyteryx
6
+ Project-URL: Documentation, https://github.com/pyteryx/pyteryx#readme
7
+ Project-URL: Repository, https://github.com/pyteryx/pyteryx
8
+ Project-URL: Issues, https://github.com/pyteryx/pyteryx/issues
9
+ Project-URL: Changelog, https://github.com/pyteryx/pyteryx/blob/main/CHANGELOG.md
10
+ Author-email: PyTeryx Contributors <contributors@pyteryx.org>
11
+ License-Expression: MIT
12
+ License-File: LICENSE
13
+ Keywords: alteryx,data-engineering,dataframe,etl,migration,pandas
14
+ Classifier: Development Status :: 3 - Alpha
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: Intended Audience :: Science/Research
17
+ Classifier: License :: OSI Approved :: MIT License
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.9
21
+ Classifier: Programming Language :: Python :: 3.10
22
+ Classifier: Programming Language :: Python :: 3.11
23
+ Classifier: Programming Language :: Python :: 3.12
24
+ Classifier: Programming Language :: Python :: 3.13
25
+ Classifier: Topic :: Scientific/Engineering
26
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
27
+ Requires-Python: >=3.9
28
+ Requires-Dist: pandas>=1.5.0
29
+ Requires-Dist: pyyaml>=6.0
30
+ Provides-Extra: dev
31
+ Requires-Dist: build>=1.0; extra == 'dev'
32
+ Requires-Dist: pytest-cov>=4.0; extra == 'dev'
33
+ Requires-Dist: pytest>=7.0; extra == 'dev'
34
+ Requires-Dist: ruff>=0.4.0; extra == 'dev'
35
+ Requires-Dist: twine>=5.0; extra == 'dev'
36
+ Description-Content-Type: text/markdown
37
+
38
+ # 🐦 PyTeryx: The Alteryx-to-Python Migration Engine
39
+
40
+ **Replicate Alteryx Designer tools as independent Python functions.**
41
+
42
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
43
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
44
+ [![PyPI version](https://img.shields.io/pypi/v/pyteryx.svg)](https://pypi.org/project/pyteryx/)
45
+
46
+ ---
47
+
48
+ PyTeryx is a Python package that mirrors every major **Alteryx Designer** tool as a simple, independent function. If you're migrating from Alteryx to Python, PyTeryx gives you a familiar API while leveraging the full power, speed, and ecosystem of **pandas**.
49
+
50
+ > **Note:** This package is not affiliated with Alteryx, Inc. "Alteryx" is a registered trademark of Alteryx, Inc. PyTeryx is an independent, community-driven open-source project.
51
+
52
+ ## 🌟 Why PyTeryx?
53
+
54
+ Migrating from a visual ETL tool like Alteryx to code-first Python is traditionally painful. Data engineers and analysts have to translate complex logic, visual anchors, and proprietary tool configurations into standard Pandas operations.
55
+
56
+ **PyTeryx bridges this gap by providing:**
57
+ - **Zero-Friction Migration**: Translating an Alteryx workflow to Python becomes a straightforward 1:1 mapping exercise.
58
+ - **Escape Vendor Lock-In**: Execute your data pipelines anywhere Python runs—locally, on Airflow, AWS Lambda, or Databricks—without expensive proprietary licensing.
59
+ - **Pure Python, Pure Pandas**: Under the hood, PyTeryx leverages optimized `pandas` operations. No bloated dependencies, no proprietary file formats.
60
+ - **CI/CD Ready**: Since your workflows are now standard Python scripts or declarative YAML files, you can version control them, write unit tests, and integrate them into standard CI/CD pipelines.
61
+
62
+ ## ✨ Core Architecture & Features
63
+
64
+ - **1:1 Alteryx Tool Mapping**: Each core Alteryx tool has a corresponding Python function (e.g. `Preparation.filter`, `Join.join`, `Transform.summarize`).
65
+ - **Familiar Output Anchors**: Tools that have multiple output anchors in Alteryx return tuples in Python.
66
+ - `Filter` returns `(True_Data, False_Data)`
67
+ - `Join` returns `(Left_Unjoined, Joined, Right_Unjoined)`
68
+ - `Unique` returns `(Unique_Records, Duplicate_Records)`
69
+ - **Pure Functions**: All methods are `@staticmethod`. There is no hidden instance state and no side effects.
70
+ - **Immutable by Default**: Original DataFrames are never mutated. Every tool returns a brand new DataFrame.
71
+ - **Type-Hinted & Documented**: Full type annotations and Google-style docstrings enable rich IDE autocomplete.
72
+
73
+ ## 📦 Installation
74
+
75
+ Install PyTeryx via pip:
76
+
77
+ ```bash
78
+ pip install pyteryx
79
+ ```
80
+
81
+ *(Requires Python 3.9+ and pandas 1.5.0+)*
82
+
83
+ ---
84
+
85
+ ## 🚀 How to Use PyTeryx
86
+
87
+ PyTeryx is designed for both software engineers (Python API) and data analysts (Declarative YAML).
88
+
89
+ ### Option 1: Python API (For Developers)
90
+
91
+ Use PyTeryx exactly like you use pandas, but with the familiar Alteryx tool names and behaviors. You can chain these together natively.
92
+
93
+ ```python
94
+ from pyteryx import InOut, Preparation, Join, Transform, Parse, Developer
95
+
96
+ # 1. Read data (Alteryx: Input Data tool)
97
+ sales = InOut.input_data("sales.csv")
98
+ customers = InOut.input_data("customers.csv")
99
+
100
+ # 2. Filter rows (Alteryx: Filter tool → T/F anchors)
101
+ # Returns a tuple representing the True and False output anchors
102
+ high_revenue, low_revenue = Preparation.filter(sales, "Revenue > 1000")
103
+
104
+ # 3. Add a calculated column (Alteryx: Formula tool)
105
+ high_revenue = Preparation.formula(high_revenue, "Profit", "Revenue - Cost")
106
+
107
+ # 4. Join with another dataset (Alteryx: Join tool → L/J/R anchors)
108
+ # Returns (Left Unjoined, Joined, Right Unjoined)
109
+ left_only, joined, right_only = Join.join(high_revenue, customers, on="CustomerID")
110
+
111
+ # 5. Summarize (Alteryx: Summarize tool)
112
+ # Supports native Alteryx aggregation aliases!
113
+ summary = Transform.summarize(
114
+ joined,
115
+ group_by="Region",
116
+ aggregations={
117
+ "Profit": ["sum", "mean"],
118
+ "CustomerID": "count distinct"
119
+ }
120
+ )
121
+
122
+ # 6. Verify data integrity (Alteryx: Test / Expect Equal tool)
123
+ Developer.test(summary, lambda df: df["Sum_Profit"].sum() > 0, "Profit must be positive!")
124
+
125
+ # 7. Output results (Alteryx: Output Data tool)
126
+ InOut.output_data(summary, "summary.parquet")
127
+ ```
128
+
129
+ ### Option 2: YAML Pipelines (No-Code ETL)
130
+
131
+ For users who prefer a declarative approach, you can build end-to-end data pipelines without writing any Python code using PyTeryx's declarative YAML engine. This is perfect for defining pipelines as configuration files.
132
+
133
+ 1. **Create a `pipeline.yaml`**
134
+ ```yaml
135
+ name: "Sales Filter Pipeline"
136
+ steps:
137
+ - id: "load"
138
+ tool: "InOut.input_data"
139
+ args:
140
+ path: "sales.csv"
141
+
142
+ - id: "filter_high"
143
+ tool: "Preparation.filter"
144
+ inputs:
145
+ df: "load"
146
+ args:
147
+ condition: "Revenue > 1000"
148
+
149
+ - id: "save"
150
+ tool: "InOut.output_data"
151
+ inputs:
152
+ df: "filter_high.0" # The '.0' grabs the first tuple output (the 'True' anchor)
153
+ args:
154
+ path: "high_revenue.csv"
155
+ ```
156
+
157
+ 2. **Run via Command Line Interface (CLI)**
158
+ ```bash
159
+ pyteryx run pipeline.yaml
160
+ ```
161
+
162
+ ---
163
+
164
+ ## 📚 Complete Tool Reference
165
+
166
+ PyTeryx has fully audited and implemented the 6 core Alteryx data-manipulation palettes, giving you comprehensive parity for data transformation.
167
+
168
+ ### 🔌 In/Out Palette
169
+ Read, write, browse, and generate data.
170
+
171
+ | PyTeryx Method | Alteryx Tool | Description |
172
+ |---|---|---|
173
+ | `InOut.input_data()` | Input Data | Read CSV, Excel, JSON, Parquet (auto-detect format) |
174
+ | `InOut.output_data()` | Output Data | Write DataFrame to CSV, Excel, JSON, Parquet |
175
+ | `InOut.text_input()` | Text Input | Create DataFrame from inline dictionary/list data |
176
+ | `InOut.browse()` | Browse | Display rich summary statistics, types, and head to stdout |
177
+ | `InOut.directory()` | Directory | List files in a path with creation/modification metadata |
178
+ | `InOut.date_time_now()` | DateTime Now | Return the current timestamp as a DataFrame |
179
+
180
+ ### 🔧 Preparation Palette
181
+ Cleanse, filter, sort, and transform rows/columns.
182
+
183
+ | PyTeryx Method | Alteryx Tool | Description |
184
+ |---|---|---|
185
+ | `Preparation.filter()` | Filter | Split data based on condition into `(true_df, false_df)` |
186
+ | `Preparation.formula()` | Formula | Add/update a column using a string expression |
187
+ | `Preparation.select()` | Select | Select, rename, or cast data types of columns |
188
+ | `Preparation.data_cleansing()` | Data Cleansing | Strip whitespace, modify case, remove punctuation/nulls |
189
+ | `Preparation.sort()` | Sort | Sort dataframe by one or multiple columns |
190
+ | `Preparation.unique()` | Unique | Split data into `(unique_df, duplicate_df)` |
191
+ | `Preparation.sample()` | Sample | Extract first N, last N, random N, or percentage |
192
+ | `Preparation.record_id()` | Record ID | Attach an auto-incrementing integer ID column |
193
+ | `Preparation.generate_rows()` | Generate Rows | Create sequential rows programmatically |
194
+ | `Preparation.auto_field()` | Auto Field | Optimize column data types to save memory |
195
+ | `Preparation.multi_field_formula()`| Multi-Field Formula| Apply a single formula across multiple columns |
196
+ | `Preparation.multi_row_formula()` | Multi-Row Formula | Apply formulas that reference previous/next rows |
197
+ | `Preparation.tile()` | Tile | Group data into quantiles or uniform bins |
198
+ | `Preparation.imputation()` | Imputation | Fill missing values (mean, median, mode, or custom value) |
199
+ | `Preparation.create_samples()` | Create Samples | Split data into estimation/validation/holdout sets |
200
+ | `Preparation.date_filter()` | Date Filter | Filter dataset by a specific date range |
201
+ | `Preparation.oversample_field()` | Oversample Field | Perform stratified sampling to balance a target class |
202
+ | `Preparation.rank()` | Rank | Assign numeric ranks (supports group-by ranking) |
203
+
204
+ ### 🔗 Join Palette
205
+ Blend and match multiple datasets together.
206
+
207
+ | PyTeryx Method | Alteryx Tool | Description |
208
+ |---|---|---|
209
+ | `Join.join()` | Join | Standard join. Returns `(Left_Unjoined, Joined, Right_Unjoined)` |
210
+ | `Join.join_multiple()` | Join Multiple | Merge 3 or more DataFrames on a common key |
211
+ | `Join.union()` | Union | Stack DataFrames vertically (by column name or position) |
212
+ | `Join.find_replace()` | Find Replace | Lookup-based string replacement (or append lookup value) |
213
+ | `Join.append_fields()` | Append Fields | Cross (Cartesian) join to append all rows |
214
+ | `Join.fuzzy_match()` | Fuzzy Match | Approximate string matching (Levenshtein/Jaro-Winkler) |
215
+ | `Join.make_group()` | Make Group | Group relationship keys using DFS connected-components |
216
+
217
+ ### 📊 Transform Palette
218
+ Reshape, pivot, and aggregate data.
219
+
220
+ | PyTeryx Method | Alteryx Tool | Description |
221
+ |---|---|---|
222
+ | `Transform.summarize()` | Summarize | GroupBy with named aggregations |
223
+ | `Transform.transpose()` | Transpose | Wide-to-long reshaping (unpivot) |
224
+ | `Transform.cross_tab()` | Cross Tab | Long-to-wide reshaping (pivot) |
225
+ | `Transform.running_total()` | Running Total | Cumulative sum calculation (supports grouping) |
226
+ | `Transform.count_records()` | Count Records | Output total row count as a single-value DataFrame |
227
+ | `Transform.arrange()` | Arrange | Manually transpose and rearrange multiple columns |
228
+ | `Transform.make_columns()` | Make Columns | Wrap sequential rows into multiple columns |
229
+ | `Transform.weighted_average()` | Weighted Average | Calculate weighted average (supports grouping) |
230
+
231
+ > **Pro-Tip**: `Transform.summarize()` and `Transform.cross_tab()` natively support familiar Alteryx aggregation aliases like `"count distinct"`, `"count null"`, `"count blank"`, `"concatenate"`, `"longest"`, `"shortest"`, and `"mode"` out-of-the-box (in addition to all standard pandas agg strings like `"sum"` and `"mean"`). Output columns are automatically named `Agg_ColumnName` (e.g. `Sum_Profit`), just like Alteryx.
232
+
233
+ ### 📝 Parse Palette
234
+ Extract and parse dates, XML, and text.
235
+
236
+ | PyTeryx Method | Alteryx Tool | Description |
237
+ |---|---|---|
238
+ | `Parse.date_time()` | DateTime | Convert strings to DateTime (or infer formats automatically) |
239
+ | `Parse.regex_match()` | RegEx (Match) | Create a boolean flag if a pattern is found |
240
+ | `Parse.regex_parse()` | RegEx (Parse) | Extract regex capture groups directly into new columns |
241
+ | `Parse.regex_replace()` | RegEx (Replace) | Replace text based on a regular expression |
242
+ | `Parse.regex_tokenize()` | RegEx (Tokenize) | Split a string by a delimiter into rows or columns |
243
+ | `Parse.text_to_columns()` | Text to Columns | Split delimited text into columns or rows |
244
+ | `Parse.xml_parse()` | XML Parse | Extract XML nodes, outer XML, and auto-flatten child tags |
245
+
246
+ ### 🛠️ Developer Palette
247
+ Advanced data manipulation utilities.
248
+
249
+ | PyTeryx Method | Alteryx Tool | Description |
250
+ |---|---|---|
251
+ | `Developer.base64_encode()` | Base64 Encoder | Encode string columns to Base64 |
252
+ | `Developer.base64_decode()` | Base64 Encoder | Decode Base64 columns to strings |
253
+ | `Developer.download()` | Download | Perform an HTTP GET request directly into a DataFrame |
254
+ | `Developer.column_info()` | Column Info | Generate a rich metadata/schema report of your data |
255
+ | `Developer.dynamic_rename()` | Dynamic Rename | Rename columns dynamically via a lookup table mapping |
256
+ | `Developer.json_parse()` | JSON Parse | Flatten a JSON string column dynamically into multiple columns |
257
+ | `Developer.dynamic_select()` | Dynamic Select | Subset columns dynamically by data type or regex pattern |
258
+ | `Developer.test()` | Test | Assert a condition evaluates to True on the dataset |
259
+ | `Developer.test_equal()` | Expect Equal | Strictly validate that two DataFrames are identical |
260
+
261
+ *(Note: PyTeryx intentionally skips Alteryx GUI workflow orchestrators like "Detour" and "Block Until Done", as well as external execution nodes like the "R Tool", since PyTeryx workflows natively leverage standard Python script execution flow).*
262
+
263
+ ---
264
+
265
+ ## 🧪 Testing & Development
266
+
267
+ PyTeryx boasts an extensive test suite verifying 1:1 parity with Alteryx tools.
268
+
269
+ ```bash
270
+ # Clone the repository
271
+ git clone https://github.com/pyteryx/pyteryx.git
272
+ cd pyteryx
273
+
274
+ # Install development dependencies
275
+ pip install -e ".[dev]"
276
+
277
+ # Run the test suite with coverage
278
+ pytest tests/ -v --cov=pyteryx --cov-report=term-missing
279
+ ```
280
+
281
+ ## 🤝 Contributing
282
+
283
+ Contributions are heavily encouraged! PyTeryx is community-driven. If you find a missing edge-case, want to optimize a pandas operation, or want to add support for a new Alteryx Marketplace tool, please open an issue or submit a pull request on GitHub!
284
+
285
+ ## 📄 License
286
+
287
+ [MIT License](LICENSE) — see the [LICENSE](LICENSE) file for details.
@@ -0,0 +1,250 @@
1
+ # 🐦 PyTeryx: The Alteryx-to-Python Migration Engine
2
+
3
+ **Replicate Alteryx Designer tools as independent Python functions.**
4
+
5
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
6
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
7
+ [![PyPI version](https://img.shields.io/pypi/v/pyteryx.svg)](https://pypi.org/project/pyteryx/)
8
+
9
+ ---
10
+
11
+ PyTeryx is a Python package that mirrors every major **Alteryx Designer** tool as a simple, independent function. If you're migrating from Alteryx to Python, PyTeryx gives you a familiar API while leveraging the full power, speed, and ecosystem of **pandas**.
12
+
13
+ > **Note:** This package is not affiliated with Alteryx, Inc. "Alteryx" is a registered trademark of Alteryx, Inc. PyTeryx is an independent, community-driven open-source project.
14
+
15
+ ## 🌟 Why PyTeryx?
16
+
17
+ Migrating from a visual ETL tool like Alteryx to code-first Python is traditionally painful. Data engineers and analysts have to translate complex logic, visual anchors, and proprietary tool configurations into standard Pandas operations.
18
+
19
+ **PyTeryx bridges this gap by providing:**
20
+ - **Zero-Friction Migration**: Translating an Alteryx workflow to Python becomes a straightforward 1:1 mapping exercise.
21
+ - **Escape Vendor Lock-In**: Execute your data pipelines anywhere Python runs—locally, on Airflow, AWS Lambda, or Databricks—without expensive proprietary licensing.
22
+ - **Pure Python, Pure Pandas**: Under the hood, PyTeryx leverages optimized `pandas` operations. No bloated dependencies, no proprietary file formats.
23
+ - **CI/CD Ready**: Since your workflows are now standard Python scripts or declarative YAML files, you can version control them, write unit tests, and integrate them into standard CI/CD pipelines.
24
+
25
+ ## ✨ Core Architecture & Features
26
+
27
+ - **1:1 Alteryx Tool Mapping**: Each core Alteryx tool has a corresponding Python function (e.g. `Preparation.filter`, `Join.join`, `Transform.summarize`).
28
+ - **Familiar Output Anchors**: Tools that have multiple output anchors in Alteryx return tuples in Python.
29
+ - `Filter` returns `(True_Data, False_Data)`
30
+ - `Join` returns `(Left_Unjoined, Joined, Right_Unjoined)`
31
+ - `Unique` returns `(Unique_Records, Duplicate_Records)`
32
+ - **Pure Functions**: All methods are `@staticmethod`. There is no hidden instance state and no side effects.
33
+ - **Immutable by Default**: Original DataFrames are never mutated. Every tool returns a brand new DataFrame.
34
+ - **Type-Hinted & Documented**: Full type annotations and Google-style docstrings enable rich IDE autocomplete.
35
+
36
+ ## 📦 Installation
37
+
38
+ Install PyTeryx via pip:
39
+
40
+ ```bash
41
+ pip install pyteryx
42
+ ```
43
+
44
+ *(Requires Python 3.9+ and pandas 1.5.0+)*
45
+
46
+ ---
47
+
48
+ ## 🚀 How to Use PyTeryx
49
+
50
+ PyTeryx is designed for both software engineers (Python API) and data analysts (Declarative YAML).
51
+
52
+ ### Option 1: Python API (For Developers)
53
+
54
+ Use PyTeryx exactly like you use pandas, but with the familiar Alteryx tool names and behaviors. You can chain these together natively.
55
+
56
+ ```python
57
+ from pyteryx import InOut, Preparation, Join, Transform, Parse, Developer
58
+
59
+ # 1. Read data (Alteryx: Input Data tool)
60
+ sales = InOut.input_data("sales.csv")
61
+ customers = InOut.input_data("customers.csv")
62
+
63
+ # 2. Filter rows (Alteryx: Filter tool → T/F anchors)
64
+ # Returns a tuple representing the True and False output anchors
65
+ high_revenue, low_revenue = Preparation.filter(sales, "Revenue > 1000")
66
+
67
+ # 3. Add a calculated column (Alteryx: Formula tool)
68
+ high_revenue = Preparation.formula(high_revenue, "Profit", "Revenue - Cost")
69
+
70
+ # 4. Join with another dataset (Alteryx: Join tool → L/J/R anchors)
71
+ # Returns (Left Unjoined, Joined, Right Unjoined)
72
+ left_only, joined, right_only = Join.join(high_revenue, customers, on="CustomerID")
73
+
74
+ # 5. Summarize (Alteryx: Summarize tool)
75
+ # Supports native Alteryx aggregation aliases!
76
+ summary = Transform.summarize(
77
+ joined,
78
+ group_by="Region",
79
+ aggregations={
80
+ "Profit": ["sum", "mean"],
81
+ "CustomerID": "count distinct"
82
+ }
83
+ )
84
+
85
+ # 6. Verify data integrity (Alteryx: Test / Expect Equal tool)
86
+ Developer.test(summary, lambda df: df["Sum_Profit"].sum() > 0, "Profit must be positive!")
87
+
88
+ # 7. Output results (Alteryx: Output Data tool)
89
+ InOut.output_data(summary, "summary.parquet")
90
+ ```
91
+
92
+ ### Option 2: YAML Pipelines (No-Code ETL)
93
+
94
+ For users who prefer a declarative approach, you can build end-to-end data pipelines without writing any Python code using PyTeryx's declarative YAML engine. This is perfect for defining pipelines as configuration files.
95
+
96
+ 1. **Create a `pipeline.yaml`**
97
+ ```yaml
98
+ name: "Sales Filter Pipeline"
99
+ steps:
100
+ - id: "load"
101
+ tool: "InOut.input_data"
102
+ args:
103
+ path: "sales.csv"
104
+
105
+ - id: "filter_high"
106
+ tool: "Preparation.filter"
107
+ inputs:
108
+ df: "load"
109
+ args:
110
+ condition: "Revenue > 1000"
111
+
112
+ - id: "save"
113
+ tool: "InOut.output_data"
114
+ inputs:
115
+ df: "filter_high.0" # The '.0' grabs the first tuple output (the 'True' anchor)
116
+ args:
117
+ path: "high_revenue.csv"
118
+ ```
119
+
120
+ 2. **Run via Command Line Interface (CLI)**
121
+ ```bash
122
+ pyteryx run pipeline.yaml
123
+ ```
124
+
125
+ ---
126
+
127
+ ## 📚 Complete Tool Reference
128
+
129
+ PyTeryx has fully audited and implemented the 6 core Alteryx data-manipulation palettes, giving you comprehensive parity for data transformation.
130
+
131
+ ### 🔌 In/Out Palette
132
+ Read, write, browse, and generate data.
133
+
134
+ | PyTeryx Method | Alteryx Tool | Description |
135
+ |---|---|---|
136
+ | `InOut.input_data()` | Input Data | Read CSV, Excel, JSON, Parquet (auto-detect format) |
137
+ | `InOut.output_data()` | Output Data | Write DataFrame to CSV, Excel, JSON, Parquet |
138
+ | `InOut.text_input()` | Text Input | Create DataFrame from inline dictionary/list data |
139
+ | `InOut.browse()` | Browse | Display rich summary statistics, types, and head to stdout |
140
+ | `InOut.directory()` | Directory | List files in a path with creation/modification metadata |
141
+ | `InOut.date_time_now()` | DateTime Now | Return the current timestamp as a DataFrame |
142
+
143
+ ### 🔧 Preparation Palette
144
+ Cleanse, filter, sort, and transform rows/columns.
145
+
146
+ | PyTeryx Method | Alteryx Tool | Description |
147
+ |---|---|---|
148
+ | `Preparation.filter()` | Filter | Split data based on condition into `(true_df, false_df)` |
149
+ | `Preparation.formula()` | Formula | Add/update a column using a string expression |
150
+ | `Preparation.select()` | Select | Select, rename, or cast data types of columns |
151
+ | `Preparation.data_cleansing()` | Data Cleansing | Strip whitespace, modify case, remove punctuation/nulls |
152
+ | `Preparation.sort()` | Sort | Sort dataframe by one or multiple columns |
153
+ | `Preparation.unique()` | Unique | Split data into `(unique_df, duplicate_df)` |
154
+ | `Preparation.sample()` | Sample | Extract first N, last N, random N, or percentage |
155
+ | `Preparation.record_id()` | Record ID | Attach an auto-incrementing integer ID column |
156
+ | `Preparation.generate_rows()` | Generate Rows | Create sequential rows programmatically |
157
+ | `Preparation.auto_field()` | Auto Field | Optimize column data types to save memory |
158
+ | `Preparation.multi_field_formula()`| Multi-Field Formula| Apply a single formula across multiple columns |
159
+ | `Preparation.multi_row_formula()` | Multi-Row Formula | Apply formulas that reference previous/next rows |
160
+ | `Preparation.tile()` | Tile | Group data into quantiles or uniform bins |
161
+ | `Preparation.imputation()` | Imputation | Fill missing values (mean, median, mode, or custom value) |
162
+ | `Preparation.create_samples()` | Create Samples | Split data into estimation/validation/holdout sets |
163
+ | `Preparation.date_filter()` | Date Filter | Filter dataset by a specific date range |
164
+ | `Preparation.oversample_field()` | Oversample Field | Perform stratified sampling to balance a target class |
165
+ | `Preparation.rank()` | Rank | Assign numeric ranks (supports group-by ranking) |
166
+
167
+ ### 🔗 Join Palette
168
+ Blend and match multiple datasets together.
169
+
170
+ | PyTeryx Method | Alteryx Tool | Description |
171
+ |---|---|---|
172
+ | `Join.join()` | Join | Standard join. Returns `(Left_Unjoined, Joined, Right_Unjoined)` |
173
+ | `Join.join_multiple()` | Join Multiple | Merge 3 or more DataFrames on a common key |
174
+ | `Join.union()` | Union | Stack DataFrames vertically (by column name or position) |
175
+ | `Join.find_replace()` | Find Replace | Lookup-based string replacement (or append lookup value) |
176
+ | `Join.append_fields()` | Append Fields | Cross (Cartesian) join to append all rows |
177
+ | `Join.fuzzy_match()` | Fuzzy Match | Approximate string matching (Levenshtein/Jaro-Winkler) |
178
+ | `Join.make_group()` | Make Group | Group relationship keys using DFS connected-components |
179
+
180
+ ### 📊 Transform Palette
181
+ Reshape, pivot, and aggregate data.
182
+
183
+ | PyTeryx Method | Alteryx Tool | Description |
184
+ |---|---|---|
185
+ | `Transform.summarize()` | Summarize | GroupBy with named aggregations |
186
+ | `Transform.transpose()` | Transpose | Wide-to-long reshaping (unpivot) |
187
+ | `Transform.cross_tab()` | Cross Tab | Long-to-wide reshaping (pivot) |
188
+ | `Transform.running_total()` | Running Total | Cumulative sum calculation (supports grouping) |
189
+ | `Transform.count_records()` | Count Records | Output total row count as a single-value DataFrame |
190
+ | `Transform.arrange()` | Arrange | Manually transpose and rearrange multiple columns |
191
+ | `Transform.make_columns()` | Make Columns | Wrap sequential rows into multiple columns |
192
+ | `Transform.weighted_average()` | Weighted Average | Calculate weighted average (supports grouping) |
193
+
194
+ > **Pro-Tip**: `Transform.summarize()` and `Transform.cross_tab()` natively support familiar Alteryx aggregation aliases like `"count distinct"`, `"count null"`, `"count blank"`, `"concatenate"`, `"longest"`, `"shortest"`, and `"mode"` out-of-the-box (in addition to all standard pandas agg strings like `"sum"` and `"mean"`). Output columns are automatically named `Agg_ColumnName` (e.g. `Sum_Profit`), just like Alteryx.
195
+
196
+ ### 📝 Parse Palette
197
+ Extract and parse dates, XML, and text.
198
+
199
+ | PyTeryx Method | Alteryx Tool | Description |
200
+ |---|---|---|
201
+ | `Parse.date_time()` | DateTime | Convert strings to DateTime (or infer formats automatically) |
202
+ | `Parse.regex_match()` | RegEx (Match) | Create a boolean flag if a pattern is found |
203
+ | `Parse.regex_parse()` | RegEx (Parse) | Extract regex capture groups directly into new columns |
204
+ | `Parse.regex_replace()` | RegEx (Replace) | Replace text based on a regular expression |
205
+ | `Parse.regex_tokenize()` | RegEx (Tokenize) | Split a string by a delimiter into rows or columns |
206
+ | `Parse.text_to_columns()` | Text to Columns | Split delimited text into columns or rows |
207
+ | `Parse.xml_parse()` | XML Parse | Extract XML nodes, outer XML, and auto-flatten child tags |
208
+
209
+ ### 🛠️ Developer Palette
210
+ Advanced data manipulation utilities.
211
+
212
+ | PyTeryx Method | Alteryx Tool | Description |
213
+ |---|---|---|
214
+ | `Developer.base64_encode()` | Base64 Encoder | Encode string columns to Base64 |
215
+ | `Developer.base64_decode()` | Base64 Encoder | Decode Base64 columns to strings |
216
+ | `Developer.download()` | Download | Perform an HTTP GET request directly into a DataFrame |
217
+ | `Developer.column_info()` | Column Info | Generate a rich metadata/schema report of your data |
218
+ | `Developer.dynamic_rename()` | Dynamic Rename | Rename columns dynamically via a lookup table mapping |
219
+ | `Developer.json_parse()` | JSON Parse | Flatten a JSON string column dynamically into multiple columns |
220
+ | `Developer.dynamic_select()` | Dynamic Select | Subset columns dynamically by data type or regex pattern |
221
+ | `Developer.test()` | Test | Assert a condition evaluates to True on the dataset |
222
+ | `Developer.test_equal()` | Expect Equal | Strictly validate that two DataFrames are identical |
223
+
224
+ *(Note: PyTeryx intentionally skips Alteryx GUI workflow orchestrators like "Detour" and "Block Until Done", as well as external execution nodes like the "R Tool", since PyTeryx workflows natively leverage standard Python script execution flow).*
225
+
226
+ ---
227
+
228
+ ## 🧪 Testing & Development
229
+
230
+ PyTeryx boasts an extensive test suite verifying 1:1 parity with Alteryx tools.
231
+
232
+ ```bash
233
+ # Clone the repository
234
+ git clone https://github.com/pyteryx/pyteryx.git
235
+ cd pyteryx
236
+
237
+ # Install development dependencies
238
+ pip install -e ".[dev]"
239
+
240
+ # Run the test suite with coverage
241
+ pytest tests/ -v --cov=pyteryx --cov-report=term-missing
242
+ ```
243
+
244
+ ## 🤝 Contributing
245
+
246
+ Contributions are heavily encouraged! PyTeryx is community-driven. If you find a missing edge-case, want to optimize a pandas operation, or want to add support for a new Alteryx Marketplace tool, please open an issue or submit a pull request on GitHub!
247
+
248
+ ## 📄 License
249
+
250
+ [MIT License](LICENSE) — see the [LICENSE](LICENSE) file for details.