pyteryx 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pyteryx-0.1.0/.gitignore +45 -0
- pyteryx-0.1.0/CHANGELOG.md +19 -0
- pyteryx-0.1.0/LICENSE +21 -0
- pyteryx-0.1.0/PKG-INFO +287 -0
- pyteryx-0.1.0/README.md +250 -0
- pyteryx-0.1.0/pyproject.toml +71 -0
- pyteryx-0.1.0/sample_pipeline.yaml +21 -0
- pyteryx-0.1.0/src/pyteryx/__init__.py +36 -0
- pyteryx-0.1.0/src/pyteryx/_validators.py +81 -0
- pyteryx-0.1.0/src/pyteryx/_version.py +3 -0
- pyteryx-0.1.0/src/pyteryx/developer.py +373 -0
- pyteryx-0.1.0/src/pyteryx/in_out.py +314 -0
- pyteryx-0.1.0/src/pyteryx/join.py +387 -0
- pyteryx-0.1.0/src/pyteryx/parse.py +325 -0
- pyteryx-0.1.0/src/pyteryx/pipeline.py +136 -0
- pyteryx-0.1.0/src/pyteryx/preparation.py +805 -0
- pyteryx-0.1.0/src/pyteryx/transform.py +437 -0
- pyteryx-0.1.0/tests/conftest.py +92 -0
- pyteryx-0.1.0/tests/test_developer.py +172 -0
- pyteryx-0.1.0/tests/test_in_out.py +138 -0
- pyteryx-0.1.0/tests/test_join.py +170 -0
- pyteryx-0.1.0/tests/test_parse.py +142 -0
- pyteryx-0.1.0/tests/test_pipeline.py +148 -0
- pyteryx-0.1.0/tests/test_preparation.py +456 -0
- pyteryx-0.1.0/tests/test_transform.py +218 -0
pyteryx-0.1.0/.gitignore
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
# Environments
|
|
2
|
+
.env
|
|
3
|
+
.venv
|
|
4
|
+
env/
|
|
5
|
+
venv/
|
|
6
|
+
ENV/
|
|
7
|
+
env.bak/
|
|
8
|
+
venv.bak/
|
|
9
|
+
|
|
10
|
+
# Build and Distribution
|
|
11
|
+
build/
|
|
12
|
+
develop-eggs/
|
|
13
|
+
dist/
|
|
14
|
+
downloads/
|
|
15
|
+
eggs/
|
|
16
|
+
.eggs/
|
|
17
|
+
lib/
|
|
18
|
+
lib64/
|
|
19
|
+
parts/
|
|
20
|
+
sdist/
|
|
21
|
+
var/
|
|
22
|
+
wheels/
|
|
23
|
+
*.egg-info/
|
|
24
|
+
.installed.cfg
|
|
25
|
+
*.egg
|
|
26
|
+
|
|
27
|
+
# Python and Cache
|
|
28
|
+
__pycache__/
|
|
29
|
+
*.py[cod]
|
|
30
|
+
*$py.class
|
|
31
|
+
.pytest_cache/
|
|
32
|
+
.coverage
|
|
33
|
+
htmlcov/
|
|
34
|
+
|
|
35
|
+
# Editor and IDEs
|
|
36
|
+
.vscode/
|
|
37
|
+
.idea/
|
|
38
|
+
*.swp
|
|
39
|
+
*.swo
|
|
40
|
+
|
|
41
|
+
# Mock data generated during E2E test
|
|
42
|
+
sales.csv
|
|
43
|
+
regions.csv
|
|
44
|
+
cli_test_output.csv
|
|
45
|
+
e2e_demo.py
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [0.1.0] — 2025-06-28
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
- **InOut** class — `input_data`, `output_data`, `text_input`, `browse`, `directory`, `date_time_now`
|
|
13
|
+
- **Preparation** class — `filter`, `formula`, `select`, `data_cleansing`, `sort`, `unique`, `sample`, `record_id`, `generate_rows`, `auto_field`, `multi_field_formula`, `multi_row_formula`, `tile`, `imputation`
|
|
14
|
+
- **Join** class — `join`, `join_multiple`, `union`, `find_replace`, `append_fields`, `fuzzy_match`
|
|
15
|
+
- **Transform** class — `summarize`, `transpose`, `cross_tab`, `running_total`, `count_records`
|
|
16
|
+
- **Parse** class — `date_time`, `regex_match`, `regex_parse`, `regex_replace`, `regex_tokenize`, `text_to_columns`, `xml_parse`
|
|
17
|
+
- **Developer** class — `base64_encode`, `base64_decode`, `download`, `column_info`, `dynamic_rename`
|
|
18
|
+
- Full test suite with pytest
|
|
19
|
+
- Type hints and Google-style docstrings on all public methods
|
pyteryx-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 PyTeryx Contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
pyteryx-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,287 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pyteryx
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Replicate Alteryx Designer tools as independent Python functions — migrate from Alteryx to pandas with a familiar API.
|
|
5
|
+
Project-URL: Homepage, https://github.com/pyteryx/pyteryx
|
|
6
|
+
Project-URL: Documentation, https://github.com/pyteryx/pyteryx#readme
|
|
7
|
+
Project-URL: Repository, https://github.com/pyteryx/pyteryx
|
|
8
|
+
Project-URL: Issues, https://github.com/pyteryx/pyteryx/issues
|
|
9
|
+
Project-URL: Changelog, https://github.com/pyteryx/pyteryx/blob/main/CHANGELOG.md
|
|
10
|
+
Author-email: PyTeryx Contributors <contributors@pyteryx.org>
|
|
11
|
+
License-Expression: MIT
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Keywords: alteryx,data-engineering,dataframe,etl,migration,pandas
|
|
14
|
+
Classifier: Development Status :: 3 - Alpha
|
|
15
|
+
Classifier: Intended Audience :: Developers
|
|
16
|
+
Classifier: Intended Audience :: Science/Research
|
|
17
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
18
|
+
Classifier: Operating System :: OS Independent
|
|
19
|
+
Classifier: Programming Language :: Python :: 3
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
23
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
24
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
25
|
+
Classifier: Topic :: Scientific/Engineering
|
|
26
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
27
|
+
Requires-Python: >=3.9
|
|
28
|
+
Requires-Dist: pandas>=1.5.0
|
|
29
|
+
Requires-Dist: pyyaml>=6.0
|
|
30
|
+
Provides-Extra: dev
|
|
31
|
+
Requires-Dist: build>=1.0; extra == 'dev'
|
|
32
|
+
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
|
|
33
|
+
Requires-Dist: pytest>=7.0; extra == 'dev'
|
|
34
|
+
Requires-Dist: ruff>=0.4.0; extra == 'dev'
|
|
35
|
+
Requires-Dist: twine>=5.0; extra == 'dev'
|
|
36
|
+
Description-Content-Type: text/markdown
|
|
37
|
+
|
|
38
|
+
# 🐦 PyTeryx: The Alteryx-to-Python Migration Engine
|
|
39
|
+
|
|
40
|
+
**Replicate Alteryx Designer tools as independent Python functions.**
|
|
41
|
+
|
|
42
|
+
[](https://www.python.org/downloads/)
|
|
43
|
+
[](LICENSE)
|
|
44
|
+
[](https://pypi.org/project/pyteryx/)
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
PyTeryx is a Python package that mirrors every major **Alteryx Designer** tool as a simple, independent function. If you're migrating from Alteryx to Python, PyTeryx gives you a familiar API while leveraging the full power, speed, and ecosystem of **pandas**.
|
|
49
|
+
|
|
50
|
+
> **Note:** This package is not affiliated with Alteryx, Inc. "Alteryx" is a registered trademark of Alteryx, Inc. PyTeryx is an independent, community-driven open-source project.
|
|
51
|
+
|
|
52
|
+
## 🌟 Why PyTeryx?
|
|
53
|
+
|
|
54
|
+
Migrating from a visual ETL tool like Alteryx to code-first Python is traditionally painful. Data engineers and analysts have to translate complex logic, visual anchors, and proprietary tool configurations into standard Pandas operations.
|
|
55
|
+
|
|
56
|
+
**PyTeryx bridges this gap by providing:**
|
|
57
|
+
- **Zero-Friction Migration**: Translating an Alteryx workflow to Python becomes a straightforward 1:1 mapping exercise.
|
|
58
|
+
- **Escape Vendor Lock-In**: Execute your data pipelines anywhere Python runs—locally, on Airflow, AWS Lambda, or Databricks—without expensive proprietary licensing.
|
|
59
|
+
- **Pure Python, Pure Pandas**: Under the hood, PyTeryx leverages optimized `pandas` operations. No bloated dependencies, no proprietary file formats.
|
|
60
|
+
- **CI/CD Ready**: Since your workflows are now standard Python scripts or declarative YAML files, you can version control them, write unit tests, and integrate them into standard CI/CD pipelines.
|
|
61
|
+
|
|
62
|
+
## ✨ Core Architecture & Features
|
|
63
|
+
|
|
64
|
+
- **1:1 Alteryx Tool Mapping**: Each core Alteryx tool has a corresponding Python function (e.g. `Preparation.filter`, `Join.join`, `Transform.summarize`).
|
|
65
|
+
- **Familiar Output Anchors**: Tools that have multiple output anchors in Alteryx return tuples in Python.
|
|
66
|
+
- `Filter` returns `(True_Data, False_Data)`
|
|
67
|
+
- `Join` returns `(Left_Unjoined, Joined, Right_Unjoined)`
|
|
68
|
+
- `Unique` returns `(Unique_Records, Duplicate_Records)`
|
|
69
|
+
- **Pure Functions**: All methods are `@staticmethod`. There is no hidden instance state and no side effects.
|
|
70
|
+
- **Immutable by Default**: Original DataFrames are never mutated. Every tool returns a brand new DataFrame.
|
|
71
|
+
- **Type-Hinted & Documented**: Full type annotations and Google-style docstrings enable rich IDE autocomplete.
|
|
72
|
+
|
|
73
|
+
## 📦 Installation
|
|
74
|
+
|
|
75
|
+
Install PyTeryx via pip:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
pip install pyteryx
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
*(Requires Python 3.9+ and pandas 1.5.0+)*
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## 🚀 How to Use PyTeryx
|
|
86
|
+
|
|
87
|
+
PyTeryx is designed for both software engineers (Python API) and data analysts (Declarative YAML).
|
|
88
|
+
|
|
89
|
+
### Option 1: Python API (For Developers)
|
|
90
|
+
|
|
91
|
+
Use PyTeryx exactly like you use pandas, but with the familiar Alteryx tool names and behaviors. You can chain these together natively.
|
|
92
|
+
|
|
93
|
+
```python
|
|
94
|
+
from pyteryx import InOut, Preparation, Join, Transform, Parse, Developer
|
|
95
|
+
|
|
96
|
+
# 1. Read data (Alteryx: Input Data tool)
|
|
97
|
+
sales = InOut.input_data("sales.csv")
|
|
98
|
+
customers = InOut.input_data("customers.csv")
|
|
99
|
+
|
|
100
|
+
# 2. Filter rows (Alteryx: Filter tool → T/F anchors)
|
|
101
|
+
# Returns a tuple representing the True and False output anchors
|
|
102
|
+
high_revenue, low_revenue = Preparation.filter(sales, "Revenue > 1000")
|
|
103
|
+
|
|
104
|
+
# 3. Add a calculated column (Alteryx: Formula tool)
|
|
105
|
+
high_revenue = Preparation.formula(high_revenue, "Profit", "Revenue - Cost")
|
|
106
|
+
|
|
107
|
+
# 4. Join with another dataset (Alteryx: Join tool → L/J/R anchors)
|
|
108
|
+
# Returns (Left Unjoined, Joined, Right Unjoined)
|
|
109
|
+
left_only, joined, right_only = Join.join(high_revenue, customers, on="CustomerID")
|
|
110
|
+
|
|
111
|
+
# 5. Summarize (Alteryx: Summarize tool)
|
|
112
|
+
# Supports native Alteryx aggregation aliases!
|
|
113
|
+
summary = Transform.summarize(
|
|
114
|
+
joined,
|
|
115
|
+
group_by="Region",
|
|
116
|
+
aggregations={
|
|
117
|
+
"Profit": ["sum", "mean"],
|
|
118
|
+
"CustomerID": "count distinct"
|
|
119
|
+
}
|
|
120
|
+
)
|
|
121
|
+
|
|
122
|
+
# 6. Verify data integrity (Alteryx: Test / Expect Equal tool)
|
|
123
|
+
Developer.test(summary, lambda df: df["Sum_Profit"].sum() > 0, "Profit must be positive!")
|
|
124
|
+
|
|
125
|
+
# 7. Output results (Alteryx: Output Data tool)
|
|
126
|
+
InOut.output_data(summary, "summary.parquet")
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### Option 2: YAML Pipelines (No-Code ETL)
|
|
130
|
+
|
|
131
|
+
For users who prefer a declarative approach, you can build end-to-end data pipelines without writing any Python code using PyTeryx's declarative YAML engine. This is perfect for defining pipelines as configuration files.
|
|
132
|
+
|
|
133
|
+
1. **Create a `pipeline.yaml`**
|
|
134
|
+
```yaml
|
|
135
|
+
name: "Sales Filter Pipeline"
|
|
136
|
+
steps:
|
|
137
|
+
- id: "load"
|
|
138
|
+
tool: "InOut.input_data"
|
|
139
|
+
args:
|
|
140
|
+
path: "sales.csv"
|
|
141
|
+
|
|
142
|
+
- id: "filter_high"
|
|
143
|
+
tool: "Preparation.filter"
|
|
144
|
+
inputs:
|
|
145
|
+
df: "load"
|
|
146
|
+
args:
|
|
147
|
+
condition: "Revenue > 1000"
|
|
148
|
+
|
|
149
|
+
- id: "save"
|
|
150
|
+
tool: "InOut.output_data"
|
|
151
|
+
inputs:
|
|
152
|
+
df: "filter_high.0" # The '.0' grabs the first tuple output (the 'True' anchor)
|
|
153
|
+
args:
|
|
154
|
+
path: "high_revenue.csv"
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
2. **Run via Command Line Interface (CLI)**
|
|
158
|
+
```bash
|
|
159
|
+
pyteryx run pipeline.yaml
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
## 📚 Complete Tool Reference
|
|
165
|
+
|
|
166
|
+
PyTeryx has fully audited and implemented the 6 core Alteryx data-manipulation palettes, giving you comprehensive parity for data transformation.
|
|
167
|
+
|
|
168
|
+
### 🔌 In/Out Palette
|
|
169
|
+
Read, write, browse, and generate data.
|
|
170
|
+
|
|
171
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
172
|
+
|---|---|---|
|
|
173
|
+
| `InOut.input_data()` | Input Data | Read CSV, Excel, JSON, Parquet (auto-detect format) |
|
|
174
|
+
| `InOut.output_data()` | Output Data | Write DataFrame to CSV, Excel, JSON, Parquet |
|
|
175
|
+
| `InOut.text_input()` | Text Input | Create DataFrame from inline dictionary/list data |
|
|
176
|
+
| `InOut.browse()` | Browse | Display rich summary statistics, types, and head to stdout |
|
|
177
|
+
| `InOut.directory()` | Directory | List files in a path with creation/modification metadata |
|
|
178
|
+
| `InOut.date_time_now()` | DateTime Now | Return the current timestamp as a DataFrame |
|
|
179
|
+
|
|
180
|
+
### 🔧 Preparation Palette
|
|
181
|
+
Cleanse, filter, sort, and transform rows/columns.
|
|
182
|
+
|
|
183
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
184
|
+
|---|---|---|
|
|
185
|
+
| `Preparation.filter()` | Filter | Split data based on condition into `(true_df, false_df)` |
|
|
186
|
+
| `Preparation.formula()` | Formula | Add/update a column using a string expression |
|
|
187
|
+
| `Preparation.select()` | Select | Select, rename, or cast data types of columns |
|
|
188
|
+
| `Preparation.data_cleansing()` | Data Cleansing | Strip whitespace, modify case, remove punctuation/nulls |
|
|
189
|
+
| `Preparation.sort()` | Sort | Sort dataframe by one or multiple columns |
|
|
190
|
+
| `Preparation.unique()` | Unique | Split data into `(unique_df, duplicate_df)` |
|
|
191
|
+
| `Preparation.sample()` | Sample | Extract first N, last N, random N, or percentage |
|
|
192
|
+
| `Preparation.record_id()` | Record ID | Attach an auto-incrementing integer ID column |
|
|
193
|
+
| `Preparation.generate_rows()` | Generate Rows | Create sequential rows programmatically |
|
|
194
|
+
| `Preparation.auto_field()` | Auto Field | Optimize column data types to save memory |
|
|
195
|
+
| `Preparation.multi_field_formula()`| Multi-Field Formula| Apply a single formula across multiple columns |
|
|
196
|
+
| `Preparation.multi_row_formula()` | Multi-Row Formula | Apply formulas that reference previous/next rows |
|
|
197
|
+
| `Preparation.tile()` | Tile | Group data into quantiles or uniform bins |
|
|
198
|
+
| `Preparation.imputation()` | Imputation | Fill missing values (mean, median, mode, or custom value) |
|
|
199
|
+
| `Preparation.create_samples()` | Create Samples | Split data into estimation/validation/holdout sets |
|
|
200
|
+
| `Preparation.date_filter()` | Date Filter | Filter dataset by a specific date range |
|
|
201
|
+
| `Preparation.oversample_field()` | Oversample Field | Perform stratified sampling to balance a target class |
|
|
202
|
+
| `Preparation.rank()` | Rank | Assign numeric ranks (supports group-by ranking) |
|
|
203
|
+
|
|
204
|
+
### 🔗 Join Palette
|
|
205
|
+
Blend and match multiple datasets together.
|
|
206
|
+
|
|
207
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
208
|
+
|---|---|---|
|
|
209
|
+
| `Join.join()` | Join | Standard join. Returns `(Left_Unjoined, Joined, Right_Unjoined)` |
|
|
210
|
+
| `Join.join_multiple()` | Join Multiple | Merge 3 or more DataFrames on a common key |
|
|
211
|
+
| `Join.union()` | Union | Stack DataFrames vertically (by column name or position) |
|
|
212
|
+
| `Join.find_replace()` | Find Replace | Lookup-based string replacement (or append lookup value) |
|
|
213
|
+
| `Join.append_fields()` | Append Fields | Cross (Cartesian) join to append all rows |
|
|
214
|
+
| `Join.fuzzy_match()` | Fuzzy Match | Approximate string matching (Levenshtein/Jaro-Winkler) |
|
|
215
|
+
| `Join.make_group()` | Make Group | Group relationship keys using DFS connected-components |
|
|
216
|
+
|
|
217
|
+
### 📊 Transform Palette
|
|
218
|
+
Reshape, pivot, and aggregate data.
|
|
219
|
+
|
|
220
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
221
|
+
|---|---|---|
|
|
222
|
+
| `Transform.summarize()` | Summarize | GroupBy with named aggregations |
|
|
223
|
+
| `Transform.transpose()` | Transpose | Wide-to-long reshaping (unpivot) |
|
|
224
|
+
| `Transform.cross_tab()` | Cross Tab | Long-to-wide reshaping (pivot) |
|
|
225
|
+
| `Transform.running_total()` | Running Total | Cumulative sum calculation (supports grouping) |
|
|
226
|
+
| `Transform.count_records()` | Count Records | Output total row count as a single-value DataFrame |
|
|
227
|
+
| `Transform.arrange()` | Arrange | Manually transpose and rearrange multiple columns |
|
|
228
|
+
| `Transform.make_columns()` | Make Columns | Wrap sequential rows into multiple columns |
|
|
229
|
+
| `Transform.weighted_average()` | Weighted Average | Calculate weighted average (supports grouping) |
|
|
230
|
+
|
|
231
|
+
> **Pro-Tip**: `Transform.summarize()` and `Transform.cross_tab()` natively support familiar Alteryx aggregation aliases like `"count distinct"`, `"count null"`, `"count blank"`, `"concatenate"`, `"longest"`, `"shortest"`, and `"mode"` out-of-the-box (in addition to all standard pandas agg strings like `"sum"` and `"mean"`). Output columns are automatically named `Agg_ColumnName` (e.g. `Sum_Profit`), just like Alteryx.
|
|
232
|
+
|
|
233
|
+
### 📝 Parse Palette
|
|
234
|
+
Extract and parse dates, XML, and text.
|
|
235
|
+
|
|
236
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
237
|
+
|---|---|---|
|
|
238
|
+
| `Parse.date_time()` | DateTime | Convert strings to DateTime (or infer formats automatically) |
|
|
239
|
+
| `Parse.regex_match()` | RegEx (Match) | Create a boolean flag if a pattern is found |
|
|
240
|
+
| `Parse.regex_parse()` | RegEx (Parse) | Extract regex capture groups directly into new columns |
|
|
241
|
+
| `Parse.regex_replace()` | RegEx (Replace) | Replace text based on a regular expression |
|
|
242
|
+
| `Parse.regex_tokenize()` | RegEx (Tokenize) | Split a string by a delimiter into rows or columns |
|
|
243
|
+
| `Parse.text_to_columns()` | Text to Columns | Split delimited text into columns or rows |
|
|
244
|
+
| `Parse.xml_parse()` | XML Parse | Extract XML nodes, outer XML, and auto-flatten child tags |
|
|
245
|
+
|
|
246
|
+
### 🛠️ Developer Palette
|
|
247
|
+
Advanced data manipulation utilities.
|
|
248
|
+
|
|
249
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
250
|
+
|---|---|---|
|
|
251
|
+
| `Developer.base64_encode()` | Base64 Encoder | Encode string columns to Base64 |
|
|
252
|
+
| `Developer.base64_decode()` | Base64 Encoder | Decode Base64 columns to strings |
|
|
253
|
+
| `Developer.download()` | Download | Perform an HTTP GET request directly into a DataFrame |
|
|
254
|
+
| `Developer.column_info()` | Column Info | Generate a rich metadata/schema report of your data |
|
|
255
|
+
| `Developer.dynamic_rename()` | Dynamic Rename | Rename columns dynamically via a lookup table mapping |
|
|
256
|
+
| `Developer.json_parse()` | JSON Parse | Flatten a JSON string column dynamically into multiple columns |
|
|
257
|
+
| `Developer.dynamic_select()` | Dynamic Select | Subset columns dynamically by data type or regex pattern |
|
|
258
|
+
| `Developer.test()` | Test | Assert a condition evaluates to True on the dataset |
|
|
259
|
+
| `Developer.test_equal()` | Expect Equal | Strictly validate that two DataFrames are identical |
|
|
260
|
+
|
|
261
|
+
*(Note: PyTeryx intentionally skips Alteryx GUI workflow orchestrators like "Detour" and "Block Until Done", as well as external execution nodes like the "R Tool", since PyTeryx workflows natively leverage standard Python script execution flow).*
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
265
|
+
## 🧪 Testing & Development
|
|
266
|
+
|
|
267
|
+
PyTeryx boasts an extensive test suite verifying 1:1 parity with Alteryx tools.
|
|
268
|
+
|
|
269
|
+
```bash
|
|
270
|
+
# Clone the repository
|
|
271
|
+
git clone https://github.com/pyteryx/pyteryx.git
|
|
272
|
+
cd pyteryx
|
|
273
|
+
|
|
274
|
+
# Install development dependencies
|
|
275
|
+
pip install -e ".[dev]"
|
|
276
|
+
|
|
277
|
+
# Run the test suite with coverage
|
|
278
|
+
pytest tests/ -v --cov=pyteryx --cov-report=term-missing
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
## 🤝 Contributing
|
|
282
|
+
|
|
283
|
+
Contributions are heavily encouraged! PyTeryx is community-driven. If you find a missing edge-case, want to optimize a pandas operation, or want to add support for a new Alteryx Marketplace tool, please open an issue or submit a pull request on GitHub!
|
|
284
|
+
|
|
285
|
+
## 📄 License
|
|
286
|
+
|
|
287
|
+
[MIT License](LICENSE) — see the [LICENSE](LICENSE) file for details.
|
pyteryx-0.1.0/README.md
ADDED
|
@@ -0,0 +1,250 @@
|
|
|
1
|
+
# 🐦 PyTeryx: The Alteryx-to-Python Migration Engine
|
|
2
|
+
|
|
3
|
+
**Replicate Alteryx Designer tools as independent Python functions.**
|
|
4
|
+
|
|
5
|
+
[](https://www.python.org/downloads/)
|
|
6
|
+
[](LICENSE)
|
|
7
|
+
[](https://pypi.org/project/pyteryx/)
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
PyTeryx is a Python package that mirrors every major **Alteryx Designer** tool as a simple, independent function. If you're migrating from Alteryx to Python, PyTeryx gives you a familiar API while leveraging the full power, speed, and ecosystem of **pandas**.
|
|
12
|
+
|
|
13
|
+
> **Note:** This package is not affiliated with Alteryx, Inc. "Alteryx" is a registered trademark of Alteryx, Inc. PyTeryx is an independent, community-driven open-source project.
|
|
14
|
+
|
|
15
|
+
## 🌟 Why PyTeryx?
|
|
16
|
+
|
|
17
|
+
Migrating from a visual ETL tool like Alteryx to code-first Python is traditionally painful. Data engineers and analysts have to translate complex logic, visual anchors, and proprietary tool configurations into standard Pandas operations.
|
|
18
|
+
|
|
19
|
+
**PyTeryx bridges this gap by providing:**
|
|
20
|
+
- **Zero-Friction Migration**: Translating an Alteryx workflow to Python becomes a straightforward 1:1 mapping exercise.
|
|
21
|
+
- **Escape Vendor Lock-In**: Execute your data pipelines anywhere Python runs—locally, on Airflow, AWS Lambda, or Databricks—without expensive proprietary licensing.
|
|
22
|
+
- **Pure Python, Pure Pandas**: Under the hood, PyTeryx leverages optimized `pandas` operations. No bloated dependencies, no proprietary file formats.
|
|
23
|
+
- **CI/CD Ready**: Since your workflows are now standard Python scripts or declarative YAML files, you can version control them, write unit tests, and integrate them into standard CI/CD pipelines.
|
|
24
|
+
|
|
25
|
+
## ✨ Core Architecture & Features
|
|
26
|
+
|
|
27
|
+
- **1:1 Alteryx Tool Mapping**: Each core Alteryx tool has a corresponding Python function (e.g. `Preparation.filter`, `Join.join`, `Transform.summarize`).
|
|
28
|
+
- **Familiar Output Anchors**: Tools that have multiple output anchors in Alteryx return tuples in Python.
|
|
29
|
+
- `Filter` returns `(True_Data, False_Data)`
|
|
30
|
+
- `Join` returns `(Left_Unjoined, Joined, Right_Unjoined)`
|
|
31
|
+
- `Unique` returns `(Unique_Records, Duplicate_Records)`
|
|
32
|
+
- **Pure Functions**: All methods are `@staticmethod`. There is no hidden instance state and no side effects.
|
|
33
|
+
- **Immutable by Default**: Original DataFrames are never mutated. Every tool returns a brand new DataFrame.
|
|
34
|
+
- **Type-Hinted & Documented**: Full type annotations and Google-style docstrings enable rich IDE autocomplete.
|
|
35
|
+
|
|
36
|
+
## 📦 Installation
|
|
37
|
+
|
|
38
|
+
Install PyTeryx via pip:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
pip install pyteryx
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
*(Requires Python 3.9+ and pandas 1.5.0+)*
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## 🚀 How to Use PyTeryx
|
|
49
|
+
|
|
50
|
+
PyTeryx is designed for both software engineers (Python API) and data analysts (Declarative YAML).
|
|
51
|
+
|
|
52
|
+
### Option 1: Python API (For Developers)
|
|
53
|
+
|
|
54
|
+
Use PyTeryx exactly like you use pandas, but with the familiar Alteryx tool names and behaviors. You can chain these together natively.
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
from pyteryx import InOut, Preparation, Join, Transform, Parse, Developer
|
|
58
|
+
|
|
59
|
+
# 1. Read data (Alteryx: Input Data tool)
|
|
60
|
+
sales = InOut.input_data("sales.csv")
|
|
61
|
+
customers = InOut.input_data("customers.csv")
|
|
62
|
+
|
|
63
|
+
# 2. Filter rows (Alteryx: Filter tool → T/F anchors)
|
|
64
|
+
# Returns a tuple representing the True and False output anchors
|
|
65
|
+
high_revenue, low_revenue = Preparation.filter(sales, "Revenue > 1000")
|
|
66
|
+
|
|
67
|
+
# 3. Add a calculated column (Alteryx: Formula tool)
|
|
68
|
+
high_revenue = Preparation.formula(high_revenue, "Profit", "Revenue - Cost")
|
|
69
|
+
|
|
70
|
+
# 4. Join with another dataset (Alteryx: Join tool → L/J/R anchors)
|
|
71
|
+
# Returns (Left Unjoined, Joined, Right Unjoined)
|
|
72
|
+
left_only, joined, right_only = Join.join(high_revenue, customers, on="CustomerID")
|
|
73
|
+
|
|
74
|
+
# 5. Summarize (Alteryx: Summarize tool)
|
|
75
|
+
# Supports native Alteryx aggregation aliases!
|
|
76
|
+
summary = Transform.summarize(
|
|
77
|
+
joined,
|
|
78
|
+
group_by="Region",
|
|
79
|
+
aggregations={
|
|
80
|
+
"Profit": ["sum", "mean"],
|
|
81
|
+
"CustomerID": "count distinct"
|
|
82
|
+
}
|
|
83
|
+
)
|
|
84
|
+
|
|
85
|
+
# 6. Verify data integrity (Alteryx: Test / Expect Equal tool)
|
|
86
|
+
Developer.test(summary, lambda df: df["Sum_Profit"].sum() > 0, "Profit must be positive!")
|
|
87
|
+
|
|
88
|
+
# 7. Output results (Alteryx: Output Data tool)
|
|
89
|
+
InOut.output_data(summary, "summary.parquet")
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
### Option 2: YAML Pipelines (No-Code ETL)
|
|
93
|
+
|
|
94
|
+
For users who prefer a declarative approach, you can build end-to-end data pipelines without writing any Python code using PyTeryx's declarative YAML engine. This is perfect for defining pipelines as configuration files.
|
|
95
|
+
|
|
96
|
+
1. **Create a `pipeline.yaml`**
|
|
97
|
+
```yaml
|
|
98
|
+
name: "Sales Filter Pipeline"
|
|
99
|
+
steps:
|
|
100
|
+
- id: "load"
|
|
101
|
+
tool: "InOut.input_data"
|
|
102
|
+
args:
|
|
103
|
+
path: "sales.csv"
|
|
104
|
+
|
|
105
|
+
- id: "filter_high"
|
|
106
|
+
tool: "Preparation.filter"
|
|
107
|
+
inputs:
|
|
108
|
+
df: "load"
|
|
109
|
+
args:
|
|
110
|
+
condition: "Revenue > 1000"
|
|
111
|
+
|
|
112
|
+
- id: "save"
|
|
113
|
+
tool: "InOut.output_data"
|
|
114
|
+
inputs:
|
|
115
|
+
df: "filter_high.0" # The '.0' grabs the first tuple output (the 'True' anchor)
|
|
116
|
+
args:
|
|
117
|
+
path: "high_revenue.csv"
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
2. **Run via Command Line Interface (CLI)**
|
|
121
|
+
```bash
|
|
122
|
+
pyteryx run pipeline.yaml
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## 📚 Complete Tool Reference
|
|
128
|
+
|
|
129
|
+
PyTeryx has fully audited and implemented the 6 core Alteryx data-manipulation palettes, giving you comprehensive parity for data transformation.
|
|
130
|
+
|
|
131
|
+
### 🔌 In/Out Palette
|
|
132
|
+
Read, write, browse, and generate data.
|
|
133
|
+
|
|
134
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
135
|
+
|---|---|---|
|
|
136
|
+
| `InOut.input_data()` | Input Data | Read CSV, Excel, JSON, Parquet (auto-detect format) |
|
|
137
|
+
| `InOut.output_data()` | Output Data | Write DataFrame to CSV, Excel, JSON, Parquet |
|
|
138
|
+
| `InOut.text_input()` | Text Input | Create DataFrame from inline dictionary/list data |
|
|
139
|
+
| `InOut.browse()` | Browse | Display rich summary statistics, types, and head to stdout |
|
|
140
|
+
| `InOut.directory()` | Directory | List files in a path with creation/modification metadata |
|
|
141
|
+
| `InOut.date_time_now()` | DateTime Now | Return the current timestamp as a DataFrame |
|
|
142
|
+
|
|
143
|
+
### 🔧 Preparation Palette
|
|
144
|
+
Cleanse, filter, sort, and transform rows/columns.
|
|
145
|
+
|
|
146
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
147
|
+
|---|---|---|
|
|
148
|
+
| `Preparation.filter()` | Filter | Split data based on condition into `(true_df, false_df)` |
|
|
149
|
+
| `Preparation.formula()` | Formula | Add/update a column using a string expression |
|
|
150
|
+
| `Preparation.select()` | Select | Select, rename, or cast data types of columns |
|
|
151
|
+
| `Preparation.data_cleansing()` | Data Cleansing | Strip whitespace, modify case, remove punctuation/nulls |
|
|
152
|
+
| `Preparation.sort()` | Sort | Sort dataframe by one or multiple columns |
|
|
153
|
+
| `Preparation.unique()` | Unique | Split data into `(unique_df, duplicate_df)` |
|
|
154
|
+
| `Preparation.sample()` | Sample | Extract first N, last N, random N, or percentage |
|
|
155
|
+
| `Preparation.record_id()` | Record ID | Attach an auto-incrementing integer ID column |
|
|
156
|
+
| `Preparation.generate_rows()` | Generate Rows | Create sequential rows programmatically |
|
|
157
|
+
| `Preparation.auto_field()` | Auto Field | Optimize column data types to save memory |
|
|
158
|
+
| `Preparation.multi_field_formula()`| Multi-Field Formula| Apply a single formula across multiple columns |
|
|
159
|
+
| `Preparation.multi_row_formula()` | Multi-Row Formula | Apply formulas that reference previous/next rows |
|
|
160
|
+
| `Preparation.tile()` | Tile | Group data into quantiles or uniform bins |
|
|
161
|
+
| `Preparation.imputation()` | Imputation | Fill missing values (mean, median, mode, or custom value) |
|
|
162
|
+
| `Preparation.create_samples()` | Create Samples | Split data into estimation/validation/holdout sets |
|
|
163
|
+
| `Preparation.date_filter()` | Date Filter | Filter dataset by a specific date range |
|
|
164
|
+
| `Preparation.oversample_field()` | Oversample Field | Perform stratified sampling to balance a target class |
|
|
165
|
+
| `Preparation.rank()` | Rank | Assign numeric ranks (supports group-by ranking) |
|
|
166
|
+
|
|
167
|
+
### 🔗 Join Palette
|
|
168
|
+
Blend and match multiple datasets together.
|
|
169
|
+
|
|
170
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
171
|
+
|---|---|---|
|
|
172
|
+
| `Join.join()` | Join | Standard join. Returns `(Left_Unjoined, Joined, Right_Unjoined)` |
|
|
173
|
+
| `Join.join_multiple()` | Join Multiple | Merge 3 or more DataFrames on a common key |
|
|
174
|
+
| `Join.union()` | Union | Stack DataFrames vertically (by column name or position) |
|
|
175
|
+
| `Join.find_replace()` | Find Replace | Lookup-based string replacement (or append lookup value) |
|
|
176
|
+
| `Join.append_fields()` | Append Fields | Cross (Cartesian) join to append all rows |
|
|
177
|
+
| `Join.fuzzy_match()` | Fuzzy Match | Approximate string matching (Levenshtein/Jaro-Winkler) |
|
|
178
|
+
| `Join.make_group()` | Make Group | Group relationship keys using DFS connected-components |
|
|
179
|
+
|
|
180
|
+
### 📊 Transform Palette
|
|
181
|
+
Reshape, pivot, and aggregate data.
|
|
182
|
+
|
|
183
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
184
|
+
|---|---|---|
|
|
185
|
+
| `Transform.summarize()` | Summarize | GroupBy with named aggregations |
|
|
186
|
+
| `Transform.transpose()` | Transpose | Wide-to-long reshaping (unpivot) |
|
|
187
|
+
| `Transform.cross_tab()` | Cross Tab | Long-to-wide reshaping (pivot) |
|
|
188
|
+
| `Transform.running_total()` | Running Total | Cumulative sum calculation (supports grouping) |
|
|
189
|
+
| `Transform.count_records()` | Count Records | Output total row count as a single-value DataFrame |
|
|
190
|
+
| `Transform.arrange()` | Arrange | Manually transpose and rearrange multiple columns |
|
|
191
|
+
| `Transform.make_columns()` | Make Columns | Wrap sequential rows into multiple columns |
|
|
192
|
+
| `Transform.weighted_average()` | Weighted Average | Calculate weighted average (supports grouping) |
|
|
193
|
+
|
|
194
|
+
> **Pro-Tip**: `Transform.summarize()` and `Transform.cross_tab()` natively support familiar Alteryx aggregation aliases like `"count distinct"`, `"count null"`, `"count blank"`, `"concatenate"`, `"longest"`, `"shortest"`, and `"mode"` out-of-the-box (in addition to all standard pandas agg strings like `"sum"` and `"mean"`). Output columns are automatically named `Agg_ColumnName` (e.g. `Sum_Profit`), just like Alteryx.
|
|
195
|
+
|
|
196
|
+
### 📝 Parse Palette
|
|
197
|
+
Extract and parse dates, XML, and text.
|
|
198
|
+
|
|
199
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
200
|
+
|---|---|---|
|
|
201
|
+
| `Parse.date_time()` | DateTime | Convert strings to DateTime (or infer formats automatically) |
|
|
202
|
+
| `Parse.regex_match()` | RegEx (Match) | Create a boolean flag if a pattern is found |
|
|
203
|
+
| `Parse.regex_parse()` | RegEx (Parse) | Extract regex capture groups directly into new columns |
|
|
204
|
+
| `Parse.regex_replace()` | RegEx (Replace) | Replace text based on a regular expression |
|
|
205
|
+
| `Parse.regex_tokenize()` | RegEx (Tokenize) | Split a string by a delimiter into rows or columns |
|
|
206
|
+
| `Parse.text_to_columns()` | Text to Columns | Split delimited text into columns or rows |
|
|
207
|
+
| `Parse.xml_parse()` | XML Parse | Extract XML nodes, outer XML, and auto-flatten child tags |
|
|
208
|
+
|
|
209
|
+
### 🛠️ Developer Palette
|
|
210
|
+
Advanced data manipulation utilities.
|
|
211
|
+
|
|
212
|
+
| PyTeryx Method | Alteryx Tool | Description |
|
|
213
|
+
|---|---|---|
|
|
214
|
+
| `Developer.base64_encode()` | Base64 Encoder | Encode string columns to Base64 |
|
|
215
|
+
| `Developer.base64_decode()` | Base64 Encoder | Decode Base64 columns to strings |
|
|
216
|
+
| `Developer.download()` | Download | Perform an HTTP GET request directly into a DataFrame |
|
|
217
|
+
| `Developer.column_info()` | Column Info | Generate a rich metadata/schema report of your data |
|
|
218
|
+
| `Developer.dynamic_rename()` | Dynamic Rename | Rename columns dynamically via a lookup table mapping |
|
|
219
|
+
| `Developer.json_parse()` | JSON Parse | Flatten a JSON string column dynamically into multiple columns |
|
|
220
|
+
| `Developer.dynamic_select()` | Dynamic Select | Subset columns dynamically by data type or regex pattern |
|
|
221
|
+
| `Developer.test()` | Test | Assert a condition evaluates to True on the dataset |
|
|
222
|
+
| `Developer.test_equal()` | Expect Equal | Strictly validate that two DataFrames are identical |
|
|
223
|
+
|
|
224
|
+
*(Note: PyTeryx intentionally skips Alteryx GUI workflow orchestrators like "Detour" and "Block Until Done", as well as external execution nodes like the "R Tool", since PyTeryx workflows natively leverage standard Python script execution flow).*
|
|
225
|
+
|
|
226
|
+
---
|
|
227
|
+
|
|
228
|
+
## 🧪 Testing & Development
|
|
229
|
+
|
|
230
|
+
PyTeryx boasts an extensive test suite verifying 1:1 parity with Alteryx tools.
|
|
231
|
+
|
|
232
|
+
```bash
|
|
233
|
+
# Clone the repository
|
|
234
|
+
git clone https://github.com/pyteryx/pyteryx.git
|
|
235
|
+
cd pyteryx
|
|
236
|
+
|
|
237
|
+
# Install development dependencies
|
|
238
|
+
pip install -e ".[dev]"
|
|
239
|
+
|
|
240
|
+
# Run the test suite with coverage
|
|
241
|
+
pytest tests/ -v --cov=pyteryx --cov-report=term-missing
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
## 🤝 Contributing
|
|
245
|
+
|
|
246
|
+
Contributions are heavily encouraged! PyTeryx is community-driven. If you find a missing edge-case, want to optimize a pandas operation, or want to add support for a new Alteryx Marketplace tool, please open an issue or submit a pull request on GitHub!
|
|
247
|
+
|
|
248
|
+
## 📄 License
|
|
249
|
+
|
|
250
|
+
[MIT License](LICENSE) — see the [LICENSE](LICENSE) file for details.
|