PyPI - mloda - Versions diffs - 0.2.11__py3-none-any.whl → 0.2.13__py3-none-any.whl - Mend

mloda 0.2.11py3-none-any.whl → 0.2.13py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

{mloda-0.2.11.dist-info → mloda-0.2.13.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: mloda
-Version: 0.2.11
+Version: 0.2.13
 Summary: Rethinking Data and Feature Engineering
 Author-email: Tom Kaltofen <info@mloda.ai>
 License:                                  Apache License
@@ -219,43 +219,18 @@ License-File: NOTICE.md
 Requires-Dist: pyarrow
 Dynamic: license-file
-# mloda: Revolutionary Process-Data Separation for Feature and Data Engineering
+# mloda: Make data and feature engineering shareable
+[![Website](https://img.shields.io/badge/website-mloda.ai-blue.svg)](https://mloda.ai)
 [![Documentation](https://img.shields.io/badge/docs-github.io-blue.svg)](https://mloda-ai.github.io/mloda/)
 [![PyPI version](https://badge.fury.io/py/mloda.svg)](https://badge.fury.io/py/mloda)
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/mloda-ai/mloda/blob/main/LICENSE.TXT)
--[![Tox](https://img.shields.io/badge/tested_with-tox-blue.svg)](https://tox.readthedocs.io/)
--[![Checked with mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
--[![code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
+[![Tox](https://img.shields.io/badge/tested_with-tox-blue.svg)](https://tox.readthedocs.io/)
+[![Checked with mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
+[![code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
 > **⚠️ Early Version Notice**: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. [Share your thoughts!](https://github.com/mloda-ai/mloda/issues/)
-## 🚀 Transforming Feature Engineering Through Process-Data Separation
-mloda **revolutionizes feature engineering** by separating **processes** (transformations) from **data**, enabling unprecedented flexibility, reusability, and scalability in machine learning workflows.
-**🤖 Built for the AI Era**: While others write code, AI writes mloda plugins. *Check the inline comments in our experimental plugin code - all AI written.*
-**🌐 Share Without Secrets**: Traditional pipelines lock business logic inside - mloda plugins separate transformations from business context, enabling safe community sharing.
-**🎯 Try the first example out NOW:** [sklearn Integration Example](https://mloda-ai.github.io/mloda/examples/sklearn_integration_basic/) - See mloda transform traditional sklearn pipelines!
-## 📋 Table of Contents
-- [🍳 Think of mloda Like Cooking Recipes](#-think-of-mloda-like-cooking-recipes)
-- [💡 The Value Proposition](#-the-value-proposition)
-- [📊 Why Process-Data Separation Changes Everything](#-why-process-data-separation-changes-everything)
-- [🚀 Quick Start](#-quick-start)
-- [🔄 Write Once, Run Anywhere](#-write-once-run-anywhere-environments--frameworks)
-- [🌍 Deploy Anywhere Python Runs](#-deploy-anywhere-python-runs)
-- [🎯 Minimal Dependencies](#-minimal-dependencies-maximum-compatibility)
-- [🔧 Complete Data Processing](#-complete-data-processing-capabilities)
-- [👥 Role-Based Governance](#-logical-role-based-data-governance)
-- [🌐 Community-Driven Plugin Ecosystem](#-community-driven-plugin-ecosystem)
-- [📖 Documentation](#-documentation)
-- [🤝 Contributing](#-contributing)
-- [📄 License](#-license)
 ## 🍳 Think of mloda Like Cooking Recipes
 **Traditional Data Pipelines** = Making everything from scratch
@@ -270,201 +245,215 @@ mloda **revolutionizes feature engineering** by separating **processes** (transf
 - Switch kitchens (home → restaurant → food truck) - same recipes work
 - Share your "tomato sauce" recipe with friends - they don't need your whole kitchen
-**Real Example**: You need to clean customer ages (remove outliers, fill missing values)
-- **Traditional**: Write age-cleaning code for training, testing, production separately
-- **mloda**: Create one "clean_age" plugin, use everywhere - development, testing, production, analysis
 **Result**: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!
-## 💡 The Value Proposition
+### Installation
+```bash
+pip install mloda
+```
-**What mloda aims to enable:**
+### 1. The Core API Call - Your Starting Point
-| Challenge | Traditional Pain Point | mloda's Approach |
-|-----------|----------------------|------------------|
-| **⏰ Repetitive Work** | Rebuild same transformations for each environment | Write once, reuse across all environments |
-| **🐛 Consistency Issues** | Different implementations create bugs | Single implementation ensures consistency |
-| **👥 Knowledge Silos** | Senior expertise locked in complex pipelines | Reusable patterns everyone can use |
-| **🚀 Deployment Friction** | Train/serve skew causes production issues | Same logic guaranteed everywhere |
-| **💡 Innovation Bottleneck** | Time spent on solved problems | Focus energy on unique business value |
+**The One Command That Does Everything**
-**Vision**: Enable data teams to spend more time solving unique business problems and less time rebuilding common patterns, while reducing the risk of inconsistencies across environments.
+```python
+# This is the heart of mloda. You describe what you want and mloda resolves the dependencies.
+from mloda_core.api.request import mlodaAPI
-## 📊 Why Process-Data Separation Changes Everything
+result = mlodaAPI.run_all(
+    features=["age", "standard_scaled__weight"]
+)
-| Aspect | Traditional Approach | mloda Approach |
-|--------|---------------------|----------------|
-| **🔄 Reusability** | Transformations tied to specific datasets | Same feature definitions work across all contexts |
-| **⚡ Flexibility** | Locked to single compute framework | Multi-framework support with automatic optimization |
-| **📝 Maintainability** | Complex nested pipeline objects | Clean, declarative feature names |
-| **🏭 Scalability** | Framework-specific limitations | Horizontal scaling without architectural changes |
+# That's it! You get processed data back
+data = result[0]
+print(data.head())
+```
-> *For those who know: Want Iceberg-like metadata capabilities across your entire data and feature lifecycle? That's exactly what mloda aims for.*
+**What just happened?**
+- mloda found your data automatically
+- Applied transformations (scaling, encoding)
+- Returned clean, ready-to-use DataFrame
-## 🚀 Quick Start
+> **Key Insight**: As long as the plugins and data accesses exist, mloda can derive any feature automatically.
-### Installation
-```bash
-pip install mloda
-```
+### 2. Setting Up Your Data
-### Your First Feature Pipeline
-``` python
-import numpy as np
-from mloda_core.api.request import mlodaAPI
-from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe
+**Using DataCreator - The mloda Way**
+```python
+# DataCreator: Perfect for testing, demos, and prototyping
+# Use this when you need synthetic data or want to test mloda without external files
 from mloda_core.abstract_plugins.components.input_data.creator.data_creator import DataCreator
 from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
-np.random.seed(42)
-n_samples = 1000
-class YourFirstSyntheticDataSet(AbstractFeatureGroup):
+class SampleDataFeature(AbstractFeatureGroup):
     @classmethod
     def input_data(cls):
-        return DataCreator({"age", "weight", "state", "gender"})
-    @classmethod
+        # Define what columns your data will have
+        return DataCreator({
+            "age", "weight", "state", "income", "target"
+        })
+    @classmethod
     def calculate_feature(cls, data, features):
+        # Generate sample data that matches your DataCreator specification
+        # This is where you'd normally load from files, databases, or APIs
         return {
-                "age": np.random.randint(25, 65, 500),
-                "weight": np.random.normal(80, 20, 500),  # Different distribution
-                "state": np.random.choice(["WA", "OR"], 500),  # Different states!
-                "gender": np.random.choice(["M", "F", "Other"], 500),  # New category!
-            }
+            'age': [25, 30, 35, None, 45, 28, 33],
+            'weight': [150, 180, None, 200, 165, 140, 175],
+            'state': ['CA', 'NY', 'TX', 'CA', 'FL', 'NY', 'TX'],
+            'income': [50000, 75000, 85000, 60000, None, 45000, 70000],
+            'target': [1, 0, 1, 0, 1, 0, 1]
+        }
+```
-# Define features with automatic dependency resolution
-features = [
-    "standard_scaled__mean_imputed__age",
-    "onehot_encoded__state",
-    "robust_scaled__weight"
-]
+**When to Use DataCreator vs Other Data Access Methods:**
-# Execute with automatic framework selection
-result = mlodaAPI.run_all(features, compute_frameworks={PandasDataframe})
-```
+- **DataCreator**: For testing, demos, synthetic data, or when you want to generate data programmatically within mloda
+- **File Access** (`DataAccessCollection` with files): When your data lives in CSV, JSON, Parquet, etc.
+- **Database Access** (`DataAccessCollection` with credentials): When connecting to SQL databases, data warehouses
+- **API Access**: When fetching data from REST APIs or other web services
-## 🔄 Write Once, Run Anywhere: Environments & Frameworks
+> **Key Insight**: DataCreator is mloda's built-in data generation tool - perfect for getting started without external dependencies. Once you're ready for production, switch to file or database access methods.
-**The Core Promise**: One plugin definition works across all environments and technologies.
+**Quick Start with Your Own Data:**
+```python
+# Replace DataCreator with real data access
+from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection
-``` python
-# Traditional approach: Rebuild for each context
-def clean_age_training(data): ...      # Training pipeline
-def clean_age_testing(data): ...       # Testing pipeline
-def clean_age_production(data): ...    # Production API
-def clean_age_spark(data): ...         # Big data processing
-def clean_age_analysis(data): ...      # Analytics
+# For files
+data_access = DataAccessCollection(files={"your_data.csv"})
-# mloda approach: Write once, use everywhere
-class CleanAgePlugin(AbstractFeatureGroup):
-    @classmethod
-    def calculate_feature(cls, data, features):
-        # Single implementation for all contexts
-        return process_age_data(data["age"])
-# Same plugin, different environments & frameworks
-mlodaAPI.run_all(["clean_age"], compute_frameworks={PandasDataframe})  # Dev
-mlodaAPI.run_all(["clean_age"], compute_frameworks={SparkDataframe})   # Production
-mlodaAPI.run_all(["clean_age"], compute_frameworks={PolarsDataframe})  # High performance
-mlodaAPI.run_all(["clean_age"], compute_frameworks={DuckDBFramework})  # Analytics
+# For databases
+data_access = DataAccessCollection(
+    credential_dicts=[{"host": "your-db.com", "username": "user"}]
+)
 ```
-**Result**: 5+ implementations → 1 plugin that adapts automatically.
+### 3. Understanding What You Get Back
-### Different Data Scales, Same Processing Logic
+**The Result Structure**
-```mermaid
-graph TB
-    subgraph "📊 Data Scenarios"
-        CSV["📄 Development<br/>Small CSV files<br/>~1K rows"]
-        BATCH["🏋️ Training<br/>Full dataset<br/>~1M+ rows"]
-        SINGLE["⚡ Inference<br/>Single row<br/>Real-time"]
-        ANALYSIS["📈 Analysis<br/>Historical batch<br/>Post-deployment"]
-    end
-    subgraph "🎯 Same Features Applied"
-        RESULT["standard_scaled__mean_imputed__age<br/>onehot_encoded__state<br/>robust_scaled__weight<br/><br/>"]
-    end
-    CSV --> RESULT
-    BATCH --> RESULT
-    SINGLE --> RESULT
-    ANALYSIS --> RESULT
-    style CSV fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
-    style BATCH fill:#fff3e0,stroke:#f57c00,stroke-width:2px
-    style SINGLE fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
-    style ANALYSIS fill:#fce4ec,stroke:#c2185b,stroke-width:2px
-    style RESULT fill:#e8f5e8,stroke:#4caf50,stroke-width:3px
+```python
+from mloda_core.api.request import mlodaAPI
+from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe
+result = mlodaAPI.run_all(features, compute_frameworks={PandasDataframe})
+# result is always a LIST of result objects
+data_list = result
+# Each object matches your compute framework type: pd.DataFrame, spark.DataFrame, etc.
+# Access your processed data
+data = result[0]  # Most common case: single result
+print(f"Shape: {data.shape}, Columns: {list(data.columns)}")
 ```
-## 🌍 Deploy Anywhere Python Runs
+> **Key Insight**: mloda returns a list of results. Most simple cases return a single DataFrame that you access with `result[0]`.
-**Universal Deployment**: mloda runs wherever Python runs - no special infrastructure needed.
+### 4. The Features Parameter
-| Environment | Use Case | Example |
-|-------------|----------|---------|
-| **💻 Local Development** | Prototyping & testing | Jupyter notebooks, VS Code |
-| **☁️ Any Cloud** | Production workloads | AWS, GCP, Azure, DigitalOcean |
-| **🏢 On-Premise** | Enterprise & compliance | Air-gapped environments |
-| **📊 Notebooks** | Data science workflows | Jupyter, Colab, Databricks |
-| **🌐 Web APIs** | Real-time serving | Flask, FastAPI, Django |
-| **⚙️ Orchestration** | Batch processing | Airflow, Prefect, Dagster |
-| **🐳 Containers** | Microservices | Docker, Kubernetes |
-| **⚡ Serverless** | Event-driven | AWS Lambda, Google Functions |
+**Feature Object Syntax**
-**No vendor lock-in. No special runtime. Just Python.**
+```python
+from mloda_core.abstract_plugins.components.feature import Feature
+from mloda_core.abstract_plugins.components.options import Options
+from mloda_core.abstract_plugins.plugin_loader.plugin_loader import PluginLoader
-## 🎯 Minimal Dependencies, Maximum Compatibility
+# Load all available plugins (required before using features)
+PluginLoader.all()
-**PyArrow-Only Core**: mloda uses only PyArrow as its core dependency - no other Python modules required.
+features = [
+    "age",                                    # Simple string
+    Feature(
+            "weight_replaced",
+            options=Options(
+                group={
+                    "imputation_method": "mean",
+                    "mloda_source_feature": "weight",
+                }
+            ),
+        ),
+    "onehot_encoded__state"                  # Chaining syntax
+]
+```
-**Why PyArrow?** It's the universal language of modern data:
-- **Interoperability**: Native bridge between Pandas, Polars, Spark, DuckDB
-- **Performance**: Zero-copy data sharing between frameworks
-- **Standards**: Apache Arrow is the foundation of modern data tools
-- **Future-Proof**: Industry standard for columnar data processing
+**Three Ways to Define Features:**
+- **Simple strings**: For basic columns like "age"
+- **Feature objects**: For explicit configuration and advanced options
+- **Chaining syntax**: Convenient shorthand for transformations
-This architectural choice enables mloda's seamless framework switching without dependency conflicts.
+### 5. Compute Frameworks
-## 🔧 Complete Data Processing Capabilities
+**Choose Your Processing Engine**
-**Beyond Feature Engineering**: mloda provides full data processing operations:
+```python
+# Different processing engines
+features = [
+    Feature("age", compute_framework=PandasDataframe.get_class_name()),
+    Feature("weight", compute_framework=PolarsDataframe.get_class_name()),
+]
-| Operation | Purpose | Example Use Case |
-|-----------|---------|------------------|
-| **🔗 Joins** | Combine datasets | User profiles + transaction history |
-| **🔀 Merges** | Consolidate data sources | Multiple feature tables into one |
-| **🔍 Filters** | Data selection & quality | Remove outliers, select time ranges |
-| **🏷️ Domain** | Data organization & governance | Logical data grouping and access control |
+# Mixed - familiar, extensive ecosystem
+result = mlodaAPI.run_all(features)
+```
-All operations work seamlessly across any compute framework with the same simple API.
+### 6. Data Access
-## 👥 Logical Role-Based Data Governance
+**Tell mloda Where Your Data Lives**
-**Clear Role Separation**: mloda logically splits data responsibilities into three distinct roles:
+```python
+from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection
-| Role | Responsibility | Key Activities |
-|------|---------------|----------------|
-| **🏗️ Data Producer** | Create & maintain plugins | Define data access, implement feature groups, ensure quality |
-| **👤 Data User** | Consume features via API | Request features, configure workflows, build ML models |
-| **🛡️ Data Owner** | Governance & lifecycle | Control access, manage compliance, oversee data quality |
+# Configure data sources
+data_access = DataAccessCollection(
+    files={"data/customers.csv"},                    # Specific files
+    folders={"data/archive/"},                       # Entire directories
+    credential_dicts=[{"host": "db.example.com"}]    # Database credentials
+)
-**Organizational Clarity**: Each role has defined boundaries, enabling proper data governance while maintaining development flexibility. [Learn more about roles](https://mloda-ai.github.io/mloda/examples/mloda_basics/4_ml_data_producers_user_owner/)
+result = mlodaAPI.run_all(
+    features=["age", "standard_scaled__income"],
+    compute_frameworks={PandasDataframe},
+    data_access_collection=data_access
+)
+```
+> **Key Insight**: Configure data access once globally, and all features can use it automatically.
-## 🌐 Community-Driven Plugin Ecosystem
+### 7. Putting It All Together
-**Share Transformations, Keep Secrets**: Unlike traditional pipelines where business logic is embedded, mloda separates transformation patterns from business context.
+**Real-World Feature Engineering Pipeline**
-| Challenge | Traditional Pipelines | mloda Solution |
-|-----------|----------------------|----------------|
-| **🔒 Knowledge Sharing** | Business logic embedded - can't share | Transformations separated - safe to share |
-| **🔄 Reusability** | Rebuild common patterns everywhere | Community library of proven patterns |
-| **⚡ Innovation** | Everyone reinvents the wheel | Build on collective knowledge |
-| **🎯 Focus** | Waste time on solved problems | Focus on unique business value |
+```python
+# Complete mlodaAPI call
+result = mlodaAPI.run_all(
+    # What you want
+    features=[
+        "standard_scaled__age",
+        "onehot_encoded__state",
+        "mean_imputed__income"
+    ],
+    # How to process it
+    compute_frameworks={PandasDataframe},
+    # Where to get it
+    data_access_collection=DataAccessCollection(files={"data/customers.csv"})
+)
+# Get your results
+processed_data = result[0]
+print(f"✅ Created {len(processed_data.columns)} features from {len(processed_data)} rows")
+# Use in your ML pipeline
+from sklearn.model_selection import train_test_split
+X = processed_data.drop('target', axis=1)
+y = processed_data['target']
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+```
-**Result**: A thriving ecosystem where data teams contribute transformation patterns while protecting their competitive advantages.
+> **🎉 You now understand mloda's core workflow!**
 ## 📖 Documentation
@@ -485,5 +474,4 @@ We welcome contributions! Whether you're building plugins, adding features, or i
 ## 📄 License
 This project is licensed under the [Apache License, Version 2.0](https://github.com/mloda-ai/mloda/blob/main/LICENSE.TXT).
 ---

mloda 0.2.11__py3-none-any.whl → 0.2.13__py3-none-any.whl

mloda 0.2.11py3-none-any.whl → 0.2.13py3-none-any.whl