PyPI - additory - Versions diffs - 0.1.0a1__tar.gz → 0.1.0a3__tar.gz - Mend

additory 0.1.0a1tar.gz → 0.1.0a3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (129) hide show

{additory-0.1.0a1 → additory-0.1.0a3}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: additory
-Version: 0.1.0a1
-Summary: A semantic, extensible dataframe transformation engine with expressions, lookup, synthetic data, and sample-data support.
+Version: 0.1.0a3
+Summary: A semantic, extensible dataframe transformation engine with expressions, lookup, and synthetic data generation support.
 Author: Krishnamoorthy Sankaran
 License: MIT
 Project-URL: homepage, https://github.com/sekarkrishna/additory
@@ -13,11 +13,14 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: pandas>=1.5
 Requires-Dist: polars>=0.20
+Requires-Dist: pyarrow>=10.0
 Requires-Dist: pyyaml>=6.0
 Requires-Dist: requests>=2.31
 Requires-Dist: toml>=0.10
 Requires-Dist: scipy>=1.9
 Requires-Dist: numpy>=1.21
+Requires-Dist: packaging>=21.0
+Requires-Dist: psutil>=5.8
 Provides-Extra: gpu
 Requires-Dist: cudf>=24.0; extra == "gpu"
 Provides-Extra: dev
@@ -32,11 +35,11 @@ Dynamic: license-file
 # Additory
-**A semantic, extensible dataframe transformation engine with expressions, lookup, synthetic data, and sample-data support.**
+**A semantic, extensible dataframe transformation engine with expressions, lookup, and augmentation support.**
 [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-[![Version](https://img.shields.io/badge/version-0.1.0a1-orange.svg)](https://github.com/sekarkrishna/additory/tree/main/V0.1.0a1/)
+[![Version](https://img.shields.io/badge/version-0.1.0a2-orange.svg)](https://github.com/sekarkrishna/additory/tree/main/V0.1.0a1/)
 **Author:** Krishnamoorthy Sankaran
@@ -49,17 +52,17 @@ Dynamic: license-file
 ## 📦 Installation
 ```bash
-pip install additory==0.1.0a1
+pip install additory==0.1.0a2
 ```
 **Optional GPU support:**
 ```bash
-pip install additory[gpu]==0.1.0a1  # Includes cuDF for GPU acceleration
+pip install additory[gpu]==0.1.0a2  # Includes cuDF for GPU acceleration
 ```
 **Development installation:**
 ```bash
-pip install additory[dev]==0.1.0a1  # Includes testing and development tools
+pip install additory[dev]==0.1.0a2  # Includes testing and development tools
 ```
 ## 🎯 Core Functions
@@ -68,7 +71,6 @@ pip install additory[dev]==0.1.0a1  # Includes testing and development tools
 |----------|---------|---------|
 | `add.to()` | Lookup/join operations | `add.to(df1, from_df=df2, bring='col', against='key')` |
 | `add.augment()` | Generate additional data | `add.augment(df, n_rows=1000)` |
-| `add.synth()` | Synthetic data from schemas | `add.synth("schema.toml", rows=5000)` |
 | `add.scan()` | Data profiling & analysis | `add.scan(df, preset="full")` |
 ## 🧬 Available Expressions
@@ -191,13 +193,9 @@ patients_with_bsa = add.bsa(patients)
 result = add.fitness_score(add.bmr(add.bmi(patients)))
 ```
-### 🔄 Augment and Synthetic Data
+### 🔄 Augment Data Generation
-**Augment** generates more data similar to your existing dataset, while **Synthetic** creates entirely new datasets from schema definitions.
-**Key Differences:**
-- **Augment**: Learns patterns from existing data to create similar rows
-- **Synthetic**: Uses predefined schemas to generate structured data
+**Augment** generates additional data similar to your existing dataset using inline strategies.
 ```python
 # Augment existing data (learns from patterns)
@@ -209,9 +207,6 @@ new_data = add.augment("@new", n_rows=500, strategy={
     'name': 'choice:[John,Jane,Bob]',
     'age': 'range:18-65'
 })
-# Generate from schema file (structured approach)
-customers = add.synth("customer_schema.toml", rows=10000)
 ```
 ## 🧪 Examples

{additory-0.1.0a1 → additory-0.1.0a3}/README.md RENAMED Viewed

@@ -1,10 +1,10 @@
 # Additory
-**A semantic, extensible dataframe transformation engine with expressions, lookup, synthetic data, and sample-data support.**
+**A semantic, extensible dataframe transformation engine with expressions, lookup, and augmentation support.**
 [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-[![Version](https://img.shields.io/badge/version-0.1.0a1-orange.svg)](https://github.com/sekarkrishna/additory/tree/main/V0.1.0a1/)
+[![Version](https://img.shields.io/badge/version-0.1.0a2-orange.svg)](https://github.com/sekarkrishna/additory/tree/main/V0.1.0a1/)
 **Author:** Krishnamoorthy Sankaran
@@ -17,17 +17,17 @@
 ## 📦 Installation
 ```bash
-pip install additory==0.1.0a1
+pip install additory==0.1.0a2
 ```
 **Optional GPU support:**
 ```bash
-pip install additory[gpu]==0.1.0a1  # Includes cuDF for GPU acceleration
+pip install additory[gpu]==0.1.0a2  # Includes cuDF for GPU acceleration
 ```
 **Development installation:**
 ```bash
-pip install additory[dev]==0.1.0a1  # Includes testing and development tools
+pip install additory[dev]==0.1.0a2  # Includes testing and development tools
 ```
 ## 🎯 Core Functions
@@ -36,7 +36,6 @@ pip install additory[dev]==0.1.0a1  # Includes testing and development tools
 |----------|---------|---------|
 | `add.to()` | Lookup/join operations | `add.to(df1, from_df=df2, bring='col', against='key')` |
 | `add.augment()` | Generate additional data | `add.augment(df, n_rows=1000)` |
-| `add.synth()` | Synthetic data from schemas | `add.synth("schema.toml", rows=5000)` |
 | `add.scan()` | Data profiling & analysis | `add.scan(df, preset="full")` |
 ## 🧬 Available Expressions
@@ -159,13 +158,9 @@ patients_with_bsa = add.bsa(patients)
 result = add.fitness_score(add.bmr(add.bmi(patients)))
 ```
-### 🔄 Augment and Synthetic Data
+### 🔄 Augment Data Generation
-**Augment** generates more data similar to your existing dataset, while **Synthetic** creates entirely new datasets from schema definitions.
-**Key Differences:**
-- **Augment**: Learns patterns from existing data to create similar rows
-- **Synthetic**: Uses predefined schemas to generate structured data
+**Augment** generates additional data similar to your existing dataset using inline strategies.
 ```python
 # Augment existing data (learns from patterns)
@@ -177,9 +172,6 @@ new_data = add.augment("@new", n_rows=500, strategy={
     'name': 'choice:[John,Jane,Bob]',
     'age': 'range:18-65'
 })
-# Generate from schema file (structured approach)
-customers = add.synth("customer_schema.toml", rows=10000)
 ```
 ## 🧪 Examples

{additory-0.1.0a1 → additory-0.1.0a3}/additory/__init__.py RENAMED Viewed

@@ -2,6 +2,9 @@
 from .dynamic_api import add as _api_instance
+# Version information
+__version__ = "0.1.0a3"
 # Expose the API instance normally
 add = _api_instance
@@ -12,4 +15,5 @@ def __getattr__(name):
 __all__ = [
     "add",
+    "__version__",
 ]

{additory-0.1.0a1 → additory-0.1.0a3}/additory/common/__init__.py RENAMED Viewed

@@ -1,14 +1,14 @@
 """
 Common Utilities Module
-Shared functionality used by both augment and synthetic modules:
+Shared functionality used by both synthetic and expressions modules:
 - Distribution functions (normal, uniform, skewed, etc.)
 - List file management (.list format)
 - Pattern file management (.properties format)
 - Fallback resolution logic
 This module eliminates code duplication and provides consistent behavior
-across augment and synthetic data generation.
+across synthetic and expression data generation.
 """
 from .distributions import (

{additory-0.1.0a1 → additory-0.1.0a3}/additory/common/backend.py RENAMED Viewed

@@ -180,11 +180,14 @@ def get_arrow_bridge():
         - Use for all cross-backend conversions
         - Handles pandas/polars/cuDF via Arrow
     """
-    from additory.core.backends.arrow_bridge import EnhancedArrowBridge
+    from additory.core.backends.arrow_bridge import EnhancedArrowBridge, ArrowBridgeError
     # Singleton pattern
     if not hasattr(get_arrow_bridge, '_instance'):
-        get_arrow_bridge._instance = EnhancedArrowBridge()
+        try:
+            get_arrow_bridge._instance = EnhancedArrowBridge()
+        except ArrowBridgeError:
+            get_arrow_bridge._instance = None
     return get_arrow_bridge._instance
@@ -194,7 +197,7 @@ def to_polars(df: Any, backend_type: BackendType = None) -> 'pl.DataFrame':
     Convert any dataframe to Polars via Arrow bridge.
     This is the primary conversion function for the Polars-only architecture.
-    All operations (expressions, augment, etc.) use this to convert input
+    All operations (expressions, synthetic, etc.) use this to convert input
     dataframes to Polars for processing.
     Args:
@@ -224,7 +227,7 @@ def to_polars(df: Any, backend_type: BackendType = None) -> 'pl.DataFrame':
         )
     # Fast path: already Polars
-    if isinstance(df, pl.DataFrame):
+    if HAS_POLARS and isinstance(df, pl.DataFrame):
         return df
     # Validate input
@@ -240,6 +243,13 @@ def to_polars(df: Any, backend_type: BackendType = None) -> 'pl.DataFrame':
     # Convert via Arrow bridge
     try:
         bridge = get_arrow_bridge()
+        if bridge is None:
+            # Fallback: direct conversion for pandas
+            if backend_type == "pandas":
+                if isinstance(df, pd.DataFrame):
+                    return pl.from_pandas(df)
+            raise RuntimeError("Arrow bridge not available and cannot convert non-pandas DataFrame")
         arrow_table = bridge.to_arrow(df, backend_type)
         pl_df = bridge.from_arrow(arrow_table, "polars")
         return pl_df
@@ -309,6 +319,12 @@ def from_polars(pl_df: 'pl.DataFrame', target_backend: BackendType) -> Any:
     # Convert via Arrow bridge
     try:
         bridge = get_arrow_bridge()
+        if bridge is None:
+            # Fallback: direct conversion for pandas
+            if target_backend == "pandas":
+                return pl_df.to_pandas()
+            raise RuntimeError("Arrow bridge not available and cannot convert to non-pandas DataFrame")
         arrow_table = bridge.to_arrow(pl_df, "polars")
         result_df = bridge.from_arrow(arrow_table, target_backend)
         return result_df

{additory-0.1.0a1 → additory-0.1.0a3}/additory/common/distributions.py RENAMED Viewed

@@ -1,5 +1,5 @@
 """
-Distribution Strategies for Data Augmentation
+Distribution Strategies for Synthetic Data Generation
 Provides statistical distribution-based data generation:
 - Normal (Gaussian) distribution

{additory-0.1.0a1 → additory-0.1.0a3}/additory/common/sample_data.py RENAMED Viewed

@@ -8,8 +8,8 @@ loaded on-demand using the existing .add file parser.
 Usage:
     from additory.common.sample_data import get_sample_dataset
-    # For augment
-    df = get_sample_dataset("augment", "sample")
+    # For synthetic
+    df = get_sample_dataset("synthetic", "sample")
     # For expressions (future)
     df = get_sample_dataset("expressions", "sample")
@@ -25,7 +25,7 @@ from additory.common.exceptions import ValidationError
 def get_sample_dataset(
-    module: str = "augment",
+    module: str = "synthetic",
     block: str = "sample",
     dataset_type: str = "clean"
 ) -> pl.DataFrame:
@@ -33,12 +33,12 @@ def get_sample_dataset(
     Load a sample dataset from .add files.
     This function provides centralized access to sample datasets across
-    all additory modules (augment, expressions, utilities). Sample datasets
+    all additory modules (synthetic, expressions, utilities). Sample datasets
     are stored as .add files in the reference/ directory structure.
     Args:
-        module: Module name ("augment", "expressions", "utilities")
-        block: Block name within the .add file ("sample" for augment)
+        module: Module name ("synthetic", "expressions", "utilities")
+        block: Block name within the .add file ("sample" for synthetic)
         dataset_type: Type of sample data ("clean" or "unclean")
     Returns:
@@ -48,8 +48,8 @@ def get_sample_dataset(
         ValidationError: If module, block, or dataset_type not found
     Examples:
-        >>> # Load augment sample dataset
-        >>> df = get_sample_dataset("augment", "sample")
+        >>> # Load synthetic sample dataset
+        >>> df = get_sample_dataset("synthetic", "sample")
         >>> print(df.shape)
         (50, 10)
@@ -57,7 +57,7 @@ def get_sample_dataset(
         >>> df = get_sample_dataset("expressions", "sample", "clean")
         >>> df_unclean = get_sample_dataset("expressions", "sample", "unclean")
-    Sample Dataset Structure (augment):
+    Sample Dataset Structure (synthetic):
         - id: Sequential numeric IDs (1-50)
         - emp_id: Employee IDs with pattern (EMP_001 - EMP_050)
         - order_id: Order IDs with different padding (ORD_0001 - ORD_0050)
@@ -72,8 +72,8 @@ def get_sample_dataset(
     # Construct path to .add file
     base_path = Path(__file__).parent.parent.parent / "reference"
-    if module == "augment":
-        add_file_path = base_path / "augment_definitions" / f"{block}_0.1.add"
+    if module == "synthetic":
+        add_file_path = base_path / "synthetic_definitions" / f"{block}_0.1.add"
     elif module == "expressions":
         add_file_path = base_path / "expressions_definitions" / f"{block}_0.1.add"
     elif module == "utilities":
@@ -81,7 +81,7 @@ def get_sample_dataset(
     else:
         raise ValidationError(
             f"Unknown module '{module}'. "
-            f"Valid modules: augment, expressions, utilities"
+            f"Valid modules: synthetic, expressions, utilities"
         )
     # Check if file exists
@@ -141,7 +141,7 @@ def list_available_samples() -> dict:
         >>> samples = list_available_samples()
         >>> print(samples)
         {
-            'augment': ['sample'],
+            'synthetic': ['sample'],
             'expressions': ['sample'],
             'utilities': []
         }
@@ -149,15 +149,15 @@ def list_available_samples() -> dict:
     base_path = Path(__file__).parent.parent.parent / "reference"
     available = {}
-    # Check augment
-    augment_path = base_path / "augment_definitions"
-    if augment_path.exists():
-        available['augment'] = [
+    # Check synthetic
+    synthetic_path = base_path / "synthetic_definitions"
+    if synthetic_path.exists():
+        available['synthetic'] = [
             f.stem.rsplit('_', 1)[0]  # Remove version suffix
-            for f in augment_path.glob("*.add")
+            for f in synthetic_path.glob("*.add")
         ]
     else:
-        available['augment'] = []
+        available['synthetic'] = []
     # Check expressions
     expressions_path = base_path / "expressions_definitions"

{additory-0.1.0a1 → additory-0.1.0a3}/additory/core/backends/arrow_bridge.py RENAMED Viewed

@@ -16,6 +16,13 @@ try:
 except ImportError as e:
     ARROW_AVAILABLE = False
     IMPORT_ERROR = str(e)
+    # Create dummy classes for type annotations
+    class pa:
+        Table = Any
+    class pl:
+        DataFrame = Any
+    class pd:
+        DataFrame = Any
 from ..logging import log_info, log_warning
 from .cudf_bridge import get_cudf_bridge

{additory-0.1.0a1 → additory-0.1.0a3}/additory/core/polars_expression_engine.py RENAMED Viewed

@@ -32,7 +32,10 @@ class PolarsExpressionEngine:
     """Exclusive Polars-based expression processing engine"""
     def __init__(self):
-        self.arrow_bridge = EnhancedArrowBridge()
+        try:
+            self.arrow_bridge = EnhancedArrowBridge()
+        except ArrowBridgeError:
+            self.arrow_bridge = None
         self.execution_stats = {
             "total_executions": 0,
             "total_time_ms": 0.0,
@@ -68,14 +71,28 @@ class PolarsExpressionEngine:
             try:
                 # Auto-detect backend if not specified
                 if backend_type is None:
-                    backend_type = self.arrow_bridge.detect_backend(df)
+                    if self.arrow_bridge:
+                        backend_type = self.arrow_bridge.detect_backend(df)
+                    else:
+                        backend_type = "pandas"  # fallback
                 # Get memory usage before processing
-                memory_before = self.arrow_bridge._get_memory_usage_mb()
+                if self.arrow_bridge:
+                    memory_before = self.arrow_bridge._get_memory_usage_mb()
+                else:
+                    memory_before = 0
                 # 1. Convert input to Arrow
                 log_info(f"[polars_engine] Converting {backend_type} to Arrow")
-                arrow_table = self.arrow_bridge.to_arrow(df, backend_type)
+                if self.arrow_bridge:
+                    arrow_table = self.arrow_bridge.to_arrow(df, backend_type)
+                else:
+                    # Fallback: assume pandas and convert directly
+                    import pandas as pd
+                    if isinstance(df, pd.DataFrame):
+                        arrow_table = pl.from_pandas(df).to_arrow()
+                    else:
+                        raise RuntimeError("Arrow bridge not available and input is not pandas DataFrame")
                 # 2. Convert Arrow to Polars
                 log_info("[polars_engine] Converting Arrow to Polars")
@@ -93,11 +110,18 @@ class PolarsExpressionEngine:
                 # 5. Convert to original backend format
                 log_info(f"[polars_engine] Converting Arrow to {backend_type}")
-                final_result = self.arrow_bridge.from_arrow(result_arrow, backend_type)
+                if self.arrow_bridge:
+                    final_result = self.arrow_bridge.from_arrow(result_arrow, backend_type)
+                else:
+                    # Fallback: convert back to pandas
+                    final_result = pl.from_arrow(result_arrow).to_pandas()
                 # Calculate execution statistics
                 execution_time = (datetime.now() - start_time).total_seconds() * 1000
-                memory_after = self.arrow_bridge._get_memory_usage_mb()
+                if self.arrow_bridge:
+                    memory_after = self.arrow_bridge._get_memory_usage_mb()
+                else:
+                    memory_after = 0
                 memory_used = max(0, memory_after - memory_before)
                 # Update global statistics
@@ -122,7 +146,8 @@ class PolarsExpressionEngine:
             finally:
                 # 6. Always cleanup Arrow memory
-                self.arrow_bridge.cleanup_arrow_memory()
+                if self.arrow_bridge:
+                    self.arrow_bridge.cleanup_arrow_memory()
     def _execute_polars_expression(self, polars_df: pl.DataFrame,
                                  expression: str, output_column: str) -> pl.DataFrame:
@@ -381,14 +406,28 @@ class PolarsExpressionEngine:
         try:
             # Auto-detect backend if not specified
             if backend_type is None:
-                backend_type = self.arrow_bridge.detect_backend(df)
+                if self.arrow_bridge:
+                    backend_type = self.arrow_bridge.detect_backend(df)
+                else:
+                    backend_type = "pandas"
             # Get memory usage before processing
-            memory_before = self.arrow_bridge._get_memory_usage_mb()
+            if self.arrow_bridge:
+                memory_before = self.arrow_bridge._get_memory_usage_mb()
+            else:
+                memory_before = 0
             # Convert to Polars via Arrow
-            arrow_table = self.arrow_bridge.to_arrow(df, backend_type)
-            polars_df = pl.from_arrow(arrow_table)
+            if self.arrow_bridge:
+                arrow_table = self.arrow_bridge.to_arrow(df, backend_type)
+                polars_df = pl.from_arrow(arrow_table)
+            else:
+                # Fallback: assume pandas
+                import pandas as pd
+                if isinstance(df, pd.DataFrame):
+                    polars_df = pl.from_pandas(df)
+                else:
+                    raise RuntimeError("Arrow bridge not available and input is not pandas DataFrame")
             # Execute using AST
             polars_expr = self._ast_to_polars_expr(ast_tree)
@@ -396,11 +435,17 @@ class PolarsExpressionEngine:
             # Convert back to original format
             result_arrow = result_df.to_arrow()
-            final_result = self.arrow_bridge.from_arrow(result_arrow, backend_type)
+            if self.arrow_bridge:
+                final_result = self.arrow_bridge.from_arrow(result_arrow, backend_type)
+            else:
+                final_result = pl.from_arrow(result_arrow).to_pandas()
             # Calculate statistics
             execution_time = (datetime.now() - start_time).total_seconds() * 1000
-            memory_after = self.arrow_bridge._get_memory_usage_mb()
+            if self.arrow_bridge:
+                memory_after = self.arrow_bridge._get_memory_usage_mb()
+            else:
+                memory_after = 0
             memory_used = max(0, memory_after - memory_before)
             # Update statistics
@@ -422,7 +467,8 @@ class PolarsExpressionEngine:
             raise PolarsExpressionError(f"AST execution failed: {e}")
         finally:
-            self.arrow_bridge.cleanup_arrow_memory()
+            if self.arrow_bridge:
+                self.arrow_bridge.cleanup_arrow_memory()
     def validate_expression(self, expression: str) -> bool:
         """
@@ -489,7 +535,10 @@ class PolarsExpressionEngine:
             Benchmark results
         """
         times = []
-        backend_type = self.arrow_bridge.detect_backend(df)
+        if self.arrow_bridge:
+            backend_type = self.arrow_bridge.detect_backend(df)
+        else:
+            backend_type = "pandas"
         for i in range(iterations):
             try:
@@ -532,7 +581,8 @@ class PolarsExpressionEngine:
         """Cleanup callback for memory manager"""
         try:
             # Cleanup Arrow bridge memory
-            self.arrow_bridge.cleanup_arrow_memory()
+            if self.arrow_bridge:
+                self.arrow_bridge.cleanup_arrow_memory()
             # Reset statistics if they get too large
             if self.execution_stats["total_executions"] > 10000:

additory 0.1.0a1__tar.gz → 0.1.0a3__tar.gz

additory 0.1.0a1tar.gz → 0.1.0a3tar.gz