PyPI - rand-engine - Versions diffs - 0.6.2__tar.gz → 0.6.3__tar.gz - Mend

rand-engine 0.6.2tar.gz → 0.6.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

rand_engine-0.6.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,397 @@
+Metadata-Version: 2.4
+Name: rand-engine
+Version: 0.6.3
+Summary: Rand Engine v2. Package with some methods to generate random data in different formats. Great to mock data while testing or developing.
+Author: marcoaureliomenezes
+Author-email: marcoaurelioreislima@gmail.com
+Requires-Python: >=3.10,<4.0
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Requires-Dist: duckdb (>=1.4.1,<2.0.0)
+Requires-Dist: fastavro (>=1.10.0,<2.0.0)
+Requires-Dist: fastparquet (>=2024.11.0,<2025.0.0)
+Requires-Dist: numpy (>=2.1.1,<3.0.0)
+Requires-Dist: pandas (>=2.2.2,<3.0.0)
+Requires-Dist: pyarrow (>=19.0.0,<20.0.0)
+Project-URL: Repository, https://github.com/marcoaureliomenezes/rand_engine
+Description-Content-Type: text/markdown
+<div align="center">
+# 🎲 Rand Engine
+**Generate millions of rows of synthetic data in seconds**
+*High-performance random data generation for testing, development, and prototyping*
+[![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
+[![Tests](https://img.shields.io/badge/tests-494%20passing-brightgreen.svg)]()
+[![License](https://img.shields.io/badge/license-MIT-blue.svg)]()
+[![Version](https://img.shields.io/badge/version-0.7.0-orange.svg)](https://pypi.org/project/rand-engine/)
+[![PyPI](https://img.shields.io/badge/PyPI-rand--engine-blue.svg)](https://pypi.org/project/rand-engine/)
+[Quick Start](#-quick-start) • [Features](#-key-features) • [Examples](#-usage-examples) • [Documentation](#-documentation) • [Benchmarks](#-performance-benchmarks)
+</div>
+---
+## 🎯 What is Rand Engine?
+**Rand Engine** is a Python library that generates **realistic synthetic data at scale** through simple declarative specifications. Built on NumPy and Pandas for maximum performance.
+**Perfect for:**
+- 🧪 Testing ETL/ELT pipelines without production data
+- 📊 Load testing and stress testing data systems
+- 🎓 Learning data engineering without complex setups
+- 🚀 Prototyping applications with realistic datasets
+- 🔐 Demos and POCs without exposing sensitive data
+---
+## 🚀 Quick Start
+### Installation
+```bash
+pip install rand-engine
+```
+### Generate Your First Dataset (3 Lines!)
+```python
+from rand_engine.main.data_generator import DataGenerator
+from rand_engine.examples.common_rand_specs import CommonRandSpecs
+# Generate 1 million customer records in seconds
+df = DataGenerator(CommonRandSpecs.customers(), seed=42).size(1_000_000).get_df()
+print(df.head())
+```
+**Output:**
+```
+   customer_id  age           city  total_spent  is_premium registration_date
+0    uuid-001    42      São Paulo      1523.50        True        2023-05-12
+1    uuid-002    28  Rio de Janeiro       872.33       False        2024-01-08
+2    uuid-003    56  Belo Horizonte      4215.89       False        2022-11-23
+```
+**That's it!** You just generated 1 million rows of realistic customer data. 🎉
+---
+## ✨ Key Features
+<table>
+<tr>
+<td width="50%">
+### 🐼 **Pandas DataFrames**
+```python
+from rand_engine.main.data_generator import DataGenerator
+df = DataGenerator(spec, seed=42).size(1_000_000).get_df()
+```
+✅ All methods (common + advanced)
+✅ Correlated columns
+✅ Complex patterns
+✅ PK/FK constraints
+</td>
+<td width="50%">
+### ⚡ **Spark DataFrames**
+```python
+from rand_engine.main.spark_generator import SparkGenerator
+df = SparkGenerator(spark, F, spec).size(100_000_000).get_df()
+```
+✅ Native Spark generation
+✅ Databricks ready
+✅ Distributed at scale
+⚠️ Common methods only
+</td>
+</tr>
+</table>
+### 🎁 **17+ Pre-Built RandSpecs**
+No configuration needed! Start generating data immediately:
+| **CommonRandSpecs** (Work Everywhere) | **AdvancedRandSpecs** (Pandas Only) |
+|---------------------------------------|-------------------------------------|
+| `customers()` `products()` `orders()` | `employees()` `devices()` `invoices()` |
+| `transactions()` `sensors()` `users()` | `shipments()` `network_devices()` `vehicles()` |
+|  | `real_estate()` `healthcare()` |
+```python
+# Use any pre-built spec instantly
+from rand_engine.examples.common_rand_specs import CommonRandSpecs
+from rand_engine.examples.advanced_rand_specs import AdvancedRandSpecs
+df_orders = DataGenerator(CommonRandSpecs.orders()).size(50_000).get_df()
+df_employees = DataGenerator(AdvancedRandSpecs.employees()).size(1_000).get_df()
+```
+### 📝 **Write to Files**
+```python
+# Write to CSV, Parquet, JSON with compression
+DataGenerator(spec).size(1_000_000).write() \
+    .format("parquet") \
+    .compression("snappy") \
+    .mode("overwrite") \
+    .save("./data/customers")
+```
+### 🌊 **Stream Data**
+```python
+# Simulate real-time data streams
+DataGenerator(spec).stream() \
+    .throughput(min=1000, max=5000) \
+    .format("json") \
+    .start("./data/stream/events")
+```
+---
+## 💡 Usage Examples
+### 1️⃣ **Local Development (Pandas)**
+```python
+from rand_engine.main.data_generator import DataGenerator
+from rand_engine.examples.common_rand_specs import CommonRandSpecs
+# Generate and explore
+df = DataGenerator(CommonRandSpecs.transactions(), seed=42).size(100_000).get_df()
+print(df.describe())
+```
+### 2️⃣ **Databricks / Spark Environments**
+```python
+from rand_engine.main.spark_generator import SparkGenerator
+from rand_engine.examples.common_rand_specs import CommonRandSpecs
+from pyspark.sql import functions as F
+# Generate Spark DataFrame with 100M rows
+df_spark = SparkGenerator(spark, F, CommonRandSpecs.orders()).size(100_000_000).get_df()
+# Write to Delta Lake
+df_spark.write.format("delta").mode("overwrite").save("/path/to/delta/table")
+```
+### 3️⃣ **Custom Specifications**
+```python
+# Define your own data structure
+custom_spec = {
+    "user_id": {
+        "method": "unique_ids",
+        "kwargs": {"strategy": "uuid4"}
+    },
+    "age": {
+        "method": "integers",
+        "kwargs": {"min": 18, "max": 80}
+    },
+    "salary": {
+        "method": "floats",
+        "kwargs": {"min": 30000, "max": 150000, "round": 2}
+    }
+}
+df = DataGenerator(custom_spec).size(50_000).get_df()
+```
+📖 **Learn more:** [BUILD_RAND_SPECS.md](./docs/BUILD_RAND_SPECS.md) | [50+ Examples](./EXAMPLES.md)
+---
+## 📊 Performance Benchmarks
+Real-world performance tests across different environments:
+| Environment | Dataset | Rows | Time | Throughput |
+|------------|---------|------|------|------------|
+| **Local (Python 3.12)** | Customers | 1M | 81.5s | ~12K rows/sec |
+| **Databricks (Standard)** | Customers | 1M | 7.4s | ~135K rows/sec |
+| **Databricks (Spark)** | Orders | 100M | 19.4s | ~5.1M rows/sec |
+| **Databricks (Custom)** | Custom Spec | 100M | 19.4s | ~5.1M rows/sec |
+💡 **Tip:** Spark generation scales linearly with cluster size for massive datasets (100M+ rows).
+---
+## 🔑 Advanced Features
+### 🔗 **Constraints System** - Referential Integrity
+Generate **multiple related tables** with Primary Keys (PK) and Foreign Keys (FK):
+```python
+from rand_engine.main.data_generator import DataGenerator
+# Define specs with constraints
+customers_spec = {
+    "customer_id": {"method": "unique_ids", "kwargs": {"strategy": "sequence"}},
+    "name": {"method": "distincts", "kwargs": {"distincts": ["Alice", "Bob", "Charlie"]}},
+    "constraints": {
+        "pk_customer": {"tipo": "PK", "fields": ["customer_id"]}
+    }
+}
+orders_spec = {
+    "order_id": {"method": "unique_ids", "kwargs": {"strategy": "sequence"}},
+    "customer_id": {"method": "integers", "kwargs": {"min": 1, "max": 1000}},
+    "amount": {"method": "floats", "kwargs": {"min": 10, "max": 1000, "round": 2}},
+    "constraints": {
+        "fk_customer": {
+            "tipo": "FK",
+            "fields": ["customer_id"],
+            "references": {"spec_name": "customers", "pk_name": "pk_customer"}
+        }
+    }
+}
+# Generate with referential integrity
+generator = DataGenerator({"customers": customers_spec, "orders": orders_spec})
+dfs = generator.size({"customers": 1000, "orders": 5000}).get_dfs()
+```
+📖 **Complete guide:** [CONSTRAINTS.md](./docs/CONSTRAINTS.md)
+### 🎨 **Advanced Methods** - Correlated Data
+Generate correlated columns for realistic patterns:
+```python
+# Currency-Country correlations
+orders_spec = {
+    "order_id": {"method": "unique_ids", "kwargs": {"strategy": "sequence"}},
+    "currency_country": {
+        "method": "distincts_map",  # Correlated pairs
+        "splitable": True,
+        "cols": ["currency", "country"],
+        "sep": ";",
+        "kwargs": {"distincts": ["USD;US", "EUR;DE", "BRL;BR", "JPY;JP"]}
+    }
+}
+df = DataGenerator(orders_spec).size(10_000).get_df()
+# Result: USD always paired with US, EUR with DE, etc.
+```
+**Available Advanced Methods:**
+- `distincts_map` - Correlated pairs (currency ↔ country)
+- `distincts_multi_map` - Hierarchical combinations (dept → level → role)
+- `distincts_map_prop` - Weighted correlated pairs
+- `complex_distincts` - Pattern-based strings (IPs, SKUs, URLs)
+📖 **Complete guide:** [BUILD_RAND_SPECS.md](./docs/BUILD_RAND_SPECS.md)
+---
+## 💡 Quick Tips
+<table>
+<tr>
+<td width="50%">
+### 🎯 **For Data Engineers**
+- Use `seed` for reproducible tests
+- Export to Parquet for large datasets
+- Use constraints for multi-table integrity
+- Stream mode for real-time testing
+</td>
+<td width="50%">
+### 🧪 **For QA Engineers**
+- Start with pre-built specs
+- Generate edge cases with probabilities
+- Multiple seeds = multiple test scenarios
+- Test PK/FK relationships
+</td>
+</tr>
+</table>
+---
+## 📚 Documentation
+| Document | Description |
+|----------|-------------|
+| **[BUILD_RAND_SPECS.md](./docs/BUILD_RAND_SPECS.md)** | Complete guide to building custom specifications |
+| **[EXAMPLES.md](./EXAMPLES.md)** | 50+ production-ready examples |
+| **[CONSTRAINTS.md](./docs/CONSTRAINTS.md)** | PK/FK system and referential integrity |
+| **[API_REFERENCE.md](./docs/API_REFERENCE.md)** | Full method reference |
+| **[LOGGING.md](./docs/LOGGING.md)** | Logging configuration |
+---
+## 🧪 Testing
+**494 tests passing** with comprehensive coverage:
+```bash
+pytest                                    # Run all tests
+pytest tests/test_2_data_generator.py -v # Test DataGenerator
+pytest tests/test_3_spark_generator.py -v # Test SparkGenerator
+pytest tests/test_8_consistency.py -v    # Test constraints
+```
+---
+## 📦 Requirements
+- **Python** >= 3.10
+- **numpy** >= 2.1.1
+- **pandas** >= 2.2.2
+- **faker** >= 28.4.1 (optional)
+- **duckdb** >= 1.1.0 (optional)
+---
+## 🤝 Contributing
+Contributions are welcome! Feel free to:
+- 🐛 Report bugs via [Issues](https://github.com/marcoaureliomenezes/rand_engine/issues)
+- 💡 Suggest features via [Discussions](https://github.com/marcoaureliomenezes/rand_engine/discussions)
+- 🔧 Submit pull requests
+---
+## 📞 Support
+- **GitHub Issues**: [Report bugs](https://github.com/marcoaureliomenezes/rand_engine/issues)
+- **GitHub Discussions**: [Ask questions](https://github.com/marcoaureliomenezes/rand_engine/discussions)
+- **Email**: marcourelioreislima@gmail.com
+---
+## 📄 License
+MIT License - see [LICENSE](LICENSE) for details.
+---
+<div align="center">
+### 🌟 Star the project if you find it useful!
+[![Star History Chart](https://api.star-history.com/svg?repos=marcoaureliomenezes/rand_engine&type=Date)](https://star-history.com/#marcoaureliomenezes/rand_engine&Date)
+**Built with ❤️ for Data Engineers and the data community**
+[⬆ Back to top](#-rand-engine)
+</div>

rand-engine 0.6.2__tar.gz → 0.6.3__tar.gz

rand-engine 0.6.2tar.gz → 0.6.3tar.gz