rand-engine 0.6.2__tar.gz → 0.6.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. rand_engine-0.6.3/PKG-INFO +397 -0
  2. rand_engine-0.6.3/README.md +374 -0
  3. {rand_engine-0.6.2 → rand_engine-0.6.3}/pyproject.toml +2 -2
  4. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/__init__.py +1 -1
  5. rand_engine-0.6.3/rand_engine/core/_np_core.py +101 -0
  6. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/core/_py_core.py +3 -3
  7. rand_engine-0.6.3/rand_engine/core/_spark_core.py +147 -0
  8. rand_engine-0.6.3/rand_engine/examples/__init__.py +31 -0
  9. rand_engine-0.6.3/rand_engine/examples/advanced_rand_specs.py +734 -0
  10. rand_engine-0.6.3/rand_engine/examples/common_rand_specs.py +225 -0
  11. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/main/_rand_generator.py +4 -4
  12. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/main/data_generator.py +8 -8
  13. rand_engine-0.6.3/rand_engine/main/spark_generator.py +60 -0
  14. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/templates/__init__.py +4 -1
  15. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/templates/web_server_logs.py +5 -5
  16. rand_engine-0.6.3/rand_engine/validators/__init__.py +29 -0
  17. rand_engine-0.6.2/rand_engine/validators/spec_validator.py → rand_engine-0.6.3/rand_engine/validators/advanced_validator.py +339 -414
  18. rand_engine-0.6.3/rand_engine/validators/common_validator.py +554 -0
  19. rand_engine-0.6.2/PKG-INFO +0 -710
  20. rand_engine-0.6.2/README.md +0 -687
  21. rand_engine-0.6.2/rand_engine/core/_ext_core.py +0 -21
  22. rand_engine-0.6.2/rand_engine/core/_np_core.py +0 -88
  23. rand_engine-0.6.2/rand_engine/core/_spark_core.py +0 -59
  24. rand_engine-0.6.2/rand_engine/main/examples.py +0 -495
  25. rand_engine-0.6.2/rand_engine/main/spark_generator.py +0 -36
  26. rand_engine-0.6.2/rand_engine/validators/__init__.py +0 -5
  27. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/file_handlers/_writer_batch.py +0 -0
  28. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/file_handlers/_writer_stream.py +0 -0
  29. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/file_handlers/file_handler.py +0 -0
  30. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/file_handlers/fs_utils.py +0 -0
  31. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/file_handlers/writer.py +0 -0
  32. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/integrations/__init__.py +0 -0
  33. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/integrations/_base_handler.py +0 -0
  34. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/integrations/_duckdb_handler.py +0 -0
  35. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/integrations/_sqlite_handler.py +0 -0
  36. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/main/_cdc_generator.py +0 -0
  37. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/main/_constraints_handler.py +0 -0
  38. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/templates/i_random_spec.py +0 -0
  39. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/utils/logger.py +0 -0
  40. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/utils/stream_handler.py +0 -0
  41. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/utils/update.py +0 -0
  42. {rand_engine-0.6.2 → rand_engine-0.6.3}/rand_engine/validators/exceptions.py +0 -0
@@ -0,0 +1,397 @@
1
+ Metadata-Version: 2.4
2
+ Name: rand-engine
3
+ Version: 0.6.3
4
+ Summary: Rand Engine v2. Package with some methods to generate random data in different formats. Great to mock data while testing or developing.
5
+ Author: marcoaureliomenezes
6
+ Author-email: marcoaurelioreislima@gmail.com
7
+ Requires-Python: >=3.10,<4.0
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3.10
10
+ Classifier: Programming Language :: Python :: 3.11
11
+ Classifier: Programming Language :: Python :: 3.12
12
+ Classifier: Programming Language :: Python :: 3.13
13
+ Classifier: Programming Language :: Python :: 3.14
14
+ Requires-Dist: duckdb (>=1.4.1,<2.0.0)
15
+ Requires-Dist: fastavro (>=1.10.0,<2.0.0)
16
+ Requires-Dist: fastparquet (>=2024.11.0,<2025.0.0)
17
+ Requires-Dist: numpy (>=2.1.1,<3.0.0)
18
+ Requires-Dist: pandas (>=2.2.2,<3.0.0)
19
+ Requires-Dist: pyarrow (>=19.0.0,<20.0.0)
20
+ Project-URL: Repository, https://github.com/marcoaureliomenezes/rand_engine
21
+ Description-Content-Type: text/markdown
22
+
23
+ <div align="center">
24
+
25
+ # 🎲 Rand Engine
26
+
27
+ **Generate millions of rows of synthetic data in seconds**
28
+
29
+ *High-performance random data generation for testing, development, and prototyping*
30
+
31
+ [![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
32
+ [![Tests](https://img.shields.io/badge/tests-494%20passing-brightgreen.svg)]()
33
+ [![License](https://img.shields.io/badge/license-MIT-blue.svg)]()
34
+ [![Version](https://img.shields.io/badge/version-0.7.0-orange.svg)](https://pypi.org/project/rand-engine/)
35
+ [![PyPI](https://img.shields.io/badge/PyPI-rand--engine-blue.svg)](https://pypi.org/project/rand-engine/)
36
+
37
+ [Quick Start](#-quick-start) • [Features](#-key-features) • [Examples](#-usage-examples) • [Documentation](#-documentation) • [Benchmarks](#-performance-benchmarks)
38
+
39
+ </div>
40
+
41
+ ---
42
+
43
+ ## 🎯 What is Rand Engine?
44
+
45
+ **Rand Engine** is a Python library that generates **realistic synthetic data at scale** through simple declarative specifications. Built on NumPy and Pandas for maximum performance.
46
+
47
+ **Perfect for:**
48
+ - 🧪 Testing ETL/ELT pipelines without production data
49
+ - 📊 Load testing and stress testing data systems
50
+ - 🎓 Learning data engineering without complex setups
51
+ - 🚀 Prototyping applications with realistic datasets
52
+ - 🔐 Demos and POCs without exposing sensitive data
53
+
54
+ ---
55
+
56
+ ## 🚀 Quick Start
57
+
58
+ ### Installation
59
+
60
+ ```bash
61
+ pip install rand-engine
62
+ ```
63
+
64
+ ### Generate Your First Dataset (3 Lines!)
65
+
66
+ ```python
67
+ from rand_engine.main.data_generator import DataGenerator
68
+ from rand_engine.examples.common_rand_specs import CommonRandSpecs
69
+
70
+ # Generate 1 million customer records in seconds
71
+ df = DataGenerator(CommonRandSpecs.customers(), seed=42).size(1_000_000).get_df()
72
+ print(df.head())
73
+ ```
74
+
75
+ **Output:**
76
+ ```
77
+ customer_id age city total_spent is_premium registration_date
78
+ 0 uuid-001 42 São Paulo 1523.50 True 2023-05-12
79
+ 1 uuid-002 28 Rio de Janeiro 872.33 False 2024-01-08
80
+ 2 uuid-003 56 Belo Horizonte 4215.89 False 2022-11-23
81
+ ```
82
+
83
+ **That's it!** You just generated 1 million rows of realistic customer data. 🎉
84
+
85
+ ---
86
+
87
+ ## ✨ Key Features
88
+
89
+ <table>
90
+ <tr>
91
+ <td width="50%">
92
+
93
+ ### 🐼 **Pandas DataFrames**
94
+ ```python
95
+ from rand_engine.main.data_generator import DataGenerator
96
+
97
+ df = DataGenerator(spec, seed=42).size(1_000_000).get_df()
98
+ ```
99
+ ✅ All methods (common + advanced)
100
+ ✅ Correlated columns
101
+ ✅ Complex patterns
102
+ ✅ PK/FK constraints
103
+
104
+ </td>
105
+ <td width="50%">
106
+
107
+ ### ⚡ **Spark DataFrames**
108
+ ```python
109
+ from rand_engine.main.spark_generator import SparkGenerator
110
+
111
+ df = SparkGenerator(spark, F, spec).size(100_000_000).get_df()
112
+ ```
113
+ ✅ Native Spark generation
114
+ ✅ Databricks ready
115
+ ✅ Distributed at scale
116
+ ⚠️ Common methods only
117
+
118
+ </td>
119
+ </tr>
120
+ </table>
121
+
122
+ ### 🎁 **17+ Pre-Built RandSpecs**
123
+
124
+ No configuration needed! Start generating data immediately:
125
+
126
+ | **CommonRandSpecs** (Work Everywhere) | **AdvancedRandSpecs** (Pandas Only) |
127
+ |---------------------------------------|-------------------------------------|
128
+ | `customers()` `products()` `orders()` | `employees()` `devices()` `invoices()` |
129
+ | `transactions()` `sensors()` `users()` | `shipments()` `network_devices()` `vehicles()` |
130
+ | | `real_estate()` `healthcare()` |
131
+
132
+ ```python
133
+ # Use any pre-built spec instantly
134
+ from rand_engine.examples.common_rand_specs import CommonRandSpecs
135
+ from rand_engine.examples.advanced_rand_specs import AdvancedRandSpecs
136
+
137
+ df_orders = DataGenerator(CommonRandSpecs.orders()).size(50_000).get_df()
138
+ df_employees = DataGenerator(AdvancedRandSpecs.employees()).size(1_000).get_df()
139
+ ```
140
+
141
+ ### 📝 **Write to Files**
142
+
143
+ ```python
144
+ # Write to CSV, Parquet, JSON with compression
145
+ DataGenerator(spec).size(1_000_000).write() \
146
+ .format("parquet") \
147
+ .compression("snappy") \
148
+ .mode("overwrite") \
149
+ .save("./data/customers")
150
+ ```
151
+
152
+ ### 🌊 **Stream Data**
153
+
154
+ ```python
155
+ # Simulate real-time data streams
156
+ DataGenerator(spec).stream() \
157
+ .throughput(min=1000, max=5000) \
158
+ .format("json") \
159
+ .start("./data/stream/events")
160
+ ```
161
+
162
+ ---
163
+
164
+ ## 💡 Usage Examples
165
+
166
+ ### 1️⃣ **Local Development (Pandas)**
167
+
168
+ ```python
169
+ from rand_engine.main.data_generator import DataGenerator
170
+ from rand_engine.examples.common_rand_specs import CommonRandSpecs
171
+
172
+ # Generate and explore
173
+ df = DataGenerator(CommonRandSpecs.transactions(), seed=42).size(100_000).get_df()
174
+ print(df.describe())
175
+ ```
176
+
177
+ ### 2️⃣ **Databricks / Spark Environments**
178
+
179
+ ```python
180
+ from rand_engine.main.spark_generator import SparkGenerator
181
+ from rand_engine.examples.common_rand_specs import CommonRandSpecs
182
+ from pyspark.sql import functions as F
183
+
184
+ # Generate Spark DataFrame with 100M rows
185
+ df_spark = SparkGenerator(spark, F, CommonRandSpecs.orders()).size(100_000_000).get_df()
186
+
187
+ # Write to Delta Lake
188
+ df_spark.write.format("delta").mode("overwrite").save("/path/to/delta/table")
189
+ ```
190
+
191
+ ### 3️⃣ **Custom Specifications**
192
+
193
+ ```python
194
+ # Define your own data structure
195
+ custom_spec = {
196
+ "user_id": {
197
+ "method": "unique_ids",
198
+ "kwargs": {"strategy": "uuid4"}
199
+ },
200
+ "age": {
201
+ "method": "integers",
202
+ "kwargs": {"min": 18, "max": 80}
203
+ },
204
+ "salary": {
205
+ "method": "floats",
206
+ "kwargs": {"min": 30000, "max": 150000, "round": 2}
207
+ }
208
+ }
209
+
210
+ df = DataGenerator(custom_spec).size(50_000).get_df()
211
+ ```
212
+
213
+ 📖 **Learn more:** [BUILD_RAND_SPECS.md](./docs/BUILD_RAND_SPECS.md) | [50+ Examples](./EXAMPLES.md)
214
+
215
+ ---
216
+
217
+ ## 📊 Performance Benchmarks
218
+
219
+ Real-world performance tests across different environments:
220
+
221
+ | Environment | Dataset | Rows | Time | Throughput |
222
+ |------------|---------|------|------|------------|
223
+ | **Local (Python 3.12)** | Customers | 1M | 81.5s | ~12K rows/sec |
224
+ | **Databricks (Standard)** | Customers | 1M | 7.4s | ~135K rows/sec |
225
+ | **Databricks (Spark)** | Orders | 100M | 19.4s | ~5.1M rows/sec |
226
+ | **Databricks (Custom)** | Custom Spec | 100M | 19.4s | ~5.1M rows/sec |
227
+
228
+ 💡 **Tip:** Spark generation scales linearly with cluster size for massive datasets (100M+ rows).
229
+
230
+ ---
231
+
232
+ ## 🔑 Advanced Features
233
+
234
+ ### 🔗 **Constraints System** - Referential Integrity
235
+
236
+ Generate **multiple related tables** with Primary Keys (PK) and Foreign Keys (FK):
237
+
238
+ ```python
239
+ from rand_engine.main.data_generator import DataGenerator
240
+
241
+ # Define specs with constraints
242
+ customers_spec = {
243
+ "customer_id": {"method": "unique_ids", "kwargs": {"strategy": "sequence"}},
244
+ "name": {"method": "distincts", "kwargs": {"distincts": ["Alice", "Bob", "Charlie"]}},
245
+ "constraints": {
246
+ "pk_customer": {"tipo": "PK", "fields": ["customer_id"]}
247
+ }
248
+ }
249
+
250
+ orders_spec = {
251
+ "order_id": {"method": "unique_ids", "kwargs": {"strategy": "sequence"}},
252
+ "customer_id": {"method": "integers", "kwargs": {"min": 1, "max": 1000}},
253
+ "amount": {"method": "floats", "kwargs": {"min": 10, "max": 1000, "round": 2}},
254
+ "constraints": {
255
+ "fk_customer": {
256
+ "tipo": "FK",
257
+ "fields": ["customer_id"],
258
+ "references": {"spec_name": "customers", "pk_name": "pk_customer"}
259
+ }
260
+ }
261
+ }
262
+
263
+ # Generate with referential integrity
264
+ generator = DataGenerator({"customers": customers_spec, "orders": orders_spec})
265
+ dfs = generator.size({"customers": 1000, "orders": 5000}).get_dfs()
266
+ ```
267
+
268
+ 📖 **Complete guide:** [CONSTRAINTS.md](./docs/CONSTRAINTS.md)
269
+
270
+ ### 🎨 **Advanced Methods** - Correlated Data
271
+
272
+ Generate correlated columns for realistic patterns:
273
+
274
+ ```python
275
+ # Currency-Country correlations
276
+ orders_spec = {
277
+ "order_id": {"method": "unique_ids", "kwargs": {"strategy": "sequence"}},
278
+ "currency_country": {
279
+ "method": "distincts_map", # Correlated pairs
280
+ "splitable": True,
281
+ "cols": ["currency", "country"],
282
+ "sep": ";",
283
+ "kwargs": {"distincts": ["USD;US", "EUR;DE", "BRL;BR", "JPY;JP"]}
284
+ }
285
+ }
286
+
287
+ df = DataGenerator(orders_spec).size(10_000).get_df()
288
+ # Result: USD always paired with US, EUR with DE, etc.
289
+ ```
290
+
291
+ **Available Advanced Methods:**
292
+ - `distincts_map` - Correlated pairs (currency ↔ country)
293
+ - `distincts_multi_map` - Hierarchical combinations (dept → level → role)
294
+ - `distincts_map_prop` - Weighted correlated pairs
295
+ - `complex_distincts` - Pattern-based strings (IPs, SKUs, URLs)
296
+
297
+ 📖 **Complete guide:** [BUILD_RAND_SPECS.md](./docs/BUILD_RAND_SPECS.md)
298
+
299
+ ---
300
+
301
+ ## 💡 Quick Tips
302
+
303
+ <table>
304
+ <tr>
305
+ <td width="50%">
306
+
307
+ ### 🎯 **For Data Engineers**
308
+ - Use `seed` for reproducible tests
309
+ - Export to Parquet for large datasets
310
+ - Use constraints for multi-table integrity
311
+ - Stream mode for real-time testing
312
+
313
+ </td>
314
+ <td width="50%">
315
+
316
+ ### 🧪 **For QA Engineers**
317
+ - Start with pre-built specs
318
+ - Generate edge cases with probabilities
319
+ - Multiple seeds = multiple test scenarios
320
+ - Test PK/FK relationships
321
+
322
+ </td>
323
+ </tr>
324
+ </table>
325
+
326
+ ---
327
+
328
+ ## 📚 Documentation
329
+
330
+ | Document | Description |
331
+ |----------|-------------|
332
+ | **[BUILD_RAND_SPECS.md](./docs/BUILD_RAND_SPECS.md)** | Complete guide to building custom specifications |
333
+ | **[EXAMPLES.md](./EXAMPLES.md)** | 50+ production-ready examples |
334
+ | **[CONSTRAINTS.md](./docs/CONSTRAINTS.md)** | PK/FK system and referential integrity |
335
+ | **[API_REFERENCE.md](./docs/API_REFERENCE.md)** | Full method reference |
336
+ | **[LOGGING.md](./docs/LOGGING.md)** | Logging configuration |
337
+
338
+ ---
339
+
340
+ ## 🧪 Testing
341
+
342
+ **494 tests passing** with comprehensive coverage:
343
+
344
+ ```bash
345
+ pytest # Run all tests
346
+ pytest tests/test_2_data_generator.py -v # Test DataGenerator
347
+ pytest tests/test_3_spark_generator.py -v # Test SparkGenerator
348
+ pytest tests/test_8_consistency.py -v # Test constraints
349
+ ```
350
+
351
+ ---
352
+
353
+ ## 📦 Requirements
354
+
355
+ - **Python** >= 3.10
356
+ - **numpy** >= 2.1.1
357
+ - **pandas** >= 2.2.2
358
+ - **faker** >= 28.4.1 (optional)
359
+ - **duckdb** >= 1.1.0 (optional)
360
+
361
+ ---
362
+
363
+ ## 🤝 Contributing
364
+
365
+ Contributions are welcome! Feel free to:
366
+ - 🐛 Report bugs via [Issues](https://github.com/marcoaureliomenezes/rand_engine/issues)
367
+ - 💡 Suggest features via [Discussions](https://github.com/marcoaureliomenezes/rand_engine/discussions)
368
+ - 🔧 Submit pull requests
369
+
370
+ ---
371
+
372
+ ## 📞 Support
373
+
374
+ - **GitHub Issues**: [Report bugs](https://github.com/marcoaureliomenezes/rand_engine/issues)
375
+ - **GitHub Discussions**: [Ask questions](https://github.com/marcoaureliomenezes/rand_engine/discussions)
376
+ - **Email**: marcourelioreislima@gmail.com
377
+
378
+ ---
379
+
380
+ ## 📄 License
381
+
382
+ MIT License - see [LICENSE](LICENSE) for details.
383
+
384
+ ---
385
+
386
+ <div align="center">
387
+
388
+ ### 🌟 Star the project if you find it useful!
389
+
390
+ [![Star History Chart](https://api.star-history.com/svg?repos=marcoaureliomenezes/rand_engine&type=Date)](https://star-history.com/#marcoaureliomenezes/rand_engine&Date)
391
+
392
+ **Built with ❤️ for Data Engineers and the data community**
393
+
394
+ [⬆ Back to top](#-rand-engine)
395
+
396
+ </div>
397
+