etl-forge 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,433 @@
1
+ Metadata-Version: 2.4
2
+ Name: etl-forge
3
+ Version: 1.0.0
4
+ Summary: A Python library for generating synthetic test data and validating ETL outputs.
5
+ Author-email: Kyriakos Kartas <mail@kkartas.gr>
6
+ Project-URL: Homepage, https://github.com/kkartas/etl-forge
7
+ Project-URL: Bug Tracker, https://github.com/kkartas/etl-forge/issues
8
+ Keywords: etl,testing,data validation,synthetic data,data quality
9
+ Classifier: Development Status :: 5 - Production/Stable
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Topic :: Software Development :: Testing
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.8
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Requires-Python: >=3.8
19
+ Description-Content-Type: text/markdown
20
+ Requires-Dist: pandas>=1.3.0
21
+ Requires-Dist: pyyaml>=5.4.0
22
+ Requires-Dist: click>=8.0.0
23
+ Requires-Dist: openpyxl>=3.0.0
24
+ Requires-Dist: numpy>=1.21.0
25
+ Requires-Dist: psutil>=5.9.0
26
+ Provides-Extra: faker
27
+ Requires-Dist: faker>=15.0.0; extra == "faker"
28
+ Provides-Extra: dev
29
+ Requires-Dist: pytest>=6.0.0; extra == "dev"
30
+ Requires-Dist: pytest-cov>=2.0.0; extra == "dev"
31
+ Requires-Dist: black>=21.0.0; extra == "dev"
32
+ Requires-Dist: flake8>=3.8.0; extra == "dev"
33
+ Requires-Dist: mypy>=0.900; extra == "dev"
34
+ Requires-Dist: matplotlib>=3.5.0; extra == "dev"
35
+ Requires-Dist: coverage>=6.0; extra == "dev"
36
+ Requires-Dist: sphinx>=4.0.0; extra == "dev"
37
+ Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "dev"
38
+
39
+ # ETLForge
40
+
41
+ [![build](https://github.com/kkartas/etl-forge/actions/workflows/ci.yml/badge.svg)](https://github.com/kkartas/etl-forge/actions/workflows/ci.yml)
42
+ [![docs](https://readthedocs.org/projects/etl-forge/badge/?version=latest)](https://etl-forge.readthedocs.io/en/latest/)
43
+ [![PyPI version](https://badge.fury.io/py/etl-forge.svg)](https://badge.fury.io/py/etl-forge)
44
+ [![codecov](https://codecov.io/gh/kkartas/etl-forge/graph/badge.svg?token=YOUR_CODECOV_TOKEN)](https://codecov.io/gh/kkartas/etl-forge)
45
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
46
+
47
+ A Python library for generating synthetic test data and validating ETL outputs. ETLForge provides both command-line tools and library functions to help you create realistic test datasets and validate data quality.
48
+
49
+ ## Features
50
+
51
+ ### 🎲 Test Data Generator
52
+ - Generate synthetic data based on YAML/JSON schema definitions
53
+ - Support for multiple data types: `int`, `float`, `string`, `date`, `category`
54
+ - Advanced constraints: ranges, uniqueness, nullable fields, categorical values
55
+ - Integration with Faker for realistic string generation
56
+ - Export to CSV or Excel formats
57
+
58
+ ### ✅ Data Validator
59
+ - Validate CSV/Excel files against schema definitions
60
+ - Comprehensive validation checks:
61
+ - Column existence
62
+ - Data type matching
63
+ - Value constraints (ranges, categories)
64
+ - Uniqueness validation
65
+ - Null value validation
66
+ - Date format validation
67
+ - Generate detailed reports of invalid rows
68
+
69
+ ### 🔧 Dual Interface
70
+ - **Command-line interface** for quick operations
71
+ - **Python library** for integration into existing workflows
72
+
73
+ ## Installation
74
+
75
+ ### Prerequisites
76
+ - Python 3.8 or higher
77
+ - pip package manager
78
+
79
+ ### Install from PyPI (Recommended)
80
+ ```bash
81
+ pip install etl-forge
82
+ ```
83
+
84
+ ### Install from Source
85
+ For development or latest features:
86
+ ```bash
87
+ git clone https://github.com/kkartas/etl-forge.git
88
+ cd etl-forge
89
+ pip install -e ".[dev]"
90
+ ```
91
+
92
+ ### Dependencies
93
+ **Core dependencies** (6 total, automatically installed):
94
+ - `pandas>=1.3.0` - Data manipulation and analysis
95
+ - `pyyaml>=5.4.0` - YAML parsing for schema files
96
+ - `click>=8.0.0` - Command-line interface framework
97
+ - `openpyxl>=3.0.0` - Excel file support
98
+ - `numpy>=1.21.0` - Numerical computing
99
+ - `psutil>=5.9.0` - System monitoring for benchmarks
100
+
101
+ **Optional dependencies** for enhanced features:
102
+ ```bash
103
+ # For realistic data generation using Faker templates
104
+ pip install etl-forge[faker]
105
+
106
+ # For development (testing, linting, documentation)
107
+ pip install etl-forge[dev]
108
+ ```
109
+
110
+ ### Verify Installation
111
+ ```bash
112
+ # CLI verification (may require adding Scripts directory to PATH on Windows)
113
+ etl-forge --version
114
+
115
+ # Alternative CLI access (works on all platforms)
116
+ python -m etl_forge.cli --version
117
+
118
+ # Library verification
119
+ python -c "from etl_forge import DataGenerator, DataValidator; print('✅ Installation verified')"
120
+ ```
121
+
122
+ ### CLI Access Note
123
+ On some systems (especially Windows), the `etl-forge` command may not be directly accessible. In such cases, use:
124
+ ```bash
125
+ python -m etl_forge.cli [command] [options]
126
+ ```
127
+
128
+ ## Quick Start
129
+
130
+ ### 1. Create a Schema
131
+
132
+ Create a `schema.yaml` file defining your data structure:
133
+
134
+ ```yaml
135
+ fields:
136
+ - name: id
137
+ type: int
138
+ unique: true
139
+ nullable: false
140
+ range:
141
+ min: 1
142
+ max: 10000
143
+
144
+ - name: name
145
+ type: string
146
+ nullable: false
147
+ faker_template: name
148
+
149
+ - name: department
150
+ type: category
151
+ nullable: false
152
+ values:
153
+ - Engineering
154
+ - Marketing
155
+ - Sales
156
+ ```
157
+
158
+ ### 2. Generate Test Data
159
+
160
+ **Command Line:**
161
+ ```bash
162
+ # Direct CLI command (if available)
163
+ etl-forge generate --schema schema.yaml --rows 500 --output sample.csv
164
+
165
+ # Alternative CLI access (works on all platforms)
166
+ python -m etl_forge.cli generate --schema schema.yaml --rows 500 --output sample.csv
167
+ ```
168
+
169
+ **Python Library:**
170
+ ```python
171
+ from etl_forge import DataGenerator
172
+
173
+ generator = DataGenerator('schema.yaml')
174
+ df = generator.generate_data(500)
175
+ generator.save_data(df, 'sample.csv')
176
+ ```
177
+
178
+ ### 3. Validate Data
179
+
180
+ **Command Line:**
181
+ ```bash
182
+ # Direct CLI command (if available)
183
+ etl-forge check --input sample.csv --schema schema.yaml --report invalid_rows.csv
184
+
185
+ # Alternative CLI access (works on all platforms)
186
+ python -m etl_forge.cli check --input sample.csv --schema schema.yaml --report invalid_rows.csv
187
+ ```
188
+
189
+ **Python Library:**
190
+ ```python
191
+ from etl_forge import DataValidator
192
+
193
+ validator = DataValidator('schema.yaml')
194
+ result = validator.validate('sample.csv')
195
+ print(f"Validation passed: {result.is_valid}")
196
+ ```
197
+
198
+ ## Schema Definition
199
+
200
+ ### Supported Field Types
201
+
202
+ #### Integer (`int`)
203
+ ```yaml
204
+ - name: age
205
+ type: int
206
+ nullable: false
207
+ range:
208
+ min: 18
209
+ max: 65
210
+ unique: false
211
+ ```
212
+
213
+ #### Float (`float`)
214
+ ```yaml
215
+ - name: salary
216
+ type: float
217
+ nullable: true
218
+ range:
219
+ min: 30000.0
220
+ max: 150000.0
221
+ precision: 2
222
+ null_rate: 0.1
223
+ ```
224
+
225
+ #### String (`string`)
226
+ ```yaml
227
+ - name: email
228
+ type: string
229
+ nullable: false
230
+ unique: true
231
+ length:
232
+ min: 10
233
+ max: 50
234
+ faker_template: email # Optional: uses Faker library
235
+ ```
236
+
237
+ #### Date (`date`)
238
+ ```yaml
239
+ - name: hire_date
240
+ type: date
241
+ nullable: false
242
+ range:
243
+ start: '2020-01-01'
244
+ end: '2024-12-31'
245
+ format: '%Y-%m-%d'
246
+ ```
247
+
248
+ #### Category (`category`)
249
+ ```yaml
250
+ - name: status
251
+ type: category
252
+ nullable: false
253
+ values:
254
+ - Active
255
+ - Inactive
256
+ - Pending
257
+ ```
258
+
259
+ ### Schema Constraints
260
+
261
+ - **`nullable`**: Allow null values (default: `false`)
262
+ - **`unique`**: Ensure all values are unique (default: `false`)
263
+ - **`range`**: Define min/max values for numeric types or start/end dates
264
+ - **`values`**: List of allowed values for categorical fields
265
+ - **`length`**: Min/max length for string fields
266
+ - **`precision`**: Decimal places for float fields
267
+ - **`format`**: Date format string (default: `'%Y-%m-%d'`)
268
+ - **`faker_template`**: Faker method name for realistic string generation
269
+ - **`null_rate`**: Probability of null values when `nullable: true` (default: 0.1)
270
+
271
+ ## Command Line Interface
272
+
273
+ ### Generate Data
274
+ ```bash
275
+ # Direct CLI command (if available)
276
+ etl-forge generate [OPTIONS]
277
+
278
+ # Alternative CLI access (works on all platforms)
279
+ python -m etl_forge.cli generate [OPTIONS]
280
+
281
+ Options:
282
+ -s, --schema PATH Path to schema file (YAML or JSON) [required]
283
+ -r, --rows INTEGER Number of rows to generate (default: 100)
284
+ -o, --output PATH Output file path (CSV or Excel) [required]
285
+ -f, --format [csv|excel] Output format (auto-detected if not specified)
286
+ ```
287
+
288
+ ### Validate Data
289
+ ```bash
290
+ # Direct CLI command (if available)
291
+ etl-forge check [OPTIONS]
292
+
293
+ # Alternative CLI access (works on all platforms)
294
+ python -m etl_forge.cli check [OPTIONS]
295
+
296
+ Options:
297
+ -i, --input PATH Path to input data file [required]
298
+ -s, --schema PATH Path to schema file [required]
299
+ -r, --report PATH Path to save invalid rows report (optional)
300
+ -v, --verbose Show detailed validation errors
301
+ ```
302
+
303
+ ### Create Example Schema
304
+ ```bash
305
+ # Direct CLI command (if available)
306
+ etl-forge create-schema example_schema.yaml
307
+
308
+ # Alternative CLI access (works on all platforms)
309
+ python -m etl_forge.cli create-schema example_schema.yaml
310
+ ```
311
+
312
+ ## Library Usage
313
+
314
+ ### Data Generation
315
+
316
+ ```python
317
+ from etl_forge import DataGenerator
318
+
319
+ # Initialize with schema
320
+ generator = DataGenerator('schema.yaml')
321
+
322
+ # Generate data
323
+ df = generator.generate_data(1000)
324
+
325
+ # Save to file
326
+ generator.save_data(df, 'output.csv')
327
+
328
+ # Or do both in one step
329
+ df = generator.generate_and_save(1000, 'output.xlsx', 'excel')
330
+ ```
331
+
332
+ ### Data Validation
333
+
334
+ ```python
335
+ from etl_forge import DataValidator
336
+
337
+ # Initialize validator
338
+ validator = DataValidator('schema.yaml')
339
+
340
+ # Validate data
341
+ result = validator.validate('data.csv')
342
+
343
+ # Check results
344
+ if result.is_valid:
345
+ print("✅ Data is valid!")
346
+ else:
347
+ print(f"❌ Found {len(result.errors)} validation errors")
348
+ print(f"Invalid rows: {len(result.invalid_rows)}")
349
+
350
+ # Generate report
351
+ result = validator.validate_and_report('data.csv', 'errors.csv')
352
+
353
+ # Print summary
354
+ validator.print_validation_summary(result)
355
+ ```
356
+
357
+ ### Advanced Usage
358
+
359
+ ```python
360
+ # Use schema as dictionary
361
+ schema_dict = {
362
+ 'fields': [
363
+ {'name': 'id', 'type': 'int', 'unique': True},
364
+ {'name': 'name', 'type': 'string', 'faker_template': 'name'}
365
+ ]
366
+ }
367
+
368
+ generator = DataGenerator(schema_dict)
369
+ validator = DataValidator(schema_dict)
370
+
371
+ # Validate DataFrame directly
372
+ import pandas as pd
373
+ df = pd.read_csv('data.csv')
374
+ result = validator.validate(df)
375
+ ```
376
+
377
+ ## Faker Integration
378
+
379
+ When the `faker` library is installed, you can use realistic data generation:
380
+
381
+ ```yaml
382
+ - name: first_name
383
+ type: string
384
+ faker_template: first_name
385
+
386
+ - name: address
387
+ type: string
388
+ faker_template: address
389
+
390
+ - name: phone
391
+ type: string
392
+ faker_template: phone_number
393
+ ```
394
+
395
+ Common Faker templates:
396
+ - `name`, `first_name`, `last_name`
397
+ - `email`, `phone_number`
398
+ - `address`, `city`, `country`
399
+ - `company`, `job`
400
+ - `date`, `time`
401
+ - And many more! See [Faker documentation](https://faker.readthedocs.io/)
402
+
403
+ ## Testing
404
+
405
+ Run the test suite:
406
+
407
+ ```bash
408
+ pytest tests/
409
+ ```
410
+
411
+ Run with coverage:
412
+
413
+ ```bash
414
+ pytest tests/ --cov=etl_forge --cov-report=html
415
+ ```
416
+
417
+ ## Performance
418
+
419
+ Performance benchmarks are available in [`BENCHMARKS.md`](BENCHMARKS.md). To reproduce them, run:
420
+
421
+ ```bash
422
+ python benchmark.py
423
+ ```
424
+
425
+ Then, to visualize the results:
426
+
427
+ ```bash
428
+ python plot_benchmark.py
429
+ ```
430
+
431
+ ## Citation
432
+
433
+ If you use `ETLForge` in your research or work, please cite it using the information in `CITATION.cff`.