etl-forge 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- etl_forge-1.0.0/PKG-INFO +433 -0
- etl_forge-1.0.0/README.md +395 -0
- etl_forge-1.0.0/etl_forge/__init__.py +11 -0
- etl_forge-1.0.0/etl_forge/cli.py +228 -0
- etl_forge-1.0.0/etl_forge/exceptions.py +9 -0
- etl_forge-1.0.0/etl_forge/generator.py +464 -0
- etl_forge-1.0.0/etl_forge/validator.py +432 -0
- etl_forge-1.0.0/etl_forge.egg-info/PKG-INFO +433 -0
- etl_forge-1.0.0/etl_forge.egg-info/SOURCES.txt +17 -0
- etl_forge-1.0.0/etl_forge.egg-info/dependency_links.txt +1 -0
- etl_forge-1.0.0/etl_forge.egg-info/entry_points.txt +2 -0
- etl_forge-1.0.0/etl_forge.egg-info/requires.txt +20 -0
- etl_forge-1.0.0/etl_forge.egg-info/top_level.txt +1 -0
- etl_forge-1.0.0/pyproject.toml +59 -0
- etl_forge-1.0.0/setup.cfg +4 -0
- etl_forge-1.0.0/tests/test_errors.py +90 -0
- etl_forge-1.0.0/tests/test_generator.py +207 -0
- etl_forge-1.0.0/tests/test_integration.py +293 -0
- etl_forge-1.0.0/tests/test_validator.py +350 -0
etl_forge-1.0.0/PKG-INFO
ADDED
|
@@ -0,0 +1,433 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: etl-forge
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: A Python library for generating synthetic test data and validating ETL outputs.
|
|
5
|
+
Author-email: Kyriakos Kartas <mail@kkartas.gr>
|
|
6
|
+
Project-URL: Homepage, https://github.com/kkartas/etl-forge
|
|
7
|
+
Project-URL: Bug Tracker, https://github.com/kkartas/etl-forge/issues
|
|
8
|
+
Keywords: etl,testing,data validation,synthetic data,data quality
|
|
9
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
10
|
+
Classifier: Intended Audience :: Developers
|
|
11
|
+
Classifier: Topic :: Software Development :: Testing
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Requires-Python: >=3.8
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
Requires-Dist: pandas>=1.3.0
|
|
21
|
+
Requires-Dist: pyyaml>=5.4.0
|
|
22
|
+
Requires-Dist: click>=8.0.0
|
|
23
|
+
Requires-Dist: openpyxl>=3.0.0
|
|
24
|
+
Requires-Dist: numpy>=1.21.0
|
|
25
|
+
Requires-Dist: psutil>=5.9.0
|
|
26
|
+
Provides-Extra: faker
|
|
27
|
+
Requires-Dist: faker>=15.0.0; extra == "faker"
|
|
28
|
+
Provides-Extra: dev
|
|
29
|
+
Requires-Dist: pytest>=6.0.0; extra == "dev"
|
|
30
|
+
Requires-Dist: pytest-cov>=2.0.0; extra == "dev"
|
|
31
|
+
Requires-Dist: black>=21.0.0; extra == "dev"
|
|
32
|
+
Requires-Dist: flake8>=3.8.0; extra == "dev"
|
|
33
|
+
Requires-Dist: mypy>=0.900; extra == "dev"
|
|
34
|
+
Requires-Dist: matplotlib>=3.5.0; extra == "dev"
|
|
35
|
+
Requires-Dist: coverage>=6.0; extra == "dev"
|
|
36
|
+
Requires-Dist: sphinx>=4.0.0; extra == "dev"
|
|
37
|
+
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "dev"
|
|
38
|
+
|
|
39
|
+
# ETLForge
|
|
40
|
+
|
|
41
|
+
[](https://github.com/kkartas/etl-forge/actions/workflows/ci.yml)
|
|
42
|
+
[](https://etl-forge.readthedocs.io/en/latest/)
|
|
43
|
+
[](https://badge.fury.io/py/etl-forge)
|
|
44
|
+
[](https://codecov.io/gh/kkartas/etl-forge)
|
|
45
|
+
[](https://opensource.org/licenses/MIT)
|
|
46
|
+
|
|
47
|
+
A Python library for generating synthetic test data and validating ETL outputs. ETLForge provides both command-line tools and library functions to help you create realistic test datasets and validate data quality.
|
|
48
|
+
|
|
49
|
+
## Features
|
|
50
|
+
|
|
51
|
+
### 🎲 Test Data Generator
|
|
52
|
+
- Generate synthetic data based on YAML/JSON schema definitions
|
|
53
|
+
- Support for multiple data types: `int`, `float`, `string`, `date`, `category`
|
|
54
|
+
- Advanced constraints: ranges, uniqueness, nullable fields, categorical values
|
|
55
|
+
- Integration with Faker for realistic string generation
|
|
56
|
+
- Export to CSV or Excel formats
|
|
57
|
+
|
|
58
|
+
### ✅ Data Validator
|
|
59
|
+
- Validate CSV/Excel files against schema definitions
|
|
60
|
+
- Comprehensive validation checks:
|
|
61
|
+
- Column existence
|
|
62
|
+
- Data type matching
|
|
63
|
+
- Value constraints (ranges, categories)
|
|
64
|
+
- Uniqueness validation
|
|
65
|
+
- Null value validation
|
|
66
|
+
- Date format validation
|
|
67
|
+
- Generate detailed reports of invalid rows
|
|
68
|
+
|
|
69
|
+
### 🔧 Dual Interface
|
|
70
|
+
- **Command-line interface** for quick operations
|
|
71
|
+
- **Python library** for integration into existing workflows
|
|
72
|
+
|
|
73
|
+
## Installation
|
|
74
|
+
|
|
75
|
+
### Prerequisites
|
|
76
|
+
- Python 3.8 or higher
|
|
77
|
+
- pip package manager
|
|
78
|
+
|
|
79
|
+
### Install from PyPI (Recommended)
|
|
80
|
+
```bash
|
|
81
|
+
pip install etl-forge
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Install from Source
|
|
85
|
+
For development or latest features:
|
|
86
|
+
```bash
|
|
87
|
+
git clone https://github.com/kkartas/etl-forge.git
|
|
88
|
+
cd etl-forge
|
|
89
|
+
pip install -e ".[dev]"
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
### Dependencies
|
|
93
|
+
**Core dependencies** (6 total, automatically installed):
|
|
94
|
+
- `pandas>=1.3.0` - Data manipulation and analysis
|
|
95
|
+
- `pyyaml>=5.4.0` - YAML parsing for schema files
|
|
96
|
+
- `click>=8.0.0` - Command-line interface framework
|
|
97
|
+
- `openpyxl>=3.0.0` - Excel file support
|
|
98
|
+
- `numpy>=1.21.0` - Numerical computing
|
|
99
|
+
- `psutil>=5.9.0` - System monitoring for benchmarks
|
|
100
|
+
|
|
101
|
+
**Optional dependencies** for enhanced features:
|
|
102
|
+
```bash
|
|
103
|
+
# For realistic data generation using Faker templates
|
|
104
|
+
pip install etl-forge[faker]
|
|
105
|
+
|
|
106
|
+
# For development (testing, linting, documentation)
|
|
107
|
+
pip install etl-forge[dev]
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### Verify Installation
|
|
111
|
+
```bash
|
|
112
|
+
# CLI verification (may require adding Scripts directory to PATH on Windows)
|
|
113
|
+
etl-forge --version
|
|
114
|
+
|
|
115
|
+
# Alternative CLI access (works on all platforms)
|
|
116
|
+
python -m etl_forge.cli --version
|
|
117
|
+
|
|
118
|
+
# Library verification
|
|
119
|
+
python -c "from etl_forge import DataGenerator, DataValidator; print('✅ Installation verified')"
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### CLI Access Note
|
|
123
|
+
On some systems (especially Windows), the `etl-forge` command may not be directly accessible. In such cases, use:
|
|
124
|
+
```bash
|
|
125
|
+
python -m etl_forge.cli [command] [options]
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## Quick Start
|
|
129
|
+
|
|
130
|
+
### 1. Create a Schema
|
|
131
|
+
|
|
132
|
+
Create a `schema.yaml` file defining your data structure:
|
|
133
|
+
|
|
134
|
+
```yaml
|
|
135
|
+
fields:
|
|
136
|
+
- name: id
|
|
137
|
+
type: int
|
|
138
|
+
unique: true
|
|
139
|
+
nullable: false
|
|
140
|
+
range:
|
|
141
|
+
min: 1
|
|
142
|
+
max: 10000
|
|
143
|
+
|
|
144
|
+
- name: name
|
|
145
|
+
type: string
|
|
146
|
+
nullable: false
|
|
147
|
+
faker_template: name
|
|
148
|
+
|
|
149
|
+
- name: department
|
|
150
|
+
type: category
|
|
151
|
+
nullable: false
|
|
152
|
+
values:
|
|
153
|
+
- Engineering
|
|
154
|
+
- Marketing
|
|
155
|
+
- Sales
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### 2. Generate Test Data
|
|
159
|
+
|
|
160
|
+
**Command Line:**
|
|
161
|
+
```bash
|
|
162
|
+
# Direct CLI command (if available)
|
|
163
|
+
etl-forge generate --schema schema.yaml --rows 500 --output sample.csv
|
|
164
|
+
|
|
165
|
+
# Alternative CLI access (works on all platforms)
|
|
166
|
+
python -m etl_forge.cli generate --schema schema.yaml --rows 500 --output sample.csv
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
**Python Library:**
|
|
170
|
+
```python
|
|
171
|
+
from etl_forge import DataGenerator
|
|
172
|
+
|
|
173
|
+
generator = DataGenerator('schema.yaml')
|
|
174
|
+
df = generator.generate_data(500)
|
|
175
|
+
generator.save_data(df, 'sample.csv')
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
### 3. Validate Data
|
|
179
|
+
|
|
180
|
+
**Command Line:**
|
|
181
|
+
```bash
|
|
182
|
+
# Direct CLI command (if available)
|
|
183
|
+
etl-forge check --input sample.csv --schema schema.yaml --report invalid_rows.csv
|
|
184
|
+
|
|
185
|
+
# Alternative CLI access (works on all platforms)
|
|
186
|
+
python -m etl_forge.cli check --input sample.csv --schema schema.yaml --report invalid_rows.csv
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
**Python Library:**
|
|
190
|
+
```python
|
|
191
|
+
from etl_forge import DataValidator
|
|
192
|
+
|
|
193
|
+
validator = DataValidator('schema.yaml')
|
|
194
|
+
result = validator.validate('sample.csv')
|
|
195
|
+
print(f"Validation passed: {result.is_valid}")
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
## Schema Definition
|
|
199
|
+
|
|
200
|
+
### Supported Field Types
|
|
201
|
+
|
|
202
|
+
#### Integer (`int`)
|
|
203
|
+
```yaml
|
|
204
|
+
- name: age
|
|
205
|
+
type: int
|
|
206
|
+
nullable: false
|
|
207
|
+
range:
|
|
208
|
+
min: 18
|
|
209
|
+
max: 65
|
|
210
|
+
unique: false
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
#### Float (`float`)
|
|
214
|
+
```yaml
|
|
215
|
+
- name: salary
|
|
216
|
+
type: float
|
|
217
|
+
nullable: true
|
|
218
|
+
range:
|
|
219
|
+
min: 30000.0
|
|
220
|
+
max: 150000.0
|
|
221
|
+
precision: 2
|
|
222
|
+
null_rate: 0.1
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
#### String (`string`)
|
|
226
|
+
```yaml
|
|
227
|
+
- name: email
|
|
228
|
+
type: string
|
|
229
|
+
nullable: false
|
|
230
|
+
unique: true
|
|
231
|
+
length:
|
|
232
|
+
min: 10
|
|
233
|
+
max: 50
|
|
234
|
+
faker_template: email # Optional: uses Faker library
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
#### Date (`date`)
|
|
238
|
+
```yaml
|
|
239
|
+
- name: hire_date
|
|
240
|
+
type: date
|
|
241
|
+
nullable: false
|
|
242
|
+
range:
|
|
243
|
+
start: '2020-01-01'
|
|
244
|
+
end: '2024-12-31'
|
|
245
|
+
format: '%Y-%m-%d'
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
#### Category (`category`)
|
|
249
|
+
```yaml
|
|
250
|
+
- name: status
|
|
251
|
+
type: category
|
|
252
|
+
nullable: false
|
|
253
|
+
values:
|
|
254
|
+
- Active
|
|
255
|
+
- Inactive
|
|
256
|
+
- Pending
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
### Schema Constraints
|
|
260
|
+
|
|
261
|
+
- **`nullable`**: Allow null values (default: `false`)
|
|
262
|
+
- **`unique`**: Ensure all values are unique (default: `false`)
|
|
263
|
+
- **`range`**: Define min/max values for numeric types or start/end dates
|
|
264
|
+
- **`values`**: List of allowed values for categorical fields
|
|
265
|
+
- **`length`**: Min/max length for string fields
|
|
266
|
+
- **`precision`**: Decimal places for float fields
|
|
267
|
+
- **`format`**: Date format string (default: `'%Y-%m-%d'`)
|
|
268
|
+
- **`faker_template`**: Faker method name for realistic string generation
|
|
269
|
+
- **`null_rate`**: Probability of null values when `nullable: true` (default: 0.1)
|
|
270
|
+
|
|
271
|
+
## Command Line Interface
|
|
272
|
+
|
|
273
|
+
### Generate Data
|
|
274
|
+
```bash
|
|
275
|
+
# Direct CLI command (if available)
|
|
276
|
+
etl-forge generate [OPTIONS]
|
|
277
|
+
|
|
278
|
+
# Alternative CLI access (works on all platforms)
|
|
279
|
+
python -m etl_forge.cli generate [OPTIONS]
|
|
280
|
+
|
|
281
|
+
Options:
|
|
282
|
+
-s, --schema PATH Path to schema file (YAML or JSON) [required]
|
|
283
|
+
-r, --rows INTEGER Number of rows to generate (default: 100)
|
|
284
|
+
-o, --output PATH Output file path (CSV or Excel) [required]
|
|
285
|
+
-f, --format [csv|excel] Output format (auto-detected if not specified)
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
### Validate Data
|
|
289
|
+
```bash
|
|
290
|
+
# Direct CLI command (if available)
|
|
291
|
+
etl-forge check [OPTIONS]
|
|
292
|
+
|
|
293
|
+
# Alternative CLI access (works on all platforms)
|
|
294
|
+
python -m etl_forge.cli check [OPTIONS]
|
|
295
|
+
|
|
296
|
+
Options:
|
|
297
|
+
-i, --input PATH Path to input data file [required]
|
|
298
|
+
-s, --schema PATH Path to schema file [required]
|
|
299
|
+
-r, --report PATH Path to save invalid rows report (optional)
|
|
300
|
+
-v, --verbose Show detailed validation errors
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
### Create Example Schema
|
|
304
|
+
```bash
|
|
305
|
+
# Direct CLI command (if available)
|
|
306
|
+
etl-forge create-schema example_schema.yaml
|
|
307
|
+
|
|
308
|
+
# Alternative CLI access (works on all platforms)
|
|
309
|
+
python -m etl_forge.cli create-schema example_schema.yaml
|
|
310
|
+
```
|
|
311
|
+
|
|
312
|
+
## Library Usage
|
|
313
|
+
|
|
314
|
+
### Data Generation
|
|
315
|
+
|
|
316
|
+
```python
|
|
317
|
+
from etl_forge import DataGenerator
|
|
318
|
+
|
|
319
|
+
# Initialize with schema
|
|
320
|
+
generator = DataGenerator('schema.yaml')
|
|
321
|
+
|
|
322
|
+
# Generate data
|
|
323
|
+
df = generator.generate_data(1000)
|
|
324
|
+
|
|
325
|
+
# Save to file
|
|
326
|
+
generator.save_data(df, 'output.csv')
|
|
327
|
+
|
|
328
|
+
# Or do both in one step
|
|
329
|
+
df = generator.generate_and_save(1000, 'output.xlsx', 'excel')
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
### Data Validation
|
|
333
|
+
|
|
334
|
+
```python
|
|
335
|
+
from etl_forge import DataValidator
|
|
336
|
+
|
|
337
|
+
# Initialize validator
|
|
338
|
+
validator = DataValidator('schema.yaml')
|
|
339
|
+
|
|
340
|
+
# Validate data
|
|
341
|
+
result = validator.validate('data.csv')
|
|
342
|
+
|
|
343
|
+
# Check results
|
|
344
|
+
if result.is_valid:
|
|
345
|
+
print("✅ Data is valid!")
|
|
346
|
+
else:
|
|
347
|
+
print(f"❌ Found {len(result.errors)} validation errors")
|
|
348
|
+
print(f"Invalid rows: {len(result.invalid_rows)}")
|
|
349
|
+
|
|
350
|
+
# Generate report
|
|
351
|
+
result = validator.validate_and_report('data.csv', 'errors.csv')
|
|
352
|
+
|
|
353
|
+
# Print summary
|
|
354
|
+
validator.print_validation_summary(result)
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
### Advanced Usage
|
|
358
|
+
|
|
359
|
+
```python
|
|
360
|
+
# Use schema as dictionary
|
|
361
|
+
schema_dict = {
|
|
362
|
+
'fields': [
|
|
363
|
+
{'name': 'id', 'type': 'int', 'unique': True},
|
|
364
|
+
{'name': 'name', 'type': 'string', 'faker_template': 'name'}
|
|
365
|
+
]
|
|
366
|
+
}
|
|
367
|
+
|
|
368
|
+
generator = DataGenerator(schema_dict)
|
|
369
|
+
validator = DataValidator(schema_dict)
|
|
370
|
+
|
|
371
|
+
# Validate DataFrame directly
|
|
372
|
+
import pandas as pd
|
|
373
|
+
df = pd.read_csv('data.csv')
|
|
374
|
+
result = validator.validate(df)
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
## Faker Integration
|
|
378
|
+
|
|
379
|
+
When the `faker` library is installed, you can use realistic data generation:
|
|
380
|
+
|
|
381
|
+
```yaml
|
|
382
|
+
- name: first_name
|
|
383
|
+
type: string
|
|
384
|
+
faker_template: first_name
|
|
385
|
+
|
|
386
|
+
- name: address
|
|
387
|
+
type: string
|
|
388
|
+
faker_template: address
|
|
389
|
+
|
|
390
|
+
- name: phone
|
|
391
|
+
type: string
|
|
392
|
+
faker_template: phone_number
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
Common Faker templates:
|
|
396
|
+
- `name`, `first_name`, `last_name`
|
|
397
|
+
- `email`, `phone_number`
|
|
398
|
+
- `address`, `city`, `country`
|
|
399
|
+
- `company`, `job`
|
|
400
|
+
- `date`, `time`
|
|
401
|
+
- And many more! See [Faker documentation](https://faker.readthedocs.io/)
|
|
402
|
+
|
|
403
|
+
## Testing
|
|
404
|
+
|
|
405
|
+
Run the test suite:
|
|
406
|
+
|
|
407
|
+
```bash
|
|
408
|
+
pytest tests/
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
Run with coverage:
|
|
412
|
+
|
|
413
|
+
```bash
|
|
414
|
+
pytest tests/ --cov=etl_forge --cov-report=html
|
|
415
|
+
```
|
|
416
|
+
|
|
417
|
+
## Performance
|
|
418
|
+
|
|
419
|
+
Performance benchmarks are available in [`BENCHMARKS.md`](BENCHMARKS.md). To reproduce them, run:
|
|
420
|
+
|
|
421
|
+
```bash
|
|
422
|
+
python benchmark.py
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
Then, to visualize the results:
|
|
426
|
+
|
|
427
|
+
```bash
|
|
428
|
+
python plot_benchmark.py
|
|
429
|
+
```
|
|
430
|
+
|
|
431
|
+
## Citation
|
|
432
|
+
|
|
433
|
+
If you use `ETLForge` in your research or work, please cite it using the information in `CITATION.cff`.
|