easy-data-loader 0.1.1__tar.gz → 0.1.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- easy_data_loader-0.1.2/PKG-INFO +110 -0
- easy_data_loader-0.1.2/README.md +84 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/pyproject.toml +11 -2
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/cli.py +14 -10
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/models.py +8 -1
- easy_data_loader-0.1.2/src/easy_data_loader.egg-info/PKG-INFO +110 -0
- easy_data_loader-0.1.2/src/easy_data_loader.egg-info/entry_points.txt +2 -0
- easy_data_loader-0.1.1/PKG-INFO +0 -81
- easy_data_loader-0.1.1/README.md +0 -62
- easy_data_loader-0.1.1/src/easy_data_loader.egg-info/PKG-INFO +0 -81
- easy_data_loader-0.1.1/src/easy_data_loader.egg-info/entry_points.txt +0 -2
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/LICENSE +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/setup.cfg +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/__init__.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/config_loader.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/custom_exceptions.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/data_inferrence.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/database_connector.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/database_operations.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/driver_detector.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/file_operations.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/log.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/orchestrator.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/pipeline.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/pipeline_base.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/procedure_pipeline.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/SOURCES.txt +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/dependency_links.txt +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/requires.txt +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/top_level.txt +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/tests/test_data_inference.py +0 -0
- {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/tests/test_imports.py +0 -0
|
@@ -0,0 +1,110 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: easy_data_loader
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: Data transfer utilities between files and databases
|
|
5
|
+
Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
|
|
6
|
+
Classifier: Development Status :: 3 - Alpha
|
|
7
|
+
Classifier: Intended Audience :: Developers
|
|
8
|
+
Classifier: Topic :: Database
|
|
9
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
10
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
12
|
+
Classifier: Operating System :: OS Independent
|
|
13
|
+
Requires-Python: >=3.13
|
|
14
|
+
Description-Content-Type: text/markdown
|
|
15
|
+
License-File: LICENSE
|
|
16
|
+
Requires-Dist: click>=8.3.0
|
|
17
|
+
Requires-Dist: openpyxl>=3.1.5
|
|
18
|
+
Requires-Dist: pandas>=2.3.3
|
|
19
|
+
Requires-Dist: pyarrow>=22.0.0
|
|
20
|
+
Requires-Dist: pydantic>=2.12.5
|
|
21
|
+
Requires-Dist: pydantic-settings>=2.12.0
|
|
22
|
+
Requires-Dist: pyodbc>=5.2.0
|
|
23
|
+
Requires-Dist: python-dotenv>=1.1.1
|
|
24
|
+
Requires-Dist: sqlalchemy>=2.0.43
|
|
25
|
+
Dynamic: license-file
|
|
26
|
+
|
|
27
|
+
# Easy Data Loader 🚀
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
[](https://badge.fury.io/py/easy-data-loader)
|
|
31
|
+
[](https://opensource.org/licenses/MIT)
|
|
32
|
+

|
|
33
|
+
|
|
34
|
+
**Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various file data sources (csv, xlsx, parquet, orc) and databases (MSSQL, PostgreSQL and others).
|
|
35
|
+
|
|
36
|
+
## ✨ Key Features
|
|
37
|
+
- **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
|
|
38
|
+
- **Integrated CLI**: Initialize a standardized project structure with a single command.
|
|
39
|
+
- **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
|
|
40
|
+
- **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
|
|
41
|
+
- **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## 📦 Installation
|
|
46
|
+
|
|
47
|
+
Install directly via `pip` or `uv`:
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
pip install easy_data_loader
|
|
51
|
+
uv add easy_data_loader
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## 🚀 Getting Started
|
|
55
|
+
|
|
56
|
+
1. Initialize a new project structure to generate template configurations:
|
|
57
|
+
```bash
|
|
58
|
+
easy-data-loader init
|
|
59
|
+
```
|
|
60
|
+
2. Review the generated `config/` folders for sample resources and pipelines.
|
|
61
|
+
3. Run all discovered pipelines across the active configurations:
|
|
62
|
+
```bash
|
|
63
|
+
easy-data-loader run_all
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## ✔️ Generic concepts
|
|
67
|
+
|
|
68
|
+
`easy_data_loader` uses `resources` as a way to define a file or a database. The resouces can represent either a source or a destination making posible the folowing ETL scenarios: file -> file, file -> database, database -> file, database -> database.
|
|
69
|
+
|
|
70
|
+
`easy_data_loader` project initializer will created the predefined folder structure `/config/resources` where the resources are expected to be defined following the current convention: the file type is .env and the file name must be prefixed with the resource type `file_` or `database_`. The predefined folder structure together with the naming convention enables `easy_data_loader` to find and load all resources.
|
|
71
|
+
|
|
72
|
+
A secondary predefined folder `/config/pipelines` will contain the pipeline definition files, which are regular Python files. There are 3 types of pipelines that can be defined:
|
|
73
|
+
- `LoadPipeline` the main pipeline type which transports data from source to destination
|
|
74
|
+
- `ProcedurePipeline` a pipeline dedicated for executing stored procedures inside a database
|
|
75
|
+
- `OrchestratorPipeline` a pipeline that can execute a group of pipelines sequentialy
|
|
76
|
+
|
|
77
|
+
## LoadPipeline
|
|
78
|
+
|
|
79
|
+
In order to define a `LoadPipeline` we must use the `BasePipelineDefinition` from `easy_data_loader` as depicted in the example pipelines created by the initializer.
|
|
80
|
+
In the simplest form there are only a few mandatory parameters:
|
|
81
|
+
- `pipeline_name : str` - this name will be used to execute the pipeline
|
|
82
|
+
- `source : str` - the file name (without extension) coresponding to the desired resource to be the data source
|
|
83
|
+
- `destination : str` - the file name (without extension) coresponding to the desired resource to be the data destination
|
|
84
|
+
|
|
85
|
+
If either the source or destination are a database then additional parameters become mandatory:
|
|
86
|
+
- `source_sql : str` - can be a table name or a specific query in the SQL dialect of the source database flavor
|
|
87
|
+
- `destination_table : str` - table name where the data will be inserted
|
|
88
|
+
|
|
89
|
+
There are many other aspects of the pipeline that can be defined:
|
|
90
|
+
- `audit : str` - the pipeline has a built in audit functionality, it records certain information after the pipeline completes in a SqlLite database. If the user desires, the same information can be recorded in a database `resource`
|
|
91
|
+
- `validator: Pydantic BaseModel` - the data read from the source `resource` can be validated using an arbitrary defined Pydantic model before is written to destination
|
|
92
|
+
- `columns : Dict[str, ColumnDefinition]` - this parameter is used for strict control on how the data is written to destination; it has the dual purpose of renaming the columns and also define explicitly the data types (mainly for inserting into a database table); the `ColumnDefinition` is constructed with an optional `target_name: str` for renaming columns and / or a `data_type : SqlAlchemy Type` thus controling column data types, lenghts, precision etc.
|
|
93
|
+
- `read_parameters : Dict[str, Any]` and `write_parameters : Dict[str, Any]` - these parameters control how the data is being read or written from source to destination and provide an easy way to use special delimiters for files, drop and recreate the database table, etc. `easy_data_loader` is using pandas as the transport layer therefore the read and write parameters will be passed to the coresponding read and write functions supported by pandas.
|
|
94
|
+
- the pipeline has a set of predefined hooks allowing the execution of functions at specific moments during the execution: `file_pre_process : Callable` - executed before the file is read into the pandas DataFrame (e.g. unzip the file); `transform : Callable` - perform data transformation over the data already in the pandas DataFrame (requires pandas methods); `file_post_process : Callable` - after the pipeline completes and the data is written to the destination perform post processing on the source file (e.g. move the file to another folder)
|
|
95
|
+
|
|
96
|
+
## ProcedurePipeline
|
|
97
|
+
|
|
98
|
+
This secondary pipeline type is responsible for executing one or more stored procedures inside a database.
|
|
99
|
+
To define one we need to use the `ProcedureDefinition` with the following parameters:
|
|
100
|
+
- `pipeline_name : str` - this name will be used to execute the pipeline
|
|
101
|
+
- `audit : str, optional` - database resource name where the audit info will be recorded
|
|
102
|
+
- `resource : str` - database resource name where the stored procedure(s) wil be executed
|
|
103
|
+
- `procedures : List[tuple(str, Optional[Dict[str, Any]])]` - list of one or more stord procedures along with optional procedures parameters as dictionaries
|
|
104
|
+
|
|
105
|
+
## OrchestratorPipeline
|
|
106
|
+
|
|
107
|
+
This pipeline type is responsible of executing sequentially a set of pipelines, `LoadPipeline`s and / or `ProcedurePipeline`s. Very simple to define using the `OrchestratorDefinition` with:
|
|
108
|
+
- `orchestrator_name : str` - name by which the orchestrator is executer
|
|
109
|
+
- 'pipelines : List[str]` - list of pipelines to execute sequentially
|
|
110
|
+
- `fail_fast : bool, Default True` - if any of the pipelines fail the rest of the pipelines in the list do not get executed
|
|
@@ -0,0 +1,84 @@
|
|
|
1
|
+
# Easy Data Loader 🚀
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
[](https://badge.fury.io/py/easy-data-loader)
|
|
5
|
+
[](https://opensource.org/licenses/MIT)
|
|
6
|
+

|
|
7
|
+
|
|
8
|
+
**Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various file data sources (csv, xlsx, parquet, orc) and databases (MSSQL, PostgreSQL and others).
|
|
9
|
+
|
|
10
|
+
## ✨ Key Features
|
|
11
|
+
- **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
|
|
12
|
+
- **Integrated CLI**: Initialize a standardized project structure with a single command.
|
|
13
|
+
- **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
|
|
14
|
+
- **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
|
|
15
|
+
- **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## 📦 Installation
|
|
20
|
+
|
|
21
|
+
Install directly via `pip` or `uv`:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
pip install easy_data_loader
|
|
25
|
+
uv add easy_data_loader
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## 🚀 Getting Started
|
|
29
|
+
|
|
30
|
+
1. Initialize a new project structure to generate template configurations:
|
|
31
|
+
```bash
|
|
32
|
+
easy-data-loader init
|
|
33
|
+
```
|
|
34
|
+
2. Review the generated `config/` folders for sample resources and pipelines.
|
|
35
|
+
3. Run all discovered pipelines across the active configurations:
|
|
36
|
+
```bash
|
|
37
|
+
easy-data-loader run_all
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## ✔️ Generic concepts
|
|
41
|
+
|
|
42
|
+
`easy_data_loader` uses `resources` as a way to define a file or a database. The resouces can represent either a source or a destination making posible the folowing ETL scenarios: file -> file, file -> database, database -> file, database -> database.
|
|
43
|
+
|
|
44
|
+
`easy_data_loader` project initializer will created the predefined folder structure `/config/resources` where the resources are expected to be defined following the current convention: the file type is .env and the file name must be prefixed with the resource type `file_` or `database_`. The predefined folder structure together with the naming convention enables `easy_data_loader` to find and load all resources.
|
|
45
|
+
|
|
46
|
+
A secondary predefined folder `/config/pipelines` will contain the pipeline definition files, which are regular Python files. There are 3 types of pipelines that can be defined:
|
|
47
|
+
- `LoadPipeline` the main pipeline type which transports data from source to destination
|
|
48
|
+
- `ProcedurePipeline` a pipeline dedicated for executing stored procedures inside a database
|
|
49
|
+
- `OrchestratorPipeline` a pipeline that can execute a group of pipelines sequentialy
|
|
50
|
+
|
|
51
|
+
## LoadPipeline
|
|
52
|
+
|
|
53
|
+
In order to define a `LoadPipeline` we must use the `BasePipelineDefinition` from `easy_data_loader` as depicted in the example pipelines created by the initializer.
|
|
54
|
+
In the simplest form there are only a few mandatory parameters:
|
|
55
|
+
- `pipeline_name : str` - this name will be used to execute the pipeline
|
|
56
|
+
- `source : str` - the file name (without extension) coresponding to the desired resource to be the data source
|
|
57
|
+
- `destination : str` - the file name (without extension) coresponding to the desired resource to be the data destination
|
|
58
|
+
|
|
59
|
+
If either the source or destination are a database then additional parameters become mandatory:
|
|
60
|
+
- `source_sql : str` - can be a table name or a specific query in the SQL dialect of the source database flavor
|
|
61
|
+
- `destination_table : str` - table name where the data will be inserted
|
|
62
|
+
|
|
63
|
+
There are many other aspects of the pipeline that can be defined:
|
|
64
|
+
- `audit : str` - the pipeline has a built in audit functionality, it records certain information after the pipeline completes in a SqlLite database. If the user desires, the same information can be recorded in a database `resource`
|
|
65
|
+
- `validator: Pydantic BaseModel` - the data read from the source `resource` can be validated using an arbitrary defined Pydantic model before is written to destination
|
|
66
|
+
- `columns : Dict[str, ColumnDefinition]` - this parameter is used for strict control on how the data is written to destination; it has the dual purpose of renaming the columns and also define explicitly the data types (mainly for inserting into a database table); the `ColumnDefinition` is constructed with an optional `target_name: str` for renaming columns and / or a `data_type : SqlAlchemy Type` thus controling column data types, lenghts, precision etc.
|
|
67
|
+
- `read_parameters : Dict[str, Any]` and `write_parameters : Dict[str, Any]` - these parameters control how the data is being read or written from source to destination and provide an easy way to use special delimiters for files, drop and recreate the database table, etc. `easy_data_loader` is using pandas as the transport layer therefore the read and write parameters will be passed to the coresponding read and write functions supported by pandas.
|
|
68
|
+
- the pipeline has a set of predefined hooks allowing the execution of functions at specific moments during the execution: `file_pre_process : Callable` - executed before the file is read into the pandas DataFrame (e.g. unzip the file); `transform : Callable` - perform data transformation over the data already in the pandas DataFrame (requires pandas methods); `file_post_process : Callable` - after the pipeline completes and the data is written to the destination perform post processing on the source file (e.g. move the file to another folder)
|
|
69
|
+
|
|
70
|
+
## ProcedurePipeline
|
|
71
|
+
|
|
72
|
+
This secondary pipeline type is responsible for executing one or more stored procedures inside a database.
|
|
73
|
+
To define one we need to use the `ProcedureDefinition` with the following parameters:
|
|
74
|
+
- `pipeline_name : str` - this name will be used to execute the pipeline
|
|
75
|
+
- `audit : str, optional` - database resource name where the audit info will be recorded
|
|
76
|
+
- `resource : str` - database resource name where the stored procedure(s) wil be executed
|
|
77
|
+
- `procedures : List[tuple(str, Optional[Dict[str, Any]])]` - list of one or more stord procedures along with optional procedures parameters as dictionaries
|
|
78
|
+
|
|
79
|
+
## OrchestratorPipeline
|
|
80
|
+
|
|
81
|
+
This pipeline type is responsible of executing sequentially a set of pipelines, `LoadPipeline`s and / or `ProcedurePipeline`s. Very simple to define using the `OrchestratorDefinition` with:
|
|
82
|
+
- `orchestrator_name : str` - name by which the orchestrator is executer
|
|
83
|
+
- 'pipelines : List[str]` - list of pipelines to execute sequentially
|
|
84
|
+
- `fail_fast : bool, Default True` - if any of the pipelines fail the rest of the pipelines in the list do not get executed
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
[project]
|
|
2
2
|
name = "easy_data_loader"
|
|
3
|
-
version = "0.1.
|
|
3
|
+
version = "0.1.2"
|
|
4
4
|
description = "Data transfer utilities between files and databases"
|
|
5
5
|
authors = [{ name = "Bojoi Gabriel", email = "bojoigabriel@gmail.com" }]
|
|
6
6
|
readme = "README.md"
|
|
@@ -16,12 +16,21 @@ dependencies = [
|
|
|
16
16
|
"python-dotenv>=1.1.1",
|
|
17
17
|
"sqlalchemy>=2.0.43",
|
|
18
18
|
]
|
|
19
|
+
classifiers = [
|
|
20
|
+
"Development Status :: 3 - Alpha",
|
|
21
|
+
"Intended Audience :: Developers",
|
|
22
|
+
"Topic :: Database",
|
|
23
|
+
"Topic :: Scientific/Engineering :: Information Analysis",
|
|
24
|
+
"License :: OSI Approved :: MIT License",
|
|
25
|
+
"Programming Language :: Python :: 3.13",
|
|
26
|
+
"Operating System :: OS Independent",
|
|
27
|
+
]
|
|
19
28
|
|
|
20
29
|
[dependency-groups]
|
|
21
30
|
dev = ["ipykernel>=7.1.0", "pytest>=8.4.2", "ruff", "mypy", "pre-commit"]
|
|
22
31
|
|
|
23
32
|
[project.scripts]
|
|
24
|
-
easy-loader = "easy_data_loader.cli:main"
|
|
33
|
+
easy-data-loader = "easy_data_loader.cli:main"
|
|
25
34
|
|
|
26
35
|
[tool.setuptools.packages.find]
|
|
27
36
|
where = ["src"]
|
|
@@ -89,26 +89,30 @@ CONN_PORT=1433
|
|
|
89
89
|
|
|
90
90
|
FILE_ENV = """
|
|
91
91
|
# file resource definition
|
|
92
|
-
FILE_TYPE=CSV
|
|
93
|
-
FOLDER_PATH=./data/imports
|
|
94
|
-
FILE_NAME=large_sales_data
|
|
92
|
+
FILE_TYPE=CSV # can also be XLSX, PARQUET, ORC
|
|
93
|
+
FOLDER_PATH=./data/imports # source folder where the file is located
|
|
94
|
+
FILE_NAME=large_sales_data # exact file name without extension
|
|
95
|
+
#FILE_PATTERN=large_sales # file pattern to search in the source folder
|
|
95
96
|
"""
|
|
96
97
|
|
|
97
98
|
MAIN = """
|
|
98
|
-
from easy_data_loader
|
|
99
|
+
from easy_data_loader import LoadPipeline, ProcedurePipeline
|
|
100
|
+
|
|
101
|
+
def main():
|
|
102
|
+
# Run an ETL pipeline
|
|
103
|
+
LoadPipeline(pipeline_name="example_pipeline").run()
|
|
99
104
|
|
|
100
|
-
# Run
|
|
101
|
-
|
|
105
|
+
# Run a procedure pipeline
|
|
106
|
+
ProcedurePipeline(pipeline_name="example_procedure").run()
|
|
102
107
|
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
# ProcedurePipeline(pipeline_name="example_procedure").run()
|
|
108
|
+
if __name__ == "__main__":
|
|
109
|
+
main()
|
|
106
110
|
"""
|
|
107
111
|
|
|
108
112
|
|
|
109
113
|
@click.group()
|
|
110
114
|
def main():
|
|
111
|
-
"""Easy Data Loader CLI - ETL instrument
|
|
115
|
+
"""Easy Data Loader CLI - ETL instrument for files and databases"""
|
|
112
116
|
pass
|
|
113
117
|
|
|
114
118
|
|
|
@@ -117,7 +117,7 @@ class ColumnDefinition(BaseModel):
|
|
|
117
117
|
|
|
118
118
|
|
|
119
119
|
class BasePipelineDefinition(BaseModel):
|
|
120
|
-
"""Base pipeline definition
|
|
120
|
+
"""Base pipeline definition"""
|
|
121
121
|
|
|
122
122
|
model_config = ConfigDict(arbitrary_types_allowed=True, extra="allow")
|
|
123
123
|
pipeline_name: str = "generic_pipeline"
|
|
@@ -149,6 +149,13 @@ class BasePipelineDefinition(BaseModel):
|
|
|
149
149
|
"""
|
|
150
150
|
return df
|
|
151
151
|
|
|
152
|
+
def file_post_process(self, file_path: Path) -> Path:
|
|
153
|
+
"""
|
|
154
|
+
Hook for postprocessing the file before reading.
|
|
155
|
+
Ex: archive file, etc.
|
|
156
|
+
"""
|
|
157
|
+
return file_path
|
|
158
|
+
|
|
152
159
|
def get_sql_columns(self) -> Dict[str, ColumnDefinition]:
|
|
153
160
|
"""
|
|
154
161
|
Return the sql column definition from the pipeline.
|
|
@@ -0,0 +1,110 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: easy_data_loader
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: Data transfer utilities between files and databases
|
|
5
|
+
Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
|
|
6
|
+
Classifier: Development Status :: 3 - Alpha
|
|
7
|
+
Classifier: Intended Audience :: Developers
|
|
8
|
+
Classifier: Topic :: Database
|
|
9
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
10
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
12
|
+
Classifier: Operating System :: OS Independent
|
|
13
|
+
Requires-Python: >=3.13
|
|
14
|
+
Description-Content-Type: text/markdown
|
|
15
|
+
License-File: LICENSE
|
|
16
|
+
Requires-Dist: click>=8.3.0
|
|
17
|
+
Requires-Dist: openpyxl>=3.1.5
|
|
18
|
+
Requires-Dist: pandas>=2.3.3
|
|
19
|
+
Requires-Dist: pyarrow>=22.0.0
|
|
20
|
+
Requires-Dist: pydantic>=2.12.5
|
|
21
|
+
Requires-Dist: pydantic-settings>=2.12.0
|
|
22
|
+
Requires-Dist: pyodbc>=5.2.0
|
|
23
|
+
Requires-Dist: python-dotenv>=1.1.1
|
|
24
|
+
Requires-Dist: sqlalchemy>=2.0.43
|
|
25
|
+
Dynamic: license-file
|
|
26
|
+
|
|
27
|
+
# Easy Data Loader 🚀
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
[](https://badge.fury.io/py/easy-data-loader)
|
|
31
|
+
[](https://opensource.org/licenses/MIT)
|
|
32
|
+

|
|
33
|
+
|
|
34
|
+
**Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various file data sources (csv, xlsx, parquet, orc) and databases (MSSQL, PostgreSQL and others).
|
|
35
|
+
|
|
36
|
+
## ✨ Key Features
|
|
37
|
+
- **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
|
|
38
|
+
- **Integrated CLI**: Initialize a standardized project structure with a single command.
|
|
39
|
+
- **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
|
|
40
|
+
- **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
|
|
41
|
+
- **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## 📦 Installation
|
|
46
|
+
|
|
47
|
+
Install directly via `pip` or `uv`:
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
pip install easy_data_loader
|
|
51
|
+
uv add easy_data_loader
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## 🚀 Getting Started
|
|
55
|
+
|
|
56
|
+
1. Initialize a new project structure to generate template configurations:
|
|
57
|
+
```bash
|
|
58
|
+
easy-data-loader init
|
|
59
|
+
```
|
|
60
|
+
2. Review the generated `config/` folders for sample resources and pipelines.
|
|
61
|
+
3. Run all discovered pipelines across the active configurations:
|
|
62
|
+
```bash
|
|
63
|
+
easy-data-loader run_all
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## ✔️ Generic concepts
|
|
67
|
+
|
|
68
|
+
`easy_data_loader` uses `resources` as a way to define a file or a database. The resouces can represent either a source or a destination making posible the folowing ETL scenarios: file -> file, file -> database, database -> file, database -> database.
|
|
69
|
+
|
|
70
|
+
`easy_data_loader` project initializer will created the predefined folder structure `/config/resources` where the resources are expected to be defined following the current convention: the file type is .env and the file name must be prefixed with the resource type `file_` or `database_`. The predefined folder structure together with the naming convention enables `easy_data_loader` to find and load all resources.
|
|
71
|
+
|
|
72
|
+
A secondary predefined folder `/config/pipelines` will contain the pipeline definition files, which are regular Python files. There are 3 types of pipelines that can be defined:
|
|
73
|
+
- `LoadPipeline` the main pipeline type which transports data from source to destination
|
|
74
|
+
- `ProcedurePipeline` a pipeline dedicated for executing stored procedures inside a database
|
|
75
|
+
- `OrchestratorPipeline` a pipeline that can execute a group of pipelines sequentialy
|
|
76
|
+
|
|
77
|
+
## LoadPipeline
|
|
78
|
+
|
|
79
|
+
In order to define a `LoadPipeline` we must use the `BasePipelineDefinition` from `easy_data_loader` as depicted in the example pipelines created by the initializer.
|
|
80
|
+
In the simplest form there are only a few mandatory parameters:
|
|
81
|
+
- `pipeline_name : str` - this name will be used to execute the pipeline
|
|
82
|
+
- `source : str` - the file name (without extension) coresponding to the desired resource to be the data source
|
|
83
|
+
- `destination : str` - the file name (without extension) coresponding to the desired resource to be the data destination
|
|
84
|
+
|
|
85
|
+
If either the source or destination are a database then additional parameters become mandatory:
|
|
86
|
+
- `source_sql : str` - can be a table name or a specific query in the SQL dialect of the source database flavor
|
|
87
|
+
- `destination_table : str` - table name where the data will be inserted
|
|
88
|
+
|
|
89
|
+
There are many other aspects of the pipeline that can be defined:
|
|
90
|
+
- `audit : str` - the pipeline has a built in audit functionality, it records certain information after the pipeline completes in a SqlLite database. If the user desires, the same information can be recorded in a database `resource`
|
|
91
|
+
- `validator: Pydantic BaseModel` - the data read from the source `resource` can be validated using an arbitrary defined Pydantic model before is written to destination
|
|
92
|
+
- `columns : Dict[str, ColumnDefinition]` - this parameter is used for strict control on how the data is written to destination; it has the dual purpose of renaming the columns and also define explicitly the data types (mainly for inserting into a database table); the `ColumnDefinition` is constructed with an optional `target_name: str` for renaming columns and / or a `data_type : SqlAlchemy Type` thus controling column data types, lenghts, precision etc.
|
|
93
|
+
- `read_parameters : Dict[str, Any]` and `write_parameters : Dict[str, Any]` - these parameters control how the data is being read or written from source to destination and provide an easy way to use special delimiters for files, drop and recreate the database table, etc. `easy_data_loader` is using pandas as the transport layer therefore the read and write parameters will be passed to the coresponding read and write functions supported by pandas.
|
|
94
|
+
- the pipeline has a set of predefined hooks allowing the execution of functions at specific moments during the execution: `file_pre_process : Callable` - executed before the file is read into the pandas DataFrame (e.g. unzip the file); `transform : Callable` - perform data transformation over the data already in the pandas DataFrame (requires pandas methods); `file_post_process : Callable` - after the pipeline completes and the data is written to the destination perform post processing on the source file (e.g. move the file to another folder)
|
|
95
|
+
|
|
96
|
+
## ProcedurePipeline
|
|
97
|
+
|
|
98
|
+
This secondary pipeline type is responsible for executing one or more stored procedures inside a database.
|
|
99
|
+
To define one we need to use the `ProcedureDefinition` with the following parameters:
|
|
100
|
+
- `pipeline_name : str` - this name will be used to execute the pipeline
|
|
101
|
+
- `audit : str, optional` - database resource name where the audit info will be recorded
|
|
102
|
+
- `resource : str` - database resource name where the stored procedure(s) wil be executed
|
|
103
|
+
- `procedures : List[tuple(str, Optional[Dict[str, Any]])]` - list of one or more stord procedures along with optional procedures parameters as dictionaries
|
|
104
|
+
|
|
105
|
+
## OrchestratorPipeline
|
|
106
|
+
|
|
107
|
+
This pipeline type is responsible of executing sequentially a set of pipelines, `LoadPipeline`s and / or `ProcedurePipeline`s. Very simple to define using the `OrchestratorDefinition` with:
|
|
108
|
+
- `orchestrator_name : str` - name by which the orchestrator is executer
|
|
109
|
+
- 'pipelines : List[str]` - list of pipelines to execute sequentially
|
|
110
|
+
- `fail_fast : bool, Default True` - if any of the pipelines fail the rest of the pipelines in the list do not get executed
|
easy_data_loader-0.1.1/PKG-INFO
DELETED
|
@@ -1,81 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: easy_data_loader
|
|
3
|
-
Version: 0.1.1
|
|
4
|
-
Summary: Data transfer utilities between files and databases
|
|
5
|
-
Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
|
|
6
|
-
Requires-Python: >=3.13
|
|
7
|
-
Description-Content-Type: text/markdown
|
|
8
|
-
License-File: LICENSE
|
|
9
|
-
Requires-Dist: click>=8.3.0
|
|
10
|
-
Requires-Dist: openpyxl>=3.1.5
|
|
11
|
-
Requires-Dist: pandas>=2.3.3
|
|
12
|
-
Requires-Dist: pyarrow>=22.0.0
|
|
13
|
-
Requires-Dist: pydantic>=2.12.5
|
|
14
|
-
Requires-Dist: pydantic-settings>=2.12.0
|
|
15
|
-
Requires-Dist: pyodbc>=5.2.0
|
|
16
|
-
Requires-Dist: python-dotenv>=1.1.1
|
|
17
|
-
Requires-Dist: sqlalchemy>=2.0.43
|
|
18
|
-
Dynamic: license-file
|
|
19
|
-
|
|
20
|
-
# Easy Data Loader 🚀
|
|
21
|
-
|
|
22
|
-
**Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various data sources (CSV, Excel, Parquet) and SQL databases (MSSQL, PostgreSQL, and others).
|
|
23
|
-
|
|
24
|
-
## ✨ Key Features
|
|
25
|
-
- **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
|
|
26
|
-
- **Integrated CLI**: Initialize a standardized project structure with a single command.
|
|
27
|
-
- **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
|
|
28
|
-
- **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
|
|
29
|
-
- **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
|
|
30
|
-
|
|
31
|
-
---
|
|
32
|
-
|
|
33
|
-
## 📦 Installation
|
|
34
|
-
|
|
35
|
-
Install directly via `pip` or `uv`:
|
|
36
|
-
|
|
37
|
-
```bash
|
|
38
|
-
pip install easy_data_loader
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
## 🚀 Getting Started
|
|
42
|
-
|
|
43
|
-
1. Initialize a new project structure to generate template configurations:
|
|
44
|
-
```bash
|
|
45
|
-
easy-loader init
|
|
46
|
-
```
|
|
47
|
-
2. Review the generated `config/` folders for sample resources and pipelines.
|
|
48
|
-
3. Run all discovered pipelines across the active configurations:
|
|
49
|
-
```bash
|
|
50
|
-
easy-loader run_all
|
|
51
|
-
```
|
|
52
|
-
|
|
53
|
-
## 🏗️ Architecture & Process Flow
|
|
54
|
-
|
|
55
|
-
The system is designed with modularity in mind. Object dependencies and their standard instantiation lifecycle are executed through the `Configuration` singleton and the `CONNECTOR_FACTORY`.
|
|
56
|
-
|
|
57
|
-
```mermaid
|
|
58
|
-
graph TD
|
|
59
|
-
%% Main Execution Flow
|
|
60
|
-
CLI[CLI Application] -->|Instantiates| Pipeline[Pipeline: Load, Procedure, Orchestrator]
|
|
61
|
-
|
|
62
|
-
%% Configuration and Definitions
|
|
63
|
-
Pipeline -->|Requests Definition & Resources| Config[Configuration Singleton]
|
|
64
|
-
Config -->|Reads| PipelineDef[Pipeline Definitions .py]
|
|
65
|
-
Config -->|Reads| ResourcesEnv[Resource Configs .env]
|
|
66
|
-
|
|
67
|
-
%% Instantiation of Operations
|
|
68
|
-
Pipeline -->|Uses ResourceConfig| Factory[CONNECTOR_FACTORY]
|
|
69
|
-
Factory -->|Creates| DBConn[DatabaseConnector: SqlServer, SQLite]
|
|
70
|
-
DBConn -->|Provides Engine to| DBOps[DatabaseOperations]
|
|
71
|
-
|
|
72
|
-
Pipeline -->|Uses FileSettings| FileOps[FileOperations]
|
|
73
|
-
|
|
74
|
-
%% Pipeline dependencies
|
|
75
|
-
Pipeline -->|Contains| DBOps
|
|
76
|
-
Pipeline -->|Contains| FileOps
|
|
77
|
-
|
|
78
|
-
%% Audit
|
|
79
|
-
Pipeline -->|Uses| AuditOps[Audit DatabaseOperations]
|
|
80
|
-
```
|
|
81
|
-
|
easy_data_loader-0.1.1/README.md
DELETED
|
@@ -1,62 +0,0 @@
|
|
|
1
|
-
# Easy Data Loader 🚀
|
|
2
|
-
|
|
3
|
-
**Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various data sources (CSV, Excel, Parquet) and SQL databases (MSSQL, PostgreSQL, and others).
|
|
4
|
-
|
|
5
|
-
## ✨ Key Features
|
|
6
|
-
- **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
|
|
7
|
-
- **Integrated CLI**: Initialize a standardized project structure with a single command.
|
|
8
|
-
- **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
|
|
9
|
-
- **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
|
|
10
|
-
- **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
|
|
11
|
-
|
|
12
|
-
---
|
|
13
|
-
|
|
14
|
-
## 📦 Installation
|
|
15
|
-
|
|
16
|
-
Install directly via `pip` or `uv`:
|
|
17
|
-
|
|
18
|
-
```bash
|
|
19
|
-
pip install easy_data_loader
|
|
20
|
-
```
|
|
21
|
-
|
|
22
|
-
## 🚀 Getting Started
|
|
23
|
-
|
|
24
|
-
1. Initialize a new project structure to generate template configurations:
|
|
25
|
-
```bash
|
|
26
|
-
easy-loader init
|
|
27
|
-
```
|
|
28
|
-
2. Review the generated `config/` folders for sample resources and pipelines.
|
|
29
|
-
3. Run all discovered pipelines across the active configurations:
|
|
30
|
-
```bash
|
|
31
|
-
easy-loader run_all
|
|
32
|
-
```
|
|
33
|
-
|
|
34
|
-
## 🏗️ Architecture & Process Flow
|
|
35
|
-
|
|
36
|
-
The system is designed with modularity in mind. Object dependencies and their standard instantiation lifecycle are executed through the `Configuration` singleton and the `CONNECTOR_FACTORY`.
|
|
37
|
-
|
|
38
|
-
```mermaid
|
|
39
|
-
graph TD
|
|
40
|
-
%% Main Execution Flow
|
|
41
|
-
CLI[CLI Application] -->|Instantiates| Pipeline[Pipeline: Load, Procedure, Orchestrator]
|
|
42
|
-
|
|
43
|
-
%% Configuration and Definitions
|
|
44
|
-
Pipeline -->|Requests Definition & Resources| Config[Configuration Singleton]
|
|
45
|
-
Config -->|Reads| PipelineDef[Pipeline Definitions .py]
|
|
46
|
-
Config -->|Reads| ResourcesEnv[Resource Configs .env]
|
|
47
|
-
|
|
48
|
-
%% Instantiation of Operations
|
|
49
|
-
Pipeline -->|Uses ResourceConfig| Factory[CONNECTOR_FACTORY]
|
|
50
|
-
Factory -->|Creates| DBConn[DatabaseConnector: SqlServer, SQLite]
|
|
51
|
-
DBConn -->|Provides Engine to| DBOps[DatabaseOperations]
|
|
52
|
-
|
|
53
|
-
Pipeline -->|Uses FileSettings| FileOps[FileOperations]
|
|
54
|
-
|
|
55
|
-
%% Pipeline dependencies
|
|
56
|
-
Pipeline -->|Contains| DBOps
|
|
57
|
-
Pipeline -->|Contains| FileOps
|
|
58
|
-
|
|
59
|
-
%% Audit
|
|
60
|
-
Pipeline -->|Uses| AuditOps[Audit DatabaseOperations]
|
|
61
|
-
```
|
|
62
|
-
|
|
@@ -1,81 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: easy_data_loader
|
|
3
|
-
Version: 0.1.1
|
|
4
|
-
Summary: Data transfer utilities between files and databases
|
|
5
|
-
Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
|
|
6
|
-
Requires-Python: >=3.13
|
|
7
|
-
Description-Content-Type: text/markdown
|
|
8
|
-
License-File: LICENSE
|
|
9
|
-
Requires-Dist: click>=8.3.0
|
|
10
|
-
Requires-Dist: openpyxl>=3.1.5
|
|
11
|
-
Requires-Dist: pandas>=2.3.3
|
|
12
|
-
Requires-Dist: pyarrow>=22.0.0
|
|
13
|
-
Requires-Dist: pydantic>=2.12.5
|
|
14
|
-
Requires-Dist: pydantic-settings>=2.12.0
|
|
15
|
-
Requires-Dist: pyodbc>=5.2.0
|
|
16
|
-
Requires-Dist: python-dotenv>=1.1.1
|
|
17
|
-
Requires-Dist: sqlalchemy>=2.0.43
|
|
18
|
-
Dynamic: license-file
|
|
19
|
-
|
|
20
|
-
# Easy Data Loader 🚀
|
|
21
|
-
|
|
22
|
-
**Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various data sources (CSV, Excel, Parquet) and SQL databases (MSSQL, PostgreSQL, and others).
|
|
23
|
-
|
|
24
|
-
## ✨ Key Features
|
|
25
|
-
- **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
|
|
26
|
-
- **Integrated CLI**: Initialize a standardized project structure with a single command.
|
|
27
|
-
- **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
|
|
28
|
-
- **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
|
|
29
|
-
- **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
|
|
30
|
-
|
|
31
|
-
---
|
|
32
|
-
|
|
33
|
-
## 📦 Installation
|
|
34
|
-
|
|
35
|
-
Install directly via `pip` or `uv`:
|
|
36
|
-
|
|
37
|
-
```bash
|
|
38
|
-
pip install easy_data_loader
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
## 🚀 Getting Started
|
|
42
|
-
|
|
43
|
-
1. Initialize a new project structure to generate template configurations:
|
|
44
|
-
```bash
|
|
45
|
-
easy-loader init
|
|
46
|
-
```
|
|
47
|
-
2. Review the generated `config/` folders for sample resources and pipelines.
|
|
48
|
-
3. Run all discovered pipelines across the active configurations:
|
|
49
|
-
```bash
|
|
50
|
-
easy-loader run_all
|
|
51
|
-
```
|
|
52
|
-
|
|
53
|
-
## 🏗️ Architecture & Process Flow
|
|
54
|
-
|
|
55
|
-
The system is designed with modularity in mind. Object dependencies and their standard instantiation lifecycle are executed through the `Configuration` singleton and the `CONNECTOR_FACTORY`.
|
|
56
|
-
|
|
57
|
-
```mermaid
|
|
58
|
-
graph TD
|
|
59
|
-
%% Main Execution Flow
|
|
60
|
-
CLI[CLI Application] -->|Instantiates| Pipeline[Pipeline: Load, Procedure, Orchestrator]
|
|
61
|
-
|
|
62
|
-
%% Configuration and Definitions
|
|
63
|
-
Pipeline -->|Requests Definition & Resources| Config[Configuration Singleton]
|
|
64
|
-
Config -->|Reads| PipelineDef[Pipeline Definitions .py]
|
|
65
|
-
Config -->|Reads| ResourcesEnv[Resource Configs .env]
|
|
66
|
-
|
|
67
|
-
%% Instantiation of Operations
|
|
68
|
-
Pipeline -->|Uses ResourceConfig| Factory[CONNECTOR_FACTORY]
|
|
69
|
-
Factory -->|Creates| DBConn[DatabaseConnector: SqlServer, SQLite]
|
|
70
|
-
DBConn -->|Provides Engine to| DBOps[DatabaseOperations]
|
|
71
|
-
|
|
72
|
-
Pipeline -->|Uses FileSettings| FileOps[FileOperations]
|
|
73
|
-
|
|
74
|
-
%% Pipeline dependencies
|
|
75
|
-
Pipeline -->|Contains| DBOps
|
|
76
|
-
Pipeline -->|Contains| FileOps
|
|
77
|
-
|
|
78
|
-
%% Audit
|
|
79
|
-
Pipeline -->|Uses| AuditOps[Audit DatabaseOperations]
|
|
80
|
-
```
|
|
81
|
-
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
{easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/database_connector.py
RENAMED
|
File without changes
|
{easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/database_operations.py
RENAMED
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
{easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/procedure_pipeline.py
RENAMED
|
File without changes
|
|
File without changes
|
{easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/dependency_links.txt
RENAMED
|
File without changes
|
{easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/requires.txt
RENAMED
|
File without changes
|
{easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/top_level.txt
RENAMED
|
File without changes
|
|
File without changes
|
|
File without changes
|