easy-data-loader 0.1.1__tar.gz → 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. easy_data_loader-0.1.2/PKG-INFO +110 -0
  2. easy_data_loader-0.1.2/README.md +84 -0
  3. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/pyproject.toml +11 -2
  4. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/cli.py +14 -10
  5. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/models.py +8 -1
  6. easy_data_loader-0.1.2/src/easy_data_loader.egg-info/PKG-INFO +110 -0
  7. easy_data_loader-0.1.2/src/easy_data_loader.egg-info/entry_points.txt +2 -0
  8. easy_data_loader-0.1.1/PKG-INFO +0 -81
  9. easy_data_loader-0.1.1/README.md +0 -62
  10. easy_data_loader-0.1.1/src/easy_data_loader.egg-info/PKG-INFO +0 -81
  11. easy_data_loader-0.1.1/src/easy_data_loader.egg-info/entry_points.txt +0 -2
  12. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/LICENSE +0 -0
  13. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/setup.cfg +0 -0
  14. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/__init__.py +0 -0
  15. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/config_loader.py +0 -0
  16. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/custom_exceptions.py +0 -0
  17. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/data_inferrence.py +0 -0
  18. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/database_connector.py +0 -0
  19. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/database_operations.py +0 -0
  20. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/driver_detector.py +0 -0
  21. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/file_operations.py +0 -0
  22. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/log.py +0 -0
  23. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/orchestrator.py +0 -0
  24. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/pipeline.py +0 -0
  25. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/pipeline_base.py +0 -0
  26. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader/procedure_pipeline.py +0 -0
  27. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/SOURCES.txt +0 -0
  28. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/dependency_links.txt +0 -0
  29. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/requires.txt +0 -0
  30. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/src/easy_data_loader.egg-info/top_level.txt +0 -0
  31. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/tests/test_data_inference.py +0 -0
  32. {easy_data_loader-0.1.1 → easy_data_loader-0.1.2}/tests/test_imports.py +0 -0
@@ -0,0 +1,110 @@
1
+ Metadata-Version: 2.4
2
+ Name: easy_data_loader
3
+ Version: 0.1.2
4
+ Summary: Data transfer utilities between files and databases
5
+ Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
6
+ Classifier: Development Status :: 3 - Alpha
7
+ Classifier: Intended Audience :: Developers
8
+ Classifier: Topic :: Database
9
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Programming Language :: Python :: 3.13
12
+ Classifier: Operating System :: OS Independent
13
+ Requires-Python: >=3.13
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE
16
+ Requires-Dist: click>=8.3.0
17
+ Requires-Dist: openpyxl>=3.1.5
18
+ Requires-Dist: pandas>=2.3.3
19
+ Requires-Dist: pyarrow>=22.0.0
20
+ Requires-Dist: pydantic>=2.12.5
21
+ Requires-Dist: pydantic-settings>=2.12.0
22
+ Requires-Dist: pyodbc>=5.2.0
23
+ Requires-Dist: python-dotenv>=1.1.1
24
+ Requires-Dist: sqlalchemy>=2.0.43
25
+ Dynamic: license-file
26
+
27
+ # Easy Data Loader 🚀
28
+
29
+
30
+ [![PyPI version](https://badge.fury.io/py/easy-data-loader.svg)](https://badge.fury.io/py/easy-data-loader)
31
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
32
+ ![Downloads](https://static.pepy.tech/badge/easy-data-loader)
33
+
34
+ **Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various file data sources (csv, xlsx, parquet, orc) and databases (MSSQL, PostgreSQL and others).
35
+
36
+ ## ✨ Key Features
37
+ - **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
38
+ - **Integrated CLI**: Initialize a standardized project structure with a single command.
39
+ - **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
40
+ - **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
41
+ - **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
42
+
43
+ ---
44
+
45
+ ## 📦 Installation
46
+
47
+ Install directly via `pip` or `uv`:
48
+
49
+ ```bash
50
+ pip install easy_data_loader
51
+ uv add easy_data_loader
52
+ ```
53
+
54
+ ## 🚀 Getting Started
55
+
56
+ 1. Initialize a new project structure to generate template configurations:
57
+ ```bash
58
+ easy-data-loader init
59
+ ```
60
+ 2. Review the generated `config/` folders for sample resources and pipelines.
61
+ 3. Run all discovered pipelines across the active configurations:
62
+ ```bash
63
+ easy-data-loader run_all
64
+ ```
65
+
66
+ ## ✔️ Generic concepts
67
+
68
+ `easy_data_loader` uses `resources` as a way to define a file or a database. The resouces can represent either a source or a destination making posible the folowing ETL scenarios: file -> file, file -> database, database -> file, database -> database.
69
+
70
+ `easy_data_loader` project initializer will created the predefined folder structure `/config/resources` where the resources are expected to be defined following the current convention: the file type is .env and the file name must be prefixed with the resource type `file_` or `database_`. The predefined folder structure together with the naming convention enables `easy_data_loader` to find and load all resources.
71
+
72
+ A secondary predefined folder `/config/pipelines` will contain the pipeline definition files, which are regular Python files. There are 3 types of pipelines that can be defined:
73
+ - `LoadPipeline` the main pipeline type which transports data from source to destination
74
+ - `ProcedurePipeline` a pipeline dedicated for executing stored procedures inside a database
75
+ - `OrchestratorPipeline` a pipeline that can execute a group of pipelines sequentialy
76
+
77
+ ## LoadPipeline
78
+
79
+ In order to define a `LoadPipeline` we must use the `BasePipelineDefinition` from `easy_data_loader` as depicted in the example pipelines created by the initializer.
80
+ In the simplest form there are only a few mandatory parameters:
81
+ - `pipeline_name : str` - this name will be used to execute the pipeline
82
+ - `source : str` - the file name (without extension) coresponding to the desired resource to be the data source
83
+ - `destination : str` - the file name (without extension) coresponding to the desired resource to be the data destination
84
+
85
+ If either the source or destination are a database then additional parameters become mandatory:
86
+ - `source_sql : str` - can be a table name or a specific query in the SQL dialect of the source database flavor
87
+ - `destination_table : str` - table name where the data will be inserted
88
+
89
+ There are many other aspects of the pipeline that can be defined:
90
+ - `audit : str` - the pipeline has a built in audit functionality, it records certain information after the pipeline completes in a SqlLite database. If the user desires, the same information can be recorded in a database `resource`
91
+ - `validator: Pydantic BaseModel` - the data read from the source `resource` can be validated using an arbitrary defined Pydantic model before is written to destination
92
+ - `columns : Dict[str, ColumnDefinition]` - this parameter is used for strict control on how the data is written to destination; it has the dual purpose of renaming the columns and also define explicitly the data types (mainly for inserting into a database table); the `ColumnDefinition` is constructed with an optional `target_name: str` for renaming columns and / or a `data_type : SqlAlchemy Type` thus controling column data types, lenghts, precision etc.
93
+ - `read_parameters : Dict[str, Any]` and `write_parameters : Dict[str, Any]` - these parameters control how the data is being read or written from source to destination and provide an easy way to use special delimiters for files, drop and recreate the database table, etc. `easy_data_loader` is using pandas as the transport layer therefore the read and write parameters will be passed to the coresponding read and write functions supported by pandas.
94
+ - the pipeline has a set of predefined hooks allowing the execution of functions at specific moments during the execution: `file_pre_process : Callable` - executed before the file is read into the pandas DataFrame (e.g. unzip the file); `transform : Callable` - perform data transformation over the data already in the pandas DataFrame (requires pandas methods); `file_post_process : Callable` - after the pipeline completes and the data is written to the destination perform post processing on the source file (e.g. move the file to another folder)
95
+
96
+ ## ProcedurePipeline
97
+
98
+ This secondary pipeline type is responsible for executing one or more stored procedures inside a database.
99
+ To define one we need to use the `ProcedureDefinition` with the following parameters:
100
+ - `pipeline_name : str` - this name will be used to execute the pipeline
101
+ - `audit : str, optional` - database resource name where the audit info will be recorded
102
+ - `resource : str` - database resource name where the stored procedure(s) wil be executed
103
+ - `procedures : List[tuple(str, Optional[Dict[str, Any]])]` - list of one or more stord procedures along with optional procedures parameters as dictionaries
104
+
105
+ ## OrchestratorPipeline
106
+
107
+ This pipeline type is responsible of executing sequentially a set of pipelines, `LoadPipeline`s and / or `ProcedurePipeline`s. Very simple to define using the `OrchestratorDefinition` with:
108
+ - `orchestrator_name : str` - name by which the orchestrator is executer
109
+ - 'pipelines : List[str]` - list of pipelines to execute sequentially
110
+ - `fail_fast : bool, Default True` - if any of the pipelines fail the rest of the pipelines in the list do not get executed
@@ -0,0 +1,84 @@
1
+ # Easy Data Loader 🚀
2
+
3
+
4
+ [![PyPI version](https://badge.fury.io/py/easy-data-loader.svg)](https://badge.fury.io/py/easy-data-loader)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
+ ![Downloads](https://static.pepy.tech/badge/easy-data-loader)
7
+
8
+ **Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various file data sources (csv, xlsx, parquet, orc) and databases (MSSQL, PostgreSQL and others).
9
+
10
+ ## ✨ Key Features
11
+ - **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
12
+ - **Integrated CLI**: Initialize a standardized project structure with a single command.
13
+ - **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
14
+ - **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
15
+ - **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
16
+
17
+ ---
18
+
19
+ ## 📦 Installation
20
+
21
+ Install directly via `pip` or `uv`:
22
+
23
+ ```bash
24
+ pip install easy_data_loader
25
+ uv add easy_data_loader
26
+ ```
27
+
28
+ ## 🚀 Getting Started
29
+
30
+ 1. Initialize a new project structure to generate template configurations:
31
+ ```bash
32
+ easy-data-loader init
33
+ ```
34
+ 2. Review the generated `config/` folders for sample resources and pipelines.
35
+ 3. Run all discovered pipelines across the active configurations:
36
+ ```bash
37
+ easy-data-loader run_all
38
+ ```
39
+
40
+ ## ✔️ Generic concepts
41
+
42
+ `easy_data_loader` uses `resources` as a way to define a file or a database. The resouces can represent either a source or a destination making posible the folowing ETL scenarios: file -> file, file -> database, database -> file, database -> database.
43
+
44
+ `easy_data_loader` project initializer will created the predefined folder structure `/config/resources` where the resources are expected to be defined following the current convention: the file type is .env and the file name must be prefixed with the resource type `file_` or `database_`. The predefined folder structure together with the naming convention enables `easy_data_loader` to find and load all resources.
45
+
46
+ A secondary predefined folder `/config/pipelines` will contain the pipeline definition files, which are regular Python files. There are 3 types of pipelines that can be defined:
47
+ - `LoadPipeline` the main pipeline type which transports data from source to destination
48
+ - `ProcedurePipeline` a pipeline dedicated for executing stored procedures inside a database
49
+ - `OrchestratorPipeline` a pipeline that can execute a group of pipelines sequentialy
50
+
51
+ ## LoadPipeline
52
+
53
+ In order to define a `LoadPipeline` we must use the `BasePipelineDefinition` from `easy_data_loader` as depicted in the example pipelines created by the initializer.
54
+ In the simplest form there are only a few mandatory parameters:
55
+ - `pipeline_name : str` - this name will be used to execute the pipeline
56
+ - `source : str` - the file name (without extension) coresponding to the desired resource to be the data source
57
+ - `destination : str` - the file name (without extension) coresponding to the desired resource to be the data destination
58
+
59
+ If either the source or destination are a database then additional parameters become mandatory:
60
+ - `source_sql : str` - can be a table name or a specific query in the SQL dialect of the source database flavor
61
+ - `destination_table : str` - table name where the data will be inserted
62
+
63
+ There are many other aspects of the pipeline that can be defined:
64
+ - `audit : str` - the pipeline has a built in audit functionality, it records certain information after the pipeline completes in a SqlLite database. If the user desires, the same information can be recorded in a database `resource`
65
+ - `validator: Pydantic BaseModel` - the data read from the source `resource` can be validated using an arbitrary defined Pydantic model before is written to destination
66
+ - `columns : Dict[str, ColumnDefinition]` - this parameter is used for strict control on how the data is written to destination; it has the dual purpose of renaming the columns and also define explicitly the data types (mainly for inserting into a database table); the `ColumnDefinition` is constructed with an optional `target_name: str` for renaming columns and / or a `data_type : SqlAlchemy Type` thus controling column data types, lenghts, precision etc.
67
+ - `read_parameters : Dict[str, Any]` and `write_parameters : Dict[str, Any]` - these parameters control how the data is being read or written from source to destination and provide an easy way to use special delimiters for files, drop and recreate the database table, etc. `easy_data_loader` is using pandas as the transport layer therefore the read and write parameters will be passed to the coresponding read and write functions supported by pandas.
68
+ - the pipeline has a set of predefined hooks allowing the execution of functions at specific moments during the execution: `file_pre_process : Callable` - executed before the file is read into the pandas DataFrame (e.g. unzip the file); `transform : Callable` - perform data transformation over the data already in the pandas DataFrame (requires pandas methods); `file_post_process : Callable` - after the pipeline completes and the data is written to the destination perform post processing on the source file (e.g. move the file to another folder)
69
+
70
+ ## ProcedurePipeline
71
+
72
+ This secondary pipeline type is responsible for executing one or more stored procedures inside a database.
73
+ To define one we need to use the `ProcedureDefinition` with the following parameters:
74
+ - `pipeline_name : str` - this name will be used to execute the pipeline
75
+ - `audit : str, optional` - database resource name where the audit info will be recorded
76
+ - `resource : str` - database resource name where the stored procedure(s) wil be executed
77
+ - `procedures : List[tuple(str, Optional[Dict[str, Any]])]` - list of one or more stord procedures along with optional procedures parameters as dictionaries
78
+
79
+ ## OrchestratorPipeline
80
+
81
+ This pipeline type is responsible of executing sequentially a set of pipelines, `LoadPipeline`s and / or `ProcedurePipeline`s. Very simple to define using the `OrchestratorDefinition` with:
82
+ - `orchestrator_name : str` - name by which the orchestrator is executer
83
+ - 'pipelines : List[str]` - list of pipelines to execute sequentially
84
+ - `fail_fast : bool, Default True` - if any of the pipelines fail the rest of the pipelines in the list do not get executed
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "easy_data_loader"
3
- version = "0.1.1"
3
+ version = "0.1.2"
4
4
  description = "Data transfer utilities between files and databases"
5
5
  authors = [{ name = "Bojoi Gabriel", email = "bojoigabriel@gmail.com" }]
6
6
  readme = "README.md"
@@ -16,12 +16,21 @@ dependencies = [
16
16
  "python-dotenv>=1.1.1",
17
17
  "sqlalchemy>=2.0.43",
18
18
  ]
19
+ classifiers = [
20
+ "Development Status :: 3 - Alpha",
21
+ "Intended Audience :: Developers",
22
+ "Topic :: Database",
23
+ "Topic :: Scientific/Engineering :: Information Analysis",
24
+ "License :: OSI Approved :: MIT License",
25
+ "Programming Language :: Python :: 3.13",
26
+ "Operating System :: OS Independent",
27
+ ]
19
28
 
20
29
  [dependency-groups]
21
30
  dev = ["ipykernel>=7.1.0", "pytest>=8.4.2", "ruff", "mypy", "pre-commit"]
22
31
 
23
32
  [project.scripts]
24
- easy-loader = "easy_data_loader.cli:main"
33
+ easy-data-loader = "easy_data_loader.cli:main"
25
34
 
26
35
  [tool.setuptools.packages.find]
27
36
  where = ["src"]
@@ -89,26 +89,30 @@ CONN_PORT=1433
89
89
 
90
90
  FILE_ENV = """
91
91
  # file resource definition
92
- FILE_TYPE=CSV
93
- FOLDER_PATH=./data/imports
94
- FILE_NAME=large_sales_data
92
+ FILE_TYPE=CSV # can also be XLSX, PARQUET, ORC
93
+ FOLDER_PATH=./data/imports # source folder where the file is located
94
+ FILE_NAME=large_sales_data # exact file name without extension
95
+ #FILE_PATTERN=large_sales # file pattern to search in the source folder
95
96
  """
96
97
 
97
98
  MAIN = """
98
- from easy_data_loader.pipeline import LoadPipeline
99
+ from easy_data_loader import LoadPipeline, ProcedurePipeline
100
+
101
+ def main():
102
+ # Run an ETL pipeline
103
+ LoadPipeline(pipeline_name="example_pipeline").run()
99
104
 
100
- # Run an ETL pipeline
101
- LoadPipeline(pipeline_name="example_pipeline").run()
105
+ # Run a procedure pipeline
106
+ ProcedurePipeline(pipeline_name="example_procedure").run()
102
107
 
103
- # Run a procedure pipeline
104
- # from easy_data_loader.procedure_pipeline import ProcedurePipeline
105
- # ProcedurePipeline(pipeline_name="example_procedure").run()
108
+ if __name__ == "__main__":
109
+ main()
106
110
  """
107
111
 
108
112
 
109
113
  @click.group()
110
114
  def main():
111
- """Easy Data Loader CLI - ETL instrument between files and databases"""
115
+ """Easy Data Loader CLI - ETL instrument for files and databases"""
112
116
  pass
113
117
 
114
118
 
@@ -117,7 +117,7 @@ class ColumnDefinition(BaseModel):
117
117
 
118
118
 
119
119
  class BasePipelineDefinition(BaseModel):
120
- """Base pipeline definition. Used for user defined pipelines but also autogenerated"""
120
+ """Base pipeline definition"""
121
121
 
122
122
  model_config = ConfigDict(arbitrary_types_allowed=True, extra="allow")
123
123
  pipeline_name: str = "generic_pipeline"
@@ -149,6 +149,13 @@ class BasePipelineDefinition(BaseModel):
149
149
  """
150
150
  return df
151
151
 
152
+ def file_post_process(self, file_path: Path) -> Path:
153
+ """
154
+ Hook for postprocessing the file before reading.
155
+ Ex: archive file, etc.
156
+ """
157
+ return file_path
158
+
152
159
  def get_sql_columns(self) -> Dict[str, ColumnDefinition]:
153
160
  """
154
161
  Return the sql column definition from the pipeline.
@@ -0,0 +1,110 @@
1
+ Metadata-Version: 2.4
2
+ Name: easy_data_loader
3
+ Version: 0.1.2
4
+ Summary: Data transfer utilities between files and databases
5
+ Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
6
+ Classifier: Development Status :: 3 - Alpha
7
+ Classifier: Intended Audience :: Developers
8
+ Classifier: Topic :: Database
9
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Programming Language :: Python :: 3.13
12
+ Classifier: Operating System :: OS Independent
13
+ Requires-Python: >=3.13
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE
16
+ Requires-Dist: click>=8.3.0
17
+ Requires-Dist: openpyxl>=3.1.5
18
+ Requires-Dist: pandas>=2.3.3
19
+ Requires-Dist: pyarrow>=22.0.0
20
+ Requires-Dist: pydantic>=2.12.5
21
+ Requires-Dist: pydantic-settings>=2.12.0
22
+ Requires-Dist: pyodbc>=5.2.0
23
+ Requires-Dist: python-dotenv>=1.1.1
24
+ Requires-Dist: sqlalchemy>=2.0.43
25
+ Dynamic: license-file
26
+
27
+ # Easy Data Loader 🚀
28
+
29
+
30
+ [![PyPI version](https://badge.fury.io/py/easy-data-loader.svg)](https://badge.fury.io/py/easy-data-loader)
31
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
32
+ ![Downloads](https://static.pepy.tech/badge/easy-data-loader)
33
+
34
+ **Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various file data sources (csv, xlsx, parquet, orc) and databases (MSSQL, PostgreSQL and others).
35
+
36
+ ## ✨ Key Features
37
+ - **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
38
+ - **Integrated CLI**: Initialize a standardized project structure with a single command.
39
+ - **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
40
+ - **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
41
+ - **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
42
+
43
+ ---
44
+
45
+ ## 📦 Installation
46
+
47
+ Install directly via `pip` or `uv`:
48
+
49
+ ```bash
50
+ pip install easy_data_loader
51
+ uv add easy_data_loader
52
+ ```
53
+
54
+ ## 🚀 Getting Started
55
+
56
+ 1. Initialize a new project structure to generate template configurations:
57
+ ```bash
58
+ easy-data-loader init
59
+ ```
60
+ 2. Review the generated `config/` folders for sample resources and pipelines.
61
+ 3. Run all discovered pipelines across the active configurations:
62
+ ```bash
63
+ easy-data-loader run_all
64
+ ```
65
+
66
+ ## ✔️ Generic concepts
67
+
68
+ `easy_data_loader` uses `resources` as a way to define a file or a database. The resouces can represent either a source or a destination making posible the folowing ETL scenarios: file -> file, file -> database, database -> file, database -> database.
69
+
70
+ `easy_data_loader` project initializer will created the predefined folder structure `/config/resources` where the resources are expected to be defined following the current convention: the file type is .env and the file name must be prefixed with the resource type `file_` or `database_`. The predefined folder structure together with the naming convention enables `easy_data_loader` to find and load all resources.
71
+
72
+ A secondary predefined folder `/config/pipelines` will contain the pipeline definition files, which are regular Python files. There are 3 types of pipelines that can be defined:
73
+ - `LoadPipeline` the main pipeline type which transports data from source to destination
74
+ - `ProcedurePipeline` a pipeline dedicated for executing stored procedures inside a database
75
+ - `OrchestratorPipeline` a pipeline that can execute a group of pipelines sequentialy
76
+
77
+ ## LoadPipeline
78
+
79
+ In order to define a `LoadPipeline` we must use the `BasePipelineDefinition` from `easy_data_loader` as depicted in the example pipelines created by the initializer.
80
+ In the simplest form there are only a few mandatory parameters:
81
+ - `pipeline_name : str` - this name will be used to execute the pipeline
82
+ - `source : str` - the file name (without extension) coresponding to the desired resource to be the data source
83
+ - `destination : str` - the file name (without extension) coresponding to the desired resource to be the data destination
84
+
85
+ If either the source or destination are a database then additional parameters become mandatory:
86
+ - `source_sql : str` - can be a table name or a specific query in the SQL dialect of the source database flavor
87
+ - `destination_table : str` - table name where the data will be inserted
88
+
89
+ There are many other aspects of the pipeline that can be defined:
90
+ - `audit : str` - the pipeline has a built in audit functionality, it records certain information after the pipeline completes in a SqlLite database. If the user desires, the same information can be recorded in a database `resource`
91
+ - `validator: Pydantic BaseModel` - the data read from the source `resource` can be validated using an arbitrary defined Pydantic model before is written to destination
92
+ - `columns : Dict[str, ColumnDefinition]` - this parameter is used for strict control on how the data is written to destination; it has the dual purpose of renaming the columns and also define explicitly the data types (mainly for inserting into a database table); the `ColumnDefinition` is constructed with an optional `target_name: str` for renaming columns and / or a `data_type : SqlAlchemy Type` thus controling column data types, lenghts, precision etc.
93
+ - `read_parameters : Dict[str, Any]` and `write_parameters : Dict[str, Any]` - these parameters control how the data is being read or written from source to destination and provide an easy way to use special delimiters for files, drop and recreate the database table, etc. `easy_data_loader` is using pandas as the transport layer therefore the read and write parameters will be passed to the coresponding read and write functions supported by pandas.
94
+ - the pipeline has a set of predefined hooks allowing the execution of functions at specific moments during the execution: `file_pre_process : Callable` - executed before the file is read into the pandas DataFrame (e.g. unzip the file); `transform : Callable` - perform data transformation over the data already in the pandas DataFrame (requires pandas methods); `file_post_process : Callable` - after the pipeline completes and the data is written to the destination perform post processing on the source file (e.g. move the file to another folder)
95
+
96
+ ## ProcedurePipeline
97
+
98
+ This secondary pipeline type is responsible for executing one or more stored procedures inside a database.
99
+ To define one we need to use the `ProcedureDefinition` with the following parameters:
100
+ - `pipeline_name : str` - this name will be used to execute the pipeline
101
+ - `audit : str, optional` - database resource name where the audit info will be recorded
102
+ - `resource : str` - database resource name where the stored procedure(s) wil be executed
103
+ - `procedures : List[tuple(str, Optional[Dict[str, Any]])]` - list of one or more stord procedures along with optional procedures parameters as dictionaries
104
+
105
+ ## OrchestratorPipeline
106
+
107
+ This pipeline type is responsible of executing sequentially a set of pipelines, `LoadPipeline`s and / or `ProcedurePipeline`s. Very simple to define using the `OrchestratorDefinition` with:
108
+ - `orchestrator_name : str` - name by which the orchestrator is executer
109
+ - 'pipelines : List[str]` - list of pipelines to execute sequentially
110
+ - `fail_fast : bool, Default True` - if any of the pipelines fail the rest of the pipelines in the list do not get executed
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ easy-data-loader = easy_data_loader.cli:main
@@ -1,81 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: easy_data_loader
3
- Version: 0.1.1
4
- Summary: Data transfer utilities between files and databases
5
- Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
6
- Requires-Python: >=3.13
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE
9
- Requires-Dist: click>=8.3.0
10
- Requires-Dist: openpyxl>=3.1.5
11
- Requires-Dist: pandas>=2.3.3
12
- Requires-Dist: pyarrow>=22.0.0
13
- Requires-Dist: pydantic>=2.12.5
14
- Requires-Dist: pydantic-settings>=2.12.0
15
- Requires-Dist: pyodbc>=5.2.0
16
- Requires-Dist: python-dotenv>=1.1.1
17
- Requires-Dist: sqlalchemy>=2.0.43
18
- Dynamic: license-file
19
-
20
- # Easy Data Loader 🚀
21
-
22
- **Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various data sources (CSV, Excel, Parquet) and SQL databases (MSSQL, PostgreSQL, and others).
23
-
24
- ## ✨ Key Features
25
- - **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
26
- - **Integrated CLI**: Initialize a standardized project structure with a single command.
27
- - **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
28
- - **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
29
- - **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
30
-
31
- ---
32
-
33
- ## 📦 Installation
34
-
35
- Install directly via `pip` or `uv`:
36
-
37
- ```bash
38
- pip install easy_data_loader
39
- ```
40
-
41
- ## 🚀 Getting Started
42
-
43
- 1. Initialize a new project structure to generate template configurations:
44
- ```bash
45
- easy-loader init
46
- ```
47
- 2. Review the generated `config/` folders for sample resources and pipelines.
48
- 3. Run all discovered pipelines across the active configurations:
49
- ```bash
50
- easy-loader run_all
51
- ```
52
-
53
- ## 🏗️ Architecture & Process Flow
54
-
55
- The system is designed with modularity in mind. Object dependencies and their standard instantiation lifecycle are executed through the `Configuration` singleton and the `CONNECTOR_FACTORY`.
56
-
57
- ```mermaid
58
- graph TD
59
- %% Main Execution Flow
60
- CLI[CLI Application] -->|Instantiates| Pipeline[Pipeline: Load, Procedure, Orchestrator]
61
-
62
- %% Configuration and Definitions
63
- Pipeline -->|Requests Definition & Resources| Config[Configuration Singleton]
64
- Config -->|Reads| PipelineDef[Pipeline Definitions .py]
65
- Config -->|Reads| ResourcesEnv[Resource Configs .env]
66
-
67
- %% Instantiation of Operations
68
- Pipeline -->|Uses ResourceConfig| Factory[CONNECTOR_FACTORY]
69
- Factory -->|Creates| DBConn[DatabaseConnector: SqlServer, SQLite]
70
- DBConn -->|Provides Engine to| DBOps[DatabaseOperations]
71
-
72
- Pipeline -->|Uses FileSettings| FileOps[FileOperations]
73
-
74
- %% Pipeline dependencies
75
- Pipeline -->|Contains| DBOps
76
- Pipeline -->|Contains| FileOps
77
-
78
- %% Audit
79
- Pipeline -->|Uses| AuditOps[Audit DatabaseOperations]
80
- ```
81
-
@@ -1,62 +0,0 @@
1
- # Easy Data Loader 🚀
2
-
3
- **Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various data sources (CSV, Excel, Parquet) and SQL databases (MSSQL, PostgreSQL, and others).
4
-
5
- ## ✨ Key Features
6
- - **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
7
- - **Integrated CLI**: Initialize a standardized project structure with a single command.
8
- - **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
9
- - **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
10
- - **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
11
-
12
- ---
13
-
14
- ## 📦 Installation
15
-
16
- Install directly via `pip` or `uv`:
17
-
18
- ```bash
19
- pip install easy_data_loader
20
- ```
21
-
22
- ## 🚀 Getting Started
23
-
24
- 1. Initialize a new project structure to generate template configurations:
25
- ```bash
26
- easy-loader init
27
- ```
28
- 2. Review the generated `config/` folders for sample resources and pipelines.
29
- 3. Run all discovered pipelines across the active configurations:
30
- ```bash
31
- easy-loader run_all
32
- ```
33
-
34
- ## 🏗️ Architecture & Process Flow
35
-
36
- The system is designed with modularity in mind. Object dependencies and their standard instantiation lifecycle are executed through the `Configuration` singleton and the `CONNECTOR_FACTORY`.
37
-
38
- ```mermaid
39
- graph TD
40
- %% Main Execution Flow
41
- CLI[CLI Application] -->|Instantiates| Pipeline[Pipeline: Load, Procedure, Orchestrator]
42
-
43
- %% Configuration and Definitions
44
- Pipeline -->|Requests Definition & Resources| Config[Configuration Singleton]
45
- Config -->|Reads| PipelineDef[Pipeline Definitions .py]
46
- Config -->|Reads| ResourcesEnv[Resource Configs .env]
47
-
48
- %% Instantiation of Operations
49
- Pipeline -->|Uses ResourceConfig| Factory[CONNECTOR_FACTORY]
50
- Factory -->|Creates| DBConn[DatabaseConnector: SqlServer, SQLite]
51
- DBConn -->|Provides Engine to| DBOps[DatabaseOperations]
52
-
53
- Pipeline -->|Uses FileSettings| FileOps[FileOperations]
54
-
55
- %% Pipeline dependencies
56
- Pipeline -->|Contains| DBOps
57
- Pipeline -->|Contains| FileOps
58
-
59
- %% Audit
60
- Pipeline -->|Uses| AuditOps[Audit DatabaseOperations]
61
- ```
62
-
@@ -1,81 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: easy_data_loader
3
- Version: 0.1.1
4
- Summary: Data transfer utilities between files and databases
5
- Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
6
- Requires-Python: >=3.13
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE
9
- Requires-Dist: click>=8.3.0
10
- Requires-Dist: openpyxl>=3.1.5
11
- Requires-Dist: pandas>=2.3.3
12
- Requires-Dist: pyarrow>=22.0.0
13
- Requires-Dist: pydantic>=2.12.5
14
- Requires-Dist: pydantic-settings>=2.12.0
15
- Requires-Dist: pyodbc>=5.2.0
16
- Requires-Dist: python-dotenv>=1.1.1
17
- Requires-Dist: sqlalchemy>=2.0.43
18
- Dynamic: license-file
19
-
20
- # Easy Data Loader 🚀
21
-
22
- **Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various data sources (CSV, Excel, Parquet) and SQL databases (MSSQL, PostgreSQL, and others).
23
-
24
- ## ✨ Key Features
25
- - **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
26
- - **Integrated CLI**: Initialize a standardized project structure with a single command.
27
- - **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
28
- - **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
29
- - **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.
30
-
31
- ---
32
-
33
- ## 📦 Installation
34
-
35
- Install directly via `pip` or `uv`:
36
-
37
- ```bash
38
- pip install easy_data_loader
39
- ```
40
-
41
- ## 🚀 Getting Started
42
-
43
- 1. Initialize a new project structure to generate template configurations:
44
- ```bash
45
- easy-loader init
46
- ```
47
- 2. Review the generated `config/` folders for sample resources and pipelines.
48
- 3. Run all discovered pipelines across the active configurations:
49
- ```bash
50
- easy-loader run_all
51
- ```
52
-
53
- ## 🏗️ Architecture & Process Flow
54
-
55
- The system is designed with modularity in mind. Object dependencies and their standard instantiation lifecycle are executed through the `Configuration` singleton and the `CONNECTOR_FACTORY`.
56
-
57
- ```mermaid
58
- graph TD
59
- %% Main Execution Flow
60
- CLI[CLI Application] -->|Instantiates| Pipeline[Pipeline: Load, Procedure, Orchestrator]
61
-
62
- %% Configuration and Definitions
63
- Pipeline -->|Requests Definition & Resources| Config[Configuration Singleton]
64
- Config -->|Reads| PipelineDef[Pipeline Definitions .py]
65
- Config -->|Reads| ResourcesEnv[Resource Configs .env]
66
-
67
- %% Instantiation of Operations
68
- Pipeline -->|Uses ResourceConfig| Factory[CONNECTOR_FACTORY]
69
- Factory -->|Creates| DBConn[DatabaseConnector: SqlServer, SQLite]
70
- DBConn -->|Provides Engine to| DBOps[DatabaseOperations]
71
-
72
- Pipeline -->|Uses FileSettings| FileOps[FileOperations]
73
-
74
- %% Pipeline dependencies
75
- Pipeline -->|Contains| DBOps
76
- Pipeline -->|Contains| FileOps
77
-
78
- %% Audit
79
- Pipeline -->|Uses| AuditOps[Audit DatabaseOperations]
80
- ```
81
-
@@ -1,2 +0,0 @@
1
- [console_scripts]
2
- easy-loader = easy_data_loader.cli:main