snowforge-package 0.2.7__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
File without changes
@@ -0,0 +1,148 @@
1
+ Metadata-Version: 2.2
2
+ Name: snowforge-package
3
+ Version: 0.2.7
4
+ Summary: A Python package for supporting migration from on-prem to cloud
5
+ Home-page: https://github.com/yourusername/Snowforge
6
+ Author: Andreas Heggelund
7
+ Author-email: andreasheggelund@gmail.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.12
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: boto3
15
+ Requires-Dist: snowflake-connector-python
16
+ Requires-Dist: coloredlogs
17
+ Requires-Dist: colored
18
+ Requires-Dist: tqdm
19
+ Requires-Dist: toml
20
+ Requires-Dist: argparse
21
+ Dynamic: author
22
+ Dynamic: author-email
23
+ Dynamic: classifier
24
+ Dynamic: description
25
+ Dynamic: description-content-type
26
+ Dynamic: home-page
27
+ Dynamic: requires-dist
28
+ Dynamic: requires-python
29
+ Dynamic: summary
30
+
31
+ # 🚀 Snowforge - Powerful Data Integration
32
+
33
+ **Snowforge** is a Python package designed to streamline data integration and transfer between **AWS**, **Snowflake**, and various **on-premise database systems**. It provides efficient data extraction, logging, configuration management, and AWS utilities to support robust data engineering workflows.
34
+
35
+ ---
36
+
37
+ ## ✨ Features
38
+
39
+ - **AWS Integration**: Manage AWS S3 and Secrets Manager operations.
40
+ - **Snowflake Connection**: Establish and manage Snowflake connections effortlessly.
41
+ - **Advanced Logging**: Centralized logging system with colored output for better visibility.
42
+ - **Configuration Management**: Load and manage credentials from a TOML configuration file.
43
+ - **Data Mover Engine**: Parallel data processing and extraction strategies for efficiency.
44
+ - **Extensible Database Extraction**: Uses a **strategy pattern** to support multiple **on-prem database systems** (e.g., Netezza, Oracle, PostgreSQL, etc.).
45
+
46
+ ---
47
+
48
+ ## 📥 Installation
49
+
50
+ Install Snowforge using pip:
51
+
52
+ ```sh
53
+ pip install snowforge-package
54
+ ```
55
+
56
+ ---
57
+
58
+ ## ⚙️ Configuration
59
+
60
+ Snowforge requires a configuration file (`snowforge_config.toml`) to manage credentials for AWS and Snowflake. The package searches for the config file in the following locations:
61
+
62
+ 1. Path specified in `SNOWFORGE_CONFIG_PATH` environment variable.
63
+ 2. Current working directory.
64
+ 3. `~/.config/snowforge_config.toml`
65
+ 4. Package directory.
66
+
67
+ ### Example `snowforge_config.toml` File
68
+
69
+ ```toml
70
+ [AWS]
71
+ [default]
72
+ AWS_ACCESS_KEY = "your-access-key"
73
+ AWS_SECRET_KEY = "your-secret-key"
74
+ REGION = "us-east-1"
75
+
76
+ [SNOWFLAKE]
77
+ [default]
78
+ USERNAME = "your-username"
79
+ ACCOUNT = "your-account"
80
+ ```
81
+
82
+ ---
83
+
84
+ ## 🚀 Quick Start
85
+
86
+ ### 🔹 Initialize AWS Integration
87
+
88
+ ```python
89
+ from Snowforge.AWSIntegration import AWSIntegration
90
+
91
+ AWSIntegration.initialize(profile="default", verbose=True)
92
+ ```
93
+
94
+ ### 🔹 Connect to Snowflake
95
+
96
+ ```python
97
+ from Snowforge.SnowflakeConnect import SnowflakeConnection
98
+
99
+ conn = SnowflakeConnection.establish_connection(user_name="your-user", account="your-account")
100
+ ```
101
+
102
+ ### 🔹 Use Logging
103
+
104
+ ```python
105
+ from Snowforge.Logging import Debug
106
+
107
+ Debug.log("This is an info message", level='INFO')
108
+ Debug.log("This is an error message", level='ERROR')
109
+ ```
110
+
111
+ ### 🔹 Extract Data from an On-Prem Database
112
+
113
+ ```python
114
+ import Snowforge.AWSIntegration as aws
115
+
116
+ def export_and_upload_table_data(extractor: ExtractorStrategy):
117
+
118
+ #Fetch data from an on-prem system:
119
+ query = extractor.extract_table_query("database.schema.table", "filter_column", "filter_value")
120
+ full_path_to_file = extractor.export_table_to_file(query, output_path, file_format(optional))
121
+
122
+ aws.upload_to_s3("bucket name", full_path_to_file, "key to store the file under in S3")
123
+
124
+ def main():
125
+ from Snowforge.Extractors.NetezzaExtractor import NetezzaExtractor
126
+ from Snowforge.Extractors.OracleExtractor import OracleExtractor
127
+ from Snowforge.Extractors.PostgrSQLExtractor import PostgrSQLExtractor
128
+
129
+ # Export and upload data from different source systems by exchanging the extractor strategy
130
+ export_and_upload_table_data(NetezzaExtractor())
131
+ export_and_upload_table_data(OracleExtractor())
132
+ export_and_upload_table_data(PostgrSQLExtractor())
133
+
134
+ ```
135
+
136
+ Since **Snowforge** follows a **strategy pattern**, it can be easily extended to support other **database systems** by implementing new extractor classes that conform to the `ExtractorStrategy` interface.
137
+
138
+ ---
139
+
140
+ ## 📜 License
141
+
142
+ This project is licensed under the **MIT License**.
143
+
144
+ ---
145
+
146
+ ## 👤 Author
147
+
148
+ Developed by **andreasheggelund@gmail.com**. Feel free to reach out for support, suggestions, or collaboration!
@@ -0,0 +1,118 @@
1
+ # 🚀 Snowforge - Powerful Data Integration
2
+
3
+ **Snowforge** is a Python package designed to streamline data integration and transfer between **AWS**, **Snowflake**, and various **on-premise database systems**. It provides efficient data extraction, logging, configuration management, and AWS utilities to support robust data engineering workflows.
4
+
5
+ ---
6
+
7
+ ## ✨ Features
8
+
9
+ - **AWS Integration**: Manage AWS S3 and Secrets Manager operations.
10
+ - **Snowflake Connection**: Establish and manage Snowflake connections effortlessly.
11
+ - **Advanced Logging**: Centralized logging system with colored output for better visibility.
12
+ - **Configuration Management**: Load and manage credentials from a TOML configuration file.
13
+ - **Data Mover Engine**: Parallel data processing and extraction strategies for efficiency.
14
+ - **Extensible Database Extraction**: Uses a **strategy pattern** to support multiple **on-prem database systems** (e.g., Netezza, Oracle, PostgreSQL, etc.).
15
+
16
+ ---
17
+
18
+ ## 📥 Installation
19
+
20
+ Install Snowforge using pip:
21
+
22
+ ```sh
23
+ pip install snowforge-package
24
+ ```
25
+
26
+ ---
27
+
28
+ ## ⚙️ Configuration
29
+
30
+ Snowforge requires a configuration file (`snowforge_config.toml`) to manage credentials for AWS and Snowflake. The package searches for the config file in the following locations:
31
+
32
+ 1. Path specified in `SNOWFORGE_CONFIG_PATH` environment variable.
33
+ 2. Current working directory.
34
+ 3. `~/.config/snowforge_config.toml`
35
+ 4. Package directory.
36
+
37
+ ### Example `snowforge_config.toml` File
38
+
39
+ ```toml
40
+ [AWS]
41
+ [default]
42
+ AWS_ACCESS_KEY = "your-access-key"
43
+ AWS_SECRET_KEY = "your-secret-key"
44
+ REGION = "us-east-1"
45
+
46
+ [SNOWFLAKE]
47
+ [default]
48
+ USERNAME = "your-username"
49
+ ACCOUNT = "your-account"
50
+ ```
51
+
52
+ ---
53
+
54
+ ## 🚀 Quick Start
55
+
56
+ ### 🔹 Initialize AWS Integration
57
+
58
+ ```python
59
+ from Snowforge.AWSIntegration import AWSIntegration
60
+
61
+ AWSIntegration.initialize(profile="default", verbose=True)
62
+ ```
63
+
64
+ ### 🔹 Connect to Snowflake
65
+
66
+ ```python
67
+ from Snowforge.SnowflakeConnect import SnowflakeConnection
68
+
69
+ conn = SnowflakeConnection.establish_connection(user_name="your-user", account="your-account")
70
+ ```
71
+
72
+ ### 🔹 Use Logging
73
+
74
+ ```python
75
+ from Snowforge.Logging import Debug
76
+
77
+ Debug.log("This is an info message", level='INFO')
78
+ Debug.log("This is an error message", level='ERROR')
79
+ ```
80
+
81
+ ### 🔹 Extract Data from an On-Prem Database
82
+
83
+ ```python
84
+ import Snowforge.AWSIntegration as aws
85
+
86
+ def export_and_upload_table_data(extractor: ExtractorStrategy):
87
+
88
+ #Fetch data from an on-prem system:
89
+ query = extractor.extract_table_query("database.schema.table", "filter_column", "filter_value")
90
+ full_path_to_file = extractor.export_table_to_file(query, output_path, file_format(optional))
91
+
92
+ aws.upload_to_s3("bucket name", full_path_to_file, "key to store the file under in S3")
93
+
94
+ def main():
95
+ from Snowforge.Extractors.NetezzaExtractor import NetezzaExtractor
96
+ from Snowforge.Extractors.OracleExtractor import OracleExtractor
97
+ from Snowforge.Extractors.PostgrSQLExtractor import PostgrSQLExtractor
98
+
99
+ # Export and upload data from different source systems by exchanging the extractor strategy
100
+ export_and_upload_table_data(NetezzaExtractor())
101
+ export_and_upload_table_data(OracleExtractor())
102
+ export_and_upload_table_data(PostgrSQLExtractor())
103
+
104
+ ```
105
+
106
+ Since **Snowforge** follows a **strategy pattern**, it can be easily extended to support other **database systems** by implementing new extractor classes that conform to the `ExtractorStrategy` interface.
107
+
108
+ ---
109
+
110
+ ## 📜 License
111
+
112
+ This project is licensed under the **MIT License**.
113
+
114
+ ---
115
+
116
+ ## 👤 Author
117
+
118
+ Developed by **andreasheggelund@gmail.com**. Feel free to reach out for support, suggestions, or collaboration!
@@ -0,0 +1,215 @@
1
+ import os
2
+ import sys
3
+ import boto3
4
+ import botocore.exceptions as be
5
+ from boto3.s3.transfer import TransferConfig
6
+ import json
7
+ from tqdm import tqdm
8
+ from .Logging import Debug # Import logging class
9
+ from .Config import Config # Import config class
10
+
11
+ class ProgressPercentage:
12
+ def __init__(self, filename):
13
+ self._filename = filename
14
+ self._size = float(os.path.getsize(filename))
15
+ self._seen_so_far = 0
16
+ self._tqdm = tqdm(total=self._size, unit='B', unit_scale=True, desc=filename)
17
+
18
+ def __call__(self, bytes_amount):
19
+ self._seen_so_far += bytes_amount
20
+ self._tqdm.update(bytes_amount)
21
+
22
+ class AWSIntegration:
23
+ """Static AWS Helper Class for managing S3 and Secrets Manager operations."""
24
+
25
+ s3_client = None
26
+ secret_client = None
27
+ _current_profile = None
28
+
29
+ @staticmethod
30
+ def initialize(profile: str = "default", verbose: bool = False):
31
+ """Initializes AWS clients for S3 and Secrets Manager.
32
+
33
+ If credentials are missing, prompts the user for input. If authentication
34
+ fails, resets credentials so that `initialize()` can be called again for reattempt.
35
+ Args:
36
+ aws_profile (str, optional): Specifies which AWS profile to use for the connection. Defaults to 'Default' profile.
37
+ verbose (bool, optional): set True to enable DEBUG output. Defaults to False.
38
+ Raises:
39
+ Exception: If AWS authentication fails or Default profile not found in .toml file.
40
+ """
41
+
42
+ # If already initialized successfully, return
43
+ if AWSIntegration.s3_client is not None and AWSIntegration.secret_client is not None:
44
+ Debug.log(f"Already authenticated! using profile: '{AWSIntegration._current_profile}'!", 'DEBUG', verbose)
45
+ return
46
+
47
+ try:
48
+ aws_creds = Config.get_aws_credentials(profile) # Change "default" to "production" as needed
49
+ access_key = aws_creds["AWS_ACCESS_KEY"]
50
+ secret_key = aws_creds["AWS_SECRET_KEY"]
51
+ region = aws_creds["REGION"]
52
+
53
+ except TypeError as e:
54
+ Debug.log(f"No profile named '{profile}' in config file.", 'ERROR')
55
+ sys.exit(1)
56
+
57
+ Debug.log(f"credentials found in config.toml by user: \nAWS_ACCESS_KEY_ID: {access_key}\nAWS_SECRET_ACCESS_KEY: {secret_key}\n", 'DEBUG', verbose)
58
+
59
+ try:
60
+ identity = AWSIntegration.check_connection(access_key, secret_key, region)
61
+ AWSIntegration._current_profile = profile # Persist the currently authenticated profile
62
+
63
+ Debug.log(f"Authenticated as: {identity['Arn'].split('/')[-1]}", 'SUCCESS')
64
+
65
+ except be.ClientError as e:
66
+ Debug.log("Invalid credentials. Please verify that your profile has the required permissions.", 'ERROR')
67
+ sys.exit(1)
68
+
69
+
70
+ try:
71
+
72
+ AWSIntegration.s3_client = boto3.client(
73
+ "s3",
74
+ aws_access_key_id=access_key,
75
+ aws_secret_access_key=secret_key,
76
+ region_name=region
77
+ )
78
+
79
+ AWSIntegration.secret_client = boto3.client(
80
+ "secretsmanager",
81
+ aws_access_key_id=access_key,
82
+ aws_secret_access_key=secret_key,
83
+ region_name=region
84
+ )
85
+
86
+ Debug.log(f"Successfully created connection to aws clients!", 'DEBUG', verbose)
87
+
88
+ except be.ClientError as e:
89
+ error_code = e.response['Error']['Code']
90
+ # Reset class variables to allow retrying on next call
91
+ AWSIntegration.s3_client = None
92
+ AWSIntegration.secret_client = None
93
+
94
+ # Reset environment variables (so `initialize()` prompts again on next call)
95
+ os.environ.pop("AWS_ACCESS_KEY_ID", None)
96
+ os.environ.pop("AWS_SECRET_ACCESS_KEY", None)
97
+
98
+ if error_code == 'InvalidAccessKeyId':
99
+ Debug.log(f"\n\nThe selected IAM user is not found.\n", 'ERROR')
100
+
101
+ @staticmethod
102
+ def check_connection(access_key: str, secret_key: str, region: str):
103
+ '''Validates connection to AWS by fetching the caller identity.
104
+ Args:
105
+ access_key (str): The IAM access key associated with the IAM user.
106
+ secret_key (str): The IAM access key associated with the IAM user.
107
+ Returns:
108
+ identity (boto3.client.identity): The called identity, IF authenticated.
109
+ '''
110
+ try:
111
+ sts_client = boto3.client(
112
+ "sts",
113
+ aws_access_key_id=access_key,
114
+ aws_secret_access_key=secret_key,
115
+ region_name=region
116
+ )
117
+
118
+ except be.ClientError as e:
119
+ Debug.log(f"Invalid credentials, verify you are using the correct IAM profile.", 'ERROR')
120
+
121
+ identity = sts_client.get_caller_identity()
122
+
123
+ return identity
124
+
125
+ @staticmethod
126
+ def define_s3_transfer_config(size_threshold: float, threads: int):
127
+ """Defines and returns an AWS S3 TransferConfig for efficient file uploads.
128
+
129
+ Args:
130
+ size_threshold (float): The file size (in GB) at which multipart upload should trigger.
131
+ threads (int): Number of concurrent threads for upload.
132
+ verbose (bool, optional): set True to enable DEBUG output. Defaults to False.
133
+
134
+ Returns:
135
+ TransferConfig: Configured transfer settings for AWS S3 uploads.
136
+ """
137
+ GB = 1024 ** 3
138
+ Debug.log(f"Threshold for multithreaded upload to S3: {size_threshold}GB\n"
139
+ f"Concurrent threads: {threads}", 'INFO')
140
+
141
+ return TransferConfig(multipart_threshold=size_threshold * GB, max_concurrency=threads)
142
+
143
+ @staticmethod
144
+ def get_secret(secret_name: str, verbose: bool = False):
145
+ """Retrieves a secret from AWS Secrets Manager.
146
+
147
+ Args:
148
+ secret_name (str): The name of the secret to retrieve.
149
+ verbose (bool, optional): set True to enable DEBUG output. Defaults to False.
150
+
151
+ Returns:
152
+ dict: The secret's value parsed as a dictionary.
153
+
154
+ Raises:
155
+ Exception: If retrieval fails.
156
+ """
157
+ AWSIntegration.initialize(verbose)
158
+ try:
159
+ response = AWSIntegration.secret_client.get_secret_value(SecretId=secret_name)
160
+ return json.loads(response['SecretString'])
161
+ except Exception as e:
162
+ Debug.log(f"Failed to retrieve secret: {e}", 'ERROR')
163
+
164
+ @staticmethod
165
+ def get_bucket_contents(bucket_name: str, verbose: bool = False):
166
+ """Lists all files in a given AWS S3 bucket.
167
+
168
+ Args:
169
+ bucket_name (str): The name of the S3 bucket.
170
+ verbose (bool, optional): set True to enable DEBUG output. Defaults to False.
171
+
172
+ Returns:
173
+ list[str]: A list of filenames stored in the bucket.
174
+
175
+ Raises:
176
+ Exception: If the bucket is not accessible.
177
+ """
178
+ AWSIntegration.initialize(verbose)
179
+ try:
180
+ response = AWSIntegration.s3_client.list_objects_v2(Bucket=bucket_name)
181
+ return [item['Key'] for item in response.get('Contents', [])]
182
+ except Exception as e:
183
+ Debug.log(f"Error fetching bucket contents: {e}", 'ERROR')
184
+
185
+ @staticmethod
186
+ def push_file_to_s3(bucket_name: str, file_to_upload: str, key: str, config: TransferConfig = None, verbose: bool = False):
187
+ """Uploads a file to an AWS S3 bucket.
188
+
189
+ Args:
190
+ bucket_name (str): The destination S3 bucket name.
191
+ file_to_upload (str): Path to the file to upload.
192
+ key (str): The S3 key (filename) to assign.
193
+ config (TransferConfig, optional): AWS S3 transfer configuration. Defaults to None.
194
+ verbose (bool, optional): set True to enable DEBUG output. Defaults to False.
195
+
196
+ Raises:
197
+ Exception: If the upload fails.
198
+ """
199
+ AWSIntegration.initialize(verbose)
200
+
201
+ if config is None:
202
+ config = AWSIntegration.define_s3_transfer_config(0.1, 10)
203
+
204
+ try:
205
+ Debug.log(f"Uploading {file_to_upload} to {bucket_name}/{key}...", 'INFO')
206
+
207
+ with open(file_to_upload, 'rb') as file_obj:
208
+ AWSIntegration.s3_client.upload_fileobj(
209
+ file_obj, bucket_name, key, Config=config, Callback=ProgressPercentage(file_to_upload)
210
+ )
211
+
212
+ Debug.log(f"Successfully uploaded {file_to_upload} to {bucket_name}/{key}", 'SUCCESS')
213
+
214
+ except Exception as e:
215
+ Debug.log(f"Error uploading file: {e}", 'ERROR')
@@ -0,0 +1,86 @@
1
+ import os
2
+ import sys
3
+ import toml
4
+ from .Logging import Debug # Use existing logging system
5
+
6
+ class Config:
7
+ """Loads configuration settings from config.toml and manages profiles globally."""
8
+
9
+ _config_data = {}
10
+
11
+ _aws_profile = "default" # Stores the globally selected profile
12
+ _snowflake_profile = "default"
13
+
14
+
15
+ CONFIG_FILE_PATHS = [
16
+ os.getenv("SNOWFORGE_CONFIG_PATH"), # Custom path via env variable
17
+ os.path.join(os.getcwd(), "snowforge_config.toml"), # Current working directory
18
+ os.path.join(os.path.expanduser("~"), ".config", "snowforge_config.toml"), # ~/.config/snowforge_config.toml
19
+ os.path.join(os.path.dirname(__file__), "snowforge_config.toml") # Package directory
20
+ ]
21
+
22
+ @staticmethod
23
+ def find_config_file(config_paths: list = CONFIG_FILE_PATHS, verbose: bool = False):
24
+ """Finds the first available config file by searching for the file in locations included in 'config_paths'.
25
+ Args:
26
+ config_paths (list): a list of locations where the script will search for the config. defaults locations are in SNOWFORGE_CONFIG_PATH, Current working dir, '~' and the package dir. You have to create this file yourself in some of these locations and name it 'snowforge_config.toml'.
27
+ verbose (bool): boolean to enable/disable verbose logging. Defaults to 'False'
28
+ """
29
+ for path in config_paths:
30
+ if path and os.path.exists(path):
31
+ Debug.log(f"found file at: {path}", 'DEBUG', verbose)
32
+ return path
33
+
34
+ Debug.log("⚠️ No config.toml file found. Exiting..", "WARNING")
35
+ return None
36
+
37
+ @staticmethod
38
+ def load_config(config_paths: list = CONFIG_FILE_PATHS, verbose: bool = False):
39
+ """Loads the config from the path included in 'config_paths'"""
40
+ config_path = Config.find_config_file(config_paths, verbose)
41
+
42
+ # Returns early if no valid path is found
43
+ if not config_path:
44
+ Debug.log(f"No valid file found! Ensure you have created a config file named 'snowforge_config.toml'.", 'ERROR')
45
+ raise FileNotFoundError
46
+
47
+ try:
48
+ Config._config_data = toml.load(config_path)
49
+ Debug.log(f"Successfully loaded config file from: {config_path}!", 'DEBUG', verbose)
50
+ except toml.TomlDecodeError as e:
51
+ Debug.log(f"Error loading config file\n{e.msg}", 'ERROR')
52
+
53
+ @staticmethod
54
+ def get_current_aws_profile()->str:
55
+ return Config._aws_profile
56
+
57
+ @staticmethod
58
+ def get_current_snowflake_profile()->str:
59
+ """Returns the current selected snowflake profile."""
60
+ return Config._snowflake_profile
61
+
62
+ @staticmethod
63
+ def get_snowflake_credentials(config_paths: list = CONFIG_FILE_PATHS, profile: str = "default", verbose: bool = False)->dict:
64
+ """Returns credentials for a given Snowflake profile specified in the snowforge_config.toml file."""
65
+ Config.load_config(config_paths, verbose)
66
+
67
+
68
+ sf_config = Config._config_data.get("SNOWFLAKE", {}).get(profile, {})
69
+ if not sf_config:
70
+ Debug.log(f"No profile '{profile}' in .toml file... Please provide a valid configuration.", 'WARNING')
71
+ return None
72
+
73
+ return sf_config
74
+
75
+ @staticmethod
76
+ def get_aws_credentials(config_paths: list = CONFIG_FILE_PATHS, profile: str = "default", verbose: bool = False)->dict:
77
+ """Returns credentials for a given AWS profile specified in the snowforge_config.toml file."""
78
+ Config.load_config(config_paths, verbose)
79
+
80
+ aws_config = Config._config_data.get("AWS", {}).get(profile, {})
81
+
82
+ if not aws_config:
83
+ Debug.log(f"No profile '{profile}' in .toml file... Please provide a valid configuration.", 'WARNING')
84
+ return None
85
+
86
+ return aws_config
@@ -0,0 +1,95 @@
1
+ import os
2
+ import multiprocessing as mp
3
+ import asyncio
4
+ from concurrent.futures import ProcessPoolExecutor
5
+
6
+ from .Extractors.ExtractorStrategy import ExtractorStrategy
7
+ from ..Logging import Debug
8
+
9
+ class Engine():
10
+ """Engine for moving data across plattforms and across on-prem/cloud"""
11
+
12
+ def __init__(self):
13
+ """Initialize the DatMover engine. Takes the ExtractStrategy as input (i.e NetezzaExtractor, OracleExtractor etc.)"""
14
+ self.cpu_count = os.cpu_count() or 4
15
+ self.pool = ProcessPoolExecutor(max_workers=8 or (self.cpu_count - 2))
16
+
17
+ @staticmethod
18
+ def parallel_process(worker_func: object, args_list: list[tuple], num_workers: int = None, use_shared_queue: bool = False, queue = None):
19
+ '''
20
+ Executes a worker function 'worker_func' in parallel using a number multiple processes defined by the 'num_workers' variable.
21
+
22
+ Args:
23
+ worker_func (function): The function that each worker process should execute.
24
+ args_list (list): A list of tuples, where each tuple contains arguments for worker_func.
25
+ num_workers (int, optional): Number of parallel workers. Defaults to max(4, CPU count - 2).
26
+ use_shared_queue (bool, optional): If True, a multiprocessing queue will be created and passed to workers.
27
+ queue (mp.Queue, optional): If provided, it will be used instead of creating a new queue.
28
+
29
+ Returns:
30
+ list: List of results from worker processes if applicable.
31
+ '''
32
+
33
+ # Determine the number of CPU cores to use
34
+ if num_workers is None:
35
+ num_workers = max(4, os.cpu_count() - 2) #failsafe slik at noen kjerner er tilgjengelig for systemet
36
+
37
+ num_processes = min(num_workers, len(args_list)) # sørger for at ingen prosesser får mer en angitt num_workers, men spawner bare opptil så mange oppgaver den har dersom num_workers > len(jobber_som_skal_kjøres)
38
+
39
+ process_list = []
40
+ if use_shared_queue and queue is None:
41
+ Debug.log("You HAVE to supply a queue as input to this function if you set 'use_shared_queue = True', otherwise the queue will not be reachable to produces/consumer processes on the other side!", 'WARNING')
42
+ raise SyntaxError
43
+
44
+ # Create and start all worker processes
45
+ for i in range(num_processes):
46
+
47
+ if use_shared_queue:
48
+ process = mp.Process(target=worker_func, args=(*args_list[i], queue))
49
+ else:
50
+ process = mp.Process(target=worker_func, args=args_list[i])
51
+
52
+ process.daemon = True # Ensure processes exit when main program exits. This ensures no orphans or zombies
53
+ process_list.append(process)
54
+ process.start()
55
+
56
+ return process_list # Return the list of running processes
57
+
58
+
59
+ @staticmethod
60
+ def determine_file_offsets(file_name: str, num_chunks: int):
61
+ """Determine file offsets for parallel reading based on line breaks."""
62
+ file_size = os.path.getsize(file_name)
63
+ chunk_size = max(1, file_size // num_chunks)
64
+
65
+ offsets = [0]
66
+ with open(file_name, 'rb') as f:
67
+ for _ in range(num_chunks - 1):
68
+ f.seek(offsets[-1] + chunk_size)
69
+ f.readline()
70
+ offsets.append(f.tell())
71
+ print(f"DEBUG: File offsets computed: {offsets}")
72
+ return offsets
73
+
74
+ @staticmethod
75
+ def export_to_file(extractor: ExtractorStrategy, output_path: str, fully_qualified_table_name: str, filter_column: str = None, filter_value: str = None, verbose: bool = False)->tuple:
76
+ """Utilizes the selected Extractor to export database(s) to a csv file."""
77
+
78
+ header, csv_file = extractor.export_external_table(output_path, fully_qualified_table_name, filter_column, filter_value, verbose)
79
+ return header, csv_file
80
+
81
+ @staticmethod
82
+ def calculate_chunks(external_table: str, compression: int = 4):
83
+ """Calculates how many chunks to split the file into."""
84
+ unzipped_chunk_filesize = 100 * 1024 * 1024 * compression # 200 MB zipped (added compression_factor in order to account for the compression factor of gzip on table data)
85
+
86
+ total_filesize = os.path.getsize(external_table)
87
+
88
+ if total_filesize > unzipped_chunk_filesize:
89
+ num_chunks = int(total_filesize // unzipped_chunk_filesize) + 2 # ensures that at least three chunks is created (trekker fra 1 lenger nede i koden)
90
+ else:
91
+ num_chunks = 2
92
+
93
+ Debug.log(f"\nTotal filesize: {total_filesize // (1024*1024)} mb\nnumber of chunks: {num_chunks - 1}\n", 'INFO')
94
+
95
+ return num_chunks
@@ -0,0 +1,20 @@
1
+ # snowforge/DataMover/Extractors/ExtractorStrategy.py
2
+ from abc import ABC, abstractmethod
3
+
4
+ class ExtractorStrategy(ABC):
5
+ """Abstract base class for all extraction strategies. Think of this like an interface from C#"""
6
+
7
+ @abstractmethod
8
+ def extract_table_query(self, fully_qualified_table_name: str, filter_column: str, filter_value: str, verbose: bool = False):
9
+ """Extracts a table based on the provided criteria."""
10
+ pass
11
+
12
+ @abstractmethod
13
+ def list_all_tables(self, database_name: str, verbose: bool = False):
14
+ """Lists all tables in a given database."""
15
+ pass
16
+
17
+ @abstractmethod
18
+ def export_external_table(self, query: str, output_path: str, table_name: str, verbose: bool = False):
19
+ """Exports an external table from a database table."""
20
+ pass
@@ -0,0 +1,77 @@
1
+ import subprocess
2
+ import os
3
+ from .ExtractorStrategy import ExtractorStrategy
4
+ from ...Logging import Debug
5
+
6
+ class NetezzaExtractor(ExtractorStrategy):
7
+ """Handles extraction from Netezza specifically."""
8
+
9
+ def extract_table_query(self, fully_qualified_table_name: str, filter_column: str = None, filter_value: str = None, verbose: bool = False)->str:
10
+ """Build query against database table. (Can be extended in a future version)."""
11
+
12
+ if filter_column is not None and filter_value is None:
13
+ Debug.log(f"You must provide a filter value in order to apply any filtering.", 'WARNING')
14
+ raise Exception
15
+
16
+ elif filter_column is None and filter_value is not None:
17
+ Debug.log(f"You cannot supply a filter value without specifying a filter (--filter option)\ncontinuing without filter.", 'WARNING')
18
+ raise Exception
19
+
20
+ if filter_value is None or filter_column is None:
21
+ query = f"SELECT * FROM {fully_qualified_table_name}"
22
+ else:
23
+ query = f"SELECT * FROM {fully_qualified_table_name} WHERE {filter_column} BETWEEN TO_DATE('{filter_value}', 'DD.MM.YYYY') AND CURRENT_DATE+1"
24
+
25
+ return query
26
+
27
+ def list_all_tables(self, database_name: str, verbose: bool = False)->list:
28
+ """Query all tables in the specified database and export them as an array."""
29
+
30
+ command = f"nzsql -q -c \"SELECT TABLE_NAME FROM _V_TABLE WHERE TABLE_SCHEMA = '{database_name}';\""
31
+ output = subprocess.check_output(command, shell=True).decode('ISO-8859-1')
32
+ table_list = [line.strip() for line in output.split('\n') if line.strip()]
33
+
34
+ return table_list
35
+
36
+ def export_external_table(self, output_path: str, fully_qualified_table_name: str, filter_column: str = None, filter_value: str = None, verbose: bool = False)->str:
37
+ """Runs the query on Netezza and exports the data to a CSV file."""
38
+
39
+ table_name = fully_qualified_table_name.split('.')[-1] if fully_qualified_table_name else None
40
+
41
+ query = self.extract_table_query(fully_qualified_table_name, filter_column, filter_value, verbose)
42
+
43
+ os.makedirs(output_path, exist_ok=True)
44
+ exported_csv_file = os.path.join(output_path, f"{table_name}_full.csv")
45
+
46
+ with open(exported_csv_file, 'w') as f:
47
+ pass
48
+
49
+ external_table_query = f"""
50
+ CREATE EXTERNAL TABLE '/export/home/nz/{exported_csv_file}'
51
+ USING (
52
+ delimiter ','
53
+ escapeChar '\\'
54
+ nullValue 'NULL'
55
+ encoding 'internal'
56
+ )
57
+ AS {query};
58
+ """
59
+ encoding_command = f"iconv -f ISO-8859-1 -t UTF-8 {exported_csv_file} -o {exported_csv_file}"
60
+ nzsql_command = f"""nzsql -c "{external_table_query}" """
61
+
62
+ try:
63
+ Debug.log(f"Running command: {encoding_command}", 'DEBUG', verbose)
64
+ subprocess.run(encoding_command, shell=True, check=True)
65
+
66
+ subprocess.run(nzsql_command, shell=True, check=True)
67
+
68
+ except subprocess.CalledProcessError as e:
69
+ Debug.log(f"Error executing Netezza command: {e}", 'ERROR')
70
+ return None
71
+ header = "ddd"
72
+ return header, exported_csv_file
73
+
74
+ def get_row_id_of_row(self, fully_qualified_table: str, header: str, row: str):
75
+ """Fetches the rowid of a given row."""
76
+
77
+ rowid_query = f""""""
@@ -0,0 +1,59 @@
1
+ import logging
2
+ from colored import Fore, Style
3
+
4
+ class Debug:
5
+ """Handles logging with colored output for better visibility."""
6
+
7
+ logger = logging.getLogger("SnowforgeLogger")
8
+ handler = logging.StreamHandler()
9
+ handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
10
+
11
+ # Ensure the logger only has one handler (avoid duplicate logs)
12
+ if not logger.hasHandlers():
13
+ logger.addHandler(handler)
14
+
15
+ @staticmethod
16
+ def log(message: str, level='INFO', verbose_logging: bool = False):
17
+ """Logs a message with a specified severity level and colored output.
18
+
19
+ Args:
20
+ message (str): The message to log.
21
+ level (str, optional): The log level (INFO, DEBUG, ERROR, etc.). Defaults to "INFO".
22
+ verbose_logging (bool, optional): Set to True to enable DEBUG output globally. Defaults to False.
23
+ """
24
+
25
+ # Adjust log level based on verbose flag
26
+ if verbose_logging:
27
+ Debug.logger.setLevel(logging.DEBUG)
28
+ else:
29
+ Debug.logger.setLevel(logging.INFO)
30
+
31
+ # Convert level to uppercase
32
+ level = level.upper()
33
+
34
+ log_levels = {
35
+ 'DEBUG': logging.DEBUG,
36
+ 'INFO': logging.INFO,
37
+ 'WARNING': logging.WARNING,
38
+ 'ERROR': logging.ERROR,
39
+ 'CRITICAL': logging.CRITICAL
40
+ }
41
+
42
+ # Define color mapping
43
+ color_map = {
44
+ 'INFO': Fore.white,
45
+ 'ERROR': Fore.red,
46
+ 'DEBUG': Fore.blue,
47
+ 'WARNING': Fore.yellow,
48
+ 'SUCCESS': Fore.light_green,
49
+ 'FAILURE': Fore.red,
50
+ 'CRITICAL': Fore.light_red
51
+ }
52
+
53
+ colored_message = f"{color_map.get(level, Fore.white)}{message}{Style.reset}"
54
+
55
+ # Ensure the log level exists
56
+ if level in log_levels:
57
+ getattr(Debug.logger, level.lower())(colored_message)
58
+ else:
59
+ Debug.logger.info(colored_message)
@@ -0,0 +1,48 @@
1
+ import snowflake.connector as sf
2
+ from .Logging import Debug # Import from same package
3
+
4
+ class SnowflakeConnection:
5
+ """Handles establishing and managing connections to Snowflake."""
6
+
7
+ DEFAULTS = {
8
+ "snowflake_username": "snowflake username",
9
+ "snowflake_account": "snowflake account"
10
+ }
11
+
12
+ _connection = None # Instance variable
13
+
14
+ @staticmethod
15
+ def establish_connection(user_name: str = DEFAULTS["snowflake_username"], account: str = DEFAULTS["snowflake_account"]) -> sf.connection:
16
+ """Establishes a connection to Snowflake.
17
+
18
+ Uses either a credentials file or manual login via username and account.
19
+
20
+ Args:
21
+ user_name (str, optional): The Snowflake username. Defaults to 'DEFAULTS["snowflake_username"]'.
22
+ account (str, optional): The Snowflake account ID. Defaults to 'DEFAULTS["snowflake_account"]'.
23
+ verbose (bool, optional): set True to enable DEBUG output. Defaults to False.
24
+
25
+ Returns:
26
+ sf.connection: A Snowflake account connection object.
27
+
28
+ Raises:
29
+ sf.errors.Error: If connection fails.
30
+ """
31
+
32
+ if SnowflakeConnection._connection:
33
+ return SnowflakeConnection._connection
34
+
35
+ try:
36
+ if user_name == SnowflakeConnection.DEFAULTS["snowflake_username"] or account == SnowflakeConnection.DEFAULTS["snowflake_account"]:
37
+ SnowflakeConnection._connection = sf.connect()
38
+ else:
39
+ SnowflakeConnection._connection = sf.connect(
40
+ user=user_name,
41
+ account=account,
42
+ authenticator="externalbrowser"
43
+ )
44
+ return SnowflakeConnection._connection
45
+
46
+ except Exception as e:
47
+ Debug.log(f"\nCould not connect to Snowflake, did you create a .toml file?\nRemember you can always connect using account + username.\nError message: {e}", 'ERROR')
48
+ raise sf.errors.ConfigSourceError
@@ -0,0 +1,7 @@
1
+ # Snowforge/__init__.py
2
+
3
+ from .Logging import Debug
4
+ from .SnowflakeConnect import SnowflakeConnection
5
+ from .AWSIntegration import AWSIntegration
6
+ from .Config import Config
7
+ __all__ = ["Debug", "SnowflakeConnection", "SnowflakeLogging", "AWSIntegration", "Config"]
@@ -0,0 +1,4 @@
1
+ # pyproject.toml
2
+ [build-system]
3
+ requires = ["setuptools", "wheel"]
4
+ build-backend = "setuptools.build_meta"
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,28 @@
1
+ from setuptools import setup, find_packages
2
+
3
+ setup(
4
+ name="snowforge-package",
5
+ version="0.2.7", # Change this for new releases
6
+ author="Andreas Heggelund",
7
+ author_email="andreasheggelund@gmail.com",
8
+ description="A Python package for supporting migration from on-prem to cloud",
9
+ long_description=open("README.md").read(),
10
+ long_description_content_type="text/markdown",
11
+ url="https://github.com/yourusername/Snowforge", # Replace with your GitHub repo
12
+ packages=find_packages(),
13
+ install_requires=[
14
+ "boto3",
15
+ "snowflake-connector-python",
16
+ "coloredlogs",
17
+ "colored",
18
+ "tqdm",
19
+ "toml",
20
+ "argparse"
21
+ ],
22
+ python_requires=">=3.12",
23
+ classifiers=[
24
+ "Programming Language :: Python :: 3",
25
+ "License :: OSI Approved :: MIT License",
26
+ "Operating System :: OS Independent",
27
+ ],
28
+ )
@@ -0,0 +1,148 @@
1
+ Metadata-Version: 2.2
2
+ Name: snowforge-package
3
+ Version: 0.2.7
4
+ Summary: A Python package for supporting migration from on-prem to cloud
5
+ Home-page: https://github.com/yourusername/Snowforge
6
+ Author: Andreas Heggelund
7
+ Author-email: andreasheggelund@gmail.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.12
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: boto3
15
+ Requires-Dist: snowflake-connector-python
16
+ Requires-Dist: coloredlogs
17
+ Requires-Dist: colored
18
+ Requires-Dist: tqdm
19
+ Requires-Dist: toml
20
+ Requires-Dist: argparse
21
+ Dynamic: author
22
+ Dynamic: author-email
23
+ Dynamic: classifier
24
+ Dynamic: description
25
+ Dynamic: description-content-type
26
+ Dynamic: home-page
27
+ Dynamic: requires-dist
28
+ Dynamic: requires-python
29
+ Dynamic: summary
30
+
31
+ # 🚀 Snowforge - Powerful Data Integration
32
+
33
+ **Snowforge** is a Python package designed to streamline data integration and transfer between **AWS**, **Snowflake**, and various **on-premise database systems**. It provides efficient data extraction, logging, configuration management, and AWS utilities to support robust data engineering workflows.
34
+
35
+ ---
36
+
37
+ ## ✨ Features
38
+
39
+ - **AWS Integration**: Manage AWS S3 and Secrets Manager operations.
40
+ - **Snowflake Connection**: Establish and manage Snowflake connections effortlessly.
41
+ - **Advanced Logging**: Centralized logging system with colored output for better visibility.
42
+ - **Configuration Management**: Load and manage credentials from a TOML configuration file.
43
+ - **Data Mover Engine**: Parallel data processing and extraction strategies for efficiency.
44
+ - **Extensible Database Extraction**: Uses a **strategy pattern** to support multiple **on-prem database systems** (e.g., Netezza, Oracle, PostgreSQL, etc.).
45
+
46
+ ---
47
+
48
+ ## 📥 Installation
49
+
50
+ Install Snowforge using pip:
51
+
52
+ ```sh
53
+ pip install snowforge-package
54
+ ```
55
+
56
+ ---
57
+
58
+ ## ⚙️ Configuration
59
+
60
+ Snowforge requires a configuration file (`snowforge_config.toml`) to manage credentials for AWS and Snowflake. The package searches for the config file in the following locations:
61
+
62
+ 1. Path specified in `SNOWFORGE_CONFIG_PATH` environment variable.
63
+ 2. Current working directory.
64
+ 3. `~/.config/snowforge_config.toml`
65
+ 4. Package directory.
66
+
67
+ ### Example `snowforge_config.toml` File
68
+
69
+ ```toml
70
+ [AWS]
71
+ [default]
72
+ AWS_ACCESS_KEY = "your-access-key"
73
+ AWS_SECRET_KEY = "your-secret-key"
74
+ REGION = "us-east-1"
75
+
76
+ [SNOWFLAKE]
77
+ [default]
78
+ USERNAME = "your-username"
79
+ ACCOUNT = "your-account"
80
+ ```
81
+
82
+ ---
83
+
84
+ ## 🚀 Quick Start
85
+
86
+ ### 🔹 Initialize AWS Integration
87
+
88
+ ```python
89
+ from Snowforge.AWSIntegration import AWSIntegration
90
+
91
+ AWSIntegration.initialize(profile="default", verbose=True)
92
+ ```
93
+
94
+ ### 🔹 Connect to Snowflake
95
+
96
+ ```python
97
+ from Snowforge.SnowflakeConnect import SnowflakeConnection
98
+
99
+ conn = SnowflakeConnection.establish_connection(user_name="your-user", account="your-account")
100
+ ```
101
+
102
+ ### 🔹 Use Logging
103
+
104
+ ```python
105
+ from Snowforge.Logging import Debug
106
+
107
+ Debug.log("This is an info message", level='INFO')
108
+ Debug.log("This is an error message", level='ERROR')
109
+ ```
110
+
111
+ ### 🔹 Extract Data from an On-Prem Database
112
+
113
+ ```python
114
+ import Snowforge.AWSIntegration as aws
115
+
116
+ def export_and_upload_table_data(extractor: ExtractorStrategy):
117
+
118
+ #Fetch data from an on-prem system:
119
+ query = extractor.extract_table_query("database.schema.table", "filter_column", "filter_value")
120
+ full_path_to_file = extractor.export_table_to_file(query, output_path, file_format(optional))
121
+
122
+ aws.upload_to_s3("bucket name", full_path_to_file, "key to store the file under in S3")
123
+
124
+ def main():
125
+ from Snowforge.Extractors.NetezzaExtractor import NetezzaExtractor
126
+ from Snowforge.Extractors.OracleExtractor import OracleExtractor
127
+ from Snowforge.Extractors.PostgrSQLExtractor import PostgrSQLExtractor
128
+
129
+ # Export and upload data from different source systems by exchanging the extractor strategy
130
+ export_and_upload_table_data(NetezzaExtractor())
131
+ export_and_upload_table_data(OracleExtractor())
132
+ export_and_upload_table_data(PostgrSQLExtractor())
133
+
134
+ ```
135
+
136
+ Since **Snowforge** follows a **strategy pattern**, it can be easily extended to support other **database systems** by implementing new extractor classes that conform to the `ExtractorStrategy` interface.
137
+
138
+ ---
139
+
140
+ ## 📜 License
141
+
142
+ This project is licensed under the **MIT License**.
143
+
144
+ ---
145
+
146
+ ## 👤 Author
147
+
148
+ Developed by **andreasheggelund@gmail.com**. Feel free to reach out for support, suggestions, or collaboration!
@@ -0,0 +1,19 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ setup.py
5
+ Snowforge/AWSIntegration.py
6
+ Snowforge/Config.py
7
+ Snowforge/Logging.py
8
+ Snowforge/SnowflakeConnect.py
9
+ Snowforge/__init__.py
10
+ Snowforge/DataMover/DataMover.py
11
+ Snowforge/DataMover/__init__.py
12
+ Snowforge/DataMover/Extractors/ExtractorStrategy.py
13
+ Snowforge/DataMover/Extractors/NetezzaExtractor.py
14
+ Snowforge/DataMover/Extractors/__init__.py
15
+ snowforge_package.egg-info/PKG-INFO
16
+ snowforge_package.egg-info/SOURCES.txt
17
+ snowforge_package.egg-info/dependency_links.txt
18
+ snowforge_package.egg-info/requires.txt
19
+ snowforge_package.egg-info/top_level.txt
@@ -0,0 +1,7 @@
1
+ boto3
2
+ snowflake-connector-python
3
+ coloredlogs
4
+ colored
5
+ tqdm
6
+ toml
7
+ argparse