sqlServerConnector 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sqlserverconnector-0.1.0/PKG-INFO +57 -0
- sqlserverconnector-0.1.0/README.md +38 -0
- sqlserverconnector-0.1.0/pyproject.toml +34 -0
- sqlserverconnector-0.1.0/setup.cfg +4 -0
- sqlserverconnector-0.1.0/src/__init__.py +6 -0
- sqlserverconnector-0.1.0/src/connector.py +406 -0
- sqlserverconnector-0.1.0/src/sqlServerConnector.egg-info/PKG-INFO +57 -0
- sqlserverconnector-0.1.0/src/sqlServerConnector.egg-info/SOURCES.txt +9 -0
- sqlserverconnector-0.1.0/src/sqlServerConnector.egg-info/dependency_links.txt +1 -0
- sqlserverconnector-0.1.0/src/sqlServerConnector.egg-info/requires.txt +7 -0
- sqlserverconnector-0.1.0/src/sqlServerConnector.egg-info/top_level.txt +2 -0
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: sqlServerConnector
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A custom SQL Server Connector for ETL processes with Pandas
|
|
5
|
+
Author-email: Nguyen Minh Son <nguyen.minhson1511@gmail.com>
|
|
6
|
+
Project-URL: Homepage, https://github.com/johnnyb1509/sqlServerConnector
|
|
7
|
+
Keywords: sql,etl,pandas,sqlalchemy
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: Operating System :: OS Independent
|
|
10
|
+
Requires-Python: >=3.8
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
Requires-Dist: pandas>=1.5.0
|
|
13
|
+
Requires-Dist: numpy
|
|
14
|
+
Requires-Dist: sqlalchemy>=2.0.0
|
|
15
|
+
Requires-Dist: pyodbc
|
|
16
|
+
Requires-Dist: pyyaml
|
|
17
|
+
Requires-Dist: loguru
|
|
18
|
+
Requires-Dist: jupyterlab
|
|
19
|
+
|
|
20
|
+
# SQL Data Connector Project
|
|
21
|
+
|
|
22
|
+
Project này cung cấp module Python mạnh mẽ (`dbConnector`) để tương tác với Microsoft SQL Server, tối ưu hóa cho các tác vụ Data Engineering như: ETL, Insert dữ liệu lớn, và Đồng bộ hóa (Upsert) dữ liệu từ Pandas DataFrame.
|
|
23
|
+
|
|
24
|
+
## Tính năng nổi bật
|
|
25
|
+
|
|
26
|
+
* **Tự động hóa cao**: Tự động tạo bảng, phát hiện kiểu dữ liệu, và thêm cột mới nếu DataFrame thay đổi.
|
|
27
|
+
* **Hiệu năng cao**: Sử dụng `fast_executemany=True` của SQLAlchemy/pyodbc để tăng tốc độ Insert.
|
|
28
|
+
* **An toàn dữ liệu**: Hỗ trợ Transaction (Commit/Rollback) để đảm bảo tính toàn vẹn dữ liệu.
|
|
29
|
+
* **Sync thông minh**: Hàm `check_and_update_table` giúp so sánh và chỉ update những dòng thay đổi, insert những dòng mới.
|
|
30
|
+
* **Tiện ích**: Hỗ trợ làm sạch dữ liệu số (ví dụ: convert "1.5M" thành 1,500,000).
|
|
31
|
+
|
|
32
|
+
## Yêu cầu cài đặt
|
|
33
|
+
|
|
34
|
+
1. **Hệ điều hành**: Windows, Linux, hoặc MacOS.
|
|
35
|
+
2. **Driver**: Cần cài đặt **ODBC Driver 17 for SQL Server**.
|
|
36
|
+
* [Tải về tại đây (Microsoft)](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server)
|
|
37
|
+
3. **Python Libraries**:
|
|
38
|
+
```bash
|
|
39
|
+
pip install -r requirements.txt
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Cấu trúc Project
|
|
43
|
+
|
|
44
|
+
* `src/dbConnector.py`: Module chính chứa class `dbJob`.
|
|
45
|
+
* `config/db_config.yaml`: File cấu hình database (cần tự tạo dựa trên mẫu).
|
|
46
|
+
* `notebooks/demo_usage.ipynb`: Ví dụ cách sử dụng.
|
|
47
|
+
|
|
48
|
+
## Hướng dẫn sử dụng nhanh
|
|
49
|
+
|
|
50
|
+
### 1. Cấu hình kết nối
|
|
51
|
+
Tạo file `config/db_config.yaml`:
|
|
52
|
+
```yaml
|
|
53
|
+
db_info:
|
|
54
|
+
server: "localhost"
|
|
55
|
+
database: "MyDatabase"
|
|
56
|
+
username: "sa"
|
|
57
|
+
password: "mypassword"
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# SQL Data Connector Project
|
|
2
|
+
|
|
3
|
+
Project này cung cấp module Python mạnh mẽ (`dbConnector`) để tương tác với Microsoft SQL Server, tối ưu hóa cho các tác vụ Data Engineering như: ETL, Insert dữ liệu lớn, và Đồng bộ hóa (Upsert) dữ liệu từ Pandas DataFrame.
|
|
4
|
+
|
|
5
|
+
## Tính năng nổi bật
|
|
6
|
+
|
|
7
|
+
* **Tự động hóa cao**: Tự động tạo bảng, phát hiện kiểu dữ liệu, và thêm cột mới nếu DataFrame thay đổi.
|
|
8
|
+
* **Hiệu năng cao**: Sử dụng `fast_executemany=True` của SQLAlchemy/pyodbc để tăng tốc độ Insert.
|
|
9
|
+
* **An toàn dữ liệu**: Hỗ trợ Transaction (Commit/Rollback) để đảm bảo tính toàn vẹn dữ liệu.
|
|
10
|
+
* **Sync thông minh**: Hàm `check_and_update_table` giúp so sánh và chỉ update những dòng thay đổi, insert những dòng mới.
|
|
11
|
+
* **Tiện ích**: Hỗ trợ làm sạch dữ liệu số (ví dụ: convert "1.5M" thành 1,500,000).
|
|
12
|
+
|
|
13
|
+
## Yêu cầu cài đặt
|
|
14
|
+
|
|
15
|
+
1. **Hệ điều hành**: Windows, Linux, hoặc MacOS.
|
|
16
|
+
2. **Driver**: Cần cài đặt **ODBC Driver 17 for SQL Server**.
|
|
17
|
+
* [Tải về tại đây (Microsoft)](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server)
|
|
18
|
+
3. **Python Libraries**:
|
|
19
|
+
```bash
|
|
20
|
+
pip install -r requirements.txt
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Cấu trúc Project
|
|
24
|
+
|
|
25
|
+
* `src/dbConnector.py`: Module chính chứa class `dbJob`.
|
|
26
|
+
* `config/db_config.yaml`: File cấu hình database (cần tự tạo dựa trên mẫu).
|
|
27
|
+
* `notebooks/demo_usage.ipynb`: Ví dụ cách sử dụng.
|
|
28
|
+
|
|
29
|
+
## Hướng dẫn sử dụng nhanh
|
|
30
|
+
|
|
31
|
+
### 1. Cấu hình kết nối
|
|
32
|
+
Tạo file `config/db_config.yaml`:
|
|
33
|
+
```yaml
|
|
34
|
+
db_info:
|
|
35
|
+
server: "localhost"
|
|
36
|
+
database: "MyDatabase"
|
|
37
|
+
username: "sa"
|
|
38
|
+
password: "mypassword"
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# File: pyproject.toml
|
|
2
|
+
|
|
3
|
+
[build-system]
|
|
4
|
+
requires = ["setuptools>=61.0"]
|
|
5
|
+
build-backend = "setuptools.build_meta"
|
|
6
|
+
|
|
7
|
+
[project]
|
|
8
|
+
name = "sqlServerConnector"
|
|
9
|
+
version = "0.1.0"
|
|
10
|
+
description = "A custom SQL Server Connector for ETL processes with Pandas"
|
|
11
|
+
readme = "README.md"
|
|
12
|
+
requires-python = ">=3.8"
|
|
13
|
+
authors = [
|
|
14
|
+
{ name="Nguyen Minh Son", email="nguyen.minhson1511@gmail.com" },
|
|
15
|
+
]
|
|
16
|
+
keywords = ["sql", "etl", "pandas", "sqlalchemy"]
|
|
17
|
+
classifiers = [
|
|
18
|
+
"Programming Language :: Python :: 3",
|
|
19
|
+
"Operating System :: OS Independent",
|
|
20
|
+
]
|
|
21
|
+
|
|
22
|
+
# Tự động cài các thư viện này khi user cài package của bạn
|
|
23
|
+
dependencies = [
|
|
24
|
+
"pandas>=1.5.0",
|
|
25
|
+
"numpy",
|
|
26
|
+
"sqlalchemy>=2.0.0",
|
|
27
|
+
"pyodbc",
|
|
28
|
+
"pyyaml",
|
|
29
|
+
"loguru",
|
|
30
|
+
"jupyterlab"
|
|
31
|
+
]
|
|
32
|
+
|
|
33
|
+
[project.urls]
|
|
34
|
+
"Homepage" = "https://github.com/johnnyb1509/sqlServerConnector"
|
|
@@ -0,0 +1,406 @@
|
|
|
1
|
+
import os
|
|
2
|
+
import numpy as np
|
|
3
|
+
import pandas as pd
|
|
4
|
+
import yaml
|
|
5
|
+
from typing import List, Optional, Dict, Union, Any
|
|
6
|
+
from loguru import logger
|
|
7
|
+
from sqlalchemy import create_engine, inspect, text, URL
|
|
8
|
+
from sqlalchemy.types import NVARCHAR, FLOAT, INTEGER, DATE, DATETIME, BIGINT
|
|
9
|
+
from sqlalchemy.exc import SQLAlchemyError
|
|
10
|
+
|
|
11
|
+
# ========================================================
|
|
12
|
+
# SQL SERVER CONNECTOR (Standardized ETL Object)
|
|
13
|
+
# Tech Stack: SQLAlchemy 2.0+, Pandas, PyODBC
|
|
14
|
+
# Unicode Support: YES (Vietnamese/UTF-8)
|
|
15
|
+
# ========================================================
|
|
16
|
+
|
|
17
|
+
class SQLServerConnector:
|
|
18
|
+
"""
|
|
19
|
+
A robust, SQLAlchemy 2.0 compliant connector for SQL Server designed for ETL processes.
|
|
20
|
+
|
|
21
|
+
Features:
|
|
22
|
+
- High-performance Upserts (Merge) using Staging Tables.
|
|
23
|
+
- Full Unicode/Vietnamese Support (NVARCHAR + UTF8).
|
|
24
|
+
- Automatic Schema Evolution (adds missing columns).
|
|
25
|
+
- Automatic Primary Key detection and creation.
|
|
26
|
+
"""
|
|
27
|
+
|
|
28
|
+
def __init__(self, server: str, database: str, username: str, password: str, driver: str = 'ODBC Driver 17 for SQL Server'):
|
|
29
|
+
self.server = server
|
|
30
|
+
self.database = database
|
|
31
|
+
self.username = username
|
|
32
|
+
self.password = password
|
|
33
|
+
self.driver = driver
|
|
34
|
+
|
|
35
|
+
# Connection URL Construction
|
|
36
|
+
# CRITICAL: 'fast_executemany' is required for proper Unicode handling in bulk inserts with PyODBC
|
|
37
|
+
self.connection_url = URL.create(
|
|
38
|
+
"mssql+pyodbc",
|
|
39
|
+
query={
|
|
40
|
+
"odbc_connect": (
|
|
41
|
+
f"DRIVER={self.driver};"
|
|
42
|
+
f"SERVER={self.server};"
|
|
43
|
+
f"DATABASE={self.database};"
|
|
44
|
+
f"UID={self.username};"
|
|
45
|
+
f"PWD={self.password};"
|
|
46
|
+
"Charsets=UTF-8;" # Explicitly request UTF-8
|
|
47
|
+
),
|
|
48
|
+
"fast_executemany": "True"
|
|
49
|
+
}
|
|
50
|
+
)
|
|
51
|
+
|
|
52
|
+
# Create Engine
|
|
53
|
+
self.engine = create_engine(
|
|
54
|
+
self.connection_url,
|
|
55
|
+
pool_pre_ping=True,
|
|
56
|
+
pool_size=20,
|
|
57
|
+
max_overflow=10
|
|
58
|
+
)
|
|
59
|
+
|
|
60
|
+
def get_engine(self):
|
|
61
|
+
"""Returns the SQLAlchemy engine object."""
|
|
62
|
+
return self.engine
|
|
63
|
+
|
|
64
|
+
def close(self):
|
|
65
|
+
"""Alias for dispose(). Closes all connections in the pool."""
|
|
66
|
+
self.dispose()
|
|
67
|
+
|
|
68
|
+
def dispose(self):
|
|
69
|
+
"""Dispose of the engine and close all connections."""
|
|
70
|
+
self.engine.dispose()
|
|
71
|
+
logger.info("Database engine disposed and connections closed.")
|
|
72
|
+
|
|
73
|
+
# ========================================================
|
|
74
|
+
# SCHEMA & METADATA METHODS
|
|
75
|
+
# ========================================================
|
|
76
|
+
|
|
77
|
+
def get_table_names(self) -> List[str]:
|
|
78
|
+
"""Retrieve all table names in the database."""
|
|
79
|
+
try:
|
|
80
|
+
inspector = inspect(self.engine)
|
|
81
|
+
return inspector.get_table_names()
|
|
82
|
+
except SQLAlchemyError as e:
|
|
83
|
+
logger.error(f"Error retrieving table names: {e}")
|
|
84
|
+
return []
|
|
85
|
+
|
|
86
|
+
def check_table_exists(self, table_name: str) -> bool:
|
|
87
|
+
"""Check if a specific table exists."""
|
|
88
|
+
return inspect(self.engine).has_table(table_name)
|
|
89
|
+
|
|
90
|
+
def get_primary_key(self, table_name: str) -> Optional[str]:
|
|
91
|
+
"""Retrieve the primary key column name for a table."""
|
|
92
|
+
try:
|
|
93
|
+
inspector = inspect(self.engine)
|
|
94
|
+
pk_constraint = inspector.get_pk_constraint(table_name)
|
|
95
|
+
if pk_constraint and pk_constraint['constrained_columns']:
|
|
96
|
+
return pk_constraint['constrained_columns'][0]
|
|
97
|
+
|
|
98
|
+
# Fallback heuristic
|
|
99
|
+
columns = [col['name'] for col in inspector.get_columns(table_name)]
|
|
100
|
+
for candidate in ['id_date', 'id', 'ID', 'Date', 'date']:
|
|
101
|
+
if candidate in columns:
|
|
102
|
+
return candidate
|
|
103
|
+
return None
|
|
104
|
+
except Exception as e:
|
|
105
|
+
logger.warning(f"Could not inspect PK for {table_name}: {e}")
|
|
106
|
+
return None
|
|
107
|
+
|
|
108
|
+
def get_columns_info(self, table_name: str) -> Dict[str, str]:
|
|
109
|
+
"""Get column names and their SQL types."""
|
|
110
|
+
inspector = inspect(self.engine)
|
|
111
|
+
columns = inspector.get_columns(table_name)
|
|
112
|
+
return {col['name']: str(col['type']) for col in columns}
|
|
113
|
+
|
|
114
|
+
# ========================================================
|
|
115
|
+
# DATA RETRIEVAL METHODS
|
|
116
|
+
# ========================================================
|
|
117
|
+
|
|
118
|
+
def get_data(self, query_or_table: str, params: Optional[Dict] = None) -> pd.DataFrame:
|
|
119
|
+
"""
|
|
120
|
+
Execute a raw SQL query or fetch a whole table.
|
|
121
|
+
Args:
|
|
122
|
+
query_or_table: SQL Query string OR Table Name.
|
|
123
|
+
params: Dictionary of parameters for the query.
|
|
124
|
+
"""
|
|
125
|
+
if "SELECT" not in query_or_table.upper() and " " not in query_or_table:
|
|
126
|
+
query = text(f"SELECT * FROM {query_or_table}")
|
|
127
|
+
else:
|
|
128
|
+
query = text(query_or_table)
|
|
129
|
+
|
|
130
|
+
try:
|
|
131
|
+
with self.engine.connect() as conn:
|
|
132
|
+
df = pd.read_sql(query, conn, params=params)
|
|
133
|
+
return df
|
|
134
|
+
except Exception as e:
|
|
135
|
+
logger.error(f"Error fetching data: {e}")
|
|
136
|
+
return pd.DataFrame()
|
|
137
|
+
|
|
138
|
+
# ========================================================
|
|
139
|
+
# CORE ETL METHODS (Upsert Logic)
|
|
140
|
+
# ========================================================
|
|
141
|
+
|
|
142
|
+
def upsert_data(self,
|
|
143
|
+
df: pd.DataFrame,
|
|
144
|
+
target_table: str,
|
|
145
|
+
primary_key: str = 'id_date',
|
|
146
|
+
match_columns: Optional[List[str]] = None,
|
|
147
|
+
auto_evolve_schema: bool = True):
|
|
148
|
+
"""
|
|
149
|
+
Main ETL Function with Unicode Support.
|
|
150
|
+
|
|
151
|
+
Args:
|
|
152
|
+
df: The new data to push.
|
|
153
|
+
target_table: The SQL table name.
|
|
154
|
+
primary_key: The Database Primary Key.
|
|
155
|
+
match_columns: Columns to match on (e.g. ['Ticker', 'Date']) for detecting updates.
|
|
156
|
+
auto_evolve_schema: If True, adds missing columns to SQL automatically.
|
|
157
|
+
"""
|
|
158
|
+
if df.empty:
|
|
159
|
+
logger.warning(f"Dataframe for {target_table} is empty. Skipping.")
|
|
160
|
+
return
|
|
161
|
+
|
|
162
|
+
# 1. PRE-PROCESS DATA
|
|
163
|
+
df_clean = self._sanitize_dataframe(df)
|
|
164
|
+
|
|
165
|
+
# 2. CHECK TARGET TABLE
|
|
166
|
+
if not self.check_table_exists(target_table):
|
|
167
|
+
logger.info(f"Table {target_table} does not exist. Creating new table.")
|
|
168
|
+
self._create_table_from_df(df_clean, target_table, primary_key)
|
|
169
|
+
return
|
|
170
|
+
|
|
171
|
+
# 3. SCHEMA EVOLUTION
|
|
172
|
+
if auto_evolve_schema:
|
|
173
|
+
self._sync_columns(df_clean, target_table)
|
|
174
|
+
|
|
175
|
+
# 4. DETERMINE MATCHING LOGIC
|
|
176
|
+
if match_columns:
|
|
177
|
+
join_keys = match_columns
|
|
178
|
+
elif primary_key in df_clean.columns:
|
|
179
|
+
join_keys = [primary_key]
|
|
180
|
+
else:
|
|
181
|
+
logger.error(f"CRITICAL: Primary Key '{primary_key}' is missing from DataFrame (likely Auto-Increment).")
|
|
182
|
+
logger.error("You MUST provide 'match_columns' to identify which rows to update.")
|
|
183
|
+
raise ValueError("Missing match keys for Identity Column Upsert.")
|
|
184
|
+
|
|
185
|
+
# 5. EXECUTE UPSERT VIA STAGING
|
|
186
|
+
self._execute_merge_upsert(df_clean, target_table, join_keys)
|
|
187
|
+
|
|
188
|
+
def _execute_merge_upsert(self, df: pd.DataFrame, target_table: str, join_keys: List[str]):
|
|
189
|
+
"""Internal: Uploads to a temp table and runs a SQL MERGE."""
|
|
190
|
+
staging_table = f"##staging_{target_table}"
|
|
191
|
+
|
|
192
|
+
with self.engine.begin() as conn:
|
|
193
|
+
try:
|
|
194
|
+
# A. Upload to Staging
|
|
195
|
+
# IMPORTANT: We use NVARCHAR mapping implicitly here via pandas to_sql,
|
|
196
|
+
# but explicit dtype mapping is safer for Unicode preservation.
|
|
197
|
+
dtype_map = {}
|
|
198
|
+
for col in df.columns:
|
|
199
|
+
if df[col].dtype == 'object':
|
|
200
|
+
dtype_map[col] = NVARCHAR(None) # Force Unicode (NVARCHAR) for all strings
|
|
201
|
+
|
|
202
|
+
df.to_sql(staging_table, conn, if_exists='replace', index=False, chunksize=5000, dtype=dtype_map)
|
|
203
|
+
|
|
204
|
+
# B. Build Dynamic SQL
|
|
205
|
+
source_cols = [col for col in df.columns]
|
|
206
|
+
|
|
207
|
+
# Join Condition: Target.Key = Source.Key AND ...
|
|
208
|
+
on_clause = " AND ".join([f"Target.[{k}] = Source.[{k}]" for k in join_keys])
|
|
209
|
+
|
|
210
|
+
# Update Clause
|
|
211
|
+
update_stmts = [f"Target.[{col}] = Source.[{col}]" for col in source_cols
|
|
212
|
+
if col not in join_keys]
|
|
213
|
+
|
|
214
|
+
# Insert Clause
|
|
215
|
+
insert_cols_str = ", ".join([f"[{col}]" for col in source_cols])
|
|
216
|
+
insert_vals_str = ", ".join([f"Source.[{col}]" for col in source_cols])
|
|
217
|
+
|
|
218
|
+
# C. Construct MERGE Query
|
|
219
|
+
# Notice the N prefix is usually for literals, but since we are copying column-to-column
|
|
220
|
+
# from a staging table that is ALREADY NVARCHAR, we don't need N'' prefixes here.
|
|
221
|
+
if not update_stmts:
|
|
222
|
+
merge_query = f"""
|
|
223
|
+
MERGE [{target_table}] AS Target
|
|
224
|
+
USING [{staging_table}] AS Source
|
|
225
|
+
ON ({on_clause})
|
|
226
|
+
WHEN NOT MATCHED BY TARGET THEN
|
|
227
|
+
INSERT ({insert_cols_str}) VALUES ({insert_vals_str});
|
|
228
|
+
"""
|
|
229
|
+
else:
|
|
230
|
+
merge_query = f"""
|
|
231
|
+
MERGE [{target_table}] AS Target
|
|
232
|
+
USING [{staging_table}] AS Source
|
|
233
|
+
ON ({on_clause})
|
|
234
|
+
WHEN MATCHED THEN
|
|
235
|
+
UPDATE SET {", ".join(update_stmts)}
|
|
236
|
+
WHEN NOT MATCHED BY TARGET THEN
|
|
237
|
+
INSERT ({insert_cols_str}) VALUES ({insert_vals_str});
|
|
238
|
+
"""
|
|
239
|
+
|
|
240
|
+
conn.execute(text(merge_query))
|
|
241
|
+
logger.success(f"Upsert successful for {target_table}. Matched on {join_keys}.")
|
|
242
|
+
|
|
243
|
+
conn.execute(text(f"DROP TABLE [{staging_table}]"))
|
|
244
|
+
|
|
245
|
+
except SQLAlchemyError as e:
|
|
246
|
+
logger.error(f"Upsert failed for {target_table}: {e}")
|
|
247
|
+
raise
|
|
248
|
+
|
|
249
|
+
# ========================================================
|
|
250
|
+
# HELPER: SCHEMA & CREATION
|
|
251
|
+
# ========================================================
|
|
252
|
+
|
|
253
|
+
def _sync_columns(self, df: pd.DataFrame, table_name: str):
|
|
254
|
+
"""Add missing columns to the SQL table."""
|
|
255
|
+
db_cols = self.get_columns_info(table_name)
|
|
256
|
+
existing_cols_lower = {k.lower() for k in db_cols.keys()}
|
|
257
|
+
|
|
258
|
+
new_cols = [col for col in df.columns if col.lower() not in existing_cols_lower]
|
|
259
|
+
|
|
260
|
+
if new_cols:
|
|
261
|
+
logger.info(f"Schema Evolution: Adding {len(new_cols)} new columns to {table_name}.")
|
|
262
|
+
with self.engine.connect() as conn:
|
|
263
|
+
for col in new_cols:
|
|
264
|
+
dtype = df[col].dtype
|
|
265
|
+
# VIETNAMESE SUPPORT: Default to NVARCHAR(MAX) for new string columns
|
|
266
|
+
sql_type = "NVARCHAR(MAX)"
|
|
267
|
+
|
|
268
|
+
if pd.api.types.is_integer_dtype(dtype):
|
|
269
|
+
sql_type = "BIGINT"
|
|
270
|
+
elif pd.api.types.is_float_dtype(dtype):
|
|
271
|
+
sql_type = "FLOAT"
|
|
272
|
+
elif pd.api.types.is_datetime64_any_dtype(dtype):
|
|
273
|
+
sql_type = "DATETIME"
|
|
274
|
+
|
|
275
|
+
try:
|
|
276
|
+
conn.execute(text(f"ALTER TABLE [{table_name}] ADD [{col}] {sql_type} NULL"))
|
|
277
|
+
conn.commit()
|
|
278
|
+
except Exception as e:
|
|
279
|
+
logger.warning(f"Failed to add column {col}: {e}")
|
|
280
|
+
|
|
281
|
+
def _create_table_from_df(self, df: pd.DataFrame, table_name: str, primary_key: Optional[str] = None):
|
|
282
|
+
"""Create a new table with Unicode support (NVARCHAR)."""
|
|
283
|
+
dtype_map = {}
|
|
284
|
+
for col in df.columns:
|
|
285
|
+
# VIETNAMESE SUPPORT: Explicitly map all object columns to NVARCHAR
|
|
286
|
+
if df[col].dtype == 'object':
|
|
287
|
+
dtype_map[col] = NVARCHAR(None) # None = MAX
|
|
288
|
+
|
|
289
|
+
df.to_sql(table_name, self.engine, index=False, dtype=dtype_map)
|
|
290
|
+
|
|
291
|
+
if primary_key:
|
|
292
|
+
if primary_key in df.columns:
|
|
293
|
+
pk_dtype = df[primary_key].dtype
|
|
294
|
+
self.set_primary_key(table_name, primary_key, source_dtype=pk_dtype)
|
|
295
|
+
else:
|
|
296
|
+
logger.warning(f"Skipping PK creation: Column '{primary_key}' not found in new data.")
|
|
297
|
+
|
|
298
|
+
def set_primary_key(self, table_name: str, column_name: str, source_dtype=None):
|
|
299
|
+
"""Sets a primary key with type detection."""
|
|
300
|
+
sql_type = "INT"
|
|
301
|
+
if source_dtype is not None:
|
|
302
|
+
if pd.api.types.is_integer_dtype(source_dtype):
|
|
303
|
+
sql_type = "BIGINT"
|
|
304
|
+
elif pd.api.types.is_float_dtype(source_dtype):
|
|
305
|
+
sql_type = "BIGINT"
|
|
306
|
+
elif pd.api.types.is_string_dtype(source_dtype):
|
|
307
|
+
# VIETNAMESE SUPPORT: PKs that are strings must also be NVARCHAR
|
|
308
|
+
sql_type = "NVARCHAR(450)"
|
|
309
|
+
elif pd.api.types.is_datetime64_any_dtype(source_dtype):
|
|
310
|
+
sql_type = "DATE"
|
|
311
|
+
|
|
312
|
+
with self.engine.connect() as conn:
|
|
313
|
+
with conn.begin():
|
|
314
|
+
try:
|
|
315
|
+
conn.execute(text(f"ALTER TABLE [{table_name}] ALTER COLUMN [{column_name}] {sql_type} NOT NULL"))
|
|
316
|
+
conn.execute(text(f"ALTER TABLE [{table_name}] ADD PRIMARY KEY ([{column_name}])"))
|
|
317
|
+
logger.info(f"Primary key set on {table_name}.{column_name}")
|
|
318
|
+
except SQLAlchemyError as e:
|
|
319
|
+
logger.error(f"Failed to set PK on {table_name}: {e}")
|
|
320
|
+
|
|
321
|
+
# ========================================================
|
|
322
|
+
# HELPER: DATA CLEANING
|
|
323
|
+
# ========================================================
|
|
324
|
+
|
|
325
|
+
def _sanitize_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
|
|
326
|
+
"""Cleans numeric strings, NaT, and NaN values."""
|
|
327
|
+
df = df.copy()
|
|
328
|
+
|
|
329
|
+
# 1. Clean Numeric Strings
|
|
330
|
+
for col in df.select_dtypes(include=['object']).columns:
|
|
331
|
+
# Only try to convert to float if it looks like a number (digits or currency symbols)
|
|
332
|
+
# Avoid trying to convert Vietnamese text columns
|
|
333
|
+
sample = df[col].dropna().head(10).astype(str).tolist()
|
|
334
|
+
if any(any(char.isdigit() for char in str(x)) for x in sample):
|
|
335
|
+
try:
|
|
336
|
+
# Attempt conversion, but strictly ignore errors so we don't break text columns
|
|
337
|
+
# We use a temp series to check if conversion is successful for majority
|
|
338
|
+
temp = df[col].apply(self._clean_numeric_string)
|
|
339
|
+
# If the column was actually text (e.g., "Hà Nội"), _clean_numeric_string returns the original string.
|
|
340
|
+
# We trust _clean_numeric_string to be safe.
|
|
341
|
+
df[col] = temp
|
|
342
|
+
except:
|
|
343
|
+
pass
|
|
344
|
+
|
|
345
|
+
# 2. Clean Dates (NaT -> None)
|
|
346
|
+
for col in df.select_dtypes(include=['datetime', 'datetimetz']).columns:
|
|
347
|
+
df[col] = df[col].replace({pd.NaT: None})
|
|
348
|
+
df[col] = df[col].astype(object).where(df[col].notnull(), None)
|
|
349
|
+
|
|
350
|
+
# 3. Clean NaN -> None
|
|
351
|
+
df = df.replace({np.nan: None})
|
|
352
|
+
df = df.where(pd.notnull(df), None)
|
|
353
|
+
return df
|
|
354
|
+
|
|
355
|
+
@staticmethod
|
|
356
|
+
def _clean_numeric_string(value):
|
|
357
|
+
"""Convert '2.5B', '1,000' to float. Safe for Vietnamese text."""
|
|
358
|
+
if pd.isna(value) or value is None: return None
|
|
359
|
+
if isinstance(value, (int, float)): return value
|
|
360
|
+
|
|
361
|
+
s = str(value).strip().upper()
|
|
362
|
+
if not s: return None
|
|
363
|
+
|
|
364
|
+
# Heuristic: If it contains many letters (excluding K,M,B,T for billions), it's probably text
|
|
365
|
+
# Count letters
|
|
366
|
+
alpha_count = sum(c.isalpha() for c in s)
|
|
367
|
+
if alpha_count > 1 and s[-1] not in ['K', 'M', 'B', 'T']:
|
|
368
|
+
return value # It's likely text (e.g. "Cổ phiếu")
|
|
369
|
+
|
|
370
|
+
# Clean common financial chars
|
|
371
|
+
s_clean = s.replace(',', '').replace('%', '')
|
|
372
|
+
|
|
373
|
+
multipliers = {'K': 1e3, 'M': 1e6, 'B': 1e9, 'T': 1e12}
|
|
374
|
+
if s_clean and s_clean[-1] in multipliers:
|
|
375
|
+
try:
|
|
376
|
+
return float(s_clean[:-1]) * multipliers[s_clean[-1]]
|
|
377
|
+
except ValueError:
|
|
378
|
+
return value
|
|
379
|
+
|
|
380
|
+
try:
|
|
381
|
+
return float(s_clean)
|
|
382
|
+
except ValueError:
|
|
383
|
+
return value
|
|
384
|
+
|
|
385
|
+
# ========================================================
|
|
386
|
+
# ENTRY POINT
|
|
387
|
+
# ========================================================
|
|
388
|
+
|
|
389
|
+
# def get_db_connector(yaml_path: Optional[str] = None, env_prefix: str = "DB") -> SQLServerConnector:
|
|
390
|
+
# """Factory function to initialize connector."""
|
|
391
|
+
# if yaml_path and os.path.exists(yaml_path):
|
|
392
|
+
# with open(yaml_path, 'r') as f:
|
|
393
|
+
# config = yaml.safe_load(f).get('db_info', {})
|
|
394
|
+
# return SQLServerConnector(
|
|
395
|
+
# config.get('server'),
|
|
396
|
+
# config.get('database'),
|
|
397
|
+
# config.get('username'),
|
|
398
|
+
# config.get('password')
|
|
399
|
+
# )
|
|
400
|
+
# else:
|
|
401
|
+
# return SQLServerConnector(
|
|
402
|
+
# os.environ.get(f'{env_prefix}_SERVER'),
|
|
403
|
+
# os.environ.get(f'{env_prefix}_NAME'),
|
|
404
|
+
# os.environ.get(f'{env_prefix}_USER'),
|
|
405
|
+
# os.environ.get(f'{env_prefix}_PASS')
|
|
406
|
+
# )
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: sqlServerConnector
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A custom SQL Server Connector for ETL processes with Pandas
|
|
5
|
+
Author-email: Nguyen Minh Son <nguyen.minhson1511@gmail.com>
|
|
6
|
+
Project-URL: Homepage, https://github.com/johnnyb1509/sqlServerConnector
|
|
7
|
+
Keywords: sql,etl,pandas,sqlalchemy
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: Operating System :: OS Independent
|
|
10
|
+
Requires-Python: >=3.8
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
Requires-Dist: pandas>=1.5.0
|
|
13
|
+
Requires-Dist: numpy
|
|
14
|
+
Requires-Dist: sqlalchemy>=2.0.0
|
|
15
|
+
Requires-Dist: pyodbc
|
|
16
|
+
Requires-Dist: pyyaml
|
|
17
|
+
Requires-Dist: loguru
|
|
18
|
+
Requires-Dist: jupyterlab
|
|
19
|
+
|
|
20
|
+
# SQL Data Connector Project
|
|
21
|
+
|
|
22
|
+
Project này cung cấp module Python mạnh mẽ (`dbConnector`) để tương tác với Microsoft SQL Server, tối ưu hóa cho các tác vụ Data Engineering như: ETL, Insert dữ liệu lớn, và Đồng bộ hóa (Upsert) dữ liệu từ Pandas DataFrame.
|
|
23
|
+
|
|
24
|
+
## Tính năng nổi bật
|
|
25
|
+
|
|
26
|
+
* **Tự động hóa cao**: Tự động tạo bảng, phát hiện kiểu dữ liệu, và thêm cột mới nếu DataFrame thay đổi.
|
|
27
|
+
* **Hiệu năng cao**: Sử dụng `fast_executemany=True` của SQLAlchemy/pyodbc để tăng tốc độ Insert.
|
|
28
|
+
* **An toàn dữ liệu**: Hỗ trợ Transaction (Commit/Rollback) để đảm bảo tính toàn vẹn dữ liệu.
|
|
29
|
+
* **Sync thông minh**: Hàm `check_and_update_table` giúp so sánh và chỉ update những dòng thay đổi, insert những dòng mới.
|
|
30
|
+
* **Tiện ích**: Hỗ trợ làm sạch dữ liệu số (ví dụ: convert "1.5M" thành 1,500,000).
|
|
31
|
+
|
|
32
|
+
## Yêu cầu cài đặt
|
|
33
|
+
|
|
34
|
+
1. **Hệ điều hành**: Windows, Linux, hoặc MacOS.
|
|
35
|
+
2. **Driver**: Cần cài đặt **ODBC Driver 17 for SQL Server**.
|
|
36
|
+
* [Tải về tại đây (Microsoft)](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server)
|
|
37
|
+
3. **Python Libraries**:
|
|
38
|
+
```bash
|
|
39
|
+
pip install -r requirements.txt
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Cấu trúc Project
|
|
43
|
+
|
|
44
|
+
* `src/dbConnector.py`: Module chính chứa class `dbJob`.
|
|
45
|
+
* `config/db_config.yaml`: File cấu hình database (cần tự tạo dựa trên mẫu).
|
|
46
|
+
* `notebooks/demo_usage.ipynb`: Ví dụ cách sử dụng.
|
|
47
|
+
|
|
48
|
+
## Hướng dẫn sử dụng nhanh
|
|
49
|
+
|
|
50
|
+
### 1. Cấu hình kết nối
|
|
51
|
+
Tạo file `config/db_config.yaml`:
|
|
52
|
+
```yaml
|
|
53
|
+
db_info:
|
|
54
|
+
server: "localhost"
|
|
55
|
+
database: "MyDatabase"
|
|
56
|
+
username: "sa"
|
|
57
|
+
password: "mypassword"
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
README.md
|
|
2
|
+
pyproject.toml
|
|
3
|
+
src/__init__.py
|
|
4
|
+
src/connector.py
|
|
5
|
+
src/sqlServerConnector.egg-info/PKG-INFO
|
|
6
|
+
src/sqlServerConnector.egg-info/SOURCES.txt
|
|
7
|
+
src/sqlServerConnector.egg-info/dependency_links.txt
|
|
8
|
+
src/sqlServerConnector.egg-info/requires.txt
|
|
9
|
+
src/sqlServerConnector.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|