PyPI - sibi-dst - Versions diffs - 0.3.40__tar.gz → 0.3.43__tar.gz - Mend

sibi-dst 0.3.40tar.gz → 0.3.43tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

sibi_dst-0.3.43/PKG-INFO ADDED Viewed

@@ -0,0 +1,194 @@
+Metadata-Version: 2.1
+Name: sibi-dst
+Version: 0.3.43
+Summary: Data Science Toolkit
+Author: Luis Valverde
+Author-email: lvalverdeb@gmail.com
+Requires-Python: >=3.11,<4.0
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Requires-Dist: apache-airflow-client (>=2.10.0,<3.0.0)
+Requires-Dist: chardet (>=5.2.0,<6.0.0)
+Requires-Dist: charset-normalizer (>=3.4.0,<4.0.0)
+Requires-Dist: clickhouse-connect (>=0.8.7,<0.9.0)
+Requires-Dist: clickhouse-driver (>=0.2.9,<0.3.0)
+Requires-Dist: dask-expr (>=1.1.20,<2.0.0)
+Requires-Dist: dask[complete] (>=2024.11.1,<2025.0.0)
+Requires-Dist: django (>=5.1.4,<6.0.0)
+Requires-Dist: djangorestframework (>=3.15.2,<4.0.0)
+Requires-Dist: folium (>=0.19.4,<0.20.0)
+Requires-Dist: geopandas (>=1.0.1,<2.0.0)
+Requires-Dist: gunicorn (>=23.0.0,<24.0.0)
+Requires-Dist: httpx (>=0.27.2,<0.28.0)
+Requires-Dist: ipython (>=8.29.0,<9.0.0)
+Requires-Dist: jinja2 (>=3.1.4,<4.0.0)
+Requires-Dist: mysqlclient (>=2.2.6,<3.0.0)
+Requires-Dist: nltk (>=3.9.1,<4.0.0)
+Requires-Dist: openpyxl (>=3.1.5,<4.0.0)
+Requires-Dist: osmnx (>=2.0.1,<3.0.0)
+Requires-Dist: pandas (>=2.2.3,<3.0.0)
+Requires-Dist: paramiko (>=3.5.0,<4.0.0)
+Requires-Dist: psutil (>=6.1.0,<7.0.0)
+Requires-Dist: psycopg2 (>=2.9.10,<3.0.0)
+Requires-Dist: pyarrow (>=18.0.0,<19.0.0)
+Requires-Dist: pydantic (>=2.9.2,<3.0.0)
+Requires-Dist: pymysql (>=1.1.1,<2.0.0)
+Requires-Dist: pytest (>=8.3.3,<9.0.0)
+Requires-Dist: pytest-mock (>=3.14.0,<4.0.0)
+Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
+Requires-Dist: s3fs (>=2024.12.0,<2025.0.0)
+Requires-Dist: sqlalchemy (>=2.0.36,<3.0.0)
+Requires-Dist: tornado (>=6.4.1,<7.0.0)
+Requires-Dist: tqdm (>=4.67.0,<5.0.0)
+Requires-Dist: uvicorn (>=0.34.0,<0.35.0)
+Requires-Dist: uvicorn-worker (>=0.3.0,<0.4.0)
+Description-Content-Type: text/markdown
+# sibi-dst
+Data Science Toolkit
+---------------------
+Data Science Toolkit built with Python, Pandas, Dask, OpenStreetMaps, Scikit-Learn, XGBOOST, Django ORM, SQLAlchemy, DjangoRestFramework, FastAPI
+Major Functionality
+--------------------
+1) **Build DataCubes, DataSets, and DataObjects** from diverse data sources, including **relational databases, Parquet files, Excel (`.xlsx`), delimited tables (`.csv`, `.tsv`), JSON, and RESTful APIs (`JSON API REST`)**.
+2) **Comprehensive DataFrame Management** utilities for efficient data handling, transformation, and optimization using **Pandas** and **Dask**.
+3) **Flexible Data Sharing** with client applications by writing to **Data Warehouses, local filesystems, and cloud storage platforms** such as **Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage**.
+4) **Microservices for Data Access** – Build scalable **API-driven services** using **RESTful APIs (`Django REST Framework`, `FastAPI`) and gRPC** for high-performance data exchange.
+Supported Technologies
+--------------------
+- **Data Processing**: Pandas, Dask
+- **Machine Learning**: Scikit-Learn, XGBoost
+- **Databases & Storage**: SQLAlchemy, Django ORM, Parquet, Amazon S3, GCS, Azure Blob Storage
+- **Mapping & Geospatial Analysis**: OpenStreetMaps, OSMnx, Geopy
+- **API Development**: Django REST Framework, gRPC
+Installation
+---------------------
+```bash
+pip install sibi-dst
+```
+Usage
+---------------------
+### Loading Data from SQLAlchemy
+```python
+from sibi_dst.df_helper import DfHelper
+from conf.transforms.fields.crm import customer_fields
+from conf.credentials import replica_db_conf
+from conf.storage import get_fs_instance
+config = {
+    'backend': 'sqlalchemy',
+    'connection_url': replica_db_conf.get('db_url'),
+    'table': 'crm_clientes_archivo',
+    'field_map': customer_fields,
+    'legacy_filters': True,
+    'fs': get_fs_instance()
+}
+df_helper = DfHelper(**config)
+result = df_helper.load(id__gte=1)
+```
+### Saving Data to ClickHouse
+```python
+clk_creds = {
+    'host': '192.168.3.171',
+    'port': 18123,
+    'user': 'username',
+    'database': 'xxxxxxx',
+    'table': 'customer_file',
+    'order_by': 'id'
+}
+df_helper.save_to_clickhouse(**clk_creds)
+```
+### Saving Data to Parquet
+```python
+df_helper.save_to_parquet(
+    parquet_filename='filename.parquet',
+    parquet_storage_path='/path/to/my/files/'
+)
+```
+Backends Supported
+---------------------
+| Backend       | Description |
+|--------------|-------------|
+| `sqlalchemy` | Load data from SQL databases using SQLAlchemy. |
+| `django_db`  | Load data from Django ORM models. |
+| `parquet`    | Load and save data from Parquet files. |
+| `http`       | Fetch data from HTTP endpoints. |
+| `osmnx`      | Geospatial mapping and routing using OpenStreetMap. |
+| `geopy`      | Geolocation services for address lookup and reverse geocoding. |
+Geospatial Utilities
+---------------------
+### **OSMnx Helper (`sibi_dst.osmnx_helper`)
+**
+Provides **OpenStreetMap-based mapping utilities** using `osmnx` and `folium`.
+#### 🔹 Key Features
+- **BaseOsmMap**: Manages interactive Folium-based maps.
+- **PBFHandler**: Loads `.pbf` (Protocolbuffer Binary Format) files for network graphs.
+#### Example: Generating an OSM Map
+```python
+from sibi_dst.osmnx_helper import BaseOsmMap
+osm_map = BaseOsmMap(osmnx_graph=my_graph, df=my_dataframe)
+osm_map.generate_map()
+```
+### **Geopy Helper (`sibi_dst.geopy_helper`)
+**
+Provides **geolocation services** using `Geopy` for forward and reverse geocoding.
+#### 🔹 Key Features
+- **GeolocationService**: Interfaces with `Nominatim` API for geocoding.
+- **Error Handling**: Manages `GeocoderTimedOut` and `GeocoderServiceError` gracefully.
+- **Singleton Geolocator**: Efficiently reuses a global geolocator instance.
+#### Example: Reverse Geocoding
+```python
+from sibi_dst.geopy_helper import GeolocationService
+gs = GeolocationService()
+location = gs.reverse_geocode(lat=9.935, lon=-84.091)
+print(location)
+```
+Advanced Features
+---------------------
+### Querying with Custom Filters
+Filters can be applied dynamically using Django-style syntax:
+```python
+result = df_helper.load(date__gte='2023-01-01', status='active')
+```
+### Parallel Processing
+Leverage Dask for parallel execution:
+```python
+results = df_helper.load_parallel(status='active')
+```
+Testing
+---------------------
+To run unit tests, use:
+```bash
+pytest tests/
+```
+Contributing
+---------------------
+Contributions are welcome! Please submit pull requests or open issues for discussions.
+License
+---------------------
+sibi-dst is licensed under the MIT License.

sibi_dst-0.3.43/README.md ADDED Viewed

@@ -0,0 +1,145 @@
+# sibi-dst
+Data Science Toolkit
+---------------------
+Data Science Toolkit built with Python, Pandas, Dask, OpenStreetMaps, Scikit-Learn, XGBOOST, Django ORM, SQLAlchemy, DjangoRestFramework, FastAPI
+Major Functionality
+--------------------
+1) **Build DataCubes, DataSets, and DataObjects** from diverse data sources, including **relational databases, Parquet files, Excel (`.xlsx`), delimited tables (`.csv`, `.tsv`), JSON, and RESTful APIs (`JSON API REST`)**.
+2) **Comprehensive DataFrame Management** utilities for efficient data handling, transformation, and optimization using **Pandas** and **Dask**.
+3) **Flexible Data Sharing** with client applications by writing to **Data Warehouses, local filesystems, and cloud storage platforms** such as **Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage**.
+4) **Microservices for Data Access** – Build scalable **API-driven services** using **RESTful APIs (`Django REST Framework`, `FastAPI`) and gRPC** for high-performance data exchange.
+Supported Technologies
+--------------------
+- **Data Processing**: Pandas, Dask
+- **Machine Learning**: Scikit-Learn, XGBoost
+- **Databases & Storage**: SQLAlchemy, Django ORM, Parquet, Amazon S3, GCS, Azure Blob Storage
+- **Mapping & Geospatial Analysis**: OpenStreetMaps, OSMnx, Geopy
+- **API Development**: Django REST Framework, gRPC
+Installation
+---------------------
+```bash
+pip install sibi-dst
+```
+Usage
+---------------------
+### Loading Data from SQLAlchemy
+```python
+from sibi_dst.df_helper import DfHelper
+from conf.transforms.fields.crm import customer_fields
+from conf.credentials import replica_db_conf
+from conf.storage import get_fs_instance
+config = {
+    'backend': 'sqlalchemy',
+    'connection_url': replica_db_conf.get('db_url'),
+    'table': 'crm_clientes_archivo',
+    'field_map': customer_fields,
+    'legacy_filters': True,
+    'fs': get_fs_instance()
+}
+df_helper = DfHelper(**config)
+result = df_helper.load(id__gte=1)
+```
+### Saving Data to ClickHouse
+```python
+clk_creds = {
+    'host': '192.168.3.171',
+    'port': 18123,
+    'user': 'username',
+    'database': 'xxxxxxx',
+    'table': 'customer_file',
+    'order_by': 'id'
+}
+df_helper.save_to_clickhouse(**clk_creds)
+```
+### Saving Data to Parquet
+```python
+df_helper.save_to_parquet(
+    parquet_filename='filename.parquet',
+    parquet_storage_path='/path/to/my/files/'
+)
+```
+Backends Supported
+---------------------
+| Backend       | Description |
+|--------------|-------------|
+| `sqlalchemy` | Load data from SQL databases using SQLAlchemy. |
+| `django_db`  | Load data from Django ORM models. |
+| `parquet`    | Load and save data from Parquet files. |
+| `http`       | Fetch data from HTTP endpoints. |
+| `osmnx`      | Geospatial mapping and routing using OpenStreetMap. |
+| `geopy`      | Geolocation services for address lookup and reverse geocoding. |
+Geospatial Utilities
+---------------------
+### **OSMnx Helper (`sibi_dst.osmnx_helper`)
+**
+Provides **OpenStreetMap-based mapping utilities** using `osmnx` and `folium`.
+#### 🔹 Key Features
+- **BaseOsmMap**: Manages interactive Folium-based maps.
+- **PBFHandler**: Loads `.pbf` (Protocolbuffer Binary Format) files for network graphs.
+#### Example: Generating an OSM Map
+```python
+from sibi_dst.osmnx_helper import BaseOsmMap
+osm_map = BaseOsmMap(osmnx_graph=my_graph, df=my_dataframe)
+osm_map.generate_map()
+```
+### **Geopy Helper (`sibi_dst.geopy_helper`)
+**
+Provides **geolocation services** using `Geopy` for forward and reverse geocoding.
+#### 🔹 Key Features
+- **GeolocationService**: Interfaces with `Nominatim` API for geocoding.
+- **Error Handling**: Manages `GeocoderTimedOut` and `GeocoderServiceError` gracefully.
+- **Singleton Geolocator**: Efficiently reuses a global geolocator instance.
+#### Example: Reverse Geocoding
+```python
+from sibi_dst.geopy_helper import GeolocationService
+gs = GeolocationService()
+location = gs.reverse_geocode(lat=9.935, lon=-84.091)
+print(location)
+```
+Advanced Features
+---------------------
+### Querying with Custom Filters
+Filters can be applied dynamically using Django-style syntax:
+```python
+result = df_helper.load(date__gte='2023-01-01', status='active')
+```
+### Parallel Processing
+Leverage Dask for parallel execution:
+```python
+results = df_helper.load_parallel(status='active')
+```
+Testing
+---------------------
+To run unit tests, use:
+```bash
+pytest tests/
+```
+Contributing
+---------------------
+Contributions are welcome! Please submit pull requests or open issues for discussions.
+License
+---------------------
+sibi-dst is licensed under the MIT License.

{sibi_dst-0.3.40 → sibi_dst-0.3.43}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "sibi-dst"
-version = "0.3.40"
+version = "0.3.43"
 description = "Data Science Toolkit"
 authors = ["Luis Valverde <lvalverdeb@gmail.com>"]
 readme = "README.md"

{sibi_dst-0.3.40 → sibi_dst-0.3.43}/sibi_dst/df_helper/__init__.py RENAMED Viewed

@@ -3,11 +3,13 @@ from __future__ import annotations
 from ._df_helper import DfHelper
 from ._parquet_artifact import ParquetArtifact
 from ._parquet_reader import ParquetReader
+from ._artifact_updater_multi_wrapper import ArtifactUpdaterMultiWrapper
 #from .data_cleaner import DataCleaner
 __all__ = [
     'DfHelper',
     'ParquetArtifact',
     'ParquetReader',
+    'ArtifactUpdaterMultiWrapper',
     #'DataCleaner'
 ]

sibi_dst-0.3.43/sibi_dst/df_helper/_artifact_updater_multi_wrapper.py ADDED Viewed

@@ -0,0 +1,262 @@
+import asyncio
+import logging
+import datetime
+import psutil
+import time
+from functools import total_ordering
+from collections import defaultdict
+from contextlib import asynccontextmanager
+import signal
+from sibi_dst.utils import Logger
+@total_ordering
+class PrioritizedItem:
+    def __init__(self, priority, artifact):
+        self.priority = priority
+        self.artifact = artifact
+    def __lt__(self, other):
+        return self.priority < other.priority
+    def __eq__(self, other):
+        return self.priority == other.priority
+class ArtifactUpdaterMultiWrapper:
+    def __init__(self, wrapped_classes=None, debug=False, **kwargs):
+        self.wrapped_classes = wrapped_classes or {}
+        self.debug = debug
+        self.logger = kwargs.setdefault('logger',Logger.default_logger(logger_name=self.__class__.__name__))
+        self.logger.set_level(logging.DEBUG if debug else logging.INFO)
+        today = datetime.datetime.today()
+        self.today_str = today.strftime('%Y-%m-%d')
+        self.current_year_starts_on_str = datetime.date(today.year, 1, 1).strftime('%Y-%m-%d')
+        self.parquet_start_date = kwargs.get('parquet_start_date', self.current_year_starts_on_str)
+        self.parquet_end_date = kwargs.get('parquet_end_date', self.today_str)
+        # track concurrency and locks
+        self.locks = {}
+        self.worker_heartbeat = defaultdict(float)
+        # graceful shutdown handling
+        loop = asyncio.get_event_loop()
+        self.register_signal_handlers(loop)
+        # dynamic scaling config
+        self.min_workers = kwargs.get('min_workers', 1)
+        self.max_workers = kwargs.get('max_workers', 8)
+        self.memory_per_worker_gb = kwargs.get('memory_per_worker_gb', 1)  # default 2GB per worker
+        self.monitor_interval = kwargs.get('monitor_interval', 10)  # default monitor interval in seconds
+        self.retry_attempts = kwargs.get('retry_attempts', 3)
+        self.update_timeout_seconds = kwargs.get('update_timeout_seconds', 600)
+        self.lock_acquire_timeout_seconds = kwargs.get('lock_acquire_timeout_seconds', 10)
+    def register_signal_handlers(self, loop):
+        for sig in (signal.SIGINT, signal.SIGTERM):
+            loop.add_signal_handler(sig, lambda: asyncio.create_task(self.shutdown()))
+    async def shutdown(self):
+        self.logger.info("Shutdown signal received. Cleaning up...")
+        tasks = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
+        [task.cancel() for task in tasks]
+        await asyncio.gather(*tasks, return_exceptions=True)
+        self.logger.info("Shutdown complete.")
+    def get_lock_for_artifact(self, artifact):
+        artifact_key = artifact.__class__.__name__
+        if artifact_key not in self.locks:
+            self.locks[artifact_key] = asyncio.Lock()
+        return self.locks[artifact_key]
+    def get_artifacts(self, data_type):
+        if data_type not in self.wrapped_classes:
+            raise ValueError(f"Unsupported data type: {data_type}")
+        return [
+            artifact_class(
+                parquet_start_date=self.parquet_start_date,
+                parquet_end_date=self.parquet_end_date,
+                logger=self.logger,
+                debug=self.debug
+            )
+            for artifact_class in self.wrapped_classes[data_type]
+        ]
+    def estimate_complexity(self, artifact):
+        try:
+            if hasattr(artifact, 'get_size_estimate'):
+                return artifact.get_size_estimate()
+        except Exception as e:
+            self.logger.warning(f"Failed to estimate complexity for {artifact}: {e}")
+        return 1  # default
+    def prioritize_tasks(self, artifacts):
+        queue = asyncio.PriorityQueue()
+        for artifact in artifacts:
+            complexity = self.estimate_complexity(artifact)
+            # we invert the complexity to ensure higher complexity -> higher priority
+            # if you want high complexity first, store negative complexity in the priority queue
+            # or if the smaller number means earlier processing, just keep as is
+            queue.put_nowait(PrioritizedItem(complexity, artifact))
+        return queue
+    async def resource_monitor(self, queue, workers):
+        """Monitor system resources and adjust worker count while queue is not empty."""
+        while True:
+            # break if queue done
+            if queue.empty():
+                await asyncio.sleep(0.5)
+                if queue.empty():
+                    break
+            try:
+                available_memory = psutil.virtual_memory().available
+                worker_memory_bytes = self.memory_per_worker_gb * (1024 ** 3)
+                max_workers_by_memory = available_memory // worker_memory_bytes
+                # figure out how many workers we can sustain
+                # note: we also cap by self.max_workers
+                optimal_workers = min(psutil.cpu_count(), max_workers_by_memory, self.max_workers)
+                # ensure at least self.min_workers is used
+                optimal_workers = max(self.min_workers, optimal_workers)
+                current_worker_count = len(workers)
+                if optimal_workers > current_worker_count:
+                    # we can add more workers if queue is not empty
+                    diff = optimal_workers - current_worker_count
+                    for _ in range(diff):
+                        worker_id = len(workers)
+                        # create a new worker
+                        w = asyncio.create_task(self.worker(queue, worker_id))
+                        workers.append(w)
+                        self.logger.info(f"Added worker {worker_id}. Total workers: {len(workers)}")
+                elif optimal_workers < current_worker_count:
+                    # remove some workers
+                    diff = current_worker_count - optimal_workers
+                    for _ in range(diff):
+                        w = workers.pop()
+                        w.cancel()
+                        self.logger.info(f"Removed a worker. Total workers: {len(workers)}")
+                await asyncio.sleep(self.monitor_interval)
+            except asyncio.CancelledError:
+                # monitor is being shut down
+                break
+            except Exception as e:
+                self.logger.error(f"Error in resource_monitor: {e}")
+                await asyncio.sleep(self.monitor_interval)
+    @asynccontextmanager
+    async def artifact_lock(self, artifact):
+        lock = self.get_lock_for_artifact(artifact)
+        try:
+            await asyncio.wait_for(lock.acquire(), timeout=self.lock_acquire_timeout_seconds)
+            yield
+        except asyncio.TimeoutError:
+            self.logger.error(f"Timeout acquiring lock for artifact: {artifact.__class__.__name__}")
+            yield  # continue but no actual lock was acquired
+        finally:
+            if lock.locked():
+                lock.release()
+    async def async_update_artifact(self, artifact, **kwargs):
+        for attempt in range(self.retry_attempts):
+            try:
+                async with self.artifact_lock(artifact):
+                    self.logger.info(
+                        f"Updating artifact: {artifact.__class__.__name__}, Attempt: {attempt + 1} of {self.retry_attempts}" )
+                    start_time = time.time()
+                    await asyncio.wait_for(
+                        asyncio.to_thread(artifact.update_parquet, **kwargs),
+                        timeout=self.update_timeout_seconds
+                    )
+                    elapsed_time = time.time() - start_time
+                    self.logger.info(
+                        f"Successfully updated artifact: {artifact.__class__.__name__} in {elapsed_time:.2f}s." )
+                    return
+            except asyncio.TimeoutError:
+                self.logger.error(f"Timeout updating artifact {artifact.__class__.__name__}, Attempt: {attempt + 1}")
+            except Exception as e:
+                self.logger.error(
+                    f"Error updating artifact {artifact.__class__.__name__}, Attempt: {attempt + 1}: {e}" )
+            # exponential backoff
+            await asyncio.sleep(2 ** attempt)
+        self.logger.error(f"All retry attempts failed for artifact: {artifact.__class__.__name__}")
+    async def worker(self, queue, worker_id, **kwargs):
+        """A worker that dynamically pulls tasks from the queue."""
+        while True:
+            try:
+                prioritized_item = await queue.get()
+                if prioritized_item is None:
+                    break
+                artifact = prioritized_item.artifact
+                # heartbeat
+                self.worker_heartbeat[worker_id] = time.time()
+                await self.async_update_artifact(artifact, **kwargs)
+            except asyncio.CancelledError:
+                self.logger.info(f"Worker {worker_id} shutting down gracefully.")
+                break
+            except Exception as e:
+                self.logger.error(f"Error in worker {worker_id}: {e}")
+            finally:
+                queue.task_done()
+    async def process_tasks(self, queue, initial_workers, **kwargs):
+        """Start a set of workers and a resource monitor to dynamically adjust them."""
+        # create initial workers
+        workers = []
+        for worker_id in range(initial_workers):
+            w = asyncio.create_task(self.worker(queue, worker_id, **kwargs))
+            workers.append(w)
+        # start resource monitor
+        monitor_task = asyncio.create_task(self.resource_monitor(queue, workers))
+        # wait until queue is done
+        try:
+            await queue.join()
+        finally:
+            # cancel resource monitor
+            monitor_task.cancel()
+            # all workers done
+            for w in workers:
+                w.cancel()
+            await asyncio.gather(*workers, return_exceptions=True)
+    async def update_data(self, data_type, **kwargs):
+        self.logger.info(f"Processing wrapper group: {data_type} with {kwargs}")
+        artifacts = self.get_artifacts(data_type)
+        queue = self.prioritize_tasks(artifacts)
+        # compute initial worker count (this can be low if memory is low initially)
+        initial_workers = self.calculate_initial_workers(len(artifacts))
+        self.logger.info(f"Initial worker count: {initial_workers} for {len(artifacts)} artifacts")
+        total_start_time = time.time()
+        await self.process_tasks(queue, initial_workers, **kwargs)
+        total_time = time.time() - total_start_time
+        self.logger.info(f"Total processing time: {total_time:.2f} seconds.")
+    def calculate_initial_workers(self, artifact_count: int) -> int:
+        """Compute the initial number of workers before resource_monitor can adjust."""
+        self.logger.info("Calculating initial worker count...")
+        available_memory = psutil.virtual_memory().available
+        self.logger.info(f"Available memory: {available_memory / (1024 ** 3):.2f} GB")
+        worker_memory_bytes = self.memory_per_worker_gb * (1024 ** 3)
+        self.logger.info(f"Memory per worker: {worker_memory_bytes / (1024 ** 3):.2f} GB")
+        max_workers_by_memory = available_memory // worker_memory_bytes
+        self.logger.info(f"Max workers by memory: {max_workers_by_memory}")
+        # also consider CPU count and artifact_count
+        initial = min(psutil.cpu_count(), max_workers_by_memory, artifact_count, self.max_workers)
+        self.logger.info(f"Optimal workers: {initial} CPU: {psutil.cpu_count()} Max Workers: {self.max_workers}")
+        return max(self.min_workers, initial)

{sibi_dst-0.3.40 → sibi_dst-0.3.43}/sibi_dst/df_helper/_df_helper.py RENAMED Viewed

@@ -112,6 +112,7 @@ class DfHelper:
         :return: None
         """
         self.logger.debug(f"backend used: {self.backend}")
+        self.logger.debug(f"kwargs passed to backend plugins: {kwargs}")
         self._backend_query = self.__get_config(QueryConfig, kwargs)
         self._backend_params = self.__get_config(ParamsConfig, kwargs)
         if self.backend == 'django_db':
@@ -124,8 +125,8 @@ class DfHelper:
         elif self.backend == 'sqlalchemy':
             self.backend_sqlalchemy = self.__get_config(SqlAlchemyConnectionConfig, kwargs)
-    @staticmethod
-    def __get_config(model: [T], kwargs: Dict[str, Any]) -> Union[T]:
+    def __get_config(self, model: [T], kwargs: Dict[str, Any]) -> Union[T]:
         """
         Initializes a Pydantic model with the keys it recognizes from the kwargs,
         and removes those keys from the kwargs dictionary.
@@ -135,7 +136,9 @@ class DfHelper:
         """
         # Extract keys that the model can accept
         recognized_keys = set(model.__annotations__.keys())
+        self.logger.debug(f"recognized keys: {recognized_keys}")
         model_kwargs = {k: kwargs.pop(k) for k in list(kwargs.keys()) if k in recognized_keys}
+        self.logger.debug(f"model_kwargs: {model_kwargs}")
         return model(**model_kwargs)
     def load_parallel(self, **options):

{sibi_dst-0.3.40 → sibi_dst-0.3.43}/sibi_dst/df_helper/_parquet_artifact.py RENAMED Viewed

@@ -1,11 +1,11 @@
+import logging
 from typing import Optional, Any, Dict
 import dask.dataframe as dd
 import fsspec
 from sibi_dst.df_helper import DfHelper
-from sibi_dst.utils import DataWrapper
-from sibi_dst.utils import DateUtils
+from sibi_dst.utils import DataWrapper, DateUtils, Logger
 class ParquetArtifact(DfHelper):
@@ -82,7 +82,12 @@ class ParquetArtifact(DfHelper):
             **kwargs,
         }
         self.df: Optional[dd.DataFrame] = None
+        self.debug = self.config.setdefault('debug', False)
+        self.logger = self.config.setdefault('logger',Logger.default_logger(logger_name=f'parquet_artifact_{__class__.__name__}'))
+        self.logger.set_level(logging.DEBUG if self.debug else logging.INFO)
         self.data_wrapper_class = data_wrapper_class
+        self.class_params = self.config.setdefault('class_params', None)
+        self.load_params = self.config.setdefault('load_params', None)
         self.date_field = self.config.setdefault('date_field', None)
         if self.date_field is None:
             raise ValueError('date_field must be set')
@@ -150,6 +155,7 @@ class ParquetArtifact(DfHelper):
     def _prepare_params(self, kwargs: Dict[str, Any]) -> Dict[str, Any]:
         """Prepare the parameters for generating the Parquet file."""
+        kwargs = {**self.config, **kwargs}
         return {
             'class_params': kwargs.pop('class_params', None),
             'date_field': kwargs.pop('date_field', self.date_field),

sibi-dst 0.3.40__tar.gz → 0.3.43__tar.gz

sibi-dst 0.3.40tar.gz → 0.3.43tar.gz