PyPI - imsciences - Versions diffs - 0.9.5.1__tar.gz → 0.9.5.5__tar.gz - Mend

imsciences 0.9.5.1tar.gz → 0.9.5.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

{imsciences-0.9.5.1/imsciences.egg-info → imsciences-0.9.5.5}/PKG-INFO RENAMED Viewed

@@ -1,10 +1,10 @@
 Metadata-Version: 2.1
 Name: imsciences
-Version: 0.9.5.1
+Version: 0.9.5.5
 Summary: IMS Data Processing Package
 Author: IMS
 Author-email: cam@im-sciences.com
-Keywords: python,data processing,apis
+Keywords: data processing,apis,data analysis,data visualization,machine learning
 Classifier: Development Status :: 3 - Alpha
 Classifier: Intended Audience :: Developers
 Classifier: Programming Language :: Python :: 3
@@ -17,6 +17,8 @@ Requires-Dist: pandas
 Requires-Dist: plotly
 Requires-Dist: numpy
 Requires-Dist: fredapi
+Requires-Dist: xgboost
+Requires-Dist: scikit-learn
 Requires-Dist: bs4
 Requires-Dist: yfinance
 Requires-Dist: holidays
@@ -33,23 +35,33 @@ The **Independent Marketing Sciences** package is a Python library designed to p
 - Seamless data processing for time series workflows.
 - Aggregation, filtering, and transformation of time series data.
 - Visualising Data
-- Integration with external data sources like FRED, Bank of England, ONS and OECD.
+- Integration with external data sources like FRED, Bank of England and ONS.
 ---
 Table of Contents
 =================
-1. [Data Processing for Time Series](#data-processing-for-time-series)
-2. [Data Processing for Incrementality Testing](#data-processing-for-incrementality-testing)
-3. [Data Visualisations](#data-visualisations)
-4. [Data Pulling](#data-pulling)
-5. [Installation](#installation)
-6. [Usage](#usage)
+1. [Usage](#usage)
+2. [Data Processing for Time Series](#data-processing-for-time-series)
+3. [Data Processing for Incrementality Testing](#data-processing-for-incrementality-testing)
+4. [Data Visualisations](#data-visualisations)
+5. [Data Pulling](#data-pulling)
+6. [Installation](#installation)
 7. [License](#license)
 ---
+## Usage
+```bash
+from imsciences import dataprocessing, geoprocessing, datapull, datavis
+ims_proc = dataprocessing()
+ims_geo = geoprocessing()
+ims_pull = datapull()
+ims_vis = datavis()
+```
 ## Data Processing for Time Series
 ## 1. `get_wd_levels`
@@ -222,6 +234,11 @@ Table of Contents
 - **Usage**: `week_commencing_2_week_commencing_conversion_isoweekday(df, date_col, week_commencing='mon')`
 - **Example**: `week_commencing_2_week_commencing_conversion_isoweekday(df, 'date_col', week_commencing='fri')`
+## 35. `seasonality_feature_extraction`
+- **Description**: Splits data into train/test sets, trains XGBoost and Random Forest on all features, extracts top features based on feature importance, merges them, optionally retrains models on top and combined features, and returns a dict of results.
+- **Usage**: `seasonality_feature_extraction(df, kpi_var, n_features=10, test_size=0.1, random_state=42, shuffle=False)`
+- **Example**: `seasonality_feature_extraction(df, 'kpi_total_sales', n_features=5, test_size=0.2, random_state=123, shuffle=True)`
 ---
 ## Data Processing for Incrementality Testing
@@ -291,8 +308,8 @@ Table of Contents
 ## 6. `pull_weather`
 - **Description**: Fetch and process historical weather data for the specified country.
-- **Usage**: `pull_weather(week_commencing, country)`
-- **Example**: `pull_weather('mon', 'GBR')`
+- **Usage**: `pull_weather(week_commencing, start_date, country)`
+- **Example**: `pull_weather('mon', '2020-01-01', 'GBR')`
 ## 7. `pull_macro_ons_uk`
 - **Description**: Fetch and process time series data from the Beta ONS API.
@@ -304,6 +321,11 @@ Table of Contents
 - **Usage**: `pull_yfinance(tickers, week_start_day)`
 - **Example**: `pull_yfinance(['^FTMC', '^IXIC'], 'mon')`
+## 9. `pull_sports_events`
+- **Description**: Pull a veriety of sports events primaraly football and rugby.
+- **Usage**: `pull_sports_events(start_date, week_commencing)`
+- **Example**: `pull_sports_events('2020-01-01', 'mon')`
 ---
 ## Installation
@@ -316,18 +338,6 @@ pip install imsciences
 ---
-## Usage
-```bash
-from imsciences import *
-ims_proc = dataprocessing()
-ims_geo = geoprocessing()
-ims_pull = datapull()
-ims_vis = datavis()
-```
----
 ## License
 This project is licensed under the MIT License. ![License](https://img.shields.io/badge/license-MIT-blue.svg)

imsciences-0.9.5.1/PKG-INFO → imsciences-0.9.5.5/README.md RENAMED Viewed

@@ -1,28 +1,3 @@
-Metadata-Version: 2.1
-Name: imsciences
-Version: 0.9.5.1
-Summary: IMS Data Processing Package
-Author: IMS
-Author-email: cam@im-sciences.com
-Keywords: python,data processing,apis
-Classifier: Development Status :: 3 - Alpha
-Classifier: Intended Audience :: Developers
-Classifier: Programming Language :: Python :: 3
-Classifier: Operating System :: Unix
-Classifier: Operating System :: MacOS :: MacOS X
-Classifier: Operating System :: Microsoft :: Windows
-Description-Content-Type: text/markdown
-License-File: LICENSE.txt
-Requires-Dist: pandas
-Requires-Dist: plotly
-Requires-Dist: numpy
-Requires-Dist: fredapi
-Requires-Dist: bs4
-Requires-Dist: yfinance
-Requires-Dist: holidays
-Requires-Dist: google-analytics-data
-Requires-Dist: geopandas
 # IMS Package Documentation
 The **Independent Marketing Sciences** package is a Python library designed to process incoming data into a format tailored for projects, particularly those utilising weekly time series data. This package offers a suite of functions for efficient data collection, manipulation, visualisation and analysis.
@@ -33,23 +8,33 @@ The **Independent Marketing Sciences** package is a Python library designed to p
 - Seamless data processing for time series workflows.
 - Aggregation, filtering, and transformation of time series data.
 - Visualising Data
-- Integration with external data sources like FRED, Bank of England, ONS and OECD.
+- Integration with external data sources like FRED, Bank of England and ONS.
 ---
 Table of Contents
 =================
-1. [Data Processing for Time Series](#data-processing-for-time-series)
-2. [Data Processing for Incrementality Testing](#data-processing-for-incrementality-testing)
-3. [Data Visualisations](#data-visualisations)
-4. [Data Pulling](#data-pulling)
-5. [Installation](#installation)
-6. [Usage](#usage)
+1. [Usage](#usage)
+2. [Data Processing for Time Series](#data-processing-for-time-series)
+3. [Data Processing for Incrementality Testing](#data-processing-for-incrementality-testing)
+4. [Data Visualisations](#data-visualisations)
+5. [Data Pulling](#data-pulling)
+6. [Installation](#installation)
 7. [License](#license)
 ---
+## Usage
+```bash
+from imsciences import dataprocessing, geoprocessing, datapull, datavis
+ims_proc = dataprocessing()
+ims_geo = geoprocessing()
+ims_pull = datapull()
+ims_vis = datavis()
+```
 ## Data Processing for Time Series
 ## 1. `get_wd_levels`
@@ -222,6 +207,11 @@ Table of Contents
 - **Usage**: `week_commencing_2_week_commencing_conversion_isoweekday(df, date_col, week_commencing='mon')`
 - **Example**: `week_commencing_2_week_commencing_conversion_isoweekday(df, 'date_col', week_commencing='fri')`
+## 35. `seasonality_feature_extraction`
+- **Description**: Splits data into train/test sets, trains XGBoost and Random Forest on all features, extracts top features based on feature importance, merges them, optionally retrains models on top and combined features, and returns a dict of results.
+- **Usage**: `seasonality_feature_extraction(df, kpi_var, n_features=10, test_size=0.1, random_state=42, shuffle=False)`
+- **Example**: `seasonality_feature_extraction(df, 'kpi_total_sales', n_features=5, test_size=0.2, random_state=123, shuffle=True)`
 ---
 ## Data Processing for Incrementality Testing
@@ -291,8 +281,8 @@ Table of Contents
 ## 6. `pull_weather`
 - **Description**: Fetch and process historical weather data for the specified country.
-- **Usage**: `pull_weather(week_commencing, country)`
-- **Example**: `pull_weather('mon', 'GBR')`
+- **Usage**: `pull_weather(week_commencing, start_date, country)`
+- **Example**: `pull_weather('mon', '2020-01-01', 'GBR')`
 ## 7. `pull_macro_ons_uk`
 - **Description**: Fetch and process time series data from the Beta ONS API.
@@ -304,6 +294,11 @@ Table of Contents
 - **Usage**: `pull_yfinance(tickers, week_start_day)`
 - **Example**: `pull_yfinance(['^FTMC', '^IXIC'], 'mon')`
+## 9. `pull_sports_events`
+- **Description**: Pull a veriety of sports events primaraly football and rugby.
+- **Usage**: `pull_sports_events(start_date, week_commencing)`
+- **Example**: `pull_sports_events('2020-01-01', 'mon')`
 ---
 ## Installation
@@ -316,20 +311,8 @@ pip install imsciences
 ---
-## Usage
-```bash
-from imsciences import *
-ims_proc = dataprocessing()
-ims_geo = geoprocessing()
-ims_pull = datapull()
-ims_vis = datavis()
-```
----
 ## License
 This project is licensed under the MIT License. ![License](https://img.shields.io/badge/license-MIT-blue.svg)
----
+---

{imsciences-0.9.5.1 → imsciences-0.9.5.5}/imsciences/geo.py RENAMED Viewed

@@ -199,13 +199,13 @@ class geoprocessing:
         return analysis_df
-    def process_city_analysis(self, raw_input_path, spend_input_path, output_path, group1, group2, response_column):
+    def process_city_analysis(self, raw_data, spend_data, output_path, group1, group2, response_column):
         """
         Process city analysis by grouping data, analyzing user metrics, and merging with spend data.
         Parameters:
-            raw_input_path (str): Path to the raw input data file (CSV or XLSX) containing at least 'date', 'city', and the specified response column.
-            spend_input_path (str): Path to the media spend data file (CSV or XLSX) with 'date', 'geo', and 'cost' columns. Costs should be numeric.
+            raw_data (str or pd.DataFrame): Raw input data as a file path (CSV/XLSX) or DataFrame.
+            spend_data (str or pd.DataFrame): Spend data as a file path (CSV/XLSX) or DataFrame.
             output_path (str): Path to save the final output file (CSV or XLSX).
             group1 (list): List of city regions for group 1.
             group2 (list): List of city regions for group 2.
@@ -217,13 +217,15 @@ class geoprocessing:
         import pandas as pd
         import os
-        def read_file(file_path):
-            """Helper function to read CSV or XLSX files."""
-            ext = os.path.splitext(file_path)[1].lower()
+        def read_file(data):
+            """Helper function to handle file paths or return DataFrame directly."""
+            if isinstance(data, pd.DataFrame):
+                return data
+            ext = os.path.splitext(data)[1].lower()
             if ext == '.csv':
-                return pd.read_csv(file_path)
+                return pd.read_csv(data)
             elif ext in ['.xlsx', '.xls']:
-                return pd.read_excel(file_path)
+                return pd.read_excel(data)
             else:
                 raise ValueError("Unsupported file type. Please use a CSV or XLSX file.")
@@ -237,9 +239,9 @@ class geoprocessing:
             else:
                 raise ValueError("Unsupported file type. Please use a CSV or XLSX file.")
-        # Read input files
-        raw_df = read_file(raw_input_path)
-        spend_df = read_file(spend_input_path)
+        # Read data
+        raw_df = read_file(raw_data)
+        spend_df = read_file(spend_data)
         # Ensure necessary columns are present
         required_columns = {'date', 'city', response_column}

{imsciences-0.9.5.1 → imsciences-0.9.5.5}/imsciences/mmm.py RENAMED Viewed

@@ -6,6 +6,9 @@ import re
 from datetime import datetime, timedelta
 import subprocess
 import json
+from sklearn.model_selection import train_test_split
+import xgboost as xgb
+from sklearn.ensemble import RandomForestRegressor
 class dataprocessing:
@@ -180,7 +183,12 @@ class dataprocessing:
         print("   - Description: Maps dates to the start of the current ISO week based on a specified weekday.")
         print("   - Usage: week_commencing_2_week_commencing_conversion_isoweekday(df, date_col, week_commencing='mon')")
         print("   - Example: week_commencing_2_week_commencing_conversion_isoweekday(df, 'date_col', week_commencing='fri')")
+        print("\n35. seasonality_feature_extraction")
+        print("   - Description: Splits data into train/test sets, trains XGBoost and Random Forest on all features, extracts top features based on feature importance, merges them, optionally retrains models on top and combined features, and returns a dict of results.")
+        print("   - Usage: seasonality_feature_extraction(df, kpi_var, n_features=10, test_size=0.1, random_state=42, shuffle=False)")
+        print("   - Example: seasonality_feature_extraction(df, 'kpi_total_sales', n_features=5, test_size=0.2, random_state=123, shuffle=True)")
     def get_wd_levels(self, levels):
         """
         Gets the current wd of whoever is working on it and gives the options to move the number of levels up.
@@ -492,15 +500,15 @@ class dataprocessing:
         return combined_df
-    def pivot_table(self, df, index_col, columns, values_col, filters_dict=None, fill_value=0, aggfunc="sum", margins=False, margins_name="Total", datetime_trans_needed=True, date_format="%Y-%m-%d", reverse_header_order=False, fill_missing_weekly_dates=False, week_commencing="W-MON"):
+    def pivot_table(self, df, index_col, columns, values_col, filters_dict=None, fill_value=0, aggfunc="sum", margins=False, margins_name="Total", datetime_trans_needed=True, date_format="%Y-%m-%d", reverse_header_order=False, fill_missing_weekly_dates=True, week_commencing="W-MON"):
         """
         Provides the ability to create pivot tables, filtering the data to get to data you want and then pivoting on certain columns
         Args:
             df (pandas.DataFrame): The DataFrame containing the data.
             index_col (str): Name of Column for your pivot table to index on
-            columns (str): Name of Columns for your pivot table.
-            values_col (str): Name of Values Columns for your pivot table.
+            columns (str or list): Name of Column(s) for your pivot table. Can be a single column or a list of columns.
+            values_col (str or list): Name of Values Column(s) for your pivot table. Can be a single column or a list of columns.
             filters_dict (dict, optional): Dictionary of conditions for the boolean mask i.e. what to filter your df on to get to your chosen cell. Defaults to None
             fill_value (int, optional): The value to replace nan with. Defaults to 0.
             aggfunc (str, optional): The method on which to aggregate the values column. Defaults to sum.
@@ -514,14 +522,19 @@ class dataprocessing:
         Returns:
             pandas.DataFrame: The pivot table specified
         """
         # Validate inputs
         if index_col not in df.columns:
             raise ValueError(f"index_col '{index_col}' not found in DataFrame.")
-        if columns not in df.columns:
-            raise ValueError(f"columns '{columns}' not found in DataFrame.")
-        if values_col not in df.columns:
-            raise ValueError(f"values_col '{values_col}' not found in DataFrame.")
+        columns = [columns] if isinstance(columns, str) else columns
+        for col in columns:
+            if col not in df.columns:
+                raise ValueError(f"columns '{col}' not found in DataFrame.")
+        values_col = [values_col] if isinstance(values_col, str) else values_col
+        for col in values_col:
+            if col not in df.columns:
+                raise ValueError(f"values_col '{col}' not found in DataFrame.")
         # Apply filters if provided
         if filters_dict:
@@ -1412,4 +1425,133 @@ class dataprocessing:
         new_col = f"week_start_{week_commencing}"
         df[new_col] = df[date_col].apply(map_to_week_start)
-        return df
+        return df
+    def seasonality_feature_extraction(self, df, kpi_var, n_features=10, test_size=0.1, random_state=42, shuffle=False):
+        """
+        1) Uses the provided dataframe (df), where:
+        - df['kpi_total_sales'] is the target (y).
+        - df['OBS'] is a date or index column (excluded from features).
+        2) Splits data into train/test using the specified test_size, random_state, and shuffle.
+        3) Trains XGBoost and Random Forest on all features.
+        4) Extracts the top n_features from each model.
+        5) Merges their unique top features.
+        6) Optionally retrains each model on the combined top features.
+        7) Returns performance metrics and the fitted models.
+        Parameters
+        ----------
+        df : pd.DataFrame
+            The input dataframe that contains kpi_var (target) and 'OBS' (date/index).
+        n_features : int, optional
+            Number of top features to extract from each model (default=10).
+        test_size : float, optional
+            Test size for train_test_split (default=0.1).
+        random_state : int, optional
+            Random state for reproducibility (default=42).
+        shuffle : bool, optional
+            Whether to shuffle the data before splitting (default=False).
+        Returns
+        -------
+        dict
+            A dictionary containing:
+            - "top_features_xgb": list of top n_features from XGBoost
+            - "top_features_rf": list of top n_features from Random Forest
+            - "combined_features": merged unique feature list
+            - "performance": dictionary of performance metrics
+            - "models": dictionary of fitted models
+        """
+        # ---------------------------------------------------------------------
+        # 1. Prepare your data (X, y)
+        # ---------------------------------------------------------------------
+        # Extract target and features
+        y = df[kpi_var]
+        X = df.drop(columns=['OBS', kpi_var])
+        # Split into train/test
+        X_train, X_test, y_train, y_test = train_test_split(
+            X, y,
+            test_size=test_size,
+            random_state=random_state,
+            shuffle=shuffle
+        )
+        # ---------------------------------------------------------------------
+        # 2. XGBoost Approach (on all features)
+        # ---------------------------------------------------------------------
+        # (A) Train full model on ALL features
+        xgb_model_full = xgb.XGBRegressor(random_state=random_state)
+        xgb_model_full.fit(X_train, y_train)
+        # (B) Get feature importances
+        xgb_importances = xgb_model_full.feature_importances_
+        xgb_feat_importance_df = (
+            pd.DataFrame({
+                'feature': X.columns,
+                'importance': xgb_importances
+            })
+            .sort_values('importance', ascending=False)
+            .reset_index(drop=True)
+        )
+        # (C) Select top N features
+        top_features_xgb = xgb_feat_importance_df['feature'].head(n_features).tolist()
+        # (D) Subset data to top N features
+        X_train_xgb_topN = X_train[top_features_xgb]
+        # (E) Retrain XGBoost on these top N features
+        xgb_model_topN = xgb.XGBRegressor(random_state=random_state)
+        xgb_model_topN.fit(X_train_xgb_topN, y_train)
+        # ---------------------------------------------------------------------
+        # 3. Random Forest Approach (on all features)
+        # ---------------------------------------------------------------------
+        rf_model_full = RandomForestRegressor(random_state=random_state)
+        rf_model_full.fit(X_train, y_train)
+        # (B) Get feature importances
+        rf_importances = rf_model_full.feature_importances_
+        rf_feat_importance_df = (
+            pd.DataFrame({
+                'feature': X.columns,
+                'importance': rf_importances
+            })
+            .sort_values('importance', ascending=False)
+            .reset_index(drop=True)
+        )
+        # (C) Select top N features
+        top_features_rf = rf_feat_importance_df['feature'].head(n_features).tolist()
+        # (D) Subset data to top N features
+        X_train_rf_topN = X_train[top_features_rf]
+        # (E) Retrain Random Forest on these top N features
+        rf_model_topN = RandomForestRegressor(random_state=random_state)
+        rf_model_topN.fit(X_train_rf_topN, y_train)
+        # ---------------------------------------------------------------------
+        # 4. Combine top features from both models
+        # ---------------------------------------------------------------------
+        combined_features = list(set(top_features_xgb + top_features_rf))
+        # Create new training/testing data with the combined features
+        X_train_combined = X_train[combined_features]
+        # (Optional) Retrain XGBoost on combined features
+        xgb_model_combined = xgb.XGBRegressor(random_state=random_state)
+        xgb_model_combined.fit(X_train_combined, y_train)
+        # (Optional) Retrain Random Forest on combined features
+        rf_model_combined = RandomForestRegressor(random_state=random_state)
+        rf_model_combined.fit(X_train_combined, y_train)
+        # Organize all results to return
+        output = {
+            "combined_features": combined_features,
+        }
+        return output

imsciences 0.9.5.1__tar.gz → 0.9.5.5__tar.gz

imsciences 0.9.5.1tar.gz → 0.9.5.5tar.gz