PyPI - lecrapaud - Versions diffs - 0.21.0__tar.gz → 0.21.2__tar.gz - Mend

lecrapaud 0.21.0tar.gz → 0.21.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of lecrapaud might be problematic. Click here for more details.

Files changed (50) hide show

{lecrapaud-0.21.0 → lecrapaud-0.21.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: lecrapaud
-Version: 0.21.0
+Version: 0.21.2
 Summary: Framework for machine and deep learning, with regression, classification and time series analysis
 License: Apache License
 License-File: LICENSE
@@ -218,7 +218,11 @@ context = {
     "val_size": 0.2,
     "test_size": 0.2,
     "pca_temporal": [
-        {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
+        # Old format (still supported)
+        # {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
+        # New simplified format - automatically creates lag columns
+        {"name": "LAST_20_RET", "column": "RET", "lags": 20},
+        {"name": "LAST_10_VOL", "column": "VOLUME", "lags": 10},
     ],
     "pca_cross_sectional": [
         {
@@ -255,11 +259,20 @@ experiment = app.create_experiment(data=your_dataframe, **context)
 2. **Parameter Precedence**: When loading an existing experiment, the stored context takes precedence over any parameters passed to the constructor.
-3. **PCA Time Series**: For time series data with `pca_cross_sectional` where index equals `date_column`, the system automatically uses an expanding window approach to prevent data leakage.
+3. **PCA Time Series**:
+   - For time series data, both `pca_cross_sectional` and `pca_temporal` automatically use an expanding window approach with periodic refresh (default: every 90 days) to prevent data leakage.
+   - The system fits PCA only on historical data (lookback window of 365 days by default) and avoids look-ahead bias.
+   - For panel data (e.g., multiple stocks), lag features are created per group when using the simplified `pca_temporal` format.
+   - Missing PCA values are handled with forward-fill followed by zero-fill to ensure compatibility with downstream models.
-4. **OpenAI Embeddings**: If using `columns_pca` with text columns, ensure `OPENAI_API_KEY` is set as an environment variable.
+4. **PCA Temporal Simplified Format**:
+   - Instead of manually listing lag columns: `{"name": "LAST_20_RET", "columns": ["RET_-1", "RET_-2", ..., "RET_-20"]}`
+   - Use the simplified format: `{"name": "LAST_20_RET", "column": "RET", "lags": 20}`
+   - The system automatically creates the lag columns, handling panel data correctly with `group_column`.
-5. **Model Indices**: The `models_idx` parameter accepts both integer indices and string names (e.g., `'xgb'`, `'lgb'`, `'catboost'`).
+5. **OpenAI Embeddings**: If using `columns_pca` with text columns, ensure `OPENAI_API_KEY` is set as an environment variable.
+6. **Model Indices**: The `models_idx` parameter accepts both integer indices and string names (e.g., `'xgb'`, `'lgb'`, `'catboost'`).

{lecrapaud-0.21.0 → lecrapaud-0.21.2}/README.md RENAMED Viewed

@@ -179,7 +179,11 @@ context = {
     "val_size": 0.2,
     "test_size": 0.2,
     "pca_temporal": [
-        {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
+        # Old format (still supported)
+        # {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
+        # New simplified format - automatically creates lag columns
+        {"name": "LAST_20_RET", "column": "RET", "lags": 20},
+        {"name": "LAST_10_VOL", "column": "VOLUME", "lags": 10},
     ],
     "pca_cross_sectional": [
         {
@@ -216,11 +220,20 @@ experiment = app.create_experiment(data=your_dataframe, **context)
 2. **Parameter Precedence**: When loading an existing experiment, the stored context takes precedence over any parameters passed to the constructor.
-3. **PCA Time Series**: For time series data with `pca_cross_sectional` where index equals `date_column`, the system automatically uses an expanding window approach to prevent data leakage.
+3. **PCA Time Series**:
+   - For time series data, both `pca_cross_sectional` and `pca_temporal` automatically use an expanding window approach with periodic refresh (default: every 90 days) to prevent data leakage.
+   - The system fits PCA only on historical data (lookback window of 365 days by default) and avoids look-ahead bias.
+   - For panel data (e.g., multiple stocks), lag features are created per group when using the simplified `pca_temporal` format.
+   - Missing PCA values are handled with forward-fill followed by zero-fill to ensure compatibility with downstream models.
-4. **OpenAI Embeddings**: If using `columns_pca` with text columns, ensure `OPENAI_API_KEY` is set as an environment variable.
+4. **PCA Temporal Simplified Format**:
+   - Instead of manually listing lag columns: `{"name": "LAST_20_RET", "columns": ["RET_-1", "RET_-2", ..., "RET_-20"]}`
+   - Use the simplified format: `{"name": "LAST_20_RET", "column": "RET", "lags": 20}`
+   - The system automatically creates the lag columns, handling panel data correctly with `group_column`.
-5. **Model Indices**: The `models_idx` parameter accepts both integer indices and string names (e.g., `'xgb'`, `'lgb'`, `'catboost'`).
+5. **OpenAI Embeddings**: If using `columns_pca` with text columns, ensure `OPENAI_API_KEY` is set as an environment variable.
+6. **Model Indices**: The `models_idx` parameter accepts both integer indices and string names (e.g., `'xgb'`, `'lgb'`, `'catboost'`).

{lecrapaud-0.21.0 → lecrapaud-0.21.2}/lecrapaud/config.py RENAMED Viewed

@@ -34,5 +34,5 @@ OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
 LECRAPAUD_LOGFILE = os.getenv("LECRAPAUD_LOGFILE")
 LECRAPAUD_TABLE_PREFIX = os.getenv("LECRAPAUD_TABLE_PREFIX", "lecrapaud")
 LECRAPAUD_OPTIMIZATION_BACKEND = os.getenv(
-    "LECRAPAUD_OPTIMIZATION_BACKEND", "ray"
+    "LECRAPAUD_OPTIMIZATION_BACKEND", "hyperopt"
 ).lower()

{lecrapaud-0.21.0 → lecrapaud-0.21.2}/lecrapaud/feature_engineering.py RENAMED Viewed

@@ -605,7 +605,7 @@ class PreprocessFeature:
         return df, pcas_dict
-    def add_pca_feature_cross_sectional(
+    def add_pca_feature_cross_sectional_old(
         self,
         df: pd.DataFrame,
         *,
@@ -657,7 +657,7 @@ class PreprocessFeature:
         return df, pcas_dict
-    def add_pca_feature_cross_sectional_time_series(
+    def add_pca_feature_cross_sectional(
         self,
         df: pd.DataFrame,
         *,
@@ -840,6 +840,11 @@ class PreprocessFeature:
                 # Merger les scores
                 df = df.merge(scores_df, on=index_col, how="left")
                 df.index = index_saved
+                # Forward fill puis 0 pour éviter les NaN
+                pca_cols = [col for col in df.columns if col.startswith(prefix)]
+                df[pca_cols] = df[pca_cols].fillna(method='ffill').fillna(0)
                 pcas_dict.update({name: pipe})
             else:
@@ -873,7 +878,7 @@ class PreprocessFeature:
         return df, pcas_dict
     # ----------------- 2) PCA TEMPORELLE (liste de colonnes lags) ----------------
-    def add_pca_feature_temporal(
+    def add_pca_feature_temporal_old(
         self,
         df: pd.DataFrame,
         *,
@@ -936,6 +941,187 @@ class PreprocessFeature:
         return df, pcas_dict
+    def add_pca_feature_temporal(
+        self,
+        df: pd.DataFrame,
+        *,
+        n_components: int = 5,
+        pcas: dict[str, Pipeline] | None = None,
+        impute_strategy: str = "median",
+        standardize: bool = True,
+        lookback_days: int = 365,
+        refresh_frequency: int = 90,
+    ) -> tuple[pd.DataFrame, dict[str, Pipeline]]:
+        """
+        PCA temporelle pour time series avec support panel data.
+        Crée automatiquement les colonnes de lags et évite le look-ahead bias.
+        Format pca_temporal simplifié:
+        [{"name": "LAST_20_RET", "column": "RET", "lags": 20}]
+        """
+        pcas_dict = {}
+        for pca_config in self.pca_temporal:
+            # Support both old and new format
+            if "columns" in pca_config:
+                # Old format: use existing columns
+                name = pca_config["name"]
+                lag_columns = pca_config["columns"]
+                base_column = None
+                num_lags = len(lag_columns)
+            else:
+                # New format: create lag columns
+                name = pca_config["name"]
+                base_column = pca_config["column"].upper()
+                num_lags = pca_config.get("lags", 20)
+                # Create lag columns if they don't exist
+                if self.group_column:
+                    # Panel data: create lags by group
+                    for lag in range(1, num_lags + 1):
+                        lag_col = f"{base_column}_-{lag}"
+                        if lag_col not in df.columns:
+                            df[lag_col] = df.groupby(self.group_column)[base_column].shift(lag)
+                else:
+                    # Simple time series
+                    for lag in range(1, num_lags + 1):
+                        lag_col = f"{base_column}_-{lag}"
+                        if lag_col not in df.columns:
+                            df[lag_col] = df[base_column].shift(lag)
+                lag_columns = [f"{base_column}_-{i}" for i in range(1, num_lags + 1)]
+            prefix = f"TMP_PC_{name}"
+            # For time series: avoid look-ahead bias
+            if self.time_series and self.date_column:
+                all_scores = []
+                unique_dates = sorted(df[self.date_column].unique())
+                if pcas is not None:
+                    # Inference: use provided PCA
+                    pipe = pcas[name]
+                    # Apply to all data at once
+                    mask = df[lag_columns].notna().all(axis=1)
+                    if mask.any():
+                        X_transform = df.loc[mask, lag_columns]
+                        scores = pipe.transform(X_transform)
+                        for i in range(n_components):
+                            df.loc[mask, f"{prefix}_{i}"] = scores[:, i]
+                    # Fill NaN with forward fill then 0
+                    pca_cols = [f"{prefix}_{i}" for i in range(n_components)]
+                    df[pca_cols] = df[pca_cols].fillna(method='ffill').fillna(0)
+                else:
+                    # Training: expanding window with periodic refresh
+                    pipe = None
+                    last_fit_date = None
+                    for current_date_ordinal in unique_dates:
+                        current_date = pd.Timestamp.fromordinal(int(current_date_ordinal))
+                        # Determine if we should refit
+                        should_refit = pipe is None or (
+                            last_fit_date is not None
+                            and (current_date - last_fit_date).days >= refresh_frequency
+                        )
+                        if should_refit and len(df[df[self.date_column] < current_date_ordinal]) > num_lags * 2:
+                            # Get historical data for fitting
+                            lookback_start = current_date - pd.Timedelta(days=lookback_days)
+                            lookback_start_ordinal = pd.Timestamp.toordinal(lookback_start)
+                            mask_fit = (
+                                (df[self.date_column] >= lookback_start_ordinal) &
+                                (df[self.date_column] < current_date_ordinal) &
+                                df[lag_columns].notna().all(axis=1)
+                            )
+                            if mask_fit.sum() >= n_components:
+                                X_fit = df.loc[mask_fit, lag_columns]
+                                # Create pipeline
+                                steps = []
+                                if impute_strategy is not None:
+                                    steps.append(("imputer", SimpleImputer(strategy=impute_strategy)))
+                                if standardize:
+                                    steps.append(("scaler", StandardScaler()))
+                                steps.append(("pca", PCA(n_components=n_components, random_state=0)))
+                                pipe = Pipeline(steps)
+                                pipe.fit(X_fit)
+                                last_fit_date = current_date
+                                logger.debug(
+                                    f"Temporal PCA {name} refitted at {current_date.strftime('%Y-%m-%d')} "
+                                    f"using {len(X_fit)} samples"
+                                )
+                        # Transform current date data
+                        if pipe is not None:
+                            mask_current = (
+                                (df[self.date_column] == current_date_ordinal) &
+                                df[lag_columns].notna().all(axis=1)
+                            )
+                            if mask_current.any():
+                                X_current = df.loc[mask_current, lag_columns]
+                                scores = pipe.transform(X_current)
+                                for i in range(n_components):
+                                    df.loc[mask_current, f"{prefix}_{i}"] = scores[:, i]
+                    # Fill NaN with forward fill then 0
+                    pca_cols = [f"{prefix}_{i}" for i in range(n_components)]
+                    for col in pca_cols:
+                        if col not in df.columns:
+                            df[col] = 0
+                    df[pca_cols] = df[pca_cols].fillna(method='ffill').fillna(0)
+                pcas_dict[name] = pipe
+            else:
+                # Non time-series: use original approach
+                mask = df[lag_columns].notna().all(axis=1)
+                if pcas is None and mask.any():
+                    X_fit = df.loc[mask, lag_columns]
+                    steps = []
+                    if impute_strategy is not None:
+                        steps.append(("imputer", SimpleImputer(strategy=impute_strategy)))
+                    if standardize:
+                        steps.append(("scaler", StandardScaler()))
+                    steps.append(("pca", PCA(n_components=n_components, random_state=0)))
+                    pipe = Pipeline(steps)
+                    pipe.fit(X_fit)
+                    pcas_dict[name] = pipe
+                elif pcas is not None:
+                    pipe = pcas[name]
+                    pcas_dict[name] = pipe
+                else:
+                    continue
+                if mask.any():
+                    X_transform = df.loc[mask, lag_columns]
+                    scores = pipe.transform(X_transform)
+                    for i in range(n_components):
+                        df.loc[mask, f"{prefix}_{i}"] = scores[:, i]
+                # Fill missing values
+                pca_cols = [f"{prefix}_{i}" for i in range(n_components)]
+                for col in pca_cols:
+                    if col not in df.columns:
+                        df[col] = 0
+                df[pca_cols] = df[pca_cols].fillna(0)
+        return df, pcas_dict
     # encoding categorical features
     def encode_categorical_features(
         self,

{lecrapaud-0.21.0 → lecrapaud-0.21.2}/lecrapaud/feature_selection.py RENAMED Viewed

@@ -278,24 +278,32 @@ class FeatureSelectionEngine:
         features_selected_list = features_selected["features"].values.tolist()
-        # Save ensemble features before correlation (aggregated features)
-        logger.info("Saving ensemble features before correlation...")
-        all_features_in_data = self.X.columns.tolist()
+        # Save ensemble features for all numerical features with global ranking
+        logger.info("Saving ensemble features with global ranking for all numerical features...")
+        numerical_features_in_data = self.X_numerical.columns.tolist()
         ensemble_rows = []
-        # Add global rank for selected features
-        features_selected_with_global_rank = features_selected.copy()
-        features_selected_with_global_rank["global_rank"] = range(1, len(features_selected_with_global_rank) + 1)
+        # Create global ranking for ALL numerical features (1 to n, no null values)
+        all_numerical_scores = pd.concat(results, axis=0)
+        all_numerical_scores = all_numerical_scores.groupby("features").agg({
+            "rank": "mean"  # Average rank across all methods
+        }).reset_index()
+        all_numerical_scores.sort_values("rank", inplace=True)
+        all_numerical_scores["global_rank"] = range(1, len(all_numerical_scores) + 1)
-        for feature in all_features_in_data:
+        for feature in numerical_features_in_data:
             feature_id = feature_map.get(feature)
             if feature_id:
                 is_selected = feature in features_selected_list
-                global_rank = None
-                if is_selected:
-                    global_rank = features_selected_with_global_rank[
-                        features_selected_with_global_rank["features"] == feature
+                # Get global rank (no null values - all features get a rank)
+                if feature in all_numerical_scores["features"].values:
+                    global_rank = all_numerical_scores[
+                        all_numerical_scores["features"] == feature
                     ]["global_rank"].values[0]
+                else:
+                    # Fallback: assign last rank + position for features not in results
+                    global_rank = len(all_numerical_scores) + numerical_features_in_data.index(feature) + 1
                 ensemble_rows.append({
                     "feature_selection_id": feature_selection.id,
@@ -353,28 +361,12 @@ class FeatureSelectionEngine:
         )
         # Final update for features after max limitation (final selection)
-        logger.info("Finalizing ensemble features with categorical features...")
+        logger.info("Finalizing ensemble features...")
         for row in ensemble_rows:
             feature = Feature.get(row["feature_id"]).name
             if feature in features and row["support"] == 1:
                 row["support"] = 2  # 2 = in final selection
-        # Add categorical features to ensemble if not already present
-        if target_type == "classification":
-            for cat_feature in categorical_features_selected:
-                feature_id = feature_map.get(cat_feature)
-                if feature_id and not any(row["feature_id"] == feature_id for row in ensemble_rows):
-                    ensemble_rows.append({
-                        "feature_selection_id": feature_selection.id,
-                        "feature_id": feature_id,
-                        "method": "ensemble",
-                        "score": None,
-                        "pvalue": None,
-                        "support": 2,  # 2 = in final selection (categorical)
-                        "rank": None,  # No rank for categorical features added at the end
-                        "training_time": 0,
-                    })
         # Re-save all ensemble data with updated support values
         FeatureSelectionRank.bulk_upsert(rows=ensemble_rows)
         logger.debug(

{lecrapaud-0.21.0 → lecrapaud-0.21.2}/lecrapaud/model_selection.py RENAMED Viewed

@@ -55,8 +55,7 @@ from tensorboardX import SummaryWriter
 # Optimization
 import ray
-from ray.tune import Tuner, TuneConfig, with_parameters
-from ray.train import RunConfig
+from ray.tune import Tuner, TuneConfig, with_parameters, RunConfig
 from ray.tune.search.hyperopt import HyperOptSearch
 from ray.tune.search.bayesopt import BayesOptSearch
 from ray.tune.logger import TBXLoggerCallback
@@ -1357,8 +1356,12 @@ class ModelSelectionEngine:
         """Choose between Ray Tune and HyperOpt standalone based on configuration."""
         if LECRAPAUD_OPTIMIZATION_BACKEND == "hyperopt":
             return self.hyperoptimize_hyperopt(x_train, y_train, x_val, y_val, model)
-        else:
+        elif LECRAPAUD_OPTIMIZATION_BACKEND == "ray":
             return self.hyperoptimize_ray(x_train, y_train, x_val, y_val, model)
+        else:
+            raise ValueError(
+                f"Invalid optimization backend: {LECRAPAUD_OPTIMIZATION_BACKEND}."
+            )
     def hyperoptimize_hyperopt(
         self, x_train, y_train, x_val, y_val, model: ModelEngine
@@ -1746,11 +1749,11 @@ def evaluate(
         y_pred_proba = (
             prediction[1] if num_classes == 2 else prediction.iloc[:, 2:].values
         )
-        if num_classes > 2:
-            lb = LabelBinarizer(sparse_output=False)  # Change to True for sparse matrix
-            lb.fit(labels)
-            y_true_onhot = lb.transform(y_true)
-            y_pred_onehot = lb.transform(y_pred)
+        # if num_classes > 2:
+        #     lb = LabelBinarizer(sparse_output=False)  # Change to True for sparse matrix
+        #     lb.fit(labels)
+        #     y_true_onhot = lb.transform(y_true)
+        #     y_pred_onehot = lb.transform(y_pred)
         score["LOGLOSS"] = log_loss(y_true, y_pred_proba)
         score["ACCURACY"] = accuracy_score(y_true, y_pred)

{lecrapaud-0.21.0 → lecrapaud-0.21.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "lecrapaud"
-version = "0.21.0"
+version = "0.21.2"
 description = "Framework for machine and deep learning, with regression, classification and time series analysis"
 authors = [
     {name = "Pierre H. Gallet"}