PyPI - warpgbm - Versions diffs - 0.1.27__tar.gz → 1.0.0__tar.gz - Mend

warpgbm 0.1.27tar.gz → 1.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

{warpgbm-0.1.27/warpgbm.egg-info → warpgbm-1.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: warpgbm
-Version: 0.1.27
+Version: 1.0.0
 Summary: A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA
 License:                     GNU GENERAL PUBLIC LICENSE
                                Version 3, 29 June 2007
@@ -686,21 +686,46 @@ Requires-Dist: tqdm
 Requires-Dist: scikit-learn
 Dynamic: license-file
-![warpgbm](https://github.com/user-attachments/assets/dee9de16-091b-49c1-a8fa-2b4ab6891184)
+![raw](https://github.com/user-attachments/assets/924844ef-2536-4bde-a330-5e30f6b0762c)
 # WarpGBM
 WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (GBDT) library built with PyTorch and CUDA. It offers blazing-fast histogram-based training and efficient prediction, with compatibility for research and production workflows.
+**New in v1.0.0:** WarpGBM introduces *Invariant Gradient Boosting* — a powerful approach to learning signals that remain stable across shifting environments (e.g., time, regimes, or datasets). Powered by a novel algorithm called **[Directional Era-Splitting (DES)](https://arxiv.org/abs/2309.14496)**, WarpGBM doesn't just train faster than other leading GBDT libraries — it trains smarter.
+If your data evolves over time, WarpGBM is the only GBDT library designed to *adapt and generalize*.
 ---
+## Contents
+- [Features](#features)
+- [Benchmarks](#benchmarks)
+- [Installation](#installation)
+- [Learning Invariant Signals Across Environments](#learning-invariant-signals-across-environments)
+  - [Why This Matters](#why-this-matters)
+  - [Visual Intuition](#visual-intuition)
+  - [Key References](#key-references)
+- [Examples](#examples)
+  - [Quick Comparison with LightGBM CPU version](#quick-comparison-with-lightgbm-cpu-version)
+  - [Pre-binned Data Example (Numerai)](#pre-binned-data-example-numerai)
+- [Documentation](#documentation)
+- [Acknowledgements](#acknowledgements)
+- [Version Notes](#version-notes)
 ## Features
-- GPU-accelerated training and histogram construction using custom CUDA kernels
-- Drop-in scikit-learn style interface
-- Supports pre-binned data or automatic quantile binning
-- Simple install with `pip`
+- **Blazing-fast GPU training** with custom CUDA kernels for binning, histogram building, split finding, and prediction
+- **Invariant signal learning** via [Directional Era-Splitting (DES)](https://arxiv.org/abs/2309.14496) — designed for datasets with shifting environments (e.g., time, regimes, experimental settings)
+- Drop-in **scikit-learn style interface** for easy adoption
+- Supports **pre-binned data** or **automatic quantile binning**
+- Works with `float32` or `int8` inputs
+- Built-in **validation and early stopping** support with MSE, RMSLE, or correlation metrics
+- Simple install with `pip`, no custom drivers required
+> 💡 **Note:** WarpGBM v1.0.0 is a *generalization* of the traditional GBDT algorithm.
+> To run standard GBM training at maximum speed, simply omit the `era_id` argument — WarpGBM will behave like a traditional booster but with industry-leading performance.
 ---
@@ -762,7 +787,62 @@ Before either method, make sure you’ve installed PyTorch with GPU support:\
 ---
-## Example
+## Learning Invariant Signals Across Environments
+Most supervised learning models rely on an assumption known as the **Empirical Risk Minimization (ERM)** principle. Under ERM, the data distribution connecting inputs \( X \) and targets \( Y \) is assumed to be **fixed** and **stationary** across training, validation, and test splits. That is:
+> The patterns you learn from the training set are expected to generalize out-of-sample — *as long as the test data follows the same distribution as the training data.*
+However, this assumption is often violated in real-world settings. Data frequently shifts across time, geography, experimental conditions, or other hidden factors. This phenomenon is known as **distribution shift**, and it leads to models that perform well in-sample but fail catastrophically out-of-sample.
+This challenge motivates the field of **Out-of-Distribution (OOD) Generalization**, which assumes your data is drawn from **distinct environments or eras** — e.g., time periods, customer segments, experimental trials. Some signals may appear predictive within specific environments but vanish or reverse in others. These are called **spurious signals**. On the other hand, signals that remain consistently predictive across all environments are called **invariant signals**.
+WarpGBM v1.0.0 introduces **Directional Era-Splitting (DES)**, a new algorithm designed to identify and learn from invariant signals — ignoring signals that fail to generalize across environments.
+---
+### Why This Matters
+- Standard models trained via ERM can learn to exploit **spurious correlations** that only hold in some parts of the data.
+- DES explicitly tests whether a feature's split is **directionally consistent** across all eras — only such *invariant splits* are kept.
+- This approach has been shown to reduce overfitting and improve out-of-sample generalization, particularly in financial and scientific datasets.
+---
+### Visual Intuition
+We contrast two views of the data:
+- **ERM Setting**: All data is assumed to come from the same source (single distribution).\
+  No awareness of environments — spurious signals can dominate.
+- **OOD Setting (Era-Splitting)**: Data is explicitly grouped by environment (era).\
+  The model checks whether a signal holds across all groups — enforcing **robustness**.
+*📷 [Placeholder for future visual illustration]*
+---
+### Key References
+- **Invariant Risk Minimization (IRM)**: [Arjovsky et al., 2019](https://arxiv.org/abs/1907.02893)
+- **Learning Explanations That Are Hard to Vary**: [Parascandolo et al., 2020](https://arxiv.org/abs/2009.00329)
+- **Era Splitting: Invariant Learning for Decision Trees**: [DeLise, 2023](https://arxiv.org/abs/2309.14496)
+---
+WarpGBM is the **first open-source GBDT framework to integrate this OOD-aware approach natively**, using efficient CUDA kernels to evaluate per-era consistency during tree growth. It’s not just faster — it’s smarter.
+---
+## Examples
+WarpGBM is easy to drop into any supervised learning workflow and comes with curated examples in the `examples/` folder.
+ - `Spiral Data.ipynb`: synthetic OOD benchmark from Learning Explanations That Are Hard to Vary
+### Quick Comparison with LightGBM CPU version
 ```python
 import numpy as np
@@ -804,7 +884,7 @@ WarpGBM:     corr = 0.8621, time = 5.40s
 ---
-## Pre-binned Data Example (Numerai)
+### Pre-binned Data Example (Numerai)
 WarpGBM can save additional training time if your dataset is already pre-binned. The Numerai tournament data is a great example:
@@ -854,16 +934,6 @@ WarpGBM:     corr = 0.0660, time = 49.16s
 ---
-### Run it live in Colab
-You can try WarpGBM in a live Colab notebook using real pre-binned Numerai tournament data:
-[Open in Colab](https://colab.research.google.com/drive/10mKSjs9UvmMgM5_lOXAylq5LUQAnNSi7?usp=sharing)
-No installation required — just press **"Open in Playground"**, then **Run All**!
----
 ## Documentation
 ### `WarpGBM` Parameters:
@@ -889,7 +959,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
    y_eval=None,                   # numpy array (float or int) 1 dimension (eval_num_samples)
    eval_every_n_trees=None,       # const (int) >= 1
    early_stopping_rounds=None,    # const (int) >= 1
-   eval_metric='mse'              # string, one of 'mse' or 'corr'. For corr, loss is 1 - correlation(y_true, preds)
+   eval_metric='mse'              # string, one of 'mse', 'rmsle' or 'corr'. For corr, loss is 1 - correlation(y_true, preds)
 )
 ```
 Train with optional validation set and early stopping.
@@ -927,3 +997,7 @@ WarpGBM builds on the shoulders of PyTorch, scikit-learn, LightGBM, and the CUDA
 ### v0.1.26
 - Fix Memory bugs in prediction and colsample bytree logic. Added "corr" eval metric.
+### v1.0.0
+- Introduce invariant learning via directional era splitting (DES). Also streamline VRAM improvements over previous sub versions.

{warpgbm-0.1.27 → warpgbm-1.0.0}/README.md RENAMED Viewed

@@ -1,18 +1,43 @@
-![warpgbm](https://github.com/user-attachments/assets/dee9de16-091b-49c1-a8fa-2b4ab6891184)
+![raw](https://github.com/user-attachments/assets/924844ef-2536-4bde-a330-5e30f6b0762c)
 # WarpGBM
 WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (GBDT) library built with PyTorch and CUDA. It offers blazing-fast histogram-based training and efficient prediction, with compatibility for research and production workflows.
+**New in v1.0.0:** WarpGBM introduces *Invariant Gradient Boosting* — a powerful approach to learning signals that remain stable across shifting environments (e.g., time, regimes, or datasets). Powered by a novel algorithm called **[Directional Era-Splitting (DES)](https://arxiv.org/abs/2309.14496)**, WarpGBM doesn't just train faster than other leading GBDT libraries — it trains smarter.
+If your data evolves over time, WarpGBM is the only GBDT library designed to *adapt and generalize*.
 ---
+## Contents
+- [Features](#features)
+- [Benchmarks](#benchmarks)
+- [Installation](#installation)
+- [Learning Invariant Signals Across Environments](#learning-invariant-signals-across-environments)
+  - [Why This Matters](#why-this-matters)
+  - [Visual Intuition](#visual-intuition)
+  - [Key References](#key-references)
+- [Examples](#examples)
+  - [Quick Comparison with LightGBM CPU version](#quick-comparison-with-lightgbm-cpu-version)
+  - [Pre-binned Data Example (Numerai)](#pre-binned-data-example-numerai)
+- [Documentation](#documentation)
+- [Acknowledgements](#acknowledgements)
+- [Version Notes](#version-notes)
 ## Features
-- GPU-accelerated training and histogram construction using custom CUDA kernels
-- Drop-in scikit-learn style interface
-- Supports pre-binned data or automatic quantile binning
-- Simple install with `pip`
+- **Blazing-fast GPU training** with custom CUDA kernels for binning, histogram building, split finding, and prediction
+- **Invariant signal learning** via [Directional Era-Splitting (DES)](https://arxiv.org/abs/2309.14496) — designed for datasets with shifting environments (e.g., time, regimes, experimental settings)
+- Drop-in **scikit-learn style interface** for easy adoption
+- Supports **pre-binned data** or **automatic quantile binning**
+- Works with `float32` or `int8` inputs
+- Built-in **validation and early stopping** support with MSE, RMSLE, or correlation metrics
+- Simple install with `pip`, no custom drivers required
+> 💡 **Note:** WarpGBM v1.0.0 is a *generalization* of the traditional GBDT algorithm.
+> To run standard GBM training at maximum speed, simply omit the `era_id` argument — WarpGBM will behave like a traditional booster but with industry-leading performance.
 ---
@@ -74,7 +99,62 @@ Before either method, make sure you’ve installed PyTorch with GPU support:\
 ---
-## Example
+## Learning Invariant Signals Across Environments
+Most supervised learning models rely on an assumption known as the **Empirical Risk Minimization (ERM)** principle. Under ERM, the data distribution connecting inputs \( X \) and targets \( Y \) is assumed to be **fixed** and **stationary** across training, validation, and test splits. That is:
+> The patterns you learn from the training set are expected to generalize out-of-sample — *as long as the test data follows the same distribution as the training data.*
+However, this assumption is often violated in real-world settings. Data frequently shifts across time, geography, experimental conditions, or other hidden factors. This phenomenon is known as **distribution shift**, and it leads to models that perform well in-sample but fail catastrophically out-of-sample.
+This challenge motivates the field of **Out-of-Distribution (OOD) Generalization**, which assumes your data is drawn from **distinct environments or eras** — e.g., time periods, customer segments, experimental trials. Some signals may appear predictive within specific environments but vanish or reverse in others. These are called **spurious signals**. On the other hand, signals that remain consistently predictive across all environments are called **invariant signals**.
+WarpGBM v1.0.0 introduces **Directional Era-Splitting (DES)**, a new algorithm designed to identify and learn from invariant signals — ignoring signals that fail to generalize across environments.
+---
+### Why This Matters
+- Standard models trained via ERM can learn to exploit **spurious correlations** that only hold in some parts of the data.
+- DES explicitly tests whether a feature's split is **directionally consistent** across all eras — only such *invariant splits* are kept.
+- This approach has been shown to reduce overfitting and improve out-of-sample generalization, particularly in financial and scientific datasets.
+---
+### Visual Intuition
+We contrast two views of the data:
+- **ERM Setting**: All data is assumed to come from the same source (single distribution).\
+  No awareness of environments — spurious signals can dominate.
+- **OOD Setting (Era-Splitting)**: Data is explicitly grouped by environment (era).\
+  The model checks whether a signal holds across all groups — enforcing **robustness**.
+*📷 [Placeholder for future visual illustration]*
+---
+### Key References
+- **Invariant Risk Minimization (IRM)**: [Arjovsky et al., 2019](https://arxiv.org/abs/1907.02893)
+- **Learning Explanations That Are Hard to Vary**: [Parascandolo et al., 2020](https://arxiv.org/abs/2009.00329)
+- **Era Splitting: Invariant Learning for Decision Trees**: [DeLise, 2023](https://arxiv.org/abs/2309.14496)
+---
+WarpGBM is the **first open-source GBDT framework to integrate this OOD-aware approach natively**, using efficient CUDA kernels to evaluate per-era consistency during tree growth. It’s not just faster — it’s smarter.
+---
+## Examples
+WarpGBM is easy to drop into any supervised learning workflow and comes with curated examples in the `examples/` folder.
+ - `Spiral Data.ipynb`: synthetic OOD benchmark from Learning Explanations That Are Hard to Vary
+### Quick Comparison with LightGBM CPU version
 ```python
 import numpy as np
@@ -116,7 +196,7 @@ WarpGBM:     corr = 0.8621, time = 5.40s
 ---
-## Pre-binned Data Example (Numerai)
+### Pre-binned Data Example (Numerai)
 WarpGBM can save additional training time if your dataset is already pre-binned. The Numerai tournament data is a great example:
@@ -166,16 +246,6 @@ WarpGBM:     corr = 0.0660, time = 49.16s
 ---
-### Run it live in Colab
-You can try WarpGBM in a live Colab notebook using real pre-binned Numerai tournament data:
-[Open in Colab](https://colab.research.google.com/drive/10mKSjs9UvmMgM5_lOXAylq5LUQAnNSi7?usp=sharing)
-No installation required — just press **"Open in Playground"**, then **Run All**!
----
 ## Documentation
 ### `WarpGBM` Parameters:
@@ -201,7 +271,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
    y_eval=None,                   # numpy array (float or int) 1 dimension (eval_num_samples)
    eval_every_n_trees=None,       # const (int) >= 1
    early_stopping_rounds=None,    # const (int) >= 1
-   eval_metric='mse'              # string, one of 'mse' or 'corr'. For corr, loss is 1 - correlation(y_true, preds)
+   eval_metric='mse'              # string, one of 'mse', 'rmsle' or 'corr'. For corr, loss is 1 - correlation(y_true, preds)
 )
 ```
 Train with optional validation set and early stopping.
@@ -238,4 +308,8 @@ WarpGBM builds on the shoulders of PyTorch, scikit-learn, LightGBM, and the CUDA
 ### v0.1.26
-- Fix Memory bugs in prediction and colsample bytree logic. Added "corr" eval metric.
+- Fix Memory bugs in prediction and colsample bytree logic. Added "corr" eval metric.
+### v1.0.0
+- Introduce invariant learning via directional era splitting (DES). Also streamline VRAM improvements over previous sub versions.

{warpgbm-0.1.27 → warpgbm-1.0.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "warpgbm"
-version = "0.1.27"
+version = "1.0.0"
 description = "A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA"
 readme = "README.md"
 requires-python = ">=3.8"

warpgbm-1.0.0/tests/test_invariant.py ADDED Viewed

@@ -0,0 +1,100 @@
+import numpy as np
+from warpgbm import WarpGBM
+import time
+import os
+import requests
+def download_file_if_missing(url, local_dir):
+    filename = os.path.basename(url)
+    local_path = os.path.join(local_dir, filename)
+    if os.path.exists(local_path):
+        print(f"✅ {filename} already exists, skipping download.")
+        return
+    # Convert GitHub blob URL to raw URL
+    raw_url = url.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/")
+    print(f"⬇️  Downloading {filename}...")
+    response = requests.get(raw_url)
+    response.raise_for_status()
+    os.makedirs(local_dir, exist_ok=True)
+    with open(local_path, "wb") as f:
+        f.write(response.content)
+    print(f"✅ Saved to {local_path}")
+# === Usage ===
+urls = [
+    "https://github.com/jefferythewind/era-splitting-notebook-examples/blob/main/Synthetic%20Memorization%20Data%20Set/X_train.npy",
+    "https://github.com/jefferythewind/era-splitting-notebook-examples/blob/main/Synthetic%20Memorization%20Data%20Set/y_train.npy",
+    "https://github.com/jefferythewind/era-splitting-notebook-examples/blob/main/Synthetic%20Memorization%20Data%20Set/X_test.npy",
+    "https://github.com/jefferythewind/era-splitting-notebook-examples/blob/main/Synthetic%20Memorization%20Data%20Set/y_test.npy",
+    "https://github.com/jefferythewind/era-splitting-notebook-examples/blob/main/Synthetic%20Memorization%20Data%20Set/X_eras.npy",
+]
+local_folder = "./synthetic_data"
+for url in urls:
+    download_file_if_missing(url, local_folder)
+def test_fit_predictpytee_correlation():
+    import numpy as np
+    import os
+    from warpgbm import WarpGBM
+    from sklearn.metrics import mean_squared_error
+    import time
+    # Load the real dataset from local .npy files
+    data_dir = "./synthetic_data"
+    X = np.load(os.path.join(data_dir, "X_train.npy"))
+    y = np.load(os.path.join(data_dir, "y_train.npy"))
+    # era = np.zeros(X.shape[0], dtype=np.int32)  # one era for default GBDT equivalence
+    era = np.load(os.path.join(data_dir, "X_eras.npy"))
+    X_test = np.load(os.path.join(data_dir, "X_test.npy"))
+    y_test = np.load(os.path.join(data_dir, "y_test.npy"))
+    print(f"X shape: {X.shape}, y shape: {y.shape}")
+    model = WarpGBM(
+        max_depth=10,
+        num_bins=127,
+        n_estimators=50,
+        learning_rate=1,
+        threads_per_block=128,
+        rows_per_thread=4,
+        colsample_bytree=0.9,
+        min_child_weight=4
+    )
+    start_fit = time.time()
+    model.fit(
+        X,
+        y,
+        era_id=era,
+        X_eval=X_test,
+        y_eval=y_test,
+        eval_every_n_trees=10,
+        early_stopping_rounds=100,
+        eval_metric="corr",
+    )
+    fit_time = time.time() - start_fit
+    print(f"  Fit time:     {fit_time:.3f} seconds")
+    start_pred = time.time()
+    preds = model.predict(X_test)
+    pred_time = time.time() - start_pred
+    print(f"  Predict time: {pred_time:.3f} seconds")
+    corr = np.corrcoef(preds, y_test)[0, 1]
+    mse = mean_squared_error(preds, y_test)
+    print(f"  Correlation:  {corr:.4f}")
+    print(f"  MSE:          {mse:.4f}")
+    assert corr > 0.95, f"Out-of-sample correlation too low: {corr:.4f}"
+    assert mse < 0.02, f"Out-of-sample MSE too high: {mse:.4f}"

warpgbm-1.0.0/version.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ 1.0.0

{warpgbm-0.1.27 → warpgbm-1.0.0}/warpgbm/core.py RENAMED Viewed

@@ -219,10 +219,12 @@ class WarpGBM(BaseEstimator, RegressorMixin):
             era_id = np.ones(X.shape[0], dtype="int32")
         # Train data preprocessing
-        self.bin_indices, era_indices, self.bin_edges, self.unique_eras, self.Y_gpu = (
+        self.bin_indices, self.era_indices, self.bin_edges, self.unique_eras, self.Y_gpu = (
             self.preprocess_gpu_data(X, y, era_id)
         )
         self.num_samples, self.num_features = X.shape
+        self.num_eras = len(self.unique_eras)
+        self.era_indices = self.era_indices.to(dtype=torch.int32)
         self.gradients = torch.zeros_like(self.Y_gpu)
         self.root_node_indices = torch.arange(self.num_samples, device=self.device, dtype=torch.int32)
         self.base_prediction = self.Y_gpu.mean().item()
@@ -231,8 +233,6 @@ class WarpGBM(BaseEstimator, RegressorMixin):
             k = max(1, int(self.colsample_bytree * self.num_features))
         else:
             k = self.num_features
-        self.best_gains = torch.zeros(k, device=self.device)
-        self.best_bins = torch.zeros(k, device=self.device, dtype=torch.int32)
         self.feature_indices = torch.arange(self.num_features, device=self.device, dtype=torch.int32)
         # ─── Optional Eval Set ───
@@ -275,9 +275,7 @@ class WarpGBM(BaseEstimator, RegressorMixin):
             max_vals = X_np.max(axis=0)
             if is_integer_type and np.all(max_vals < self.num_bins):
-                print(
-                    "Detected pre-binned integer input — skipping quantile binning."
-                )
+                print("Detected pre-binned integer input — skipping quantile binning.")
                 for f in range(self.num_features):
                     bin_indices[:,f] = torch.as_tensor( X_np[:, f], device=self.device).contiguous()
                 # bin_indices = X_np.to("cuda", non_blocking=True).contiguous()
@@ -319,10 +317,10 @@ class WarpGBM(BaseEstimator, RegressorMixin):
     def compute_histograms(self, sample_indices, feature_indices):
         grad_hist = torch.zeros(
-            (len(feature_indices), self.num_bins), device=self.device, dtype=torch.float32
+            ( self.num_eras, len(feature_indices), self.num_bins), device=self.device, dtype=torch.float32
         )
         hess_hist = torch.zeros(
-            (len(feature_indices), self.num_bins), device=self.device, dtype=torch.float32
+            ( self.num_eras, len(feature_indices), self.num_bins), device=self.device, dtype=torch.float32
         )
         node_kernel.compute_histogram3(
@@ -330,6 +328,7 @@ class WarpGBM(BaseEstimator, RegressorMixin):
             self.residual,
             sample_indices,
             feature_indices,
+            self.era_indices,
             grad_hist,
             hess_hist,
             self.num_bins,
@@ -345,21 +344,30 @@ class WarpGBM(BaseEstimator, RegressorMixin):
             self.min_split_gain,
             self.min_child_weight,
             self.L2_reg,
-            self.best_gains,
-            self.best_bins,
-            self.threads_per_block,
+            self.per_era_gain,
+            self.per_era_direction,
+            self.threads_per_block
         )
-        if torch.all(self.best_bins == -1):
-            return -1, -1  # No valid split found
+        if self.num_eras == 1:
+            era_splitting_criterion = self.per_era_gain[0,:,:]  # [F, B-1]
+            dir_score_mask = era_splitting_criterion > self.min_split_gain
+        else:
+            directional_agreement = self.per_era_direction.mean(dim=0).abs()  # [F, B-1]
+            era_splitting_criterion = self.per_era_gain.mean(dim=0)  # [F, B-1]
+            dir_score_mask = ( directional_agreement == directional_agreement.max() ) & (era_splitting_criterion > self.min_split_gain)
+        if not dir_score_mask.any():
+            return -1, -1
-        # print(self.best_bins)
-        # print(self.best_gains)
+        era_splitting_criterion[dir_score_mask == 0] = float("-inf")
+        best_idx = torch.argmax(era_splitting_criterion) #index of flattened tensor
+        split_bins = self.num_bins - 1
+        best_feature = best_idx // split_bins
+        best_bin = best_idx % split_bins
-        f = torch.argmax(self.best_gains).item()
-        b = self.best_bins[f].item()
+        return best_feature.item(), best_bin.item()
-        return f, b
     def grow_tree(self, gradient_histogram, hessian_histogram, node_indices, depth):
         if depth == self.max_depth:
@@ -372,29 +380,15 @@ class WarpGBM(BaseEstimator, RegressorMixin):
             gradient_histogram, hessian_histogram
         )
-        # print(local_feature, best_bin)
         if local_feature == -1:
             leaf_value = self.residual[node_indices].mean()
             self.gradients[node_indices] += self.learning_rate * leaf_value
             return {"leaf_value": leaf_value.item(), "samples": parent_size}
-        # print("DEBUG SHAPES -> bin_indices:", self.bin_indices.shape,
-        #     "| node_indices max:", node_indices.max().item(),
-        #     "| local_feature:", local_feature,
-        #     "| feat_indices_tree len:", len(self.feat_indices_tree),
-        #     "| feat index:", self.feat_indices_tree[local_feature])
         split_mask = self.bin_indices[node_indices, self.feat_indices_tree[local_feature]] <= best_bin
         left_indices = node_indices[split_mask]
         right_indices = node_indices[~split_mask]
-        # print("DEBUG SHAPES -> left_indices:", left_indices.shape,
-        #       "| right_indices:", right_indices.shape,
-        #       "| parent_size:", parent_size,
-        #       "| local_feature:", local_feature,
-        #       "| best_bin:", best_bin)
         left_size = left_indices.numel()
         right_size = right_indices.numel()
@@ -463,6 +457,10 @@ class WarpGBM(BaseEstimator, RegressorMixin):
             k = max(1, int(self.colsample_bytree * self.num_features))
         else:
             self.feat_indices_tree = self.feature_indices
+            k = self.num_features
+        self.per_era_gain = torch.zeros(self.num_eras, k, self.num_bins-1, device=self.device, dtype=torch.float32)
+        self.per_era_direction = torch.zeros(self.num_eras, k, self.num_bins-1, device=self.device, dtype=torch.float32)
         for i in range(self.n_estimators):
             self.residual = self.Y_gpu - self.gradients

warpgbm-1.0.0/warpgbm/cuda/best_split_kernel.cu ADDED Viewed

@@ -0,0 +1,89 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+__global__ void directional_split_kernel(
+    const float *__restrict__ G,                // [E * F * B]
+    const float *__restrict__ H,                // [E * F * B]
+    int E, int F, int B,
+    float min_split_gain,
+    float min_child_samples,
+    float eps,
+    float *__restrict__ per_era_gain,           // [E * F * (B-1)]
+    float *__restrict__ per_era_direction       // [E * F * (B-1)]
+)
+{
+    int f = blockIdx.x * blockDim.x + threadIdx.x; // feature index
+    int e = blockIdx.y;                            // era index
+    if (f >= F || e >= E) return;
+    // Access base offset for this (era, feature)
+    int base = e * F * B + f * B;
+    int base_gain = e * F * (B - 1) + f * (B - 1);
+    float G_total = 0.0f, H_total = 0.0f;
+    for (int b = 0; b < B; ++b) {
+        G_total += G[base + b];
+        H_total += H[base + b];
+    }
+    float G_L = 0.0f, H_L = 0.0f;
+    for (int b = 0; b < B - 1; ++b) {
+        G_L += G[base + b];
+        H_L += H[base + b];
+        float G_R = G_total - G_L;
+        float H_R = H_total - H_L;
+        float gain = 0.0f;
+        float dir = 0.0f;
+        if (H_L >= min_child_samples && H_R >= min_child_samples) {
+            gain = (G_L * G_L) / (H_L + eps)
+                 + (G_R * G_R) / (H_R + eps)
+                 - (G_total * G_total) / (H_total + eps);
+            float left_val = G_L / (H_L + eps);
+            float right_val = G_R / (H_R + eps);
+            dir = (left_val > right_val) ? 1.0f : -1.0f;
+        }
+        per_era_gain[base_gain + b] = gain;
+        per_era_direction[base_gain + b] = dir;
+    }
+}
+void launch_directional_split_kernel(
+    const at::Tensor &G, // [E, F, B]
+    const at::Tensor &H, // [E, F, B]
+    float min_split_gain,
+    float min_child_samples,
+    float eps,
+    at::Tensor &per_era_gain,       // [E, F, B]
+    at::Tensor &per_era_direction,  // [E, F, B]
+    int threads = 128)
+{
+    int E = G.size(0);
+    int F = G.size(1);
+    int B = G.size(2);
+    dim3 blocks((F + threads - 1) / threads, E); // (feature blocks, era grid)
+    dim3 thread_dims(threads);
+    directional_split_kernel<<<blocks, thread_dims>>>(
+        G.data_ptr<float>(),
+        H.data_ptr<float>(),
+        E, F, B,
+        min_split_gain,
+        min_child_samples,
+        eps,
+        per_era_gain.data_ptr<float>(),
+        per_era_direction.data_ptr<float>());
+    cudaError_t err = cudaGetLastError();
+    if (err != cudaSuccess) {
+        printf("Directional split kernel launch failed: %s\n", cudaGetErrorString(err));
+    }
+}

{warpgbm-0.1.27 → warpgbm-1.0.0}/warpgbm/cuda/histogram_kernel.cu RENAMED Viewed

@@ -3,13 +3,14 @@
 #include <torch/extension.h>
 __global__ void histogram_tiled_configurable_kernel(
-    const int8_t *__restrict__ bin_indices, // [N, F]
+    const int8_t *__restrict__ bin_indices, // [N, F_master]
     const float *__restrict__ residuals,    // [N]
     const int32_t *__restrict__ sample_indices, // [N]
     const int32_t *__restrict__ feature_indices, // [F]
+    const int32_t *__restrict__ era_indices, // [N]
     float *__restrict__ grad_hist,          // [F * B]
     float *__restrict__ hess_hist,          // [F * B]
-    int64_t N, int64_t F, int64_t B,
+    int64_t N, int64_t F_master, int64_t F, int64_t B, int64_t num_eras,
     int rows_per_thread)
 {
     int hist_feat_idx = blockIdx.x;
@@ -17,15 +18,15 @@ __global__ void histogram_tiled_configurable_kernel(
     int row_start = (blockIdx.y * blockDim.x + threadIdx.x) * rows_per_thread;
     extern __shared__ float shmem[];
-    float *sh_grad = shmem;       // [B]
-    float *sh_hess = &sh_grad[B]; // [B]
+    float *sh_grad = shmem;                             // [num_eras * B]
+    float *sh_hess = &sh_grad[num_eras * B];            // [num_eras * B]
     // Initialize shared memory histograms
-    for (int b = threadIdx.x; b < B; b += blockDim.x)
-    {
-        sh_grad[b] = 0.0f;
-        sh_hess[b] = 0.0f;
+    for (int i = threadIdx.x; i < num_eras * B; i += blockDim.x) {
+        sh_grad[i] = 0.0f;
+        sh_hess[i] = 0.0f;
     }
     __syncthreads();
     // Each thread processes multiple rows
@@ -35,23 +36,28 @@ __global__ void histogram_tiled_configurable_kernel(
         if (row < N)
         {
             int sample = sample_indices[row];
-            int8_t bin = bin_indices[sample * F + feat];
+            int8_t bin = bin_indices[sample * F_master + feat];
+            int32_t era = era_indices[sample];
             if (bin >= 0 && bin < B)
             {
-                atomicAdd(&sh_grad[bin], residuals[sample]);
-                atomicAdd(&sh_hess[bin], 1.0f);
+                atomicAdd(&sh_grad[era * B + bin], residuals[sample]);
+                atomicAdd(&sh_hess[era * B + bin], 1.0f);
             }
         }
     }
     __syncthreads();
     // One thread per bin writes results back to global memory
-    for (int b = threadIdx.x; b < B; b += blockDim.x)
+    for (int b = threadIdx.x; b < num_eras * B; b += blockDim.x)
     {
-        int64_t idx = hist_feat_idx * B + b;
+        int e = b / B;
+        int bin = b % B;
+        int64_t idx = e * F * B + hist_feat_idx * B + bin;
         atomicAdd(&grad_hist[idx], sh_grad[b]);
         atomicAdd(&hess_hist[idx], sh_hess[b]);
     }
 }
 void launch_histogram_kernel_cuda_configurable(
@@ -59,6 +65,7 @@ void launch_histogram_kernel_cuda_configurable(
     const at::Tensor &residuals,
     const at::Tensor &sample_indices,
     const at::Tensor &feature_indices,
+    const at::Tensor &era_indices,
     at::Tensor &grad_hist,
     at::Tensor &hess_hist,
     int num_bins,
@@ -75,16 +82,18 @@ void launch_histogram_kernel_cuda_configurable(
     dim3 blocks(F, row_tiles); // grid.x = F, grid.y = row_tiles
     dim3 threads(threads_per_block);
-    int shared_mem_bytes = 2 * num_bins * sizeof(float);
+    int num_eras = grad_hist.size(0); // inferred from output tensor
+    int shared_mem_bytes = 2 * num_eras * num_bins * sizeof(float);
     histogram_tiled_configurable_kernel<<<blocks, threads, shared_mem_bytes>>>(
         bin_indices.data_ptr<int8_t>(),
         residuals.data_ptr<float>(),
         sample_indices.data_ptr<int32_t>(),
         feature_indices.data_ptr<int32_t>(),
+        era_indices.data_ptr<int32_t>(),
         grad_hist.data_ptr<float>(),
         hess_hist.data_ptr<float>(),
-        N, num_features_master, num_bins,
+        N, num_features_master, F, num_bins, num_eras,
         rows_per_thread);
     cudaError_t err = cudaGetLastError();

{warpgbm-0.1.27 → warpgbm-1.0.0}/warpgbm/cuda/node_kernel.cpp RENAMED Viewed

@@ -3,21 +3,22 @@
 // Declare the function from histogram_kernel.cu
-void launch_best_split_kernel_cuda(
-    const at::Tensor &G, // [F x B]
-    const at::Tensor &H, // [F x B]
+void launch_directional_split_kernel(
+    const at::Tensor &G, // [E, F, B]
+    const at::Tensor &H, // [E, F, B]
     float min_split_gain,
     float min_child_samples,
     float eps,
-    at::Tensor &best_gains, // [F], float32
-    at::Tensor &best_bins,
-    int threads);
+    at::Tensor &per_era_gain,       // [E, F, B]
+    at::Tensor &per_era_direction,  // [E, F, B]
+    int threads = 128);
 void launch_histogram_kernel_cuda_configurable(
     const at::Tensor &bin_indices,
-    const at::Tensor &residual,
+    const at::Tensor &residuals,
     const at::Tensor &sample_indices,
     const at::Tensor &feature_indices,
+    const at::Tensor &era_indices,
     at::Tensor &grad_hist,
     at::Tensor &hess_hist,
     int num_bins,
@@ -40,7 +41,7 @@ void predict_with_forest(
 PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
 {
     m.def("compute_histogram3", &launch_histogram_kernel_cuda_configurable, "Histogram Feature Shared Mem");
-    m.def("compute_split", &launch_best_split_kernel_cuda, "Best Split (CUDA)");
+    m.def("compute_split", &launch_directional_split_kernel, "Best Split (CUDA)");
     m.def("custom_cuda_binner", &launch_bin_column_kernel, "Custom CUDA binning kernel");
     m.def("predict_forest", &predict_with_forest, "CUDA Predictions");
 }

{warpgbm-0.1.27 → warpgbm-1.0.0/warpgbm.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: warpgbm
-Version: 0.1.27
+Version: 1.0.0
 Summary: A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA
 License:                     GNU GENERAL PUBLIC LICENSE
                                Version 3, 29 June 2007
@@ -686,21 +686,46 @@ Requires-Dist: tqdm
 Requires-Dist: scikit-learn
 Dynamic: license-file
-![warpgbm](https://github.com/user-attachments/assets/dee9de16-091b-49c1-a8fa-2b4ab6891184)
+![raw](https://github.com/user-attachments/assets/924844ef-2536-4bde-a330-5e30f6b0762c)
 # WarpGBM
 WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (GBDT) library built with PyTorch and CUDA. It offers blazing-fast histogram-based training and efficient prediction, with compatibility for research and production workflows.
+**New in v1.0.0:** WarpGBM introduces *Invariant Gradient Boosting* — a powerful approach to learning signals that remain stable across shifting environments (e.g., time, regimes, or datasets). Powered by a novel algorithm called **[Directional Era-Splitting (DES)](https://arxiv.org/abs/2309.14496)**, WarpGBM doesn't just train faster than other leading GBDT libraries — it trains smarter.
+If your data evolves over time, WarpGBM is the only GBDT library designed to *adapt and generalize*.
 ---
+## Contents
+- [Features](#features)
+- [Benchmarks](#benchmarks)
+- [Installation](#installation)
+- [Learning Invariant Signals Across Environments](#learning-invariant-signals-across-environments)
+  - [Why This Matters](#why-this-matters)
+  - [Visual Intuition](#visual-intuition)
+  - [Key References](#key-references)
+- [Examples](#examples)
+  - [Quick Comparison with LightGBM CPU version](#quick-comparison-with-lightgbm-cpu-version)
+  - [Pre-binned Data Example (Numerai)](#pre-binned-data-example-numerai)
+- [Documentation](#documentation)
+- [Acknowledgements](#acknowledgements)
+- [Version Notes](#version-notes)
 ## Features
-- GPU-accelerated training and histogram construction using custom CUDA kernels
-- Drop-in scikit-learn style interface
-- Supports pre-binned data or automatic quantile binning
-- Simple install with `pip`
+- **Blazing-fast GPU training** with custom CUDA kernels for binning, histogram building, split finding, and prediction
+- **Invariant signal learning** via [Directional Era-Splitting (DES)](https://arxiv.org/abs/2309.14496) — designed for datasets with shifting environments (e.g., time, regimes, experimental settings)
+- Drop-in **scikit-learn style interface** for easy adoption
+- Supports **pre-binned data** or **automatic quantile binning**
+- Works with `float32` or `int8` inputs
+- Built-in **validation and early stopping** support with MSE, RMSLE, or correlation metrics
+- Simple install with `pip`, no custom drivers required
+> 💡 **Note:** WarpGBM v1.0.0 is a *generalization* of the traditional GBDT algorithm.
+> To run standard GBM training at maximum speed, simply omit the `era_id` argument — WarpGBM will behave like a traditional booster but with industry-leading performance.
 ---
@@ -762,7 +787,62 @@ Before either method, make sure you’ve installed PyTorch with GPU support:\
 ---
-## Example
+## Learning Invariant Signals Across Environments
+Most supervised learning models rely on an assumption known as the **Empirical Risk Minimization (ERM)** principle. Under ERM, the data distribution connecting inputs \( X \) and targets \( Y \) is assumed to be **fixed** and **stationary** across training, validation, and test splits. That is:
+> The patterns you learn from the training set are expected to generalize out-of-sample — *as long as the test data follows the same distribution as the training data.*
+However, this assumption is often violated in real-world settings. Data frequently shifts across time, geography, experimental conditions, or other hidden factors. This phenomenon is known as **distribution shift**, and it leads to models that perform well in-sample but fail catastrophically out-of-sample.
+This challenge motivates the field of **Out-of-Distribution (OOD) Generalization**, which assumes your data is drawn from **distinct environments or eras** — e.g., time periods, customer segments, experimental trials. Some signals may appear predictive within specific environments but vanish or reverse in others. These are called **spurious signals**. On the other hand, signals that remain consistently predictive across all environments are called **invariant signals**.
+WarpGBM v1.0.0 introduces **Directional Era-Splitting (DES)**, a new algorithm designed to identify and learn from invariant signals — ignoring signals that fail to generalize across environments.
+---
+### Why This Matters
+- Standard models trained via ERM can learn to exploit **spurious correlations** that only hold in some parts of the data.
+- DES explicitly tests whether a feature's split is **directionally consistent** across all eras — only such *invariant splits* are kept.
+- This approach has been shown to reduce overfitting and improve out-of-sample generalization, particularly in financial and scientific datasets.
+---
+### Visual Intuition
+We contrast two views of the data:
+- **ERM Setting**: All data is assumed to come from the same source (single distribution).\
+  No awareness of environments — spurious signals can dominate.
+- **OOD Setting (Era-Splitting)**: Data is explicitly grouped by environment (era).\
+  The model checks whether a signal holds across all groups — enforcing **robustness**.
+*📷 [Placeholder for future visual illustration]*
+---
+### Key References
+- **Invariant Risk Minimization (IRM)**: [Arjovsky et al., 2019](https://arxiv.org/abs/1907.02893)
+- **Learning Explanations That Are Hard to Vary**: [Parascandolo et al., 2020](https://arxiv.org/abs/2009.00329)
+- **Era Splitting: Invariant Learning for Decision Trees**: [DeLise, 2023](https://arxiv.org/abs/2309.14496)
+---
+WarpGBM is the **first open-source GBDT framework to integrate this OOD-aware approach natively**, using efficient CUDA kernels to evaluate per-era consistency during tree growth. It’s not just faster — it’s smarter.
+---
+## Examples
+WarpGBM is easy to drop into any supervised learning workflow and comes with curated examples in the `examples/` folder.
+ - `Spiral Data.ipynb`: synthetic OOD benchmark from Learning Explanations That Are Hard to Vary
+### Quick Comparison with LightGBM CPU version
 ```python
 import numpy as np
@@ -804,7 +884,7 @@ WarpGBM:     corr = 0.8621, time = 5.40s
 ---
-## Pre-binned Data Example (Numerai)
+### Pre-binned Data Example (Numerai)
 WarpGBM can save additional training time if your dataset is already pre-binned. The Numerai tournament data is a great example:
@@ -854,16 +934,6 @@ WarpGBM:     corr = 0.0660, time = 49.16s
 ---
-### Run it live in Colab
-You can try WarpGBM in a live Colab notebook using real pre-binned Numerai tournament data:
-[Open in Colab](https://colab.research.google.com/drive/10mKSjs9UvmMgM5_lOXAylq5LUQAnNSi7?usp=sharing)
-No installation required — just press **"Open in Playground"**, then **Run All**!
----
 ## Documentation
 ### `WarpGBM` Parameters:
@@ -889,7 +959,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
    y_eval=None,                   # numpy array (float or int) 1 dimension (eval_num_samples)
    eval_every_n_trees=None,       # const (int) >= 1
    early_stopping_rounds=None,    # const (int) >= 1
-   eval_metric='mse'              # string, one of 'mse' or 'corr'. For corr, loss is 1 - correlation(y_true, preds)
+   eval_metric='mse'              # string, one of 'mse', 'rmsle' or 'corr'. For corr, loss is 1 - correlation(y_true, preds)
 )
 ```
 Train with optional validation set and early stopping.
@@ -927,3 +997,7 @@ WarpGBM builds on the shoulders of PyTorch, scikit-learn, LightGBM, and the CUDA
 ### v0.1.26
 - Fix Memory bugs in prediction and colsample bytree logic. Added "corr" eval metric.
+### v1.0.0
+- Introduce invariant learning via directional era splitting (DES). Also streamline VRAM improvements over previous sub versions.

{warpgbm-0.1.27 → warpgbm-1.0.0}/warpgbm.egg-info/SOURCES.txt RENAMED Viewed

@@ -8,6 +8,7 @@ tests/__init__.py
 tests/full_numerai_test.py
 tests/numerai_test.py
 tests/test_fit_predict_corr.py
+tests/test_invariant.py
 warpgbm/__init__.py
 warpgbm/core.py
 warpgbm/metrics.py

warpgbm-0.1.27/version.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- 0.1.27

warpgbm-0.1.27/warpgbm/cuda/best_split_kernel.cu DELETED Viewed

@@ -1,79 +0,0 @@
-#include <torch/extension.h>
-#include <cuda.h>
-#include <cuda_runtime.h>
-__global__ void best_split_kernel_global_only(
-    const float *__restrict__ G, // [F x B]
-    const float *__restrict__ H, // [F x B]
-    int F,
-    int B,
-    float min_split_gain,
-    float min_child_samples,
-    float eps,
-    float *__restrict__ best_gains, // [F]
-    int *__restrict__ best_bins     // [F]
-)
-{
-    int f = blockIdx.x * blockDim.x + threadIdx.x;
-    if (f >= F)
-        return;
-    float G_total = 0.0f, H_total = 0.0f;
-    for (int b = 0; b < B; ++b)
-    {
-        G_total += G[f * B + b];
-        H_total += H[f * B + b];
-    }
-    float G_L = 0.0f, H_L = 0.0f;
-    float best_gain = min_split_gain;
-    int best_bin = -1;
-    for (int b = 0; b < B - 1; ++b)
-    {
-        G_L += G[f * B + b];
-        H_L += H[f * B + b];
-        float G_R = G_total - G_L;
-        float H_R = H_total - H_L;
-        if (H_L >= min_child_samples && H_R >= min_child_samples)
-        {
-            float gain = (G_L * G_L) / (H_L + eps) + (G_R * G_R) / (H_R + eps) - (G_total * G_total) / (H_total + eps);
-            if (gain > best_gain)
-            {
-                best_gain = gain;
-                best_bin = b;
-            }
-        }
-    }
-    best_gains[f] = best_gain;
-    best_bins[f] = best_bin;
-}
-void launch_best_split_kernel_cuda(
-    const at::Tensor &G, // [F x B]
-    const at::Tensor &H, // [F x B]
-    float min_split_gain,
-    float min_child_samples,
-    float eps,
-    at::Tensor &best_gains, // [F], float32
-    at::Tensor &best_bins,  // [F], int32
-    int threads)
-{
-    int F = G.size(0);
-    int B = G.size(1);
-    int blocks = (F + threads - 1) / threads;
-    best_split_kernel_global_only<<<blocks, threads>>>(
-        G.data_ptr<float>(),
-        H.data_ptr<float>(),
-        F,
-        B,
-        min_split_gain,
-        min_child_samples,
-        eps,
-        best_gains.data_ptr<float>(),
-        best_bins.data_ptr<int>());
-}