warpgbm 0.1.21__tar.gz → 0.1.22__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (24) hide show
  1. {warpgbm-0.1.21/warpgbm.egg-info → warpgbm-0.1.22}/PKG-INFO +9 -16
  2. {warpgbm-0.1.21 → warpgbm-0.1.22}/README.md +8 -15
  3. {warpgbm-0.1.21 → warpgbm-0.1.22}/pyproject.toml +1 -1
  4. {warpgbm-0.1.21 → warpgbm-0.1.22}/tests/test_fit_predict_corr.py +11 -8
  5. warpgbm-0.1.22/version.txt +1 -0
  6. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/core.py +152 -92
  7. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/histogram_kernel.cu +0 -14
  8. {warpgbm-0.1.21 → warpgbm-0.1.22/warpgbm.egg-info}/PKG-INFO +9 -16
  9. warpgbm-0.1.21/version.txt +0 -1
  10. {warpgbm-0.1.21 → warpgbm-0.1.22}/LICENSE +0 -0
  11. {warpgbm-0.1.21 → warpgbm-0.1.22}/MANIFEST.in +0 -0
  12. {warpgbm-0.1.21 → warpgbm-0.1.22}/setup.cfg +0 -0
  13. {warpgbm-0.1.21 → warpgbm-0.1.22}/setup.py +0 -0
  14. {warpgbm-0.1.21 → warpgbm-0.1.22}/tests/__init__.py +0 -0
  15. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/__init__.py +0 -0
  16. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/__init__.py +0 -0
  17. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/best_split_kernel.cu +0 -0
  18. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/binner.cu +0 -0
  19. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/node_kernel.cpp +0 -0
  20. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/predict.cu +0 -0
  21. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm.egg-info/SOURCES.txt +0 -0
  22. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm.egg-info/dependency_links.txt +0 -0
  23. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm.egg-info/requires.txt +0 -0
  24. {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: warpgbm
3
- Version: 0.1.21
3
+ Version: 0.1.22
4
4
  Summary: A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA
5
5
  License: GNU GENERAL PUBLIC LICENSE
6
6
  Version 3, 29 June 2007
@@ -704,26 +704,20 @@ WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (G
704
704
 
705
705
  ---
706
706
 
707
- ## Performance Note
708
-
709
- In our initial tests on an NVIDIA 3090 (local) and A100 (Google Colab Pro), WarpGBM achieves **14x to 20x faster training times** compared to LightGBM's CPU version and **2x faster** on the GPU version using default configurations. Speed also outperforms XGBoost and CatBoost on regression problems. It also consumes **significantly less RAM and CPU**. These early results hint at more thorough benchmarking to come.
710
-
711
- ---
712
-
713
707
  ## Benchmarks
714
708
 
715
709
  ### Scikit-Learn Synthetic Data: 1 Million Rows and 1,000 Features
716
710
 
717
- In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.19** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment. The CPU versions don't even come close to the speed here so we didn't test them.
711
+ In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.21** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment.
718
712
 
719
713
  ```
720
- WarpGBM: corr = 0.8882, train = 21.8s, infer = 11.6s
721
- XGBoost: corr = 0.8877, train = 33.4s, infer = 8.1s
722
- LightGBM: corr = 0.8604, train = 30.2s, infer = 1.4s
723
- CatBoost: corr = 0.8935, train = 377.9s, infer = 375.8s
714
+ WarpGBM: corr = 0.8882, train = 18.7s, infer = 4.9s
715
+ XGBoost: corr = 0.8877, train = 33.1s, infer = 8.1s
716
+ LightGBM: corr = 0.8604, train = 30.3s, infer = 1.4s
717
+ CatBoost: corr = 0.8935, train = 400.0s, infer = 382.6s
724
718
  ```
725
719
 
726
- Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK
720
+ Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK?usp=sharing
727
721
 
728
722
  ---
729
723
 
@@ -746,7 +740,7 @@ pip install warpgbm
746
740
  This installs from PyPI and also compiles CUDA code locally during installation. This method works well **if your environment already has PyTorch with GPU support** installed and configured.
747
741
 
748
742
  > **Tip:**\
749
- > If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag:
743
+ > If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag. This is currently required in the Colab environments.
750
744
  >
751
745
  > ```bash
752
746
  > pip install warpgbm --no-build-isolation
@@ -886,8 +880,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
886
880
 
887
881
  ### Methods:
888
882
  - `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
889
- - `.predict(X, chunksize=50_000)`: Predict on new raw float or pre-binned data.
890
- - `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
883
+ - `.predict(X)`: Predict on new data, using parallelized CUDA kernel.
891
884
 
892
885
  ---
893
886
 
@@ -16,26 +16,20 @@ WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (G
16
16
 
17
17
  ---
18
18
 
19
- ## Performance Note
20
-
21
- In our initial tests on an NVIDIA 3090 (local) and A100 (Google Colab Pro), WarpGBM achieves **14x to 20x faster training times** compared to LightGBM's CPU version and **2x faster** on the GPU version using default configurations. Speed also outperforms XGBoost and CatBoost on regression problems. It also consumes **significantly less RAM and CPU**. These early results hint at more thorough benchmarking to come.
22
-
23
- ---
24
-
25
19
  ## Benchmarks
26
20
 
27
21
  ### Scikit-Learn Synthetic Data: 1 Million Rows and 1,000 Features
28
22
 
29
- In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.19** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment. The CPU versions don't even come close to the speed here so we didn't test them.
23
+ In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.21** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment.
30
24
 
31
25
  ```
32
- WarpGBM: corr = 0.8882, train = 21.8s, infer = 11.6s
33
- XGBoost: corr = 0.8877, train = 33.4s, infer = 8.1s
34
- LightGBM: corr = 0.8604, train = 30.2s, infer = 1.4s
35
- CatBoost: corr = 0.8935, train = 377.9s, infer = 375.8s
26
+ WarpGBM: corr = 0.8882, train = 18.7s, infer = 4.9s
27
+ XGBoost: corr = 0.8877, train = 33.1s, infer = 8.1s
28
+ LightGBM: corr = 0.8604, train = 30.3s, infer = 1.4s
29
+ CatBoost: corr = 0.8935, train = 400.0s, infer = 382.6s
36
30
  ```
37
31
 
38
- Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK
32
+ Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK?usp=sharing
39
33
 
40
34
  ---
41
35
 
@@ -58,7 +52,7 @@ pip install warpgbm
58
52
  This installs from PyPI and also compiles CUDA code locally during installation. This method works well **if your environment already has PyTorch with GPU support** installed and configured.
59
53
 
60
54
  > **Tip:**\
61
- > If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag:
55
+ > If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag. This is currently required in the Colab environments.
62
56
  >
63
57
  > ```bash
64
58
  > pip install warpgbm --no-build-isolation
@@ -198,8 +192,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
198
192
 
199
193
  ### Methods:
200
194
  - `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
201
- - `.predict(X, chunksize=50_000)`: Predict on new raw float or pre-binned data.
202
- - `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
195
+ - `.predict(X)`: Predict on new data, using parallelized CUDA kernel.
203
196
 
204
197
  ---
205
198
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "warpgbm"
7
- version = "0.1.21"
7
+ version = "0.1.22"
8
8
  description = "A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"
@@ -1,11 +1,9 @@
1
1
  import numpy as np
2
2
  from warpgbm import WarpGBM
3
3
  from sklearn.datasets import make_regression
4
-
5
- import numpy as np
6
4
  import time
7
- from warpgbm import WarpGBM
8
- from sklearn.datasets import make_regression
5
+ from sklearn.metrics import mean_squared_error
6
+
9
7
 
10
8
  def test_fit_predictpytee_correlation():
11
9
  np.random.seed(42)
@@ -14,19 +12,20 @@ def test_fit_predictpytee_correlation():
14
12
  X, y = make_regression(n_samples=N, n_features=F, noise=0.1, random_state=42)
15
13
  era = np.zeros(N, dtype=np.int32)
16
14
  corrs = []
15
+ mses = []
17
16
 
18
- for hist_type in ['hist1', 'hist2', 'hist3']:
17
+ for hist_type in ["hist1", "hist2", "hist3"]:
19
18
  print(f"\nTesting histogram method: {hist_type}")
20
19
 
21
20
  model = WarpGBM(
22
21
  max_depth=10,
23
22
  num_bins=10,
24
- n_estimators=10,
23
+ n_estimators=100,
25
24
  learning_rate=1,
26
25
  verbosity=False,
27
26
  histogram_computer=hist_type,
28
27
  threads_per_block=64,
29
- rows_per_thread=4
28
+ rows_per_thread=4,
30
29
  )
31
30
 
32
31
  start_fit = time.time()
@@ -40,7 +39,11 @@ def test_fit_predictpytee_correlation():
40
39
  print(f" Predict time: {pred_time:.3f} seconds")
41
40
 
42
41
  corr = np.corrcoef(preds, y)[0, 1]
42
+ mse = mean_squared_error(preds, y)
43
43
  print(f" Correlation: {corr:.4f}")
44
+ print(f" MSE: {mse:.4f}")
44
45
  corrs.append(corr)
46
+ mses.append(mse)
45
47
 
46
- assert (np.array(corrs) > 0.95).all(), f"In-sample correlation too low: {corrs}"
48
+ assert (np.array(corrs) > 0.9).all(), f"In-sample correlation too low: {corrs}"
49
+ assert (np.array(mses) < 2).all(), f"In-sample mse too high: {mses}"
@@ -0,0 +1 @@
1
+ 0.1.22
@@ -7,11 +7,12 @@ from typing import Tuple
7
7
  from torch import Tensor
8
8
 
9
9
  histogram_kernels = {
10
- 'hist1': node_kernel.compute_histogram,
11
- 'hist2': node_kernel.compute_histogram2,
12
- 'hist3': node_kernel.compute_histogram3
10
+ "hist1": node_kernel.compute_histogram,
11
+ "hist2": node_kernel.compute_histogram2,
12
+ "hist3": node_kernel.compute_histogram3,
13
13
  }
14
14
 
15
+
15
16
  class WarpGBM(BaseEstimator, RegressorMixin):
16
17
  def __init__(
17
18
  self,
@@ -22,12 +23,12 @@ class WarpGBM(BaseEstimator, RegressorMixin):
22
23
  min_child_weight=20,
23
24
  min_split_gain=0.0,
24
25
  verbosity=True,
25
- histogram_computer='hist3',
26
+ histogram_computer="hist3",
26
27
  threads_per_block=64,
27
28
  rows_per_thread=4,
28
29
  L2_reg=1e-6,
29
30
  L1_reg=0.0,
30
- device='cuda'
31
+ device="cuda",
31
32
  ):
32
33
  # Validate arguments
33
34
  self._validate_hyperparams(
@@ -41,7 +42,7 @@ class WarpGBM(BaseEstimator, RegressorMixin):
41
42
  threads_per_block=threads_per_block,
42
43
  rows_per_thread=rows_per_thread,
43
44
  L2_reg=L2_reg,
44
- L1_reg=L1_reg
45
+ L1_reg=L1_reg,
45
46
  )
46
47
 
47
48
  self.num_bins = num_bins
@@ -73,22 +74,28 @@ class WarpGBM(BaseEstimator, RegressorMixin):
73
74
  def _validate_hyperparams(self, **kwargs):
74
75
  # Type checks
75
76
  int_params = [
76
- "num_bins", "max_depth", "n_estimators", "min_child_weight",
77
- "threads_per_block", "rows_per_thread"
78
- ]
79
- float_params = [
80
- "learning_rate", "min_split_gain", "L2_reg", "L1_reg"
77
+ "num_bins",
78
+ "max_depth",
79
+ "n_estimators",
80
+ "min_child_weight",
81
+ "threads_per_block",
82
+ "rows_per_thread",
81
83
  ]
84
+ float_params = ["learning_rate", "min_split_gain", "L2_reg", "L1_reg"]
82
85
 
83
86
  for param in int_params:
84
87
  if not isinstance(kwargs[param], int):
85
- raise TypeError(f"{param} must be an integer, got {type(kwargs[param])}.")
88
+ raise TypeError(
89
+ f"{param} must be an integer, got {type(kwargs[param])}."
90
+ )
86
91
 
87
92
  for param in float_params:
88
- if not isinstance(kwargs[param], (float, int)): # Accept ints as valid floats
93
+ if not isinstance(
94
+ kwargs[param], (float, int)
95
+ ): # Accept ints as valid floats
89
96
  raise TypeError(f"{param} must be a float, got {type(kwargs[param])}.")
90
-
91
- if not ( 2 <= kwargs["num_bins"] <= 127 ):
97
+
98
+ if not (2 <= kwargs["num_bins"] <= 127):
92
99
  raise ValueError("num_bins must be between 2 and 127 inclusive.")
93
100
  if kwargs["max_depth"] < 1:
94
101
  raise ValueError("max_depth must be at least 1.")
@@ -101,29 +108,39 @@ class WarpGBM(BaseEstimator, RegressorMixin):
101
108
  if kwargs["min_split_gain"] < 0:
102
109
  raise ValueError("min_split_gain must be non-negative.")
103
110
  if kwargs["threads_per_block"] <= 0 or kwargs["threads_per_block"] % 32 != 0:
104
- raise ValueError("threads_per_block should be a positive multiple of 32 (warp size).")
105
- if not ( 1 <= kwargs["rows_per_thread"] <= 16 ):
106
- raise ValueError("rows_per_thread must be positive between 1 and 16 inclusive.")
111
+ raise ValueError(
112
+ "threads_per_block should be a positive multiple of 32 (warp size)."
113
+ )
114
+ if not (1 <= kwargs["rows_per_thread"] <= 16):
115
+ raise ValueError(
116
+ "rows_per_thread must be positive between 1 and 16 inclusive."
117
+ )
107
118
  if kwargs["L2_reg"] < 0 or kwargs["L1_reg"] < 0:
108
119
  raise ValueError("L2_reg and L1_reg must be non-negative.")
109
120
  if kwargs["histogram_computer"] not in histogram_kernels:
110
- raise ValueError(f"Invalid histogram_computer: {kwargs['histogram_computer']}. Choose from {list(histogram_kernels.keys())}.")
121
+ raise ValueError(
122
+ f"Invalid histogram_computer: {kwargs['histogram_computer']}. Choose from {list(histogram_kernels.keys())}."
123
+ )
111
124
 
112
125
  def fit(self, X, y, era_id=None):
113
126
  if era_id is None:
114
- era_id = np.ones(X.shape[0], dtype='int32')
115
- self.bin_indices, era_indices, self.bin_edges, self.unique_eras, self.Y_gpu = self.preprocess_gpu_data(X, y, era_id)
127
+ era_id = np.ones(X.shape[0], dtype="int32")
128
+ self.bin_indices, era_indices, self.bin_edges, self.unique_eras, self.Y_gpu = (
129
+ self.preprocess_gpu_data(X, y, era_id)
130
+ )
116
131
  self.num_samples, self.num_features = X.shape
117
132
  self.gradients = torch.zeros_like(self.Y_gpu)
118
133
  self.root_node_indices = torch.arange(self.num_samples, device=self.device)
119
134
  self.base_prediction = self.Y_gpu.mean().item()
120
135
  self.gradients += self.base_prediction
121
136
  self.best_gains = torch.zeros(self.num_features, device=self.device)
122
- self.best_bins = torch.zeros(self.num_features, device=self.device, dtype=torch.int32)
137
+ self.best_bins = torch.zeros(
138
+ self.num_features, device=self.device, dtype=torch.int32
139
+ )
123
140
  with torch.no_grad():
124
141
  self.forest = self.grow_forest()
125
142
  return self
126
-
143
+
127
144
  def preprocess_gpu_data(self, X_np, Y_np, era_id_np):
128
145
  with torch.no_grad():
129
146
  self.num_samples, self.num_features = X_np.shape
@@ -133,39 +150,66 @@ class WarpGBM(BaseEstimator, RegressorMixin):
133
150
  if is_integer_type:
134
151
  max_vals = X_np.max(axis=0)
135
152
  if np.all(max_vals < self.num_bins):
136
- print("Detected pre-binned integer input — skipping quantile binning.")
137
- bin_indices = torch.from_numpy(X_np).to(self.device).contiguous().to(torch.int8)
138
-
153
+ print(
154
+ "Detected pre-binned integer input — skipping quantile binning."
155
+ )
156
+ bin_indices = (
157
+ torch.from_numpy(X_np)
158
+ .to(self.device)
159
+ .contiguous()
160
+ .to(torch.int8)
161
+ )
162
+
139
163
  # We'll store None or an empty tensor in self.bin_edges
140
164
  # to indicate that we skip binning at predict-time
141
- bin_edges = torch.arange(1, self.num_bins, dtype=torch.float32).repeat(self.num_features, 1)
165
+ bin_edges = torch.arange(
166
+ 1, self.num_bins, dtype=torch.float32
167
+ ).repeat(self.num_features, 1)
142
168
  bin_edges = bin_edges.to(self.device)
143
- unique_eras, era_indices = torch.unique(era_id_gpu, return_inverse=True)
169
+ unique_eras, era_indices = torch.unique(
170
+ era_id_gpu, return_inverse=True
171
+ )
144
172
  return bin_indices, era_indices, bin_edges, unique_eras, Y_gpu
145
173
  else:
146
- print("Integer input detected, but values exceed num_bins — falling back to quantile binning.")
147
-
148
- bin_indices = torch.empty((self.num_samples, self.num_features), dtype=torch.int8, device='cuda')
149
- bin_edges = torch.empty((self.num_features, self.num_bins - 1), dtype=torch.float32, device='cuda')
174
+ print(
175
+ "Integer input detected, but values exceed num_bins — falling back to quantile binning."
176
+ )
177
+
178
+ bin_indices = torch.empty(
179
+ (self.num_samples, self.num_features), dtype=torch.int8, device="cuda"
180
+ )
181
+ bin_edges = torch.empty(
182
+ (self.num_features, self.num_bins - 1),
183
+ dtype=torch.float32,
184
+ device="cuda",
185
+ )
150
186
 
151
187
  X_np = torch.from_numpy(X_np).to(torch.float32).pin_memory()
152
188
 
153
189
  for f in range(self.num_features):
154
- X_f = X_np[:, f].to('cuda', non_blocking=True)
155
- quantiles = torch.linspace(0, 1, self.num_bins + 1, device='cuda', dtype=X_f.dtype)[1:-1]
156
- bin_edges_f = torch.quantile(X_f, quantiles, dim=0).contiguous() # shape: [B-1] for 1D input
190
+ X_f = X_np[:, f].to("cuda", non_blocking=True)
191
+ quantiles = torch.linspace(
192
+ 0, 1, self.num_bins + 1, device="cuda", dtype=X_f.dtype
193
+ )[1:-1]
194
+ bin_edges_f = torch.quantile(
195
+ X_f, quantiles, dim=0
196
+ ).contiguous() # shape: [B-1] for 1D input
157
197
  bin_indices_f = bin_indices[:, f].contiguous() # view into output
158
198
  node_kernel.custom_cuda_binner(X_f, bin_edges_f, bin_indices_f)
159
- bin_indices[:,f] = bin_indices_f
160
- bin_edges[f,:] = bin_edges_f
199
+ bin_indices[:, f] = bin_indices_f
200
+ bin_edges[f, :] = bin_edges_f
161
201
 
162
202
  unique_eras, era_indices = torch.unique(era_id_gpu, return_inverse=True)
163
203
  return bin_indices, era_indices, bin_edges, unique_eras, Y_gpu
164
204
 
165
205
  def compute_histograms(self, bin_indices_sub, gradients):
166
- grad_hist = torch.zeros((self.num_features, self.num_bins), device=self.device, dtype=torch.float32)
167
- hess_hist = torch.zeros((self.num_features, self.num_bins), device=self.device, dtype=torch.float32)
168
-
206
+ grad_hist = torch.zeros(
207
+ (self.num_features, self.num_bins), device=self.device, dtype=torch.float32
208
+ )
209
+ hess_hist = torch.zeros(
210
+ (self.num_features, self.num_bins), device=self.device, dtype=torch.float32
211
+ )
212
+
169
213
  self.compute_histogram(
170
214
  bin_indices_sub,
171
215
  gradients,
@@ -173,7 +217,7 @@ class WarpGBM(BaseEstimator, RegressorMixin):
173
217
  hess_hist,
174
218
  self.num_bins,
175
219
  self.threads_per_block,
176
- self.rows_per_thread
220
+ self.rows_per_thread,
177
221
  )
178
222
  return grad_hist, hess_hist
179
223
 
@@ -186,7 +230,7 @@ class WarpGBM(BaseEstimator, RegressorMixin):
186
230
  self.L2_reg,
187
231
  self.best_gains,
188
232
  self.best_bins,
189
- self.threads_per_block
233
+ self.threads_per_block,
190
234
  )
191
235
 
192
236
  if torch.all(self.best_bins == -1):
@@ -196,59 +240,74 @@ class WarpGBM(BaseEstimator, RegressorMixin):
196
240
  b = self.best_bins[f].item()
197
241
 
198
242
  return f, b
199
-
243
+
200
244
  def grow_tree(self, gradient_histogram, hessian_histogram, node_indices, depth):
201
245
  if depth == self.max_depth:
202
246
  leaf_value = self.residual[node_indices].mean()
203
247
  self.gradients[node_indices] += self.learning_rate * leaf_value
204
248
  return {"leaf_value": leaf_value.item(), "samples": node_indices.numel()}
205
-
249
+
206
250
  parent_size = node_indices.numel()
207
- best_feature, best_bin = self.find_best_split(gradient_histogram, hessian_histogram)
208
-
251
+ best_feature, best_bin = self.find_best_split(
252
+ gradient_histogram, hessian_histogram
253
+ )
254
+
209
255
  if best_feature == -1:
210
256
  leaf_value = self.residual[node_indices].mean()
211
257
  self.gradients[node_indices] += self.learning_rate * leaf_value
212
258
  return {"leaf_value": leaf_value.item(), "samples": parent_size}
213
-
214
- split_mask = (self.bin_indices[node_indices, best_feature] <= best_bin)
259
+
260
+ split_mask = self.bin_indices[node_indices, best_feature] <= best_bin
215
261
  left_indices = node_indices[split_mask]
216
262
  right_indices = node_indices[~split_mask]
217
263
 
218
264
  left_size = left_indices.numel()
219
265
  right_size = right_indices.numel()
220
266
 
221
-
222
267
  if left_size <= right_size:
223
- grad_hist_left, hess_hist_left = self.compute_histograms( self.bin_indices[left_indices], self.residual[left_indices] )
268
+ grad_hist_left, hess_hist_left = self.compute_histograms(
269
+ self.bin_indices[left_indices], self.residual[left_indices]
270
+ )
224
271
  grad_hist_right = gradient_histogram - grad_hist_left
225
272
  hess_hist_right = hessian_histogram - hess_hist_left
226
273
  else:
227
- grad_hist_right, hess_hist_right = self.compute_histograms( self.bin_indices[right_indices], self.residual[right_indices] )
274
+ grad_hist_right, hess_hist_right = self.compute_histograms(
275
+ self.bin_indices[right_indices], self.residual[right_indices]
276
+ )
228
277
  grad_hist_left = gradient_histogram - grad_hist_right
229
278
  hess_hist_left = hessian_histogram - hess_hist_right
230
279
 
231
280
  new_depth = depth + 1
232
- left_child = self.grow_tree(grad_hist_left, hess_hist_left, left_indices, new_depth)
233
- right_child = self.grow_tree(grad_hist_right, hess_hist_right, right_indices, new_depth)
234
-
235
- return { "feature": best_feature, "bin": best_bin, "left": left_child, "right": right_child }
281
+ left_child = self.grow_tree(
282
+ grad_hist_left, hess_hist_left, left_indices, new_depth
283
+ )
284
+ right_child = self.grow_tree(
285
+ grad_hist_right, hess_hist_right, right_indices, new_depth
286
+ )
287
+
288
+ return {
289
+ "feature": best_feature,
290
+ "bin": best_bin,
291
+ "left": left_child,
292
+ "right": right_child,
293
+ }
236
294
 
237
295
  def grow_forest(self):
238
296
  forest = [{} for _ in range(self.n_estimators)]
239
297
  self.training_loss = []
240
-
241
- for i in tqdm( range(self.n_estimators) ):
298
+
299
+ for i in tqdm(range(self.n_estimators)):
242
300
  self.residual = self.Y_gpu - self.gradients
243
-
244
- self.root_gradient_histogram, self.root_hessian_histogram = \
301
+
302
+ self.root_gradient_histogram, self.root_hessian_histogram = (
245
303
  self.compute_histograms(self.bin_indices, self.residual)
246
-
304
+ )
305
+
247
306
  tree = self.grow_tree(
248
307
  self.root_gradient_histogram,
249
308
  self.root_hessian_histogram,
250
309
  self.root_node_indices,
251
- depth=0
310
+ depth=0,
252
311
  )
253
312
  forest[i] = tree
254
313
  # loss = ((self.Y_gpu - self.gradients) ** 2).mean().item()
@@ -261,7 +320,9 @@ class WarpGBM(BaseEstimator, RegressorMixin):
261
320
  def predict(self, X_np):
262
321
  X_tensor = torch.from_numpy(X_np).to(torch.float32).pin_memory()
263
322
  num_samples = X_tensor.size(0)
264
- bin_indices = torch.zeros((num_samples, self.num_features), dtype=torch.int8, device=self.device)
323
+ bin_indices = torch.zeros(
324
+ (num_samples, self.num_features), dtype=torch.int8, device=self.device
325
+ )
265
326
 
266
327
  with torch.no_grad():
267
328
  for f in range(self.num_features):
@@ -271,17 +332,16 @@ class WarpGBM(BaseEstimator, RegressorMixin):
271
332
  node_kernel.custom_cuda_binner(X_f, bin_edges_f, bin_indices_f)
272
333
  bin_indices[:, f] = bin_indices_f
273
334
 
274
- tree_tensor = torch.stack([
275
- self.flatten_tree(tree, max_nodes=2**(self.max_depth + 1))
276
- for tree in self.forest
277
- ]).to(self.device)
335
+ tree_tensor = torch.stack(
336
+ [
337
+ self.flatten_tree(tree, max_nodes=2 ** (self.max_depth + 1))
338
+ for tree in self.forest
339
+ ]
340
+ ).to(self.device)
278
341
 
279
- out = torch.zeros(num_samples, device=self.device)
342
+ out = torch.zeros(num_samples, device=self.device) + self.base_prediction
280
343
  node_kernel.predict_forest(
281
- bin_indices.contiguous(),
282
- tree_tensor.contiguous(),
283
- self.learning_rate,
284
- out
344
+ bin_indices.contiguous(), tree_tensor.contiguous(), self.learning_rate, out
285
345
  )
286
346
 
287
347
  return out.cpu().numpy()
@@ -289,20 +349,20 @@ class WarpGBM(BaseEstimator, RegressorMixin):
289
349
  def flatten_tree(self, tree, max_nodes):
290
350
  """
291
351
  Convert a recursive tree structure into a flat matrix format.
292
-
352
+
293
353
  Each row in the output represents a node:
294
354
  - Columns: [feature, bin, left_id, right_id, is_leaf, value]
295
355
  - Internal nodes fill columns 0–3 and set is_leaf = 0
296
356
  - Leaf nodes fill only value and set is_leaf = 1
297
-
357
+
298
358
  Args:
299
359
  tree (list): A list containing a single root node (recursive dict form).
300
360
  max_nodes (int): Max number of nodes to allocate in the flat matrix.
301
-
361
+
302
362
  Returns:
303
363
  torch.Tensor: [max_nodes x 6] matrix representing the flattened tree.
304
364
  """
305
- flat = torch.full((max_nodes, 6), float('nan'), dtype=torch.float32)
365
+ flat = torch.full((max_nodes, 6), float("nan"), dtype=torch.float32)
306
366
  node_counter = [0]
307
367
  node_list = []
308
368
 
@@ -310,16 +370,16 @@ class WarpGBM(BaseEstimator, RegressorMixin):
310
370
  curr_id = node_counter[0]
311
371
  node_counter[0] += 1
312
372
 
313
- new_node = {'node_id': curr_id}
314
- if 'leaf_value' in node:
315
- new_node['leaf_value'] = float(node['leaf_value'])
373
+ new_node = {"node_id": curr_id}
374
+ if "leaf_value" in node:
375
+ new_node["leaf_value"] = float(node["leaf_value"])
316
376
  else:
317
- new_node['best_feature'] = float(node['feature'])
318
- new_node['split_bin'] = float(node['bin'])
319
- new_node['left_id'] = node_counter[0]
320
- walk(node['left'])
321
- new_node['right_id'] = node_counter[0]
322
- walk(node['right'])
377
+ new_node["best_feature"] = float(node["feature"])
378
+ new_node["split_bin"] = float(node["bin"])
379
+ new_node["left_id"] = node_counter[0]
380
+ walk(node["left"])
381
+ new_node["right_id"] = node_counter[0]
382
+ walk(node["right"])
323
383
 
324
384
  node_list.append(new_node)
325
385
  return new_node
@@ -327,15 +387,15 @@ class WarpGBM(BaseEstimator, RegressorMixin):
327
387
  walk(tree)
328
388
 
329
389
  for node in node_list:
330
- i = node['node_id']
331
- if 'leaf_value' in node:
390
+ i = node["node_id"]
391
+ if "leaf_value" in node:
332
392
  flat[i, 4] = 1.0
333
- flat[i, 5] = node['leaf_value']
393
+ flat[i, 5] = node["leaf_value"]
334
394
  else:
335
- flat[i, 0] = node['best_feature']
336
- flat[i, 1] = node['split_bin']
337
- flat[i, 2] = node['left_id']
338
- flat[i, 3] = node['right_id']
395
+ flat[i, 0] = node["best_feature"]
396
+ flat[i, 1] = node["split_bin"]
397
+ flat[i, 2] = node["left_id"]
398
+ flat[i, 3] = node["right_id"]
339
399
  flat[i, 4] = 0.0
340
400
 
341
- return flat
401
+ return flat
@@ -107,12 +107,6 @@ void launch_histogram_kernel_cuda(
107
107
  N, F, B);
108
108
  }
109
109
 
110
- #define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
111
- #define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
112
- #define CHECK_INPUT(x) \
113
- CHECK_CUDA(x); \
114
- CHECK_CONTIGUOUS(x)
115
-
116
110
  // CUDA kernel: tiled, 64-bit safe
117
111
  __global__ void histogram_tiled_kernel(
118
112
  const int8_t *__restrict__ bin_indices, // [N, F]
@@ -148,10 +142,6 @@ void launch_histogram_kernel_cuda_2(
148
142
  int threads_per_block = 256,
149
143
  int rows_per_thread = 1)
150
144
  {
151
- CHECK_INPUT(bin_indices);
152
- CHECK_INPUT(gradients);
153
- CHECK_INPUT(grad_hist);
154
- CHECK_INPUT(hess_hist);
155
145
 
156
146
  int64_t N = bin_indices.size(0);
157
147
  int64_t F = bin_indices.size(1);
@@ -233,10 +223,6 @@ void launch_histogram_kernel_cuda_configurable(
233
223
  int threads_per_block = 256,
234
224
  int rows_per_thread = 1)
235
225
  {
236
- CHECK_INPUT(bin_indices);
237
- CHECK_INPUT(gradients);
238
- CHECK_INPUT(grad_hist);
239
- CHECK_INPUT(hess_hist);
240
226
 
241
227
  int64_t N = bin_indices.size(0);
242
228
  int64_t F = bin_indices.size(1);
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: warpgbm
3
- Version: 0.1.21
3
+ Version: 0.1.22
4
4
  Summary: A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA
5
5
  License: GNU GENERAL PUBLIC LICENSE
6
6
  Version 3, 29 June 2007
@@ -704,26 +704,20 @@ WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (G
704
704
 
705
705
  ---
706
706
 
707
- ## Performance Note
708
-
709
- In our initial tests on an NVIDIA 3090 (local) and A100 (Google Colab Pro), WarpGBM achieves **14x to 20x faster training times** compared to LightGBM's CPU version and **2x faster** on the GPU version using default configurations. Speed also outperforms XGBoost and CatBoost on regression problems. It also consumes **significantly less RAM and CPU**. These early results hint at more thorough benchmarking to come.
710
-
711
- ---
712
-
713
707
  ## Benchmarks
714
708
 
715
709
  ### Scikit-Learn Synthetic Data: 1 Million Rows and 1,000 Features
716
710
 
717
- In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.19** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment. The CPU versions don't even come close to the speed here so we didn't test them.
711
+ In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.21** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment.
718
712
 
719
713
  ```
720
- WarpGBM: corr = 0.8882, train = 21.8s, infer = 11.6s
721
- XGBoost: corr = 0.8877, train = 33.4s, infer = 8.1s
722
- LightGBM: corr = 0.8604, train = 30.2s, infer = 1.4s
723
- CatBoost: corr = 0.8935, train = 377.9s, infer = 375.8s
714
+ WarpGBM: corr = 0.8882, train = 18.7s, infer = 4.9s
715
+ XGBoost: corr = 0.8877, train = 33.1s, infer = 8.1s
716
+ LightGBM: corr = 0.8604, train = 30.3s, infer = 1.4s
717
+ CatBoost: corr = 0.8935, train = 400.0s, infer = 382.6s
724
718
  ```
725
719
 
726
- Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK
720
+ Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK?usp=sharing
727
721
 
728
722
  ---
729
723
 
@@ -746,7 +740,7 @@ pip install warpgbm
746
740
  This installs from PyPI and also compiles CUDA code locally during installation. This method works well **if your environment already has PyTorch with GPU support** installed and configured.
747
741
 
748
742
  > **Tip:**\
749
- > If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag:
743
+ > If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag. This is currently required in the Colab environments.
750
744
  >
751
745
  > ```bash
752
746
  > pip install warpgbm --no-build-isolation
@@ -886,8 +880,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
886
880
 
887
881
  ### Methods:
888
882
  - `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
889
- - `.predict(X, chunksize=50_000)`: Predict on new raw float or pre-binned data.
890
- - `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
883
+ - `.predict(X)`: Predict on new data, using parallelized CUDA kernel.
891
884
 
892
885
  ---
893
886
 
@@ -1 +0,0 @@
1
- 0.1.21
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes