warpgbm 0.1.21__tar.gz → 0.1.22__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {warpgbm-0.1.21/warpgbm.egg-info → warpgbm-0.1.22}/PKG-INFO +9 -16
- {warpgbm-0.1.21 → warpgbm-0.1.22}/README.md +8 -15
- {warpgbm-0.1.21 → warpgbm-0.1.22}/pyproject.toml +1 -1
- {warpgbm-0.1.21 → warpgbm-0.1.22}/tests/test_fit_predict_corr.py +11 -8
- warpgbm-0.1.22/version.txt +1 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/core.py +152 -92
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/histogram_kernel.cu +0 -14
- {warpgbm-0.1.21 → warpgbm-0.1.22/warpgbm.egg-info}/PKG-INFO +9 -16
- warpgbm-0.1.21/version.txt +0 -1
- {warpgbm-0.1.21 → warpgbm-0.1.22}/LICENSE +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/MANIFEST.in +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/setup.cfg +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/setup.py +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/tests/__init__.py +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/__init__.py +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/__init__.py +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/best_split_kernel.cu +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/binner.cu +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/node_kernel.cpp +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm/cuda/predict.cu +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm.egg-info/SOURCES.txt +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm.egg-info/dependency_links.txt +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm.egg-info/requires.txt +0 -0
- {warpgbm-0.1.21 → warpgbm-0.1.22}/warpgbm.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: warpgbm
|
3
|
-
Version: 0.1.
|
3
|
+
Version: 0.1.22
|
4
4
|
Summary: A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA
|
5
5
|
License: GNU GENERAL PUBLIC LICENSE
|
6
6
|
Version 3, 29 June 2007
|
@@ -704,26 +704,20 @@ WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (G
|
|
704
704
|
|
705
705
|
---
|
706
706
|
|
707
|
-
## Performance Note
|
708
|
-
|
709
|
-
In our initial tests on an NVIDIA 3090 (local) and A100 (Google Colab Pro), WarpGBM achieves **14x to 20x faster training times** compared to LightGBM's CPU version and **2x faster** on the GPU version using default configurations. Speed also outperforms XGBoost and CatBoost on regression problems. It also consumes **significantly less RAM and CPU**. These early results hint at more thorough benchmarking to come.
|
710
|
-
|
711
|
-
---
|
712
|
-
|
713
707
|
## Benchmarks
|
714
708
|
|
715
709
|
### Scikit-Learn Synthetic Data: 1 Million Rows and 1,000 Features
|
716
710
|
|
717
|
-
In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.
|
711
|
+
In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.21** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment.
|
718
712
|
|
719
713
|
```
|
720
|
-
WarpGBM: corr = 0.8882, train =
|
721
|
-
XGBoost: corr = 0.8877, train = 33.
|
722
|
-
LightGBM: corr = 0.8604, train = 30.
|
723
|
-
CatBoost: corr = 0.8935, train =
|
714
|
+
WarpGBM: corr = 0.8882, train = 18.7s, infer = 4.9s
|
715
|
+
XGBoost: corr = 0.8877, train = 33.1s, infer = 8.1s
|
716
|
+
LightGBM: corr = 0.8604, train = 30.3s, infer = 1.4s
|
717
|
+
CatBoost: corr = 0.8935, train = 400.0s, infer = 382.6s
|
724
718
|
```
|
725
719
|
|
726
|
-
Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK
|
720
|
+
Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK?usp=sharing
|
727
721
|
|
728
722
|
---
|
729
723
|
|
@@ -746,7 +740,7 @@ pip install warpgbm
|
|
746
740
|
This installs from PyPI and also compiles CUDA code locally during installation. This method works well **if your environment already has PyTorch with GPU support** installed and configured.
|
747
741
|
|
748
742
|
> **Tip:**\
|
749
|
-
> If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag
|
743
|
+
> If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag. This is currently required in the Colab environments.
|
750
744
|
>
|
751
745
|
> ```bash
|
752
746
|
> pip install warpgbm --no-build-isolation
|
@@ -886,8 +880,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
|
|
886
880
|
|
887
881
|
### Methods:
|
888
882
|
- `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
|
889
|
-
- `.predict(X
|
890
|
-
- `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
|
883
|
+
- `.predict(X)`: Predict on new data, using parallelized CUDA kernel.
|
891
884
|
|
892
885
|
---
|
893
886
|
|
@@ -16,26 +16,20 @@ WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (G
|
|
16
16
|
|
17
17
|
---
|
18
18
|
|
19
|
-
## Performance Note
|
20
|
-
|
21
|
-
In our initial tests on an NVIDIA 3090 (local) and A100 (Google Colab Pro), WarpGBM achieves **14x to 20x faster training times** compared to LightGBM's CPU version and **2x faster** on the GPU version using default configurations. Speed also outperforms XGBoost and CatBoost on regression problems. It also consumes **significantly less RAM and CPU**. These early results hint at more thorough benchmarking to come.
|
22
|
-
|
23
|
-
---
|
24
|
-
|
25
19
|
## Benchmarks
|
26
20
|
|
27
21
|
### Scikit-Learn Synthetic Data: 1 Million Rows and 1,000 Features
|
28
22
|
|
29
|
-
In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.
|
23
|
+
In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.21** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment.
|
30
24
|
|
31
25
|
```
|
32
|
-
WarpGBM: corr = 0.8882, train =
|
33
|
-
XGBoost: corr = 0.8877, train = 33.
|
34
|
-
LightGBM: corr = 0.8604, train = 30.
|
35
|
-
CatBoost: corr = 0.8935, train =
|
26
|
+
WarpGBM: corr = 0.8882, train = 18.7s, infer = 4.9s
|
27
|
+
XGBoost: corr = 0.8877, train = 33.1s, infer = 8.1s
|
28
|
+
LightGBM: corr = 0.8604, train = 30.3s, infer = 1.4s
|
29
|
+
CatBoost: corr = 0.8935, train = 400.0s, infer = 382.6s
|
36
30
|
```
|
37
31
|
|
38
|
-
Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK
|
32
|
+
Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK?usp=sharing
|
39
33
|
|
40
34
|
---
|
41
35
|
|
@@ -58,7 +52,7 @@ pip install warpgbm
|
|
58
52
|
This installs from PyPI and also compiles CUDA code locally during installation. This method works well **if your environment already has PyTorch with GPU support** installed and configured.
|
59
53
|
|
60
54
|
> **Tip:**\
|
61
|
-
> If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag
|
55
|
+
> If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag. This is currently required in the Colab environments.
|
62
56
|
>
|
63
57
|
> ```bash
|
64
58
|
> pip install warpgbm --no-build-isolation
|
@@ -198,8 +192,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
|
|
198
192
|
|
199
193
|
### Methods:
|
200
194
|
- `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
|
201
|
-
- `.predict(X
|
202
|
-
- `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
|
195
|
+
- `.predict(X)`: Predict on new data, using parallelized CUDA kernel.
|
203
196
|
|
204
197
|
---
|
205
198
|
|
@@ -1,11 +1,9 @@
|
|
1
1
|
import numpy as np
|
2
2
|
from warpgbm import WarpGBM
|
3
3
|
from sklearn.datasets import make_regression
|
4
|
-
|
5
|
-
import numpy as np
|
6
4
|
import time
|
7
|
-
from
|
8
|
-
|
5
|
+
from sklearn.metrics import mean_squared_error
|
6
|
+
|
9
7
|
|
10
8
|
def test_fit_predictpytee_correlation():
|
11
9
|
np.random.seed(42)
|
@@ -14,19 +12,20 @@ def test_fit_predictpytee_correlation():
|
|
14
12
|
X, y = make_regression(n_samples=N, n_features=F, noise=0.1, random_state=42)
|
15
13
|
era = np.zeros(N, dtype=np.int32)
|
16
14
|
corrs = []
|
15
|
+
mses = []
|
17
16
|
|
18
|
-
for hist_type in [
|
17
|
+
for hist_type in ["hist1", "hist2", "hist3"]:
|
19
18
|
print(f"\nTesting histogram method: {hist_type}")
|
20
19
|
|
21
20
|
model = WarpGBM(
|
22
21
|
max_depth=10,
|
23
22
|
num_bins=10,
|
24
|
-
n_estimators=
|
23
|
+
n_estimators=100,
|
25
24
|
learning_rate=1,
|
26
25
|
verbosity=False,
|
27
26
|
histogram_computer=hist_type,
|
28
27
|
threads_per_block=64,
|
29
|
-
rows_per_thread=4
|
28
|
+
rows_per_thread=4,
|
30
29
|
)
|
31
30
|
|
32
31
|
start_fit = time.time()
|
@@ -40,7 +39,11 @@ def test_fit_predictpytee_correlation():
|
|
40
39
|
print(f" Predict time: {pred_time:.3f} seconds")
|
41
40
|
|
42
41
|
corr = np.corrcoef(preds, y)[0, 1]
|
42
|
+
mse = mean_squared_error(preds, y)
|
43
43
|
print(f" Correlation: {corr:.4f}")
|
44
|
+
print(f" MSE: {mse:.4f}")
|
44
45
|
corrs.append(corr)
|
46
|
+
mses.append(mse)
|
45
47
|
|
46
|
-
assert (np.array(corrs) > 0.
|
48
|
+
assert (np.array(corrs) > 0.9).all(), f"In-sample correlation too low: {corrs}"
|
49
|
+
assert (np.array(mses) < 2).all(), f"In-sample mse too high: {mses}"
|
@@ -0,0 +1 @@
|
|
1
|
+
0.1.22
|
@@ -7,11 +7,12 @@ from typing import Tuple
|
|
7
7
|
from torch import Tensor
|
8
8
|
|
9
9
|
histogram_kernels = {
|
10
|
-
|
11
|
-
|
12
|
-
|
10
|
+
"hist1": node_kernel.compute_histogram,
|
11
|
+
"hist2": node_kernel.compute_histogram2,
|
12
|
+
"hist3": node_kernel.compute_histogram3,
|
13
13
|
}
|
14
14
|
|
15
|
+
|
15
16
|
class WarpGBM(BaseEstimator, RegressorMixin):
|
16
17
|
def __init__(
|
17
18
|
self,
|
@@ -22,12 +23,12 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
22
23
|
min_child_weight=20,
|
23
24
|
min_split_gain=0.0,
|
24
25
|
verbosity=True,
|
25
|
-
histogram_computer=
|
26
|
+
histogram_computer="hist3",
|
26
27
|
threads_per_block=64,
|
27
28
|
rows_per_thread=4,
|
28
29
|
L2_reg=1e-6,
|
29
30
|
L1_reg=0.0,
|
30
|
-
device=
|
31
|
+
device="cuda",
|
31
32
|
):
|
32
33
|
# Validate arguments
|
33
34
|
self._validate_hyperparams(
|
@@ -41,7 +42,7 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
41
42
|
threads_per_block=threads_per_block,
|
42
43
|
rows_per_thread=rows_per_thread,
|
43
44
|
L2_reg=L2_reg,
|
44
|
-
L1_reg=L1_reg
|
45
|
+
L1_reg=L1_reg,
|
45
46
|
)
|
46
47
|
|
47
48
|
self.num_bins = num_bins
|
@@ -73,22 +74,28 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
73
74
|
def _validate_hyperparams(self, **kwargs):
|
74
75
|
# Type checks
|
75
76
|
int_params = [
|
76
|
-
"num_bins",
|
77
|
-
"
|
78
|
-
|
79
|
-
|
80
|
-
"
|
77
|
+
"num_bins",
|
78
|
+
"max_depth",
|
79
|
+
"n_estimators",
|
80
|
+
"min_child_weight",
|
81
|
+
"threads_per_block",
|
82
|
+
"rows_per_thread",
|
81
83
|
]
|
84
|
+
float_params = ["learning_rate", "min_split_gain", "L2_reg", "L1_reg"]
|
82
85
|
|
83
86
|
for param in int_params:
|
84
87
|
if not isinstance(kwargs[param], int):
|
85
|
-
raise TypeError(
|
88
|
+
raise TypeError(
|
89
|
+
f"{param} must be an integer, got {type(kwargs[param])}."
|
90
|
+
)
|
86
91
|
|
87
92
|
for param in float_params:
|
88
|
-
if not isinstance(
|
93
|
+
if not isinstance(
|
94
|
+
kwargs[param], (float, int)
|
95
|
+
): # Accept ints as valid floats
|
89
96
|
raise TypeError(f"{param} must be a float, got {type(kwargs[param])}.")
|
90
|
-
|
91
|
-
if not (
|
97
|
+
|
98
|
+
if not (2 <= kwargs["num_bins"] <= 127):
|
92
99
|
raise ValueError("num_bins must be between 2 and 127 inclusive.")
|
93
100
|
if kwargs["max_depth"] < 1:
|
94
101
|
raise ValueError("max_depth must be at least 1.")
|
@@ -101,29 +108,39 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
101
108
|
if kwargs["min_split_gain"] < 0:
|
102
109
|
raise ValueError("min_split_gain must be non-negative.")
|
103
110
|
if kwargs["threads_per_block"] <= 0 or kwargs["threads_per_block"] % 32 != 0:
|
104
|
-
raise ValueError(
|
105
|
-
|
106
|
-
|
111
|
+
raise ValueError(
|
112
|
+
"threads_per_block should be a positive multiple of 32 (warp size)."
|
113
|
+
)
|
114
|
+
if not (1 <= kwargs["rows_per_thread"] <= 16):
|
115
|
+
raise ValueError(
|
116
|
+
"rows_per_thread must be positive between 1 and 16 inclusive."
|
117
|
+
)
|
107
118
|
if kwargs["L2_reg"] < 0 or kwargs["L1_reg"] < 0:
|
108
119
|
raise ValueError("L2_reg and L1_reg must be non-negative.")
|
109
120
|
if kwargs["histogram_computer"] not in histogram_kernels:
|
110
|
-
raise ValueError(
|
121
|
+
raise ValueError(
|
122
|
+
f"Invalid histogram_computer: {kwargs['histogram_computer']}. Choose from {list(histogram_kernels.keys())}."
|
123
|
+
)
|
111
124
|
|
112
125
|
def fit(self, X, y, era_id=None):
|
113
126
|
if era_id is None:
|
114
|
-
era_id = np.ones(X.shape[0], dtype=
|
115
|
-
self.bin_indices, era_indices, self.bin_edges, self.unique_eras, self.Y_gpu =
|
127
|
+
era_id = np.ones(X.shape[0], dtype="int32")
|
128
|
+
self.bin_indices, era_indices, self.bin_edges, self.unique_eras, self.Y_gpu = (
|
129
|
+
self.preprocess_gpu_data(X, y, era_id)
|
130
|
+
)
|
116
131
|
self.num_samples, self.num_features = X.shape
|
117
132
|
self.gradients = torch.zeros_like(self.Y_gpu)
|
118
133
|
self.root_node_indices = torch.arange(self.num_samples, device=self.device)
|
119
134
|
self.base_prediction = self.Y_gpu.mean().item()
|
120
135
|
self.gradients += self.base_prediction
|
121
136
|
self.best_gains = torch.zeros(self.num_features, device=self.device)
|
122
|
-
self.best_bins = torch.zeros(
|
137
|
+
self.best_bins = torch.zeros(
|
138
|
+
self.num_features, device=self.device, dtype=torch.int32
|
139
|
+
)
|
123
140
|
with torch.no_grad():
|
124
141
|
self.forest = self.grow_forest()
|
125
142
|
return self
|
126
|
-
|
143
|
+
|
127
144
|
def preprocess_gpu_data(self, X_np, Y_np, era_id_np):
|
128
145
|
with torch.no_grad():
|
129
146
|
self.num_samples, self.num_features = X_np.shape
|
@@ -133,39 +150,66 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
133
150
|
if is_integer_type:
|
134
151
|
max_vals = X_np.max(axis=0)
|
135
152
|
if np.all(max_vals < self.num_bins):
|
136
|
-
print(
|
137
|
-
|
138
|
-
|
153
|
+
print(
|
154
|
+
"Detected pre-binned integer input — skipping quantile binning."
|
155
|
+
)
|
156
|
+
bin_indices = (
|
157
|
+
torch.from_numpy(X_np)
|
158
|
+
.to(self.device)
|
159
|
+
.contiguous()
|
160
|
+
.to(torch.int8)
|
161
|
+
)
|
162
|
+
|
139
163
|
# We'll store None or an empty tensor in self.bin_edges
|
140
164
|
# to indicate that we skip binning at predict-time
|
141
|
-
bin_edges = torch.arange(
|
165
|
+
bin_edges = torch.arange(
|
166
|
+
1, self.num_bins, dtype=torch.float32
|
167
|
+
).repeat(self.num_features, 1)
|
142
168
|
bin_edges = bin_edges.to(self.device)
|
143
|
-
unique_eras, era_indices = torch.unique(
|
169
|
+
unique_eras, era_indices = torch.unique(
|
170
|
+
era_id_gpu, return_inverse=True
|
171
|
+
)
|
144
172
|
return bin_indices, era_indices, bin_edges, unique_eras, Y_gpu
|
145
173
|
else:
|
146
|
-
print(
|
147
|
-
|
148
|
-
|
149
|
-
|
174
|
+
print(
|
175
|
+
"Integer input detected, but values exceed num_bins — falling back to quantile binning."
|
176
|
+
)
|
177
|
+
|
178
|
+
bin_indices = torch.empty(
|
179
|
+
(self.num_samples, self.num_features), dtype=torch.int8, device="cuda"
|
180
|
+
)
|
181
|
+
bin_edges = torch.empty(
|
182
|
+
(self.num_features, self.num_bins - 1),
|
183
|
+
dtype=torch.float32,
|
184
|
+
device="cuda",
|
185
|
+
)
|
150
186
|
|
151
187
|
X_np = torch.from_numpy(X_np).to(torch.float32).pin_memory()
|
152
188
|
|
153
189
|
for f in range(self.num_features):
|
154
|
-
X_f = X_np[:, f].to(
|
155
|
-
quantiles = torch.linspace(
|
156
|
-
|
190
|
+
X_f = X_np[:, f].to("cuda", non_blocking=True)
|
191
|
+
quantiles = torch.linspace(
|
192
|
+
0, 1, self.num_bins + 1, device="cuda", dtype=X_f.dtype
|
193
|
+
)[1:-1]
|
194
|
+
bin_edges_f = torch.quantile(
|
195
|
+
X_f, quantiles, dim=0
|
196
|
+
).contiguous() # shape: [B-1] for 1D input
|
157
197
|
bin_indices_f = bin_indices[:, f].contiguous() # view into output
|
158
198
|
node_kernel.custom_cuda_binner(X_f, bin_edges_f, bin_indices_f)
|
159
|
-
bin_indices[:,f] = bin_indices_f
|
160
|
-
bin_edges[f
|
199
|
+
bin_indices[:, f] = bin_indices_f
|
200
|
+
bin_edges[f, :] = bin_edges_f
|
161
201
|
|
162
202
|
unique_eras, era_indices = torch.unique(era_id_gpu, return_inverse=True)
|
163
203
|
return bin_indices, era_indices, bin_edges, unique_eras, Y_gpu
|
164
204
|
|
165
205
|
def compute_histograms(self, bin_indices_sub, gradients):
|
166
|
-
grad_hist = torch.zeros(
|
167
|
-
|
168
|
-
|
206
|
+
grad_hist = torch.zeros(
|
207
|
+
(self.num_features, self.num_bins), device=self.device, dtype=torch.float32
|
208
|
+
)
|
209
|
+
hess_hist = torch.zeros(
|
210
|
+
(self.num_features, self.num_bins), device=self.device, dtype=torch.float32
|
211
|
+
)
|
212
|
+
|
169
213
|
self.compute_histogram(
|
170
214
|
bin_indices_sub,
|
171
215
|
gradients,
|
@@ -173,7 +217,7 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
173
217
|
hess_hist,
|
174
218
|
self.num_bins,
|
175
219
|
self.threads_per_block,
|
176
|
-
self.rows_per_thread
|
220
|
+
self.rows_per_thread,
|
177
221
|
)
|
178
222
|
return grad_hist, hess_hist
|
179
223
|
|
@@ -186,7 +230,7 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
186
230
|
self.L2_reg,
|
187
231
|
self.best_gains,
|
188
232
|
self.best_bins,
|
189
|
-
self.threads_per_block
|
233
|
+
self.threads_per_block,
|
190
234
|
)
|
191
235
|
|
192
236
|
if torch.all(self.best_bins == -1):
|
@@ -196,59 +240,74 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
196
240
|
b = self.best_bins[f].item()
|
197
241
|
|
198
242
|
return f, b
|
199
|
-
|
243
|
+
|
200
244
|
def grow_tree(self, gradient_histogram, hessian_histogram, node_indices, depth):
|
201
245
|
if depth == self.max_depth:
|
202
246
|
leaf_value = self.residual[node_indices].mean()
|
203
247
|
self.gradients[node_indices] += self.learning_rate * leaf_value
|
204
248
|
return {"leaf_value": leaf_value.item(), "samples": node_indices.numel()}
|
205
|
-
|
249
|
+
|
206
250
|
parent_size = node_indices.numel()
|
207
|
-
best_feature, best_bin = self.find_best_split(
|
208
|
-
|
251
|
+
best_feature, best_bin = self.find_best_split(
|
252
|
+
gradient_histogram, hessian_histogram
|
253
|
+
)
|
254
|
+
|
209
255
|
if best_feature == -1:
|
210
256
|
leaf_value = self.residual[node_indices].mean()
|
211
257
|
self.gradients[node_indices] += self.learning_rate * leaf_value
|
212
258
|
return {"leaf_value": leaf_value.item(), "samples": parent_size}
|
213
|
-
|
214
|
-
split_mask =
|
259
|
+
|
260
|
+
split_mask = self.bin_indices[node_indices, best_feature] <= best_bin
|
215
261
|
left_indices = node_indices[split_mask]
|
216
262
|
right_indices = node_indices[~split_mask]
|
217
263
|
|
218
264
|
left_size = left_indices.numel()
|
219
265
|
right_size = right_indices.numel()
|
220
266
|
|
221
|
-
|
222
267
|
if left_size <= right_size:
|
223
|
-
grad_hist_left, hess_hist_left = self.compute_histograms(
|
268
|
+
grad_hist_left, hess_hist_left = self.compute_histograms(
|
269
|
+
self.bin_indices[left_indices], self.residual[left_indices]
|
270
|
+
)
|
224
271
|
grad_hist_right = gradient_histogram - grad_hist_left
|
225
272
|
hess_hist_right = hessian_histogram - hess_hist_left
|
226
273
|
else:
|
227
|
-
grad_hist_right, hess_hist_right = self.compute_histograms(
|
274
|
+
grad_hist_right, hess_hist_right = self.compute_histograms(
|
275
|
+
self.bin_indices[right_indices], self.residual[right_indices]
|
276
|
+
)
|
228
277
|
grad_hist_left = gradient_histogram - grad_hist_right
|
229
278
|
hess_hist_left = hessian_histogram - hess_hist_right
|
230
279
|
|
231
280
|
new_depth = depth + 1
|
232
|
-
left_child = self.grow_tree(
|
233
|
-
|
234
|
-
|
235
|
-
|
281
|
+
left_child = self.grow_tree(
|
282
|
+
grad_hist_left, hess_hist_left, left_indices, new_depth
|
283
|
+
)
|
284
|
+
right_child = self.grow_tree(
|
285
|
+
grad_hist_right, hess_hist_right, right_indices, new_depth
|
286
|
+
)
|
287
|
+
|
288
|
+
return {
|
289
|
+
"feature": best_feature,
|
290
|
+
"bin": best_bin,
|
291
|
+
"left": left_child,
|
292
|
+
"right": right_child,
|
293
|
+
}
|
236
294
|
|
237
295
|
def grow_forest(self):
|
238
296
|
forest = [{} for _ in range(self.n_estimators)]
|
239
297
|
self.training_loss = []
|
240
|
-
|
241
|
-
for i in tqdm(
|
298
|
+
|
299
|
+
for i in tqdm(range(self.n_estimators)):
|
242
300
|
self.residual = self.Y_gpu - self.gradients
|
243
|
-
|
244
|
-
self.root_gradient_histogram, self.root_hessian_histogram =
|
301
|
+
|
302
|
+
self.root_gradient_histogram, self.root_hessian_histogram = (
|
245
303
|
self.compute_histograms(self.bin_indices, self.residual)
|
246
|
-
|
304
|
+
)
|
305
|
+
|
247
306
|
tree = self.grow_tree(
|
248
307
|
self.root_gradient_histogram,
|
249
308
|
self.root_hessian_histogram,
|
250
309
|
self.root_node_indices,
|
251
|
-
depth=0
|
310
|
+
depth=0,
|
252
311
|
)
|
253
312
|
forest[i] = tree
|
254
313
|
# loss = ((self.Y_gpu - self.gradients) ** 2).mean().item()
|
@@ -261,7 +320,9 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
261
320
|
def predict(self, X_np):
|
262
321
|
X_tensor = torch.from_numpy(X_np).to(torch.float32).pin_memory()
|
263
322
|
num_samples = X_tensor.size(0)
|
264
|
-
bin_indices = torch.zeros(
|
323
|
+
bin_indices = torch.zeros(
|
324
|
+
(num_samples, self.num_features), dtype=torch.int8, device=self.device
|
325
|
+
)
|
265
326
|
|
266
327
|
with torch.no_grad():
|
267
328
|
for f in range(self.num_features):
|
@@ -271,17 +332,16 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
271
332
|
node_kernel.custom_cuda_binner(X_f, bin_edges_f, bin_indices_f)
|
272
333
|
bin_indices[:, f] = bin_indices_f
|
273
334
|
|
274
|
-
tree_tensor = torch.stack(
|
275
|
-
|
276
|
-
|
277
|
-
|
335
|
+
tree_tensor = torch.stack(
|
336
|
+
[
|
337
|
+
self.flatten_tree(tree, max_nodes=2 ** (self.max_depth + 1))
|
338
|
+
for tree in self.forest
|
339
|
+
]
|
340
|
+
).to(self.device)
|
278
341
|
|
279
|
-
out = torch.zeros(num_samples, device=self.device)
|
342
|
+
out = torch.zeros(num_samples, device=self.device) + self.base_prediction
|
280
343
|
node_kernel.predict_forest(
|
281
|
-
bin_indices.contiguous(),
|
282
|
-
tree_tensor.contiguous(),
|
283
|
-
self.learning_rate,
|
284
|
-
out
|
344
|
+
bin_indices.contiguous(), tree_tensor.contiguous(), self.learning_rate, out
|
285
345
|
)
|
286
346
|
|
287
347
|
return out.cpu().numpy()
|
@@ -289,20 +349,20 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
289
349
|
def flatten_tree(self, tree, max_nodes):
|
290
350
|
"""
|
291
351
|
Convert a recursive tree structure into a flat matrix format.
|
292
|
-
|
352
|
+
|
293
353
|
Each row in the output represents a node:
|
294
354
|
- Columns: [feature, bin, left_id, right_id, is_leaf, value]
|
295
355
|
- Internal nodes fill columns 0–3 and set is_leaf = 0
|
296
356
|
- Leaf nodes fill only value and set is_leaf = 1
|
297
|
-
|
357
|
+
|
298
358
|
Args:
|
299
359
|
tree (list): A list containing a single root node (recursive dict form).
|
300
360
|
max_nodes (int): Max number of nodes to allocate in the flat matrix.
|
301
|
-
|
361
|
+
|
302
362
|
Returns:
|
303
363
|
torch.Tensor: [max_nodes x 6] matrix representing the flattened tree.
|
304
364
|
"""
|
305
|
-
flat = torch.full((max_nodes, 6), float(
|
365
|
+
flat = torch.full((max_nodes, 6), float("nan"), dtype=torch.float32)
|
306
366
|
node_counter = [0]
|
307
367
|
node_list = []
|
308
368
|
|
@@ -310,16 +370,16 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
310
370
|
curr_id = node_counter[0]
|
311
371
|
node_counter[0] += 1
|
312
372
|
|
313
|
-
new_node = {
|
314
|
-
if
|
315
|
-
new_node[
|
373
|
+
new_node = {"node_id": curr_id}
|
374
|
+
if "leaf_value" in node:
|
375
|
+
new_node["leaf_value"] = float(node["leaf_value"])
|
316
376
|
else:
|
317
|
-
new_node[
|
318
|
-
new_node[
|
319
|
-
new_node[
|
320
|
-
walk(node[
|
321
|
-
new_node[
|
322
|
-
walk(node[
|
377
|
+
new_node["best_feature"] = float(node["feature"])
|
378
|
+
new_node["split_bin"] = float(node["bin"])
|
379
|
+
new_node["left_id"] = node_counter[0]
|
380
|
+
walk(node["left"])
|
381
|
+
new_node["right_id"] = node_counter[0]
|
382
|
+
walk(node["right"])
|
323
383
|
|
324
384
|
node_list.append(new_node)
|
325
385
|
return new_node
|
@@ -327,15 +387,15 @@ class WarpGBM(BaseEstimator, RegressorMixin):
|
|
327
387
|
walk(tree)
|
328
388
|
|
329
389
|
for node in node_list:
|
330
|
-
i = node[
|
331
|
-
if
|
390
|
+
i = node["node_id"]
|
391
|
+
if "leaf_value" in node:
|
332
392
|
flat[i, 4] = 1.0
|
333
|
-
flat[i, 5] = node[
|
393
|
+
flat[i, 5] = node["leaf_value"]
|
334
394
|
else:
|
335
|
-
flat[i, 0] = node[
|
336
|
-
flat[i, 1] = node[
|
337
|
-
flat[i, 2] = node[
|
338
|
-
flat[i, 3] = node[
|
395
|
+
flat[i, 0] = node["best_feature"]
|
396
|
+
flat[i, 1] = node["split_bin"]
|
397
|
+
flat[i, 2] = node["left_id"]
|
398
|
+
flat[i, 3] = node["right_id"]
|
339
399
|
flat[i, 4] = 0.0
|
340
400
|
|
341
|
-
return flat
|
401
|
+
return flat
|
@@ -107,12 +107,6 @@ void launch_histogram_kernel_cuda(
|
|
107
107
|
N, F, B);
|
108
108
|
}
|
109
109
|
|
110
|
-
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
|
111
|
-
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
|
112
|
-
#define CHECK_INPUT(x) \
|
113
|
-
CHECK_CUDA(x); \
|
114
|
-
CHECK_CONTIGUOUS(x)
|
115
|
-
|
116
110
|
// CUDA kernel: tiled, 64-bit safe
|
117
111
|
__global__ void histogram_tiled_kernel(
|
118
112
|
const int8_t *__restrict__ bin_indices, // [N, F]
|
@@ -148,10 +142,6 @@ void launch_histogram_kernel_cuda_2(
|
|
148
142
|
int threads_per_block = 256,
|
149
143
|
int rows_per_thread = 1)
|
150
144
|
{
|
151
|
-
CHECK_INPUT(bin_indices);
|
152
|
-
CHECK_INPUT(gradients);
|
153
|
-
CHECK_INPUT(grad_hist);
|
154
|
-
CHECK_INPUT(hess_hist);
|
155
145
|
|
156
146
|
int64_t N = bin_indices.size(0);
|
157
147
|
int64_t F = bin_indices.size(1);
|
@@ -233,10 +223,6 @@ void launch_histogram_kernel_cuda_configurable(
|
|
233
223
|
int threads_per_block = 256,
|
234
224
|
int rows_per_thread = 1)
|
235
225
|
{
|
236
|
-
CHECK_INPUT(bin_indices);
|
237
|
-
CHECK_INPUT(gradients);
|
238
|
-
CHECK_INPUT(grad_hist);
|
239
|
-
CHECK_INPUT(hess_hist);
|
240
226
|
|
241
227
|
int64_t N = bin_indices.size(0);
|
242
228
|
int64_t F = bin_indices.size(1);
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: warpgbm
|
3
|
-
Version: 0.1.
|
3
|
+
Version: 0.1.22
|
4
4
|
Summary: A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA
|
5
5
|
License: GNU GENERAL PUBLIC LICENSE
|
6
6
|
Version 3, 29 June 2007
|
@@ -704,26 +704,20 @@ WarpGBM is a high-performance, GPU-accelerated Gradient Boosted Decision Tree (G
|
|
704
704
|
|
705
705
|
---
|
706
706
|
|
707
|
-
## Performance Note
|
708
|
-
|
709
|
-
In our initial tests on an NVIDIA 3090 (local) and A100 (Google Colab Pro), WarpGBM achieves **14x to 20x faster training times** compared to LightGBM's CPU version and **2x faster** on the GPU version using default configurations. Speed also outperforms XGBoost and CatBoost on regression problems. It also consumes **significantly less RAM and CPU**. These early results hint at more thorough benchmarking to come.
|
710
|
-
|
711
|
-
---
|
712
|
-
|
713
707
|
## Benchmarks
|
714
708
|
|
715
709
|
### Scikit-Learn Synthetic Data: 1 Million Rows and 1,000 Features
|
716
710
|
|
717
|
-
In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.
|
711
|
+
In this benchmark we compare the speed and in-sample correlation of **WarpGBM v0.1.21** against LightGBM, XGBoost and CatBoost, all with their GPU-enabled versions. This benchmark runs on Google Colab with the L4 GPU environment.
|
718
712
|
|
719
713
|
```
|
720
|
-
WarpGBM: corr = 0.8882, train =
|
721
|
-
XGBoost: corr = 0.8877, train = 33.
|
722
|
-
LightGBM: corr = 0.8604, train = 30.
|
723
|
-
CatBoost: corr = 0.8935, train =
|
714
|
+
WarpGBM: corr = 0.8882, train = 18.7s, infer = 4.9s
|
715
|
+
XGBoost: corr = 0.8877, train = 33.1s, infer = 8.1s
|
716
|
+
LightGBM: corr = 0.8604, train = 30.3s, infer = 1.4s
|
717
|
+
CatBoost: corr = 0.8935, train = 400.0s, infer = 382.6s
|
724
718
|
```
|
725
719
|
|
726
|
-
Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK
|
720
|
+
Colab Notebook: https://colab.research.google.com/drive/16U1kbYlD5HibGbnF5NGsjChZ1p1IA2pK?usp=sharing
|
727
721
|
|
728
722
|
---
|
729
723
|
|
@@ -746,7 +740,7 @@ pip install warpgbm
|
|
746
740
|
This installs from PyPI and also compiles CUDA code locally during installation. This method works well **if your environment already has PyTorch with GPU support** installed and configured.
|
747
741
|
|
748
742
|
> **Tip:**\
|
749
|
-
> If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag
|
743
|
+
> If you encounter an error related to mismatched or missing CUDA versions, try installing with the following flag. This is currently required in the Colab environments.
|
750
744
|
>
|
751
745
|
> ```bash
|
752
746
|
> pip install warpgbm --no-build-isolation
|
@@ -886,8 +880,7 @@ No installation required — just press **"Open in Playground"**, then **Run All
|
|
886
880
|
|
887
881
|
### Methods:
|
888
882
|
- `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
|
889
|
-
- `.predict(X
|
890
|
-
- `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
|
883
|
+
- `.predict(X)`: Predict on new data, using parallelized CUDA kernel.
|
891
884
|
|
892
885
|
---
|
893
886
|
|
warpgbm-0.1.21/version.txt
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
0.1.21
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|