warpgbm 0.1.14__tar.gz → 0.1.16__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: warpgbm
3
- Version: 0.1.14
3
+ Version: 0.1.16
4
4
  Summary: A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA
5
5
  License: GNU GENERAL PUBLIC LICENSE
6
6
  Version 3, 29 June 2007
@@ -735,6 +735,17 @@ This installs from PyPI and also compiles CUDA code locally during installation.
735
735
  > pip install warpgbm --no-build-isolation
736
736
  > ```
737
737
 
738
+ ### Windows
739
+
740
+ Thank you, ShatteredX, for providing working instructions for a Windows installation.
741
+
742
+ ```
743
+ git clone https://github.com/jefferythewind/warpgbm.git
744
+ cd warpgbm
745
+ python setup.py bdist_wheel
746
+ pip install .\dist\warpgbm-0.1.15-cp310-cp310-win_amd64.whl
747
+ ```
748
+
738
749
  Before either method, make sure you’ve installed PyTorch with GPU support:\
739
750
  [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
740
751
 
@@ -851,18 +862,15 @@ No installation required — just press **"Open in Playground"**, then **Run All
851
862
  - `n_estimators`: Number of boosting iterations (default: 100)
852
863
  - `min_child_weight`: Minimum sum of instance weight needed in a child (default: 20)
853
864
  - `min_split_gain`: Minimum loss reduction required to make a further partition (default: 0.0)
854
- - `verbosity`: Whether to print training logs (default: True)
855
865
  - `histogram_computer`: Choice of histogram kernel (`'hist1'`, `'hist2'`, `'hist3'`) (default: `'hist3'`)
856
866
  - `threads_per_block`: CUDA threads per block (default: 32)
857
867
  - `rows_per_thread`: Number of training rows processed per thread (default: 4)
858
- - `device`: Device to train on (`'cuda'` or `'cpu'`, default: `'cuda'`)
859
- - `split_type`: Algorithm used to choose best split (`'v1'` = CUDA kernel, `'v2'` = torch-based) (default: `'v2'`)
868
+ - `L2_reg`: L2 regularizer (default: 1e-6)
860
869
 
861
870
  ### Methods:
862
871
  - `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
863
- - `.predict(X)`: Predict on new raw float or pre-binned data.
864
- - `.predict_data(bin_indices)`: Predict from binned data directly (NumPy `int8` matrix).
865
- - `.grow_forest()`: Manually triggers tree construction loop (usually not needed).
872
+ - `.predict(X, chunksize=50_000)`: Predict on new raw float or pre-binned data.
873
+ - `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
866
874
 
867
875
  ---
868
876
 
@@ -47,6 +47,17 @@ This installs from PyPI and also compiles CUDA code locally during installation.
47
47
  > pip install warpgbm --no-build-isolation
48
48
  > ```
49
49
 
50
+ ### Windows
51
+
52
+ Thank you, ShatteredX, for providing working instructions for a Windows installation.
53
+
54
+ ```
55
+ git clone https://github.com/jefferythewind/warpgbm.git
56
+ cd warpgbm
57
+ python setup.py bdist_wheel
58
+ pip install .\dist\warpgbm-0.1.15-cp310-cp310-win_amd64.whl
59
+ ```
60
+
50
61
  Before either method, make sure you’ve installed PyTorch with GPU support:\
51
62
  [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
52
63
 
@@ -163,18 +174,15 @@ No installation required — just press **"Open in Playground"**, then **Run All
163
174
  - `n_estimators`: Number of boosting iterations (default: 100)
164
175
  - `min_child_weight`: Minimum sum of instance weight needed in a child (default: 20)
165
176
  - `min_split_gain`: Minimum loss reduction required to make a further partition (default: 0.0)
166
- - `verbosity`: Whether to print training logs (default: True)
167
177
  - `histogram_computer`: Choice of histogram kernel (`'hist1'`, `'hist2'`, `'hist3'`) (default: `'hist3'`)
168
178
  - `threads_per_block`: CUDA threads per block (default: 32)
169
179
  - `rows_per_thread`: Number of training rows processed per thread (default: 4)
170
- - `device`: Device to train on (`'cuda'` or `'cpu'`, default: `'cuda'`)
171
- - `split_type`: Algorithm used to choose best split (`'v1'` = CUDA kernel, `'v2'` = torch-based) (default: `'v2'`)
180
+ - `L2_reg`: L2 regularizer (default: 1e-6)
172
181
 
173
182
  ### Methods:
174
183
  - `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
175
- - `.predict(X)`: Predict on new raw float or pre-binned data.
176
- - `.predict_data(bin_indices)`: Predict from binned data directly (NumPy `int8` matrix).
177
- - `.grow_forest()`: Manually triggers tree construction loop (usually not needed).
184
+ - `.predict(X, chunksize=50_000)`: Predict on new raw float or pre-binned data.
185
+ - `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
178
186
 
179
187
  ---
180
188
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "warpgbm"
7
- version = "0.1.14"
7
+ version = "0.1.16"
8
8
  description = "A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"
@@ -1,14 +1,12 @@
1
1
  import numpy as np
2
2
  from warpgbm import WarpGBM
3
+ from sklearn.datasets import make_regression
3
4
 
4
5
  def test_fit_predict_correlation():
5
6
  np.random.seed(42)
6
- N = 500
7
- F = 5
8
- X = np.random.randn(N, F).astype(np.float32)
9
- true_weights = np.array([0.5, -1.0, 2.0, 0.0, 1.0])
10
- noise = 0.1 * np.random.randn(N)
11
- y = (X @ true_weights + noise).astype(np.float32)
7
+ N = 1_000_000
8
+ F = 100
9
+ X, y = make_regression(n_samples=N, n_features=F, noise=0.1, random_state=42)
12
10
  era = np.zeros(N, dtype=np.int32)
13
11
  corrs = []
14
12
 
@@ -0,0 +1 @@
1
+ 0.1.16
@@ -0,0 +1,522 @@
1
+ import torch
2
+ import numpy as np
3
+ from sklearn.base import BaseEstimator, RegressorMixin
4
+ from warpgbm.cuda import node_kernel
5
+ from tqdm import tqdm
6
+ from typing import Tuple
7
+ from torch import Tensor
8
+
9
+ histogram_kernels = {
10
+ 'hist1': node_kernel.compute_histogram,
11
+ 'hist2': node_kernel.compute_histogram2,
12
+ 'hist3': node_kernel.compute_histogram3
13
+ }
14
+
15
+ @torch.jit.script
16
+ def jit_find_best_split(
17
+ G: Tensor, H: Tensor,
18
+ lambda_l2: float,
19
+ lambda_l1: float, # unused placeholder for now
20
+ min_split_gain: float,
21
+ min_child_weight: float
22
+ ) -> Tuple[int, int]:
23
+ F, B = G.size()
24
+ Bm1 = B - 1
25
+ eps = 0
26
+
27
+ GH = torch.stack([G, H], dim=0).cumsum(dim=2) # [2, F, B]
28
+ GL, HL_raw = GH[0, :, :-1], GH[1, :, :-1] # [F, B-1]
29
+ GP, HP = GH[0, :, -1:], GH[1, :, -1:] # [F, 1]
30
+ H_R_raw = HP - HL_raw
31
+
32
+ # Validity mask using raw child hessians
33
+ valid = (HL_raw >= min_child_weight) & (H_R_raw >= min_child_weight)
34
+
35
+ # Closed-form gain
36
+ HL, HP = HL_raw + lambda_l2, HP + lambda_l2
37
+ num = (HP * GL - HL * GP).pow(2)
38
+ denom = HP * HL * (HP - HL) + eps
39
+ gain = torch.where(valid & (num / denom >= min_split_gain), num / denom, torch.full_like(num, -float("inf")))
40
+
41
+ gain_flat = gain.view(-1)
42
+ best_idx = torch.argmax(gain_flat)
43
+
44
+ if gain_flat[best_idx].item() == float('-inf'):
45
+ return -1, -1
46
+
47
+ return best_idx // Bm1, best_idx % Bm1
48
+
49
+ class WarpGBM(BaseEstimator, RegressorMixin):
50
+ def __init__(
51
+ self,
52
+ num_bins=10,
53
+ max_depth=3,
54
+ learning_rate=0.1,
55
+ n_estimators=100,
56
+ min_child_weight=20,
57
+ min_split_gain=0.0,
58
+ verbosity=True,
59
+ histogram_computer='hist3',
60
+ threads_per_block=64,
61
+ rows_per_thread=4,
62
+ L2_reg = 1e-6,
63
+ L1_reg = 0.0,
64
+ device = 'cuda'
65
+ ):
66
+ self.num_bins = num_bins
67
+ self.max_depth = max_depth
68
+ self.learning_rate = learning_rate
69
+ self.n_estimators = n_estimators
70
+ self.forest = None
71
+ self.bin_edges = None # shape: [num_features, num_bins-1] if using quantile binning
72
+ self.base_prediction = None
73
+ self.unique_eras = None
74
+ self.device = device
75
+ self.root_gradient_histogram = None
76
+ self.root_hessian_histogram = None
77
+ self.gradients = None
78
+ self.root_node_indices = None
79
+ self.bin_indices = None
80
+ self.Y_gpu = None
81
+ self.num_features = None
82
+ self.num_samples = None
83
+ self.out_feature = torch.zeros(1, device=self.device, dtype=torch.int32)
84
+ self.out_bin = torch.zeros(1, device=self.device, dtype=torch.int32)
85
+ self.min_child_weight = min_child_weight
86
+ self.min_split_gain = min_split_gain
87
+ self.best_gain = torch.tensor([-float('inf')], dtype=torch.float32, device=self.device)
88
+ self.best_feature = torch.tensor([-1], dtype=torch.int32, device=self.device)
89
+ self.best_bin = torch.tensor([-1], dtype=torch.int32, device=self.device)
90
+ self.compute_histogram = histogram_kernels[histogram_computer]
91
+ self.threads_per_block = threads_per_block
92
+ self.rows_per_thread = rows_per_thread
93
+ self.L2_reg = L2_reg
94
+ self.L1_reg = L1_reg
95
+
96
+ def fit(self, X, y, era_id=None):
97
+ if era_id is None:
98
+ era_id = np.ones(X.shape[0], dtype='int32')
99
+ self.bin_indices, era_indices, self.bin_edges, self.unique_eras, self.Y_gpu = self.preprocess_gpu_data(X, y, era_id)
100
+ self.num_samples, self.num_features = X.shape
101
+ self.gradients = torch.zeros_like(self.Y_gpu)
102
+ self.root_node_indices = torch.arange(self.num_samples, device=self.device)
103
+ self.base_prediction = self.Y_gpu.mean().item()
104
+ self.gradients += self.base_prediction
105
+ self.split_gains = torch.zeros((self.num_features, self.num_bins - 1), device=self.device)
106
+ self.forest = self.grow_forest()
107
+ return self
108
+
109
+ def compute_quantile_bins(self, X, num_bins):
110
+ quantiles = torch.linspace(0, 1, num_bins + 1)[1:-1] # exclude 0% and 100%
111
+ bin_edges = torch.quantile(X, quantiles, dim=0) # shape: [B-1, F]
112
+ return bin_edges.T # shape: [F, B-1]
113
+
114
+ def preprocess_gpu_data(self, X_np, Y_np, era_id_np):
115
+ self.num_samples, self.num_features = X_np.shape
116
+ Y_gpu = torch.from_numpy(Y_np).type(torch.float32).to(self.device)
117
+ era_id_gpu = torch.from_numpy(era_id_np).type(torch.int32).to(self.device)
118
+ is_integer_type = np.issubdtype(X_np.dtype, np.integer)
119
+ if is_integer_type:
120
+ max_vals = X_np.max(axis=0)
121
+ if np.all(max_vals < self.num_bins):
122
+ print("Detected pre-binned integer input — skipping quantile binning.")
123
+ bin_indices = torch.from_numpy(X_np).to(self.device).contiguous().to(torch.int8)
124
+
125
+ # We'll store None or an empty tensor in self.bin_edges
126
+ # to indicate that we skip binning at predict-time
127
+ bin_edges = torch.arange(1, self.num_bins, dtype=torch.float32).repeat(self.num_features, 1)
128
+ bin_edges = bin_edges.to(self.device)
129
+ unique_eras, era_indices = torch.unique(era_id_gpu, return_inverse=True)
130
+ return bin_indices, era_indices, bin_edges, unique_eras, Y_gpu
131
+ else:
132
+ print("Integer input detected, but values exceed num_bins — falling back to quantile binning.")
133
+
134
+ print("Performing quantile binning on CPU...")
135
+ X_cpu = torch.from_numpy(X_np).type(torch.float32) # CPU tensor
136
+ bin_edges_cpu = self.compute_quantile_bins(X_cpu, self.num_bins).type(torch.float32).contiguous()
137
+ bin_indices_cpu = torch.empty((self.num_samples, self.num_features), dtype=torch.int8)
138
+ for f in range(self.num_features):
139
+ bin_indices_cpu[:, f] = torch.bucketize(X_cpu[:, f], bin_edges_cpu[f], right=False).type(torch.int8)
140
+ bin_indices = bin_indices_cpu.to(self.device).contiguous()
141
+ bin_edges = bin_edges_cpu.to(self.device)
142
+ unique_eras, era_indices = torch.unique(era_id_gpu, return_inverse=True)
143
+ return bin_indices, era_indices, bin_edges, unique_eras, Y_gpu
144
+
145
+ def compute_histograms(self, bin_indices_sub, gradients):
146
+ grad_hist = torch.zeros((self.num_features, self.num_bins), device=self.device, dtype=torch.float32)
147
+ hess_hist = torch.zeros((self.num_features, self.num_bins), device=self.device, dtype=torch.float32)
148
+
149
+ self.compute_histogram(
150
+ bin_indices_sub,
151
+ gradients,
152
+ grad_hist,
153
+ hess_hist,
154
+ self.num_bins,
155
+ self.threads_per_block,
156
+ self.rows_per_thread
157
+ )
158
+ return grad_hist, hess_hist
159
+
160
+ def find_best_split(self, gradient_histogram, hessian_histogram):
161
+ f,b = jit_find_best_split(
162
+ gradient_histogram,
163
+ hessian_histogram,
164
+ self.L2_reg,
165
+ self.L1_reg,
166
+ self.min_split_gain,
167
+ self.min_child_weight,
168
+ )
169
+ return (f, b)
170
+
171
+ def grow_tree(self, gradient_histogram, hessian_histogram, node_indices, depth):
172
+ if depth == self.max_depth:
173
+ leaf_value = self.residual[node_indices].mean()
174
+ self.gradients[node_indices] += self.learning_rate * leaf_value
175
+ return {"leaf_value": leaf_value.item(), "samples": node_indices.numel()}
176
+
177
+ parent_size = node_indices.numel()
178
+ best_feature, best_bin = self.find_best_split(gradient_histogram, hessian_histogram)
179
+
180
+ if best_feature == -1:
181
+ leaf_value = self.residual[node_indices].mean()
182
+ self.gradients[node_indices] += self.learning_rate * leaf_value
183
+ return {"leaf_value": leaf_value.item(), "samples": parent_size}
184
+
185
+ split_mask = (self.bin_indices[node_indices, best_feature] <= best_bin)
186
+ left_indices = node_indices[split_mask]
187
+ right_indices = node_indices[~split_mask]
188
+
189
+ left_size = left_indices.numel()
190
+ right_size = right_indices.numel()
191
+
192
+ if left_size == 0 or right_size == 0:
193
+ leaf_value = self.residual[node_indices].mean()
194
+ self.gradients[node_indices] += self.learning_rate * leaf_value
195
+ return {"leaf_value": leaf_value.item(), "samples": parent_size}
196
+
197
+ if left_size <= right_size:
198
+ grad_hist_left, hess_hist_left = self.compute_histograms( self.bin_indices[left_indices], self.residual[left_indices] )
199
+ grad_hist_right = gradient_histogram - grad_hist_left
200
+ hess_hist_right = hessian_histogram - hess_hist_left
201
+ else:
202
+ grad_hist_right, hess_hist_right = self.compute_histograms( self.bin_indices[right_indices], self.residual[right_indices] )
203
+ grad_hist_left = gradient_histogram - grad_hist_right
204
+ hess_hist_left = hessian_histogram - hess_hist_right
205
+
206
+ new_depth = depth + 1
207
+ left_child = self.grow_tree(grad_hist_left, hess_hist_left, left_indices, new_depth)
208
+ right_child = self.grow_tree(grad_hist_right, hess_hist_right, right_indices, new_depth)
209
+
210
+ return { "feature": best_feature, "bin": best_bin, "left": left_child, "right": right_child }
211
+
212
+ def grow_forest(self):
213
+ forest = [{} for _ in range(self.n_estimators)]
214
+ self.training_loss = []
215
+
216
+ for i in tqdm( range(self.n_estimators) ):
217
+ self.residual = self.Y_gpu - self.gradients
218
+
219
+ self.root_gradient_histogram, self.root_hessian_histogram = \
220
+ self.compute_histograms(self.bin_indices, self.residual)
221
+
222
+ tree = self.grow_tree(
223
+ self.root_gradient_histogram,
224
+ self.root_hessian_histogram,
225
+ self.root_node_indices,
226
+ depth=0
227
+ )
228
+ forest[i] = tree
229
+ # loss = ((self.Y_gpu - self.gradients) ** 2).mean().item()
230
+ # self.training_loss.append(loss)
231
+ # print(f"🌲 Tree {i+1}/{self.n_estimators} - MSE: {loss:.6f}")
232
+
233
+ print("Finished training forest.")
234
+ return forest
235
+
236
+ def predict(self, X_np, chunk_size=50000):
237
+ """
238
+ Vectorized predict using a padded layer-by-layer approach.
239
+ We assume `flatten_forest_to_tensors` has produced self.flat_forest with
240
+ "features", "thresholds", "leaf_values", all shaped [n_trees, max_nodes].
241
+ """
242
+ # 1) Convert X_np -> bin_indices
243
+ is_integer_type = np.issubdtype(X_np.dtype, np.integer)
244
+ if is_integer_type:
245
+ max_vals = X_np.max(axis=0)
246
+ if np.all(max_vals < self.num_bins):
247
+ bin_indices = X_np.astype(np.int8)
248
+ else:
249
+ raise ValueError("Pre-binned integers must be < num_bins")
250
+ else:
251
+ X_cpu = torch.from_numpy(X_np).type(torch.float32)
252
+ bin_indices = torch.empty((X_np.shape[0], X_np.shape[1]), dtype=torch.int8)
253
+ bin_edges_cpu = self.bin_edges.to('cpu')
254
+ for f in range(self.num_features):
255
+ bin_indices[:, f] = torch.bucketize(X_cpu[:, f], bin_edges_cpu[f], right=False).type(torch.int8)
256
+ bin_indices = bin_indices.numpy()
257
+
258
+ # 2) Ensure we have a padded representation
259
+ self.flat_forest = self.flatten_forest_to_tensors(self.forest)
260
+
261
+ features_t = self.flat_forest["features"] # [n_trees, max_nodes], int16
262
+ thresholds_t = self.flat_forest["thresholds"] # [n_trees, max_nodes], int16
263
+ values_t = self.flat_forest["leaf_values"] # [n_trees, max_nodes], float32
264
+ max_nodes = self.flat_forest["max_nodes"]
265
+
266
+ n_trees = features_t.shape[0]
267
+ N = bin_indices.shape[0]
268
+ out = np.zeros(N, dtype=np.float32)
269
+
270
+ # 3) Process rows in chunks
271
+ for start in tqdm(range(0, N, chunk_size)):
272
+ end = min(start + chunk_size, N)
273
+ chunk_np = bin_indices[start:end] # shape [chunk_size, F]
274
+ chunk_gpu = torch.from_numpy(chunk_np).to(self.device) # [chunk_size, F], int8
275
+
276
+ # Accumulate raw (unscaled) leaf sums
277
+ chunk_preds = torch.zeros((end - start,), dtype=torch.float32, device=self.device)
278
+
279
+ # node_idx[i] tracks the current node index in the padded tree for row i
280
+ node_idx = torch.zeros((end - start,), dtype=torch.int32, device=self.device)
281
+
282
+ # 'active' is a boolean mask over [0..(end-start-1)], indicating which rows haven't reached a leaf
283
+ active = torch.ones((end - start,), dtype=torch.bool, device=self.device)
284
+
285
+ for t in range(n_trees):
286
+ # Reset for each tree (each tree is independent)
287
+ node_idx.fill_(0)
288
+ active.fill_(True)
289
+
290
+ tree_features = features_t[t] # shape [max_nodes], int16
291
+ tree_thresh = thresholds_t[t] # shape [max_nodes], int16
292
+ tree_values = values_t[t] # shape [max_nodes], float32
293
+
294
+ # Up to self.max_depth+1 layers
295
+ for _level in range(self.max_depth + 1):
296
+ active_idx = active.nonzero(as_tuple=True)[0]
297
+ if active_idx.numel() == 0:
298
+ break # all rows are done in this tree
299
+
300
+ current_node_idx = node_idx[active_idx]
301
+ f = tree_features[current_node_idx] # shape [#active], int16
302
+ thr = tree_thresh[current_node_idx] # shape [#active], int16
303
+ vals = tree_values[current_node_idx] # shape [#active], float32
304
+
305
+ mask_no_node = (f == -2)
306
+ mask_leaf = (f == -1)
307
+
308
+ # If leaf, add leaf value and mark inactive.
309
+ if mask_leaf.any():
310
+ leaf_rows = active_idx[mask_leaf]
311
+ chunk_preds[leaf_rows] += vals[mask_leaf]
312
+ active[leaf_rows] = False
313
+
314
+ # If no node, mark inactive.
315
+ if mask_no_node.any():
316
+ no_node_rows = active_idx[mask_no_node]
317
+ active[no_node_rows] = False
318
+
319
+ # For internal nodes, perform bin comparison.
320
+ mask_internal = (~mask_leaf & ~mask_no_node)
321
+ if mask_internal.any():
322
+ internal_rows = active_idx[mask_internal]
323
+ act_f = f[mask_internal].long()
324
+ act_thr = thr[mask_internal]
325
+ binvals = chunk_gpu[internal_rows, act_f]
326
+ go_left = (binvals <= act_thr)
327
+ new_left_idx = current_node_idx[mask_internal] * 2 + 1
328
+ new_right_idx = current_node_idx[mask_internal] * 2 + 2
329
+ node_idx[internal_rows[go_left]] = new_left_idx[go_left]
330
+ node_idx[internal_rows[~go_left]] = new_right_idx[~go_left]
331
+ # end per-tree layer loop
332
+ # end for each tree
333
+
334
+ out[start:end] = (
335
+ self.base_prediction + self.learning_rate * chunk_preds
336
+ ).cpu().numpy()
337
+
338
+ return out
339
+
340
+ def flatten_forest_to_tensors(self, forest):
341
+ """
342
+ Convert a list of dict-based trees into a fixed-size array representation
343
+ for each tree, up to max_depth. Each tree is stored in a 'perfect binary tree'
344
+ layout:
345
+ - node 0 is the root
346
+ - node i has children (2*i + 1) and (2*i + 2), if within range
347
+ - feature = -2 indicates no node / invalid
348
+ - feature = -1 indicates a leaf node
349
+ - otherwise, an internal node with that feature.
350
+ """
351
+ n_trees = len(forest)
352
+ max_nodes = 2 ** (self.max_depth + 1) - 1 # total array slots per tree
353
+
354
+ # Allocate padded arrays (on CPU for ease of indexing).
355
+ feat_arr = np.full((n_trees, max_nodes), -2, dtype=np.int16)
356
+ thresh_arr = np.full((n_trees, max_nodes), -2, dtype=np.int16)
357
+ value_arr = np.zeros((n_trees, max_nodes), dtype=np.float32)
358
+
359
+ def fill_padded(tree, tree_idx, node_idx, depth):
360
+ """
361
+ Recursively fill feat_arr, thresh_arr, value_arr for a single tree.
362
+ If depth == self.max_depth, no children are added.
363
+ If there's no node, feature remains -2.
364
+ """
365
+ if "leaf_value" in tree:
366
+ feat_arr[tree_idx, node_idx] = -1
367
+ thresh_arr[tree_idx, node_idx] = -1
368
+ value_arr[tree_idx, node_idx] = tree["leaf_value"]
369
+ return
370
+
371
+ feat = tree["feature"]
372
+ bin_th = tree["bin"]
373
+
374
+ feat_arr[tree_idx, node_idx] = feat
375
+ thresh_arr[tree_idx, node_idx] = bin_th
376
+ # Internal nodes keep a 0 value.
377
+
378
+ if depth < self.max_depth:
379
+ left_idx = 2 * node_idx + 1
380
+ right_idx = 2 * node_idx + 2
381
+ fill_padded(tree["left"], tree_idx, left_idx, depth + 1)
382
+ fill_padded(tree["right"], tree_idx, right_idx, depth + 1)
383
+ # At max depth, children remain unfilled (-2).
384
+
385
+ for t, root in enumerate(forest):
386
+ fill_padded(root, t, 0, 0)
387
+
388
+ # Convert to torch Tensors on the proper device.
389
+ features_t = torch.from_numpy(feat_arr).to(self.device)
390
+ thresholds_t = torch.from_numpy(thresh_arr).to(self.device)
391
+ leaf_values_t = torch.from_numpy(value_arr).to(self.device)
392
+
393
+ return {
394
+ "features": features_t, # [n_trees, max_nodes]
395
+ "thresholds": thresholds_t, # [n_trees, max_nodes]
396
+ "leaf_values": leaf_values_t, # [n_trees, max_nodes]
397
+ "max_nodes": max_nodes
398
+ }
399
+
400
+ def predict_numpy(self, X_np, chunk_size=50000):
401
+ """
402
+ Fully NumPy-based version of predict_fast.
403
+ Assumes flatten_forest_to_tensors has been called and `self.flat_forest` is ready.
404
+ """
405
+ # 1) Convert X_np -> bin_indices
406
+ is_integer_type = np.issubdtype(X_np.dtype, np.integer)
407
+ if is_integer_type:
408
+ max_vals = X_np.max(axis=0)
409
+ if np.all(max_vals < self.num_bins):
410
+ bin_indices = X_np.astype(np.int8)
411
+ else:
412
+ raise ValueError("Pre-binned integers must be < num_bins")
413
+ else:
414
+ bin_indices = np.empty_like(X_np, dtype=np.int8)
415
+ # Ensure bin_edges are NumPy arrays
416
+ if isinstance(self.bin_edges[0], torch.Tensor):
417
+ bin_edges_np = [be.cpu().numpy() for be in self.bin_edges]
418
+ else:
419
+ bin_edges_np = self.bin_edges
420
+
421
+ for f in range(self.num_features):
422
+ bin_indices[:, f] = np.searchsorted(bin_edges_np[f], X_np[:, f], side='left')
423
+
424
+ # Ensure we have a padded representation
425
+ self.flat_forest = self.flatten_forest(self.forest)
426
+
427
+ # 2) Padded forest arrays (already NumPy now)
428
+ features_t = self.flat_forest["features"] # [n_trees, max_nodes], int16
429
+ thresholds_t = self.flat_forest["thresholds"] # [n_trees, max_nodes], int16
430
+ values_t = self.flat_forest["leaf_values"] # [n_trees, max_nodes], float32
431
+ max_nodes = self.flat_forest["max_nodes"]
432
+ n_trees = features_t.shape[0]
433
+ N = bin_indices.shape[0]
434
+ out = np.zeros(N, dtype=np.float32)
435
+
436
+ # 3) Process in chunks
437
+ for start in tqdm( range(0, N, chunk_size) ):
438
+ end = min(start + chunk_size, N)
439
+ chunk = bin_indices[start:end] # [chunk_size, F]
440
+ chunk_preds = np.zeros(end - start, dtype=np.float32)
441
+
442
+ for t in range(n_trees):
443
+ node_idx = np.zeros(end - start, dtype=np.int32)
444
+ active = np.ones(end - start, dtype=bool)
445
+
446
+ tree_features = features_t[t] # [max_nodes]
447
+ tree_thresh = thresholds_t[t] # [max_nodes]
448
+ tree_values = values_t[t] # [max_nodes]
449
+
450
+ for _level in range(self.max_depth + 1):
451
+ active_idx = np.nonzero(active)[0]
452
+ if active_idx.size == 0:
453
+ break
454
+
455
+ current_node_idx = node_idx[active_idx]
456
+ f = tree_features[current_node_idx]
457
+ thr = tree_thresh[current_node_idx]
458
+ vals = tree_values[current_node_idx]
459
+
460
+ mask_no_node = (f == -2)
461
+ mask_leaf = (f == -1)
462
+ mask_internal = ~(mask_leaf | mask_no_node)
463
+
464
+ if np.any(mask_leaf):
465
+ leaf_rows = active_idx[mask_leaf]
466
+ chunk_preds[leaf_rows] += vals[mask_leaf]
467
+ active[leaf_rows] = False
468
+
469
+ if np.any(mask_no_node):
470
+ no_node_rows = active_idx[mask_no_node]
471
+ active[no_node_rows] = False
472
+
473
+ if np.any(mask_internal):
474
+ internal_rows = active_idx[mask_internal]
475
+ act_f = f[mask_internal].astype(np.int32)
476
+ act_thr = thr[mask_internal]
477
+ binvals = chunk[internal_rows, act_f]
478
+ go_left = binvals <= act_thr
479
+
480
+ new_left_idx = current_node_idx[mask_internal] * 2 + 1
481
+ new_right_idx = current_node_idx[mask_internal] * 2 + 2
482
+ node_idx[internal_rows[go_left]] = new_left_idx[go_left]
483
+ node_idx[internal_rows[~go_left]] = new_right_idx[~go_left]
484
+
485
+ out[start:end] = self.base_prediction + self.learning_rate * chunk_preds
486
+
487
+ return out
488
+
489
+ def flatten_forest(self, forest):
490
+ n_trees = len(forest)
491
+ max_nodes = 2 ** (self.max_depth + 1) - 1
492
+
493
+ feat_arr = np.full((n_trees, max_nodes), -2, dtype=np.int16)
494
+ thresh_arr = np.full((n_trees, max_nodes), -2, dtype=np.int16)
495
+ value_arr = np.zeros((n_trees, max_nodes), dtype=np.float32)
496
+
497
+ def fill_padded(tree, tree_idx, node_idx, depth):
498
+ if "leaf_value" in tree:
499
+ feat_arr[tree_idx, node_idx] = -1
500
+ thresh_arr[tree_idx, node_idx] = -1
501
+ value_arr[tree_idx, node_idx] = tree["leaf_value"]
502
+ return
503
+ feat = tree["feature"]
504
+ bin_th = tree["bin"]
505
+ feat_arr[tree_idx, node_idx] = feat
506
+ thresh_arr[tree_idx, node_idx] = bin_th
507
+
508
+ if depth < self.max_depth:
509
+ left_idx = 2 * node_idx + 1
510
+ right_idx = 2 * node_idx + 2
511
+ fill_padded(tree["left"], tree_idx, left_idx, depth + 1)
512
+ fill_padded(tree["right"], tree_idx, right_idx, depth + 1)
513
+
514
+ for t, root in enumerate(forest):
515
+ fill_padded(root, t, 0, 0)
516
+
517
+ return {
518
+ "features": feat_arr,
519
+ "thresholds": thresh_arr,
520
+ "leaf_values": value_arr,
521
+ "max_nodes": max_nodes
522
+ }
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: warpgbm
3
- Version: 0.1.14
3
+ Version: 0.1.16
4
4
  Summary: A fast GPU-accelerated Gradient Boosted Decision Tree library with PyTorch + CUDA
5
5
  License: GNU GENERAL PUBLIC LICENSE
6
6
  Version 3, 29 June 2007
@@ -735,6 +735,17 @@ This installs from PyPI and also compiles CUDA code locally during installation.
735
735
  > pip install warpgbm --no-build-isolation
736
736
  > ```
737
737
 
738
+ ### Windows
739
+
740
+ Thank you, ShatteredX, for providing working instructions for a Windows installation.
741
+
742
+ ```
743
+ git clone https://github.com/jefferythewind/warpgbm.git
744
+ cd warpgbm
745
+ python setup.py bdist_wheel
746
+ pip install .\dist\warpgbm-0.1.15-cp310-cp310-win_amd64.whl
747
+ ```
748
+
738
749
  Before either method, make sure you’ve installed PyTorch with GPU support:\
739
750
  [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
740
751
 
@@ -851,18 +862,15 @@ No installation required — just press **"Open in Playground"**, then **Run All
851
862
  - `n_estimators`: Number of boosting iterations (default: 100)
852
863
  - `min_child_weight`: Minimum sum of instance weight needed in a child (default: 20)
853
864
  - `min_split_gain`: Minimum loss reduction required to make a further partition (default: 0.0)
854
- - `verbosity`: Whether to print training logs (default: True)
855
865
  - `histogram_computer`: Choice of histogram kernel (`'hist1'`, `'hist2'`, `'hist3'`) (default: `'hist3'`)
856
866
  - `threads_per_block`: CUDA threads per block (default: 32)
857
867
  - `rows_per_thread`: Number of training rows processed per thread (default: 4)
858
- - `device`: Device to train on (`'cuda'` or `'cpu'`, default: `'cuda'`)
859
- - `split_type`: Algorithm used to choose best split (`'v1'` = CUDA kernel, `'v2'` = torch-based) (default: `'v2'`)
868
+ - `L2_reg`: L2 regularizer (default: 1e-6)
860
869
 
861
870
  ### Methods:
862
871
  - `.fit(X, y, era_id=None)`: Train the model. `X` can be raw floats or pre-binned `int8` data. `era_id` is optional and used internally.
863
- - `.predict(X)`: Predict on new raw float or pre-binned data.
864
- - `.predict_data(bin_indices)`: Predict from binned data directly (NumPy `int8` matrix).
865
- - `.grow_forest()`: Manually triggers tree construction loop (usually not needed).
872
+ - `.predict(X, chunksize=50_000)`: Predict on new raw float or pre-binned data.
873
+ - `.predict_numpy(X, chunksize=50_000)`: Same as `.predict(X)` but without using the GPU.
866
874
 
867
875
  ---
868
876
 
@@ -1 +0,0 @@
1
- 0.1.14
@@ -1,241 +0,0 @@
1
- import torch
2
- import numpy as np
3
- from sklearn.base import BaseEstimator, RegressorMixin
4
- from warpgbm.cuda import node_kernel
5
- from tqdm import tqdm
6
-
7
- histogram_kernels = {
8
- 'hist1': node_kernel.compute_histogram,
9
- 'hist2': node_kernel.compute_histogram2,
10
- 'hist3': node_kernel.compute_histogram3
11
- }
12
-
13
- class WarpGBM(BaseEstimator, RegressorMixin):
14
- def __init__(
15
- self,
16
- num_bins=10,
17
- max_depth=3,
18
- learning_rate=0.1,
19
- n_estimators=100,
20
- min_child_weight=20,
21
- min_split_gain=0.0,
22
- verbosity=True,
23
- histogram_computer='hist3',
24
- threads_per_block=64,
25
- rows_per_thread=4,
26
- L2_reg = 1e-6,
27
- device = 'cuda'
28
- ):
29
- self.num_bins = num_bins
30
- self.max_depth = max_depth
31
- self.learning_rate = learning_rate
32
- self.n_estimators = n_estimators
33
- self.forest = None
34
- self.bin_edges = None # shape: [num_features, num_bins-1] if using quantile binning
35
- self.base_prediction = None
36
- self.unique_eras = None
37
- self.device = device
38
- self.root_gradient_histogram = None
39
- self.root_hessian_histogram = None
40
- self.gradients = None
41
- self.root_node_indices = None
42
- self.bin_indices = None
43
- self.Y_gpu = None
44
- self.num_features = None
45
- self.num_samples = None
46
- self.out_feature = torch.zeros(1, device=self.device, dtype=torch.int32)
47
- self.out_bin = torch.zeros(1, device=self.device, dtype=torch.int32)
48
- self.min_child_weight = min_child_weight
49
- self.min_split_gain = min_split_gain
50
- self.best_gain = torch.tensor([-float('inf')], dtype=torch.float32, device=self.device)
51
- self.best_feature = torch.tensor([-1], dtype=torch.int32, device=self.device)
52
- self.best_bin = torch.tensor([-1], dtype=torch.int32, device=self.device)
53
- self.compute_histogram = histogram_kernels[histogram_computer]
54
- self.threads_per_block = threads_per_block
55
- self.rows_per_thread = rows_per_thread
56
- self.L2_reg = L2_reg
57
-
58
-
59
- def fit(self, X, y, era_id=None):
60
- if era_id is None:
61
- era_id = np.ones(X.shape[0], dtype='int32')
62
- self.bin_indices, era_indices, self.bin_edges, self.unique_eras, self.Y_gpu = self.preprocess_gpu_data(X, y, era_id)
63
- self.num_samples, self.num_features = X.shape
64
- self.gradients = torch.zeros_like(self.Y_gpu)
65
- self.root_node_indices = torch.arange(self.num_samples, device=self.device)
66
- self.base_prediction = self.Y_gpu.mean().item()
67
- self.gradients += self.base_prediction
68
- self.split_gains = torch.zeros((self.num_features, self.num_bins - 1), device=self.device)
69
- self.forest = self.grow_forest()
70
- return self
71
-
72
- def compute_quantile_bins(self, X, num_bins):
73
- quantiles = torch.linspace(0, 1, num_bins + 1)[1:-1] # exclude 0% and 100%
74
- bin_edges = torch.quantile(X, quantiles, dim=0) # shape: [B-1, F]
75
- return bin_edges.T # shape: [F, B-1]
76
-
77
- def preprocess_gpu_data(self, X_np, Y_np, era_id_np):
78
- self.num_samples, self.num_features = X_np.shape
79
- Y_gpu = torch.from_numpy(Y_np).type(torch.float32).to(self.device)
80
- era_id_gpu = torch.from_numpy(era_id_np).type(torch.int32).to(self.device)
81
- is_integer_type = np.issubdtype(X_np.dtype, np.integer)
82
- if is_integer_type:
83
- max_vals = X_np.max(axis=0)
84
- if np.all(max_vals < self.num_bins):
85
- print("Detected pre-binned integer input — skipping quantile binning.")
86
- bin_indices = torch.from_numpy(X_np).to(self.device).contiguous().to(torch.int8)
87
-
88
- # We'll store None or an empty tensor in self.bin_edges
89
- # to indicate that we skip binning at predict-time
90
- bin_edges = torch.arange(1, self.num_bins, dtype=torch.float32).repeat(self.num_features, 1)
91
- bin_edges = bin_edges.to(self.device)
92
- unique_eras, era_indices = torch.unique(era_id_gpu, return_inverse=True)
93
- return bin_indices, era_indices, bin_edges, unique_eras, Y_gpu
94
- else:
95
- print("Integer input detected, but values exceed num_bins — falling back to quantile binning.")
96
-
97
- print("Performing quantile binning on CPU...")
98
- X_cpu = torch.from_numpy(X_np).type(torch.float32) # CPU tensor
99
- bin_edges_cpu = self.compute_quantile_bins(X_cpu, self.num_bins).type(torch.float32).contiguous()
100
- bin_indices_cpu = torch.empty((self.num_samples, self.num_features), dtype=torch.int8)
101
- for f in range(self.num_features):
102
- bin_indices_cpu[:, f] = torch.bucketize(X_cpu[:, f], bin_edges_cpu[f], right=False).type(torch.int8)
103
- bin_indices = bin_indices_cpu.to(self.device).contiguous()
104
- bin_edges = bin_edges_cpu.to(self.device)
105
- unique_eras, era_indices = torch.unique(era_id_gpu, return_inverse=True)
106
- return bin_indices, era_indices, bin_edges, unique_eras, Y_gpu
107
-
108
- def compute_histograms(self, bin_indices_sub, gradients):
109
- grad_hist = torch.zeros((self.num_features, self.num_bins), device=self.device, dtype=torch.float32)
110
- hess_hist = torch.zeros((self.num_features, self.num_bins), device=self.device, dtype=torch.float32)
111
-
112
- self.compute_histogram(
113
- bin_indices_sub,
114
- gradients,
115
- grad_hist,
116
- hess_hist,
117
- self.num_bins,
118
- self.threads_per_block,
119
- self.rows_per_thread
120
- )
121
- return grad_hist, hess_hist
122
-
123
- def find_best_split(self, gradient_histogram, hessian_histogram):
124
- node_kernel.compute_split(
125
- gradient_histogram.contiguous(),
126
- hessian_histogram.contiguous(),
127
- self.num_features,
128
- self.num_bins,
129
- self.min_split_gain,
130
- self.min_child_weight,
131
- self.L2_reg,
132
- self.out_feature,
133
- self.out_bin
134
- )
135
-
136
- f = int(self.out_feature[0])
137
- b = int(self.out_bin[0])
138
- return (f, b)
139
-
140
- def grow_tree(self, gradient_histogram, hessian_histogram, node_indices, depth):
141
- if depth == self.max_depth:
142
- leaf_value = self.residual[node_indices].mean()
143
- self.gradients[node_indices] += self.learning_rate * leaf_value
144
- return {"leaf_value": leaf_value.item(), "samples": node_indices.numel()}
145
-
146
- parent_size = node_indices.numel()
147
- best_feature, best_bin = self.find_best_split(gradient_histogram, hessian_histogram)
148
-
149
- if best_feature == -1:
150
- leaf_value = self.residual[node_indices].mean()
151
- self.gradients[node_indices] += self.learning_rate * leaf_value
152
- return {"leaf_value": leaf_value.item(), "samples": parent_size}
153
-
154
- split_mask = (self.bin_indices[node_indices, best_feature] <= best_bin)
155
- left_indices = node_indices[split_mask]
156
- right_indices = node_indices[~split_mask]
157
-
158
- left_size = left_indices.numel()
159
- right_size = right_indices.numel()
160
-
161
- if left_size == 0 or right_size == 0:
162
- leaf_value = self.residual[node_indices].mean()
163
- self.gradients[node_indices] += self.learning_rate * leaf_value
164
- return {"leaf_value": leaf_value.item(), "samples": parent_size}
165
-
166
- if left_size <= right_size:
167
- grad_hist_left, hess_hist_left = self.compute_histograms( self.bin_indices[left_indices], self.residual[left_indices] )
168
- grad_hist_right = gradient_histogram - grad_hist_left
169
- hess_hist_right = hessian_histogram - hess_hist_left
170
- else:
171
- grad_hist_right, hess_hist_right = self.compute_histograms( self.bin_indices[right_indices], self.residual[right_indices] )
172
- grad_hist_left = gradient_histogram - grad_hist_right
173
- hess_hist_left = hessian_histogram - hess_hist_right
174
-
175
- new_depth = depth + 1
176
- left_child = self.grow_tree(grad_hist_left, hess_hist_left, left_indices, new_depth)
177
- right_child = self.grow_tree(grad_hist_right, hess_hist_right, right_indices, new_depth)
178
-
179
- return { "feature": best_feature, "bin": best_bin, "left": left_child, "right": right_child }
180
-
181
- def grow_forest(self):
182
- forest = [{} for _ in range(self.n_estimators)]
183
- self.training_loss = []
184
-
185
- for i in range(self.n_estimators):
186
- self.residual = self.Y_gpu - self.gradients
187
-
188
- self.root_gradient_histogram, self.root_hessian_histogram = \
189
- self.compute_histograms(self.bin_indices, self.residual)
190
-
191
- tree = self.grow_tree(
192
- self.root_gradient_histogram,
193
- self.root_hessian_histogram,
194
- self.root_node_indices,
195
- depth=0
196
- )
197
- forest[i] = tree
198
- loss = ((self.Y_gpu - self.gradients) ** 2).mean().item()
199
- self.training_loss.append(loss)
200
- # print(f"🌲 Tree {i+1}/{self.n_estimators} - MSE: {loss:.6f}")
201
-
202
- print("Finished training forest.")
203
- return forest
204
-
205
- def predict(self, X_np, era_id_np=None):
206
- is_integer_type = np.issubdtype(X_np.dtype, np.integer)
207
- if is_integer_type:
208
- max_vals = X_np.max(axis=0)
209
- if np.all(max_vals < self.num_bins):
210
- bin_indices = X_np.astype(np.int8)
211
- return self.predict_data(bin_indices)
212
-
213
- X_cpu = torch.from_numpy(X_np).type(torch.float32) # CPU tensor
214
- bin_indices_cpu = torch.empty((X_np.shape[0], X_np.shape[1]), dtype=torch.int8)
215
- bin_edges_cpu = self.bin_edges.to('cpu')
216
- for f in range(self.num_features):
217
- bin_indices_cpu[:, f] = torch.bucketize(X_cpu[:, f], bin_edges_cpu[f], right=False).type(torch.int8)
218
-
219
- bin_indices = bin_indices_cpu.numpy() # Use CPU numpy array for predict_data
220
- return self.predict_data(bin_indices)
221
-
222
- @staticmethod
223
- def process_node(node, data_idx, bin_indices):
224
- while 'leaf_value' not in node:
225
- if bin_indices[data_idx, node['feature']] <= node['bin']:
226
- node = node['left']
227
- else:
228
- node = node['right']
229
- return node['leaf_value']
230
-
231
- def predict_data(self, bin_indices):
232
- n = bin_indices.shape[0]
233
- preds = np.zeros(n)
234
- proc = self.process_node # local var for speed
235
- lr = self.learning_rate
236
- base = self.base_prediction
237
- forest = self.forest
238
-
239
- for i in tqdm( range(n) ):
240
- preds[i] = base + lr * np.sum([proc( tree, i, bin_indices ) for tree in forest])
241
- return preds
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes