torch-ddp 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 01d7ef15d46b7d5ce121a60f0293d96300674a08ad1ae955f9732bd3d60ecdbd
4
+ data.tar.gz: f1a67feaa2252260a2b549c4022f0db06cdc816c520ec414a0a1a68b3a26b1e9
5
+ SHA512:
6
+ metadata.gz: d166257ca64d58ad7783745fdcf8f0606f18bceb8ca07fa01fff12d4acf10976564a6d8764098afffecc0befebc8f38dd88bbaef61984af6bffb7d13493ce056
7
+ data.tar.gz: 43d0f74ef949c8c4ac9c37c28ae03df2d2cbba9a4a10442a198a11a520018cdd4bd69992a791e1e9419715816d6b97a1dd07ada5e1d5ed04b6f83dc2227fa4c8
data/LICENSE.txt ADDED
@@ -0,0 +1,46 @@
1
+ BSD 3-Clause License
2
+
3
+ From Torch-rb:
4
+
5
+ Copyright (c) 2019- Andrew Kane
6
+
7
+ From PyTorch (for ported code):
8
+
9
+ Copyright (c) 2016- Facebook, Inc (Adam Paszke)
10
+ Copyright (c) 2014- Facebook, Inc (Soumith Chintala)
11
+ Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
12
+ Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
13
+ Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
14
+ Copyright (c) 2011-2013 NYU (Clement Farabet)
15
+ Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
16
+ Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
17
+ Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
18
+
19
+ All rights reserved.
20
+
21
+ Redistribution and use in source and binary forms, with or without
22
+ modification, are permitted provided that the following conditions are met:
23
+
24
+ 1. Redistributions of source code must retain the above copyright
25
+ notice, this list of conditions and the following disclaimer.
26
+
27
+ 2. Redistributions in binary form must reproduce the above copyright
28
+ notice, this list of conditions and the following disclaimer in the
29
+ documentation and/or other materials provided with the distribution.
30
+
31
+ 3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories America
32
+ and IDIAP Research Institute nor the names of its contributors may be
33
+ used to endorse or promote products derived from this software without
34
+ specific prior written permission.
35
+
36
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
37
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
38
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
39
+ ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
40
+ LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
41
+ CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
42
+ SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
43
+ INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
44
+ CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
45
+ ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
46
+ POSSIBILITY OF SUCH DAMAGE.
data/README.md ADDED
@@ -0,0 +1,114 @@
1
+ # Torch DDP
2
+
3
+ Optional distributed data parallel support for [`torch-rb`](https://github.com/ankane/torch.rb). It adds the `Torch::Distributed` API, a `DistributedDataParallel` wrapper, and a `torchrun` launcher that mirrors the PyTorch CLI.
4
+
5
+ Note: This gem has only seen testing across a narrow set of multi-GPU setups (limited Linux versions, drivers, and interconnects), so expect rough edges and please report issues you find.
6
+
7
+ ## Installation
8
+
9
+ Build LibTorch with distributed backends (Gloo for CPU, NCCL for CUDA). Point the extension at your LibTorch, CUDA, and optional Gloo includes:
10
+
11
+ ```sh
12
+ bundle config build.torch-ddp --with-torch-dir=/path/to/libtorch --with-gloo-include=/path/to/gloo
13
+ ```
14
+
15
+ If your CUDA or Gloo headers aren't in standard locations, extend the build config:
16
+
17
+ ```sh
18
+ bundle config build.torch-ddp --with-torch-dir=/path/to/libtorch --with-cuda-include=/path/to/cuda/include --with-gloo-include=/path/to/gloo/repo
19
+ ```
20
+
21
+ Add the gem next to `torch-rb`:
22
+
23
+ ```ruby
24
+ gem "torch-rb"
25
+ gem "torch-ddp"
26
+ ```
27
+
28
+ ## Usage
29
+
30
+ Initialize a process group and wrap your module:
31
+
32
+ ```ruby
33
+ require "torch/distributed"
34
+
35
+ Torch::Distributed.init_process_group
36
+ ddp = Torch::NN::Parallel::DistributedDataParallel.new(model)
37
+ ```
38
+
39
+ For single-node launches, `torchrun` will set ranks and world size for you:
40
+
41
+ ```sh
42
+ bundle exec torchrun --standalone --nproc-per-node=gpu path/to/training_script.rb
43
+ ```
44
+
45
+ Torch.rb ships with a `torchrun` launcher that handles process orchestration and sets the `RANK`, `LOCAL_RANK`, `WORLD_SIZE`, `MASTER_ADDR`, and `MASTER_PORT` environment variables expected by `Torch::Distributed.init_process_group`.
46
+
47
+ For multi-node runs, launch the same command on every node with matching rendezvous settings:
48
+
49
+ ```sh
50
+ bundle exec torchrun \
51
+ --nnodes=2 \
52
+ --node-rank=0 \
53
+ --rdzv-backend=c10d \
54
+ --rdzv-endpoint=host0.example.com:29503 \
55
+ --rdzv-id=my-job \
56
+ --nproc-per-node=4 \
57
+ path/to/training_script.rb
58
+ ```
59
+
60
+ On node 1, change `--node-rank=1`. The launcher restarts workers up to `--max-restarts` times and can be combined with tools like `bundle exec` or custom scripts via `--no-ruby`.
61
+
62
+ Use `Torch::Distributed.fork_world` for test helpers and small experiments without a launcher. Set `start_method: :spawn` to launch fresh worker processes instead of forking (avoids CUDA fork issues).
63
+
64
+ ## Examples
65
+
66
+ Run the distributed MNIST sample (spawns one process per GPU):
67
+
68
+ ```sh
69
+ bundle exec ruby examples/mnist/distributed.rb --gpus 2
70
+ ```
71
+
72
+ or
73
+ ```sh
74
+ bundle exec torchrun --standalone --nproc-per-node=gpu examples/mnist/distributed.rb
75
+ ```
76
+
77
+ Run the training benchmark (variable batch size / GPU count):
78
+
79
+ ```sh
80
+ bundle exec ruby examples/benchmark/training.rb --arch mnist_cnn --batch-size 256 --gpus 1 --steps 50
81
+ ```
82
+
83
+ Set `--gpus` to 2+ to enable distributed training; `--steps` measures only timed steps and `--warmup` sets warmup iterations.
84
+
85
+ Generate a comparison table across backends, group sizes, and batch sizes:
86
+
87
+ ```sh
88
+ bundle exec ruby examples/benchmark/training.rb --backends gloo,nccl --batch-sizes 32,64,128,256 --gpus 2 --steps 50
89
+ ```
90
+
91
+ Example results on dual RTX 3090s:
92
+ Processing speed: images per second. Convergence speed: average loss reduction per step and per second.
93
+
94
+ ```text
95
+ Backend | Proc Group | Batch | Images/s |
96
+ --------+------------+-------+----------|
97
+ gloo | 1 | 32 | 1724.4 |
98
+ gloo | 1 | 64 | 1941.8 |
99
+ gloo | 1 | 128 | 2038.7 |
100
+ gloo | 1 | 256 | 2171.8 |
101
+ gloo | 2 | 32 | 2261.0 |
102
+ gloo | 2 | 64 | 2870.6 |
103
+ gloo | 2 | 128 | 3398.4 |
104
+ gloo | 2 | 256 | 3743.1 |
105
+ nccl | 1 | 32 | 1804.8 |
106
+ nccl | 1 | 64 | 1963.0 |
107
+ nccl | 1 | 128 | 2051.5 |
108
+ nccl | 1 | 256 | 2143.3 |
109
+ nccl | 2 | 32 | 3046.1 |
110
+ nccl | 2 | 64 | 3513.6 |
111
+ nccl | 2 | 128 | 3892.1 |
112
+ nccl | 2 | 256 | 4024.5 |
113
+ --------+------------+-------+----------|
114
+ ```
data/bin/torchrun ADDED
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require "torch/torchrun"
5
+
6
+ Torch::TorchRun.start(ARGV)
@@ -0,0 +1,374 @@
1
+ # Benchmark training throughput for common architectures/datasets.
2
+ # Usage examples:
3
+ # ruby examples/benchmark/training.rb --arch mnist_cnn --batch-size 128 --gpus 1
4
+ # ruby examples/benchmark/training.rb --arch mnist_cnn --batch-size 128 --gpus 2 --steps 50
5
+
6
+ require "bundler/setup"
7
+ require "optparse"
8
+ require "torch"
9
+ require "torch/distributed"
10
+ require "torch/nn/parallel/distributed_data_parallel"
11
+ require "torchvision"
12
+
13
+ DEFAULT_BACKEND = if Torch.const_defined?(:CUDA) && Torch::CUDA.respond_to?(:available?) && Torch::CUDA.available?
14
+ "nccl"
15
+ else
16
+ Torch::Distributed.get_default_backend_for_device(Torch::Accelerator.current_accelerator) || "gloo"
17
+ end
18
+ SPAWN_BACKEND_ENV = "TORCH_RB_BENCH_BACKEND".freeze
19
+ SPAWN_GROUP_ENV = "TORCH_RB_BENCH_GROUP_SIZE".freeze
20
+ SPAWN_BATCH_ENV = "TORCH_RB_BENCH_BATCH_SIZE".freeze
21
+
22
+ def parse_list(value)
23
+ value.split(",").map(&:strip).reject(&:empty?)
24
+ end
25
+
26
+ def backend_supported?(backend)
27
+ return true unless backend == "nccl"
28
+
29
+ Torch.const_defined?(:CUDA) && Torch::CUDA.respond_to?(:available?) && Torch::CUDA.available?
30
+ end
31
+
32
+ def usable_cuda_device_count
33
+ return 0 unless Torch.const_defined?(:CUDA) && Torch::CUDA.respond_to?(:available?) && Torch::CUDA.available?
34
+
35
+ Torch::CUDA.respond_to?(:device_count) ? Torch::CUDA.device_count : 0
36
+ rescue
37
+ 0
38
+ end
39
+
40
+ def spawn_worker_process?
41
+ ENV[Torch::Distributed::SPAWN_ENV_KEY] == "1"
42
+ end
43
+
44
+ def apply_spawn_overrides!(options)
45
+ return unless ENV[Torch::Distributed::SPAWN_ENV_KEY] == "1"
46
+
47
+ if ENV[SPAWN_BACKEND_ENV]
48
+ options[:backends] = [ENV[SPAWN_BACKEND_ENV]]
49
+ end
50
+
51
+ if ENV[SPAWN_GROUP_ENV]
52
+ group_size = ENV[SPAWN_GROUP_ENV].to_i
53
+ if group_size.positive?
54
+ options[:group_sizes] = [group_size]
55
+ options[:gpus] = group_size
56
+ end
57
+ end
58
+
59
+ if ENV[SPAWN_BATCH_ENV]
60
+ batch_size = ENV[SPAWN_BATCH_ENV].to_i
61
+ options[:batch_sizes] = [batch_size] if batch_size.positive?
62
+ end
63
+ end
64
+
65
+ def with_spawn_env(backend:, group_size:, batch_size:)
66
+ previous = {
67
+ SPAWN_BACKEND_ENV => ENV[SPAWN_BACKEND_ENV],
68
+ SPAWN_GROUP_ENV => ENV[SPAWN_GROUP_ENV],
69
+ SPAWN_BATCH_ENV => ENV[SPAWN_BATCH_ENV]
70
+ }
71
+
72
+ ENV[SPAWN_BACKEND_ENV] = backend
73
+ ENV[SPAWN_GROUP_ENV] = group_size.to_s
74
+ ENV[SPAWN_BATCH_ENV] = batch_size.to_s
75
+
76
+ yield
77
+ ensure
78
+ ENV[SPAWN_BACKEND_ENV] = previous[SPAWN_BACKEND_ENV]
79
+ ENV[SPAWN_GROUP_ENV] = previous[SPAWN_GROUP_ENV]
80
+ ENV[SPAWN_BATCH_ENV] = previous[SPAWN_BATCH_ENV]
81
+ end
82
+
83
+ class MnistCnn < Torch::NN::Module
84
+ def initialize
85
+ super()
86
+ @conv1 = Torch::NN::Conv2d.new(1, 32, 3, stride: 1)
87
+ @conv2 = Torch::NN::Conv2d.new(32, 64, 3, stride: 1)
88
+ @dropout1 = Torch::NN::Dropout2d.new(p: 0.25)
89
+ @dropout2 = Torch::NN::Dropout2d.new(p: 0.5)
90
+ @fc1 = Torch::NN::Linear.new(9216, 128)
91
+ @fc2 = Torch::NN::Linear.new(128, 10)
92
+ end
93
+
94
+ def forward(x)
95
+ x = Torch::NN::F.relu(@conv1.call(x))
96
+ x = Torch::NN::F.relu(@conv2.call(x))
97
+ x = Torch::NN::F.max_pool2d(x, 2)
98
+ x = @dropout1.call(x)
99
+ x = Torch.flatten(x, start_dim: 1)
100
+ x = Torch::NN::F.relu(@fc1.call(x))
101
+ x = @dropout2.call(x)
102
+ Torch::NN::F.log_softmax(@fc2.call(x), 1)
103
+ end
104
+ end
105
+
106
+ ARCH_CONFIGS = {
107
+ "mnist_cnn" => {
108
+ model: -> { MnistCnn.new },
109
+ dataset: :mnist
110
+ }
111
+ }.freeze
112
+
113
+ def parse_options
114
+ defaults = {
115
+ arch: "mnist_cnn",
116
+ batch_sizes: [128],
117
+ steps: 100,
118
+ warmup: 10,
119
+ backends: [DEFAULT_BACKEND],
120
+ gpus: Torch::CUDA.available? ? [Torch::CUDA.device_count, 1].max : 1,
121
+ group_sizes: nil,
122
+ data_dir: File.join(__dir__, "data"),
123
+ lr: 0.01
124
+ }
125
+
126
+ OptionParser.new do |opts|
127
+ opts.banner = "Usage: ruby examples/benchmark/training.rb [options]"
128
+ opts.on("--arch NAME", "Architecture to benchmark (#{ARCH_CONFIGS.keys.join(', ')}, default: #{defaults[:arch]})") { |v| defaults[:arch] = v }
129
+ opts.on("--batch-size N", Integer, "Batch size per process (default: #{defaults[:batch_sizes].first})") { |v| defaults[:batch_sizes] = [v] }
130
+ opts.on("--batch-sizes LIST", String, "Comma-separated batch sizes per process") { |v| defaults[:batch_sizes] = parse_list(v).map(&:to_i) }
131
+ opts.on("--steps N", Integer, "Number of timed training steps (default: #{defaults[:steps]})") { |v| defaults[:steps] = v }
132
+ opts.on("--warmup N", Integer, "Number of warmup steps not included in timing (default: #{defaults[:warmup]})") { |v| defaults[:warmup] = v }
133
+ opts.on("--backend NAME", String, "Process group backend (default: #{defaults[:backends].first})") { |v| defaults[:backends] = [v] }
134
+ opts.on("--backends LIST", String, "Comma-separated list of backends to benchmark (gloo,nccl)") { |v| defaults[:backends] = parse_list(v) }
135
+ opts.on("--gpus N", Integer, "Number of GPUs/processes to use (1 for non-distributed)") { |v| defaults[:gpus] = v }
136
+ opts.on("--group-sizes LIST", String, "Process group sizes to benchmark (default: 1..gpus)") { |v| defaults[:group_sizes] = parse_list(v).map(&:to_i) }
137
+ opts.on("--data-dir PATH", String, "Directory for cached datasets (default: #{defaults[:data_dir]})") { |v| defaults[:data_dir] = v }
138
+ opts.on("--lr FLOAT", Float, "Learning rate (default: #{defaults[:lr]})") { |v| defaults[:lr] = v }
139
+ end.parse!(ARGV)
140
+
141
+ defaults[:group_sizes] ||= (1..defaults[:gpus]).to_a
142
+ defaults
143
+ end
144
+
145
+ def dataset_for(name, data_dir, distributed:, rank:, world_size:)
146
+ case name
147
+ when :mnist
148
+ transforms = TorchVision::Transforms::Compose.new([
149
+ TorchVision::Transforms::ToTensor.new,
150
+ TorchVision::Transforms::Normalize.new([0.1307], [0.3081])
151
+ ])
152
+
153
+ if distributed
154
+ if rank.zero?
155
+ train = TorchVision::Datasets::MNIST.new(data_dir, train: true, download: true, transform: transforms)
156
+ Torch::Distributed.barrier
157
+ else
158
+ Torch::Distributed.barrier
159
+ train = TorchVision::Datasets::MNIST.new(data_dir, train: true, download: false, transform: transforms)
160
+ end
161
+ indices = rank.step(train.size - 1, world_size).to_a
162
+ Torch::Utils::Data::Subset.new(train, indices)
163
+ else
164
+ TorchVision::Datasets::MNIST.new(data_dir, train: true, download: true, transform: transforms)
165
+ end
166
+ else
167
+ raise ArgumentError, "Unknown dataset: #{name}"
168
+ end
169
+ end
170
+
171
+ def sync_cuda_if_needed(device)
172
+ return unless device && device.type == "cuda"
173
+ return unless Torch.const_defined?(:CUDA) && Torch::CUDA.respond_to?(:synchronize)
174
+
175
+ Torch::CUDA.synchronize
176
+ end
177
+
178
+ def benchmark_worker(rank, world_size, port, options)
179
+ arch = options.fetch(:arch)
180
+ config = ARCH_CONFIGS[arch]
181
+ raise ArgumentError, "Unsupported architecture #{arch.inspect}" unless config
182
+
183
+ distributed = world_size > 1
184
+ accelerator = Torch::Accelerator.current_accelerator
185
+ selected_backend = options[:backend] || Torch::Distributed.get_default_backend_for_device(accelerator) || DEFAULT_BACKEND
186
+ if distributed
187
+ store = Torch::Distributed::TCPStore.new("127.0.0.1", port, world_size, rank.zero?)
188
+ Torch::Distributed.init_process_group(selected_backend, store: store, rank: rank, world_size: world_size)
189
+ end
190
+
191
+ cuda_devices = usable_cuda_device_count
192
+ device = if cuda_devices.positive? && options[:gpus] > 0
193
+ Torch.device("cuda:#{rank % cuda_devices}")
194
+ else
195
+ Torch.device("cpu")
196
+ end
197
+
198
+ model = config[:model].call.to(device)
199
+ if distributed
200
+ ddp_devices = device.type == "cuda" ? [device.index] : nil
201
+ model = Torch::NN::Parallel::DistributedDataParallel.new(model, device_ids: ddp_devices)
202
+ end
203
+ optimizer = Torch::Optim::SGD.new(model.parameters, lr: options[:lr])
204
+
205
+ loader = Torch::Utils::Data::DataLoader.new(
206
+ dataset_for(config[:dataset], options[:data_dir], distributed: distributed, rank: rank, world_size: world_size),
207
+ batch_size: options[:batch_size],
208
+ shuffle: true
209
+ )
210
+
211
+ warmup_steps = options[:warmup]
212
+ timed_steps = options[:steps]
213
+ total_steps = warmup_steps + timed_steps
214
+ losses = []
215
+
216
+ # Warm up the model (including one full timed-length pass) to avoid init overhead in measurements.
217
+ step_idx = 0
218
+ loader.each do |data, target|
219
+ data = data.to(device)
220
+ target = target.to(device)
221
+
222
+ optimizer.zero_grad
223
+ loss = Torch::NN::F.nll_loss(model.call(data), target)
224
+ loss.backward
225
+ optimizer.step
226
+
227
+ step_idx += 1
228
+ break if step_idx >= total_steps
229
+ end
230
+
231
+ sync_cuda_if_needed(device)
232
+ Torch::Distributed.barrier if distributed
233
+
234
+ timed = 0
235
+ step_idx = 0
236
+ start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
237
+ loader.each do |data, target|
238
+ data = data.to(device)
239
+ target = target.to(device)
240
+
241
+ optimizer.zero_grad
242
+ loss = Torch::NN::F.nll_loss(model.call(data), target)
243
+ loss.backward
244
+ optimizer.step
245
+
246
+ loss_value = loss.item
247
+ if distributed
248
+ loss_tensor = Torch.tensor([loss_value], device: device)
249
+ Torch::Distributed.all_reduce(loss_tensor)
250
+ loss_value = loss_tensor.item / world_size.to_f
251
+ end
252
+ losses << loss_value if !distributed || rank.zero?
253
+
254
+ step_idx += 1
255
+ break if step_idx >= timed_steps
256
+ end
257
+
258
+ sync_cuda_if_needed(device)
259
+ Torch::Distributed.barrier if distributed
260
+ elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
261
+ timed = step_idx
262
+
263
+ images = timed * options[:batch_size] * world_size
264
+ throughput = elapsed.positive? ? images.to_f / elapsed : 0.0
265
+ initial_loss = losses.first || 0.0
266
+ final_loss = losses.last || initial_loss
267
+ loss_delta = initial_loss - final_loss
268
+ loss_delta_per_step = timed.zero? ? 0.0 : loss_delta / timed
269
+ loss_delta_per_sec = elapsed.zero? ? 0.0 : loss_delta / elapsed
270
+
271
+ result = if !distributed || rank.zero?
272
+ {
273
+ backend: selected_backend,
274
+ world_size: world_size,
275
+ batch_size: options[:batch_size],
276
+ arch: arch,
277
+ dataset: config[:dataset],
278
+ elapsed: elapsed,
279
+ timed_steps: timed,
280
+ images: images,
281
+ throughput: throughput,
282
+ initial_loss: initial_loss,
283
+ final_loss: final_loss,
284
+ loss_delta: loss_delta,
285
+ loss_delta_per_step: loss_delta_per_step,
286
+ loss_delta_per_sec: loss_delta_per_sec
287
+ }
288
+ end
289
+
290
+ Torch::Distributed.destroy_process_group if distributed
291
+ result
292
+ end
293
+
294
+ def run_benchmark_case(world_size, options)
295
+ if world_size > 1
296
+ outputs = Torch::Distributed.fork_world(world_size, start_method: :spawn) do |rank, port|
297
+ benchmark_worker(rank, world_size, port, options)
298
+ end
299
+ outputs.compact.first
300
+ else
301
+ benchmark_worker(0, 1, Torch::Distributed.free_port, options)
302
+ end
303
+ end
304
+
305
+ def print_summary_table(results)
306
+ puts "\nBenchmark comparison (processing vs convergence)"
307
+ puts "Processing speed: images per second. Convergence speed: average loss reduction per step and per second.\n"
308
+
309
+ headers = ["Backend", "Proc Group", "Batch", "Images/s", "Loss delta/step", "Loss delta/s", "Final loss"]
310
+ formatters = [
311
+ ->(r) { r[:backend] },
312
+ ->(r) { r[:world_size] },
313
+ ->(r) { r[:batch_size] },
314
+ ->(r) { format("%.1f", r[:throughput]) },
315
+ ->(r) { format("%.4f", r[:loss_delta_per_step]) },
316
+ ->(r) { format("%.4f", r[:loss_delta_per_sec]) },
317
+ ->(r) { format("%.4f", r[:final_loss]) }
318
+ ]
319
+
320
+ widths = headers.each_with_index.map do |header, idx|
321
+ [header.length, results.map { |r| formatters[idx].call(r).to_s.length }.max].compact.max
322
+ end
323
+
324
+ header_line = headers.each_with_index.map { |h, idx| h.ljust(widths[idx]) }.join(" | ")
325
+ divider = widths.map { |w| "-" * w }.join("-+-")
326
+ puts header_line
327
+ puts divider
328
+
329
+ results.sort_by { |r| [r[:backend], r[:world_size], r[:batch_size]] }.each do |result|
330
+ row = formatters.each_with_index.map { |formatter, idx| formatter.call(result).to_s.ljust(widths[idx]) }
331
+ puts row.join(" | ")
332
+ end
333
+ end
334
+
335
+ options = parse_options
336
+ apply_spawn_overrides!(options)
337
+ max_world_size = options[:gpus]
338
+ raise "Number of GPUs requested must be >= 1" if max_world_size < 1
339
+ Torch.manual_seed(1)
340
+
341
+ group_sizes = options[:group_sizes].map { |v| [v, max_world_size].min }.select { |v| v >= 1 }.uniq.sort
342
+ batch_sizes = options[:batch_sizes].map { |v| [v, 1].max }.uniq
343
+ backends = options[:backends].map(&:downcase).uniq
344
+
345
+ if group_sizes.any? { |size| size > 1 }
346
+ raise "torch.distributed is not available" unless Torch::Distributed.available?
347
+ end
348
+
349
+ results = []
350
+
351
+ backends.each do |backend|
352
+ unless backend_supported?(backend)
353
+ warn "Skipping backend=#{backend} because required accelerator support is unavailable."
354
+ next
355
+ end
356
+
357
+ group_sizes.each do |world_size|
358
+ batch_sizes.each do |batch_size|
359
+ run_options = options.merge(batch_size: batch_size, backend: backend, gpus: world_size)
360
+ puts "Running backend=#{backend}, group_size=#{world_size}, batch_size=#{batch_size}..." unless spawn_worker_process?
361
+ with_spawn_env(backend: backend, group_size: world_size, batch_size: batch_size) do
362
+ results << run_benchmark_case(world_size, run_options)
363
+ end
364
+ end
365
+ end
366
+ end
367
+
368
+ results.compact!
369
+
370
+ if results.empty?
371
+ puts "No benchmark results to report."
372
+ else
373
+ print_summary_table(results)
374
+ end