EvoScientist 0.0.1.dev2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (107) hide show
  1. EvoScientist/EvoScientist.py +157 -0
  2. EvoScientist/__init__.py +24 -0
  3. EvoScientist/__main__.py +4 -0
  4. EvoScientist/backends.py +392 -0
  5. EvoScientist/cli.py +1553 -0
  6. EvoScientist/middleware.py +35 -0
  7. EvoScientist/prompts.py +277 -0
  8. EvoScientist/skills/accelerate/SKILL.md +332 -0
  9. EvoScientist/skills/accelerate/references/custom-plugins.md +453 -0
  10. EvoScientist/skills/accelerate/references/megatron-integration.md +489 -0
  11. EvoScientist/skills/accelerate/references/performance.md +525 -0
  12. EvoScientist/skills/bitsandbytes/SKILL.md +411 -0
  13. EvoScientist/skills/bitsandbytes/references/memory-optimization.md +521 -0
  14. EvoScientist/skills/bitsandbytes/references/qlora-training.md +521 -0
  15. EvoScientist/skills/bitsandbytes/references/quantization-formats.md +447 -0
  16. EvoScientist/skills/find-skills/SKILL.md +133 -0
  17. EvoScientist/skills/find-skills/scripts/install_skill.py +211 -0
  18. EvoScientist/skills/flash-attention/SKILL.md +367 -0
  19. EvoScientist/skills/flash-attention/references/benchmarks.md +215 -0
  20. EvoScientist/skills/flash-attention/references/transformers-integration.md +293 -0
  21. EvoScientist/skills/llama-cpp/SKILL.md +258 -0
  22. EvoScientist/skills/llama-cpp/references/optimization.md +89 -0
  23. EvoScientist/skills/llama-cpp/references/quantization.md +213 -0
  24. EvoScientist/skills/llama-cpp/references/server.md +125 -0
  25. EvoScientist/skills/lm-evaluation-harness/SKILL.md +490 -0
  26. EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  27. EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  28. EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  29. EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  30. EvoScientist/skills/ml-paper-writing/SKILL.md +937 -0
  31. EvoScientist/skills/ml-paper-writing/references/checklists.md +361 -0
  32. EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +562 -0
  33. EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +367 -0
  34. EvoScientist/skills/ml-paper-writing/references/sources.md +159 -0
  35. EvoScientist/skills/ml-paper-writing/references/writing-guide.md +476 -0
  36. EvoScientist/skills/ml-paper-writing/templates/README.md +251 -0
  37. EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +534 -0
  38. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +144 -0
  39. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +952 -0
  40. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +111 -0
  41. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +1493 -0
  42. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +315 -0
  43. EvoScientist/skills/ml-paper-writing/templates/acl/README.md +50 -0
  44. EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +312 -0
  45. EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +377 -0
  46. EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +101 -0
  47. EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +1940 -0
  48. EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +26 -0
  49. EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +70 -0
  50. EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +326 -0
  51. EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +3 -0
  52. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +11 -0
  53. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +1440 -0
  54. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
  55. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +218 -0
  56. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +305 -0
  57. EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +485 -0
  58. EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +508 -0
  59. EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +1246 -0
  60. EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +485 -0
  61. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +24 -0
  62. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +1440 -0
  63. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
  64. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +246 -0
  65. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +414 -0
  66. EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +508 -0
  67. EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +1246 -0
  68. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +79 -0
  69. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +201 -0
  70. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +75 -0
  71. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
  72. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +662 -0
  73. EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +864 -0
  74. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +1443 -0
  75. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +767 -0
  76. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
  77. EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +36 -0
  78. EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +53 -0
  79. EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +38 -0
  80. EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +382 -0
  81. EvoScientist/skills/peft/SKILL.md +431 -0
  82. EvoScientist/skills/peft/references/advanced-usage.md +514 -0
  83. EvoScientist/skills/peft/references/troubleshooting.md +480 -0
  84. EvoScientist/skills/ray-data/SKILL.md +326 -0
  85. EvoScientist/skills/ray-data/references/integration.md +82 -0
  86. EvoScientist/skills/ray-data/references/transformations.md +83 -0
  87. EvoScientist/skills/skill-creator/LICENSE.txt +202 -0
  88. EvoScientist/skills/skill-creator/SKILL.md +356 -0
  89. EvoScientist/skills/skill-creator/references/output-patterns.md +82 -0
  90. EvoScientist/skills/skill-creator/references/workflows.md +28 -0
  91. EvoScientist/skills/skill-creator/scripts/init_skill.py +303 -0
  92. EvoScientist/skills/skill-creator/scripts/package_skill.py +110 -0
  93. EvoScientist/skills/skill-creator/scripts/quick_validate.py +95 -0
  94. EvoScientist/stream/__init__.py +53 -0
  95. EvoScientist/stream/emitter.py +94 -0
  96. EvoScientist/stream/formatter.py +168 -0
  97. EvoScientist/stream/tracker.py +115 -0
  98. EvoScientist/stream/utils.py +255 -0
  99. EvoScientist/subagent.yaml +147 -0
  100. EvoScientist/tools.py +135 -0
  101. EvoScientist/utils.py +207 -0
  102. evoscientist-0.0.1.dev2.dist-info/METADATA +227 -0
  103. evoscientist-0.0.1.dev2.dist-info/RECORD +107 -0
  104. evoscientist-0.0.1.dev2.dist-info/WHEEL +5 -0
  105. evoscientist-0.0.1.dev2.dist-info/entry_points.txt +5 -0
  106. evoscientist-0.0.1.dev2.dist-info/licenses/LICENSE +21 -0
  107. evoscientist-0.0.1.dev2.dist-info/top_level.txt +1 -0
@@ -0,0 +1,525 @@
1
+ # Accelerate Performance Tuning
2
+
3
+ ## Profiling
4
+
5
+ ### Basic Profiling
6
+
7
+ ```python
8
+ from accelerate import Accelerator
9
+ import time
10
+
11
+ accelerator = Accelerator()
12
+
13
+ # Warmup
14
+ for _ in range(10):
15
+ batch = next(iter(dataloader))
16
+ outputs = model(**batch)
17
+ loss = outputs.loss
18
+ accelerator.backward(loss)
19
+ optimizer.step()
20
+ optimizer.zero_grad()
21
+
22
+ # Profile training loop
23
+ start = time.time()
24
+ total_batches = 100
25
+
26
+ for i, batch in enumerate(dataloader):
27
+ if i >= total_batches:
28
+ break
29
+
30
+ outputs = model(**batch)
31
+ loss = outputs.loss
32
+ accelerator.backward(loss)
33
+ optimizer.step()
34
+ optimizer.zero_grad()
35
+
36
+ accelerator.wait_for_everyone() # Sync all processes
37
+ elapsed = time.time() - start
38
+
39
+ # Metrics
40
+ batches_per_sec = total_batches / elapsed
41
+ samples_per_sec = (total_batches * batch_size * accelerator.num_processes) / elapsed
42
+
43
+ print(f"Throughput: {samples_per_sec:.2f} samples/sec")
44
+ print(f"Batches/sec: {batches_per_sec:.2f}")
45
+ ```
46
+
47
+ ### PyTorch Profiler Integration
48
+
49
+ ```python
50
+ from torch.profiler import profile, ProfilerActivity
51
+
52
+ with profile(
53
+ activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
54
+ record_shapes=True,
55
+ profile_memory=True,
56
+ with_stack=True
57
+ ) as prof:
58
+ for i, batch in enumerate(dataloader):
59
+ if i >= 10: # Profile first 10 batches
60
+ break
61
+
62
+ outputs = model(**batch)
63
+ loss = outputs.loss
64
+ accelerator.backward(loss)
65
+ optimizer.step()
66
+ optimizer.zero_grad()
67
+
68
+ # Print profiling results
69
+ print(prof.key_averages().table(
70
+ sort_by="cuda_time_total", row_limit=20
71
+ ))
72
+
73
+ # Export to Chrome tracing
74
+ prof.export_chrome_trace("trace.json")
75
+ # View at chrome://tracing
76
+ ```
77
+
78
+ ## Memory Optimization
79
+
80
+ ### 1. Gradient Accumulation
81
+
82
+ **Problem**: Large batch size causes OOM
83
+
84
+ **Solution**: Accumulate gradients across micro-batches
85
+
86
+ ```python
87
+ accelerator = Accelerator(gradient_accumulation_steps=8)
88
+
89
+ # Effective batch = batch_size × accumulation_steps × num_gpus
90
+ # Example: 4 × 8 × 8 = 256
91
+
92
+ for batch in dataloader:
93
+ with accelerator.accumulate(model): # Handles accumulation logic
94
+ outputs = model(**batch)
95
+ loss = outputs.loss
96
+ accelerator.backward(loss)
97
+ optimizer.step()
98
+ optimizer.zero_grad()
99
+ ```
100
+
101
+ **Memory savings**: 8× less activation memory (with 8 accumulation steps)
102
+
103
+ ### 2. Gradient Checkpointing
104
+
105
+ **Enable in model**:
106
+
107
+ ```python
108
+ from transformers import AutoModelForCausalLM
109
+
110
+ model = AutoModelForCausalLM.from_pretrained(
111
+ "gpt2",
112
+ use_cache=False # Required for gradient checkpointing
113
+ )
114
+
115
+ # Enable checkpointing
116
+ model.gradient_checkpointing_enable()
117
+
118
+ # Prepare with Accelerate
119
+ model = accelerator.prepare(model)
120
+ ```
121
+
122
+ **Memory savings**: 30-50% with 10-15% slowdown
123
+
124
+ ### 3. Mixed Precision
125
+
126
+ **BF16 (A100/H100)**:
127
+ ```python
128
+ accelerator = Accelerator(mixed_precision='bf16')
129
+
130
+ # Automatic mixed precision
131
+ for batch in dataloader:
132
+ outputs = model(**batch) # Forward in BF16
133
+ loss = outputs.loss
134
+ accelerator.backward(loss) # Backward in FP32
135
+ optimizer.step()
136
+ ```
137
+
138
+ **FP16 (V100, older GPUs)**:
139
+ ```python
140
+ from accelerate.utils import GradScalerKwargs
141
+
142
+ scaler_kwargs = GradScalerKwargs(
143
+ init_scale=2.**16,
144
+ growth_interval=2000
145
+ )
146
+
147
+ accelerator = Accelerator(
148
+ mixed_precision='fp16',
149
+ kwargs_handlers=[scaler_kwargs]
150
+ )
151
+ ```
152
+
153
+ **Memory savings**: 50% compared to FP32
154
+
155
+ ### 4. CPU Offloading (DeepSpeed)
156
+
157
+ ```python
158
+ from accelerate.utils import DeepSpeedPlugin
159
+
160
+ ds_plugin = DeepSpeedPlugin(
161
+ zero_stage=3,
162
+ offload_optimizer_device="cpu", # Offload optimizer to CPU
163
+ offload_param_device="cpu", # Offload parameters to CPU
164
+ )
165
+
166
+ accelerator = Accelerator(
167
+ deepspeed_plugin=ds_plugin,
168
+ mixed_precision='bf16'
169
+ )
170
+ ```
171
+
172
+ **Memory savings**: 10-20× for optimizer state, 5-10× for parameters
173
+
174
+ **Trade-off**: 20-30% slower due to CPU-GPU transfers
175
+
176
+ ### 5. Flash Attention
177
+
178
+ ```python
179
+ # Install flash-attn
180
+ # pip install flash-attn
181
+
182
+ from transformers import AutoModelForCausalLM
183
+
184
+ model = AutoModelForCausalLM.from_pretrained(
185
+ "gpt2",
186
+ attn_implementation="flash_attention_2" # Enable Flash Attention 2
187
+ )
188
+
189
+ model = accelerator.prepare(model)
190
+ ```
191
+
192
+ **Memory savings**: 50% for attention, 2× faster
193
+
194
+ **Requirements**: A100/H100, sequence length must be multiple of 128
195
+
196
+ ## Communication Optimization
197
+
198
+ ### 1. Gradient Bucketing (DDP)
199
+
200
+ ```python
201
+ from accelerate.utils import DistributedDataParallelKwargs
202
+
203
+ ddp_kwargs = DistributedDataParallelKwargs(
204
+ bucket_cap_mb=25, # Bucket size for gradient reduction
205
+ gradient_as_bucket_view=True, # Reduce memory copies
206
+ static_graph=False # Set True if model doesn't change
207
+ )
208
+
209
+ accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])
210
+ ```
211
+
212
+ **Recommended bucket sizes**:
213
+ - Small models (<1B): 25 MB
214
+ - Medium models (1-10B): 50-100 MB
215
+ - Large models (>10B): 100-200 MB
216
+
217
+ ### 2. Find Unused Parameters
218
+
219
+ ```python
220
+ # Only enable if model has unused parameters (slower!)
221
+ ddp_kwargs = DistributedDataParallelKwargs(
222
+ find_unused_parameters=True
223
+ )
224
+ ```
225
+
226
+ **Use case**: Models with conditional branches (e.g., mixture of experts)
227
+
228
+ **Cost**: 10-20% slower
229
+
230
+ ### 3. NCCL Tuning
231
+
232
+ ```bash
233
+ # Set environment variables before launch
234
+ export NCCL_DEBUG=INFO # Debug info
235
+ export NCCL_IB_DISABLE=0 # Enable InfiniBand
236
+ export NCCL_SOCKET_IFNAME=eth0 # Network interface
237
+ export NCCL_P2P_LEVEL=NVL # Use NVLink
238
+
239
+ accelerate launch train.py
240
+ ```
241
+
242
+ **NCCL_P2P_LEVEL options**:
243
+ - `NVL`: NVLink (fastest, within node)
244
+ - `PIX`: PCIe (fast, within node)
245
+ - `PHB`: PCIe host bridge (slow, cross-node)
246
+
247
+ ## Data Loading Optimization
248
+
249
+ ### 1. DataLoader Workers
250
+
251
+ ```python
252
+ from torch.utils.data import DataLoader
253
+
254
+ train_loader = DataLoader(
255
+ dataset,
256
+ batch_size=32,
257
+ num_workers=4, # Parallel data loading
258
+ pin_memory=True, # Pin memory for faster GPU transfer
259
+ prefetch_factor=2, # Prefetch batches per worker
260
+ persistent_workers=True # Keep workers alive between epochs
261
+ )
262
+
263
+ train_loader = accelerator.prepare(train_loader)
264
+ ```
265
+
266
+ **Recommendations**:
267
+ - `num_workers`: 2-4 per GPU (8 GPUs → 16-32 workers)
268
+ - `pin_memory`: Always True for GPU training
269
+ - `prefetch_factor`: 2-4 (higher for slow data loading)
270
+
271
+ ### 2. Data Preprocessing
272
+
273
+ ```python
274
+ from datasets import load_dataset
275
+
276
+ # Bad: Preprocess during training (slow)
277
+ dataset = load_dataset("openwebtext")
278
+
279
+ for batch in dataset:
280
+ tokens = tokenizer(batch['text']) # Slow!
281
+ ...
282
+
283
+ # Good: Preprocess once, save
284
+ dataset = load_dataset("openwebtext")
285
+ tokenized = dataset.map(
286
+ lambda x: tokenizer(x['text']),
287
+ batched=True,
288
+ num_proc=8, # Parallel preprocessing
289
+ remove_columns=['text']
290
+ )
291
+ tokenized.save_to_disk("preprocessed_data")
292
+
293
+ # Load preprocessed
294
+ dataset = load_from_disk("preprocessed_data")
295
+ ```
296
+
297
+ ### 3. Faster Tokenization
298
+
299
+ ```python
300
+ import os
301
+
302
+ # Enable Rust-based tokenizers (10× faster)
303
+ os.environ["TOKENIZERS_PARALLELISM"] = "true"
304
+
305
+ from transformers import AutoTokenizer
306
+
307
+ tokenizer = AutoTokenizer.from_pretrained(
308
+ "gpt2",
309
+ use_fast=True # Use fast Rust tokenizer
310
+ )
311
+ ```
312
+
313
+ ## Compilation (PyTorch 2.0+)
314
+
315
+ ### Compile Model
316
+
317
+ ```python
318
+ import torch
319
+
320
+ # Compile model for faster execution
321
+ model = torch.compile(
322
+ model,
323
+ mode="reduce-overhead", # Options: default, reduce-overhead, max-autotune
324
+ fullgraph=False, # Compile entire graph (stricter)
325
+ dynamic=True # Support dynamic shapes
326
+ )
327
+
328
+ model = accelerator.prepare(model)
329
+ ```
330
+
331
+ **Speedup**: 10-50% depending on model
332
+
333
+ **Compilation modes**:
334
+ - `default`: Balanced (best for most cases)
335
+ - `reduce-overhead`: Min overhead (best for small batches)
336
+ - `max-autotune`: Max performance (slow compile, best for production)
337
+
338
+ ### Compilation Best Practices
339
+
340
+ ```python
341
+ # Bad: Compile after prepare (won't work)
342
+ model = accelerator.prepare(model)
343
+ model = torch.compile(model) # Error!
344
+
345
+ # Good: Compile before prepare
346
+ model = torch.compile(model)
347
+ model = accelerator.prepare(model)
348
+
349
+ # Training loop
350
+ for batch in dataloader:
351
+ # First iteration: slow (compilation)
352
+ # Subsequent iterations: fast (compiled)
353
+ outputs = model(**batch)
354
+ ...
355
+ ```
356
+
357
+ ## Benchmarking Different Strategies
358
+
359
+ ### Script Template
360
+
361
+ ```python
362
+ import time
363
+ import torch
364
+ from accelerate import Accelerator
365
+
366
+ def benchmark_strategy(strategy_name, accelerator_kwargs):
367
+ """Benchmark a specific training strategy."""
368
+ accelerator = Accelerator(**accelerator_kwargs)
369
+
370
+ # Setup
371
+ model = create_model()
372
+ optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
373
+ dataloader = create_dataloader()
374
+
375
+ model, optimizer, dataloader = accelerator.prepare(
376
+ model, optimizer, dataloader
377
+ )
378
+
379
+ # Warmup
380
+ for i, batch in enumerate(dataloader):
381
+ if i >= 10:
382
+ break
383
+ outputs = model(**batch)
384
+ loss = outputs.loss
385
+ accelerator.backward(loss)
386
+ optimizer.step()
387
+ optimizer.zero_grad()
388
+
389
+ # Benchmark
390
+ accelerator.wait_for_everyone()
391
+ torch.cuda.synchronize()
392
+ start = time.time()
393
+
394
+ num_batches = 100
395
+ for i, batch in enumerate(dataloader):
396
+ if i >= num_batches:
397
+ break
398
+
399
+ outputs = model(**batch)
400
+ loss = outputs.loss
401
+ accelerator.backward(loss)
402
+ optimizer.step()
403
+ optimizer.zero_grad()
404
+
405
+ accelerator.wait_for_everyone()
406
+ torch.cuda.synchronize()
407
+ elapsed = time.time() - start
408
+
409
+ # Metrics
410
+ throughput = (num_batches * batch_size * accelerator.num_processes) / elapsed
411
+ memory_used = torch.cuda.max_memory_allocated() / 1e9 # GB
412
+
413
+ if accelerator.is_main_process:
414
+ print(f"\n{strategy_name}:")
415
+ print(f" Throughput: {throughput:.2f} samples/sec")
416
+ print(f" Memory: {memory_used:.2f} GB")
417
+ print(f" Time: {elapsed:.2f} sec")
418
+
419
+ torch.cuda.reset_peak_memory_stats()
420
+
421
+ # Benchmark different strategies
422
+ strategies = [
423
+ ("DDP + FP32", {}),
424
+ ("DDP + BF16", {"mixed_precision": "bf16"}),
425
+ ("DDP + BF16 + GradAccum", {"mixed_precision": "bf16", "gradient_accumulation_steps": 4}),
426
+ ("FSDP", {"fsdp_plugin": fsdp_plugin}),
427
+ ("DeepSpeed ZeRO-2", {"deepspeed_plugin": ds_plugin_stage2}),
428
+ ("DeepSpeed ZeRO-3", {"deepspeed_plugin": ds_plugin_stage3}),
429
+ ]
430
+
431
+ for name, kwargs in strategies:
432
+ benchmark_strategy(name, kwargs)
433
+ ```
434
+
435
+ ## Performance Checklist
436
+
437
+ **Before training**:
438
+ - [ ] Use BF16/FP16 mixed precision
439
+ - [ ] Enable gradient checkpointing (if OOM)
440
+ - [ ] Set appropriate `num_workers` (2-4 per GPU)
441
+ - [ ] Enable `pin_memory=True`
442
+ - [ ] Preprocess data once, not during training
443
+ - [ ] Compile model with `torch.compile` (PyTorch 2.0+)
444
+
445
+ **For large models**:
446
+ - [ ] Use FSDP or DeepSpeed ZeRO-3
447
+ - [ ] Enable CPU offloading (if still OOM)
448
+ - [ ] Use Flash Attention
449
+ - [ ] Increase gradient accumulation
450
+
451
+ **For multi-node**:
452
+ - [ ] Check network topology (InfiniBand > Ethernet)
453
+ - [ ] Tune NCCL settings
454
+ - [ ] Use larger bucket sizes for DDP
455
+ - [ ] Verify NVLink for tensor parallelism
456
+
457
+ **Profiling**:
458
+ - [ ] Profile first 10-100 batches
459
+ - [ ] Check GPU utilization (`nvidia-smi dmon`)
460
+ - [ ] Check data loading time (should be <5% of iteration)
461
+ - [ ] Identify communication bottlenecks
462
+
463
+ ## Common Performance Issues
464
+
465
+ ### Issue: Low GPU Utilization (<80%)
466
+
467
+ **Cause 1**: Data loading bottleneck
468
+ ```python
469
+ # Solution: Increase workers and prefetch
470
+ num_workers=8
471
+ prefetch_factor=4
472
+ ```
473
+
474
+ **Cause 2**: Small batch size
475
+ ```python
476
+ # Solution: Increase batch size or use gradient accumulation
477
+ batch_size=32 # Increase
478
+ gradient_accumulation_steps=4 # Or accumulate
479
+ ```
480
+
481
+ ### Issue: High Memory Usage
482
+
483
+ **Solution 1**: Gradient checkpointing
484
+ ```python
485
+ model.gradient_checkpointing_enable()
486
+ ```
487
+
488
+ **Solution 2**: Reduce batch size, increase accumulation
489
+ ```python
490
+ batch_size=8 # Reduce from 32
491
+ gradient_accumulation_steps=16 # Maintain effective batch
492
+ ```
493
+
494
+ **Solution 3**: Use FSDP or DeepSpeed ZeRO-3
495
+ ```python
496
+ accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
497
+ ```
498
+
499
+ ### Issue: Slow Multi-GPU Training
500
+
501
+ **Cause**: Communication bottleneck
502
+
503
+ **Check 1**: Gradient bucket size
504
+ ```python
505
+ ddp_kwargs = DistributedDataParallelKwargs(bucket_cap_mb=100)
506
+ ```
507
+
508
+ **Check 2**: NCCL settings
509
+ ```bash
510
+ export NCCL_DEBUG=INFO
511
+ # Check for "Using NVLS" (good) vs "Using PHB" (bad)
512
+ ```
513
+
514
+ **Check 3**: Network bandwidth
515
+ ```bash
516
+ # Test inter-GPU bandwidth
517
+ nvidia-smi nvlink -s
518
+ ```
519
+
520
+ ## Resources
521
+
522
+ - Accelerate Performance: https://huggingface.co/docs/accelerate/usage_guides/performance
523
+ - PyTorch Profiler: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
524
+ - NCCL Tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
525
+ - Flash Attention: https://github.com/Dao-AILab/flash-attention