EvoScientist 0.0.1.dev3__py3-none-any.whl → 0.1.0rc1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (108) hide show
  1. EvoScientist/EvoScientist.py +17 -49
  2. EvoScientist/backends.py +0 -26
  3. EvoScientist/cli.py +1109 -255
  4. EvoScientist/middleware.py +8 -61
  5. EvoScientist/stream/__init__.py +0 -25
  6. EvoScientist/stream/utils.py +16 -23
  7. EvoScientist/tools.py +0 -64
  8. evoscientist-0.1.0rc1.dist-info/METADATA +199 -0
  9. evoscientist-0.1.0rc1.dist-info/RECORD +21 -0
  10. evoscientist-0.1.0rc1.dist-info/entry_points.txt +2 -0
  11. EvoScientist/memory.py +0 -715
  12. EvoScientist/paths.py +0 -45
  13. EvoScientist/skills/accelerate/SKILL.md +0 -332
  14. EvoScientist/skills/accelerate/references/custom-plugins.md +0 -453
  15. EvoScientist/skills/accelerate/references/megatron-integration.md +0 -489
  16. EvoScientist/skills/accelerate/references/performance.md +0 -525
  17. EvoScientist/skills/bitsandbytes/SKILL.md +0 -411
  18. EvoScientist/skills/bitsandbytes/references/memory-optimization.md +0 -521
  19. EvoScientist/skills/bitsandbytes/references/qlora-training.md +0 -521
  20. EvoScientist/skills/bitsandbytes/references/quantization-formats.md +0 -447
  21. EvoScientist/skills/find-skills/SKILL.md +0 -133
  22. EvoScientist/skills/find-skills/scripts/install_skill.py +0 -211
  23. EvoScientist/skills/flash-attention/SKILL.md +0 -367
  24. EvoScientist/skills/flash-attention/references/benchmarks.md +0 -215
  25. EvoScientist/skills/flash-attention/references/transformers-integration.md +0 -293
  26. EvoScientist/skills/llama-cpp/SKILL.md +0 -258
  27. EvoScientist/skills/llama-cpp/references/optimization.md +0 -89
  28. EvoScientist/skills/llama-cpp/references/quantization.md +0 -213
  29. EvoScientist/skills/llama-cpp/references/server.md +0 -125
  30. EvoScientist/skills/lm-evaluation-harness/SKILL.md +0 -490
  31. EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +0 -490
  32. EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +0 -488
  33. EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +0 -602
  34. EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +0 -519
  35. EvoScientist/skills/ml-paper-writing/SKILL.md +0 -937
  36. EvoScientist/skills/ml-paper-writing/references/checklists.md +0 -361
  37. EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +0 -562
  38. EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +0 -367
  39. EvoScientist/skills/ml-paper-writing/references/sources.md +0 -159
  40. EvoScientist/skills/ml-paper-writing/references/writing-guide.md +0 -476
  41. EvoScientist/skills/ml-paper-writing/templates/README.md +0 -251
  42. EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +0 -534
  43. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +0 -144
  44. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +0 -952
  45. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +0 -111
  46. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +0 -1493
  47. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +0 -315
  48. EvoScientist/skills/ml-paper-writing/templates/acl/README.md +0 -50
  49. EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +0 -312
  50. EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +0 -377
  51. EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +0 -101
  52. EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +0 -1940
  53. EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +0 -26
  54. EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +0 -70
  55. EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +0 -326
  56. EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +0 -3
  57. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +0 -11
  58. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +0 -1440
  59. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
  60. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +0 -218
  61. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +0 -305
  62. EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +0 -485
  63. EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +0 -508
  64. EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +0 -1246
  65. EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +0 -485
  66. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +0 -24
  67. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +0 -1440
  68. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
  69. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +0 -246
  70. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +0 -414
  71. EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +0 -508
  72. EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +0 -1246
  73. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +0 -79
  74. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +0 -201
  75. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +0 -75
  76. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
  77. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +0 -662
  78. EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +0 -864
  79. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +0 -1443
  80. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +0 -767
  81. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
  82. EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +0 -36
  83. EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +0 -53
  84. EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +0 -38
  85. EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +0 -382
  86. EvoScientist/skills/peft/SKILL.md +0 -431
  87. EvoScientist/skills/peft/references/advanced-usage.md +0 -514
  88. EvoScientist/skills/peft/references/troubleshooting.md +0 -480
  89. EvoScientist/skills/ray-data/SKILL.md +0 -326
  90. EvoScientist/skills/ray-data/references/integration.md +0 -82
  91. EvoScientist/skills/ray-data/references/transformations.md +0 -83
  92. EvoScientist/skills/skill-creator/LICENSE.txt +0 -202
  93. EvoScientist/skills/skill-creator/SKILL.md +0 -356
  94. EvoScientist/skills/skill-creator/references/output-patterns.md +0 -82
  95. EvoScientist/skills/skill-creator/references/workflows.md +0 -28
  96. EvoScientist/skills/skill-creator/scripts/init_skill.py +0 -303
  97. EvoScientist/skills/skill-creator/scripts/package_skill.py +0 -110
  98. EvoScientist/skills/skill-creator/scripts/quick_validate.py +0 -95
  99. EvoScientist/skills_manager.py +0 -392
  100. EvoScientist/stream/display.py +0 -604
  101. EvoScientist/stream/events.py +0 -415
  102. EvoScientist/stream/state.py +0 -343
  103. evoscientist-0.0.1.dev3.dist-info/METADATA +0 -321
  104. evoscientist-0.0.1.dev3.dist-info/RECORD +0 -113
  105. evoscientist-0.0.1.dev3.dist-info/entry_points.txt +0 -5
  106. {evoscientist-0.0.1.dev3.dist-info → evoscientist-0.1.0rc1.dist-info}/WHEEL +0 -0
  107. {evoscientist-0.0.1.dev3.dist-info → evoscientist-0.1.0rc1.dist-info}/licenses/LICENSE +0 -0
  108. {evoscientist-0.0.1.dev3.dist-info → evoscientist-0.1.0rc1.dist-info}/top_level.txt +0 -0
@@ -1,489 +0,0 @@
1
- # Megatron Integration with Accelerate
2
-
3
- ## Overview
4
-
5
- Accelerate supports Megatron-LM for massive model training with tensor parallelism and pipeline parallelism.
6
-
7
- **Megatron capabilities**:
8
- - **Tensor Parallelism (TP)**: Split layers across GPUs
9
- - **Pipeline Parallelism (PP)**: Split model depth across GPUs
10
- - **Data Parallelism (DP)**: Replicate model across GPU groups
11
- - **Sequence Parallelism**: Split sequences for long contexts
12
-
13
- ## Setup
14
-
15
- ### Install Megatron-LM
16
-
17
- ```bash
18
- # Clone Megatron-LM repository
19
- git clone https://github.com/NVIDIA/Megatron-LM.git
20
- cd Megatron-LM
21
- pip install -e .
22
-
23
- # Install Apex (NVIDIA optimizations)
24
- git clone https://github.com/NVIDIA/apex
25
- cd apex
26
- pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
27
- --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
28
- ```
29
-
30
- ### Accelerate Configuration
31
-
32
- ```bash
33
- accelerate config
34
- ```
35
-
36
- **Questions**:
37
- ```
38
- In which compute environment are you running?
39
- > This machine
40
-
41
- Which type of machine are you using?
42
- > Multi-GPU
43
-
44
- How many different machines will you use?
45
- > 1
46
-
47
- Do you want to use DeepSpeed/FSDP?
48
- > No
49
-
50
- Do you want to use Megatron-LM?
51
- > Yes
52
-
53
- What is the Tensor Parallelism degree? [1-8]
54
- > 2
55
-
56
- Do you want to enable Sequence Parallelism?
57
- > No
58
-
59
- What is the Pipeline Parallelism degree? [1-8]
60
- > 2
61
-
62
- What is the Data Parallelism degree? [1-8]
63
- > 2
64
-
65
- Where to perform activation checkpointing? ['SELECTIVE', 'FULL', 'NONE']
66
- > SELECTIVE
67
-
68
- Where to perform activation partitioning? ['SEQUENTIAL', 'UNIFORM']
69
- > SEQUENTIAL
70
- ```
71
-
72
- **Generated config** (`~/.cache/huggingface/accelerate/default_config.yaml`):
73
- ```yaml
74
- compute_environment: LOCAL_MACHINE
75
- distributed_type: MEGATRON_LM
76
- downcast_bf16: 'no'
77
- machine_rank: 0
78
- main_training_function: main
79
- megatron_lm_config:
80
- megatron_lm_gradient_clipping: 1.0
81
- megatron_lm_learning_rate_decay_iters: 320000
82
- megatron_lm_num_micro_batches: 1
83
- megatron_lm_pp_degree: 2
84
- megatron_lm_recompute_activations: true
85
- megatron_lm_sequence_parallelism: false
86
- megatron_lm_tp_degree: 2
87
- mixed_precision: bf16
88
- num_machines: 1
89
- num_processes: 8
90
- rdzv_backend: static
91
- same_network: true
92
- tpu_env: []
93
- tpu_use_cluster: false
94
- tpu_use_sudo: false
95
- use_cpu: false
96
- ```
97
-
98
- ## Parallelism Strategies
99
-
100
- ### Tensor Parallelism (TP)
101
-
102
- **Splits each transformer layer across GPUs**:
103
-
104
- ```python
105
- # Layer split across 2 GPUs
106
- # GPU 0: First half of attention heads
107
- # GPU 1: Second half of attention heads
108
-
109
- # Each GPU computes partial outputs
110
- # All-reduce combines results
111
- ```
112
-
113
- **TP degree recommendations**:
114
- - **TP=1**: No tensor parallelism (single GPU per layer)
115
- - **TP=2**: 2 GPUs per layer (good for 7-13B models)
116
- - **TP=4**: 4 GPUs per layer (good for 20-40B models)
117
- - **TP=8**: 8 GPUs per layer (good for 70B+ models)
118
-
119
- **Benefits**:
120
- - Reduces memory per GPU
121
- - All-reduce communication (fast)
122
-
123
- **Drawbacks**:
124
- - Requires fast inter-GPU bandwidth (NVLink)
125
- - Communication overhead per layer
126
-
127
- ### Pipeline Parallelism (PP)
128
-
129
- **Splits model depth across GPUs**:
130
-
131
- ```python
132
- # 12-layer model, PP=4
133
- # GPU 0: Layers 0-2
134
- # GPU 1: Layers 3-5
135
- # GPU 2: Layers 6-8
136
- # GPU 3: Layers 9-11
137
- ```
138
-
139
- **PP degree recommendations**:
140
- - **PP=1**: No pipeline parallelism
141
- - **PP=2**: 2 pipeline stages (good for 20-40B models)
142
- - **PP=4**: 4 pipeline stages (good for 70B+ models)
143
- - **PP=8**: 8 pipeline stages (good for 175B+ models)
144
-
145
- **Benefits**:
146
- - Linear memory reduction (4× PP = 4× less memory)
147
- - Works across nodes (slower interconnect OK)
148
-
149
- **Drawbacks**:
150
- - Pipeline bubbles (idle time)
151
- - Requires micro-batching
152
-
153
- ### Data Parallelism (DP)
154
-
155
- **Replicates model across GPU groups**:
156
-
157
- ```python
158
- # 8 GPUs, TP=2, PP=2, DP=2
159
- # Group 0 (GPUs 0-3): Full model replica
160
- # Group 1 (GPUs 4-7): Full model replica
161
- ```
162
-
163
- **DP degree**:
164
- - `DP = total_gpus / (TP × PP)`
165
- - Example: 8 GPUs, TP=2, PP=2 → DP=2
166
-
167
- **Benefits**:
168
- - Increases throughput
169
- - Scales batch size
170
-
171
- ### Sequence Parallelism
172
-
173
- **Splits long sequences across GPUs** (extends TP):
174
-
175
- ```python
176
- # 8K sequence, TP=2, Sequence Parallel=True
177
- # GPU 0: Tokens 0-4095
178
- # GPU 1: Tokens 4096-8191
179
- ```
180
-
181
- **Benefits**:
182
- - Enables very long sequences (100K+ tokens)
183
- - Reduces activation memory
184
-
185
- **Requirements**:
186
- - Must use with TP > 1
187
- - RoPE/ALiBi position encodings work best
188
-
189
- ## Accelerate Code Example
190
-
191
- ### Basic Setup
192
-
193
- ```python
194
- from accelerate import Accelerator
195
- from accelerate.utils import MegatronLMPlugin
196
-
197
- # Configure Megatron
198
- megatron_plugin = MegatronLMPlugin(
199
- tp_degree=2, # Tensor parallelism degree
200
- pp_degree=2, # Pipeline parallelism degree
201
- num_micro_batches=4, # Micro-batches for pipeline
202
- gradient_clipping=1.0, # Gradient clipping value
203
- sequence_parallelism=False, # Enable sequence parallelism
204
- recompute_activations=True, # Activation checkpointing
205
- use_distributed_optimizer=True, # Distributed optimizer
206
- custom_prepare_model_function=None, # Custom model prep
207
- )
208
-
209
- # Initialize accelerator
210
- accelerator = Accelerator(
211
- mixed_precision='bf16',
212
- megatron_lm_plugin=megatron_plugin
213
- )
214
-
215
- # Prepare model and optimizer
216
- model, optimizer, train_dataloader = accelerator.prepare(
217
- model, optimizer, train_dataloader
218
- )
219
-
220
- # Training loop (same as DDP!)
221
- for batch in train_dataloader:
222
- optimizer.zero_grad()
223
- outputs = model(**batch)
224
- loss = outputs.loss
225
- accelerator.backward(loss)
226
- optimizer.step()
227
- ```
228
-
229
- ### Full Training Script
230
-
231
- ```python
232
- import torch
233
- from accelerate import Accelerator
234
- from accelerate.utils import MegatronLMPlugin
235
- from transformers import GPT2Config, GPT2LMHeadModel
236
-
237
- def main():
238
- # Megatron configuration
239
- megatron_plugin = MegatronLMPlugin(
240
- tp_degree=2,
241
- pp_degree=2,
242
- num_micro_batches=4,
243
- gradient_clipping=1.0,
244
- )
245
-
246
- accelerator = Accelerator(
247
- mixed_precision='bf16',
248
- gradient_accumulation_steps=8,
249
- megatron_lm_plugin=megatron_plugin
250
- )
251
-
252
- # Model
253
- config = GPT2Config(
254
- n_layer=24,
255
- n_head=16,
256
- n_embd=1024,
257
- )
258
- model = GPT2LMHeadModel(config)
259
-
260
- # Optimizer
261
- optimizer = torch.optim.AdamW(model.parameters(), lr=6e-4)
262
-
263
- # Prepare
264
- model, optimizer, train_loader = accelerator.prepare(
265
- model, optimizer, train_loader
266
- )
267
-
268
- # Training loop
269
- for epoch in range(num_epochs):
270
- for batch in train_loader:
271
- with accelerator.accumulate(model):
272
- outputs = model(**batch)
273
- loss = outputs.loss
274
- accelerator.backward(loss)
275
- optimizer.step()
276
- optimizer.zero_grad()
277
-
278
- # Save checkpoint
279
- accelerator.wait_for_everyone()
280
- accelerator.save_state(f'checkpoint-epoch-{epoch}')
281
-
282
- if __name__ == '__main__':
283
- main()
284
- ```
285
-
286
- ### Launch Command
287
-
288
- ```bash
289
- # 8 GPUs, TP=2, PP=2, DP=2
290
- accelerate launch --multi_gpu --num_processes 8 train.py
291
-
292
- # Multi-node (2 nodes, 8 GPUs each)
293
- # Node 0
294
- accelerate launch --multi_gpu --num_processes 16 \
295
- --num_machines 2 --machine_rank 0 \
296
- --main_process_ip $MASTER_ADDR \
297
- --main_process_port 29500 \
298
- train.py
299
-
300
- # Node 1
301
- accelerate launch --multi_gpu --num_processes 16 \
302
- --num_machines 2 --machine_rank 1 \
303
- --main_process_ip $MASTER_ADDR \
304
- --main_process_port 29500 \
305
- train.py
306
- ```
307
-
308
- ## Activation Checkpointing
309
-
310
- **Reduces memory by recomputing activations**:
311
-
312
- ```python
313
- megatron_plugin = MegatronLMPlugin(
314
- recompute_activations=True, # Enable checkpointing
315
- checkpoint_num_layers=1, # Checkpoint every N layers
316
- distribute_checkpointed_activations=True, # Distribute across TP
317
- partition_activations=True, # Partition in PP
318
- check_for_nan_in_loss_and_grad=True, # Stability check
319
- )
320
- ```
321
-
322
- **Strategies**:
323
- - `SELECTIVE`: Checkpoint transformer blocks only
324
- - `FULL`: Checkpoint all layers
325
- - `NONE`: No checkpointing
326
-
327
- **Memory savings**: 30-50% with 10-15% slowdown
328
-
329
- ## Distributed Optimizer
330
-
331
- **Shards optimizer state across DP ranks**:
332
-
333
- ```python
334
- megatron_plugin = MegatronLMPlugin(
335
- use_distributed_optimizer=True, # Enable sharded optimizer
336
- )
337
- ```
338
-
339
- **Benefits**:
340
- - Reduces optimizer memory by DP degree
341
- - Example: DP=4 → 4× less optimizer memory per GPU
342
-
343
- **Compatible with**:
344
- - AdamW, Adam, SGD
345
- - Mixed precision training
346
-
347
- ## Performance Tuning
348
-
349
- ### Micro-Batch Size
350
-
351
- ```python
352
- # Pipeline parallelism requires micro-batching
353
- megatron_plugin = MegatronLMPlugin(
354
- pp_degree=4,
355
- num_micro_batches=16, # 16 micro-batches per pipeline
356
- )
357
-
358
- # Effective batch = num_micro_batches × micro_batch_size × DP
359
- # Example: 16 × 2 × 4 = 128
360
- ```
361
-
362
- **Recommendations**:
363
- - More micro-batches → less pipeline bubble
364
- - Typical: 4-16 micro-batches
365
-
366
- ### Sequence Length
367
-
368
- ```python
369
- # For long sequences, enable sequence parallelism
370
- megatron_plugin = MegatronLMPlugin(
371
- tp_degree=4,
372
- sequence_parallelism=True, # Required: TP > 1
373
- )
374
-
375
- # Enables sequences up to TP × normal limit
376
- # Example: TP=4, 8K normal → 32K with sequence parallel
377
- ```
378
-
379
- ### GPU Topology
380
-
381
- **NVLink required for TP**:
382
- ```bash
383
- # Check NVLink topology
384
- nvidia-smi topo -m
385
-
386
- # Good topology (NVLink between all GPUs)
387
- # GPU0 - GPU1: NV12 (fast)
388
- # GPU0 - GPU2: NV12 (fast)
389
-
390
- # Bad topology (PCIe only)
391
- # GPU0 - GPU4: PHB (slow, avoid TP across these)
392
- ```
393
-
394
- **Recommendations**:
395
- - **TP**: Within same node (NVLink)
396
- - **PP**: Across nodes (slower interconnect OK)
397
- - **DP**: Any topology
398
-
399
- ## Model Size Guidelines
400
-
401
- | Model Size | GPUs | TP | PP | DP | Micro-Batches |
402
- |------------|------|----|----|----|--------------|
403
- | 7B | 8 | 1 | 1 | 8 | 1 |
404
- | 13B | 8 | 2 | 1 | 4 | 1 |
405
- | 20B | 16 | 4 | 1 | 4 | 1 |
406
- | 40B | 32 | 4 | 2 | 4 | 4 |
407
- | 70B | 64 | 8 | 2 | 4 | 8 |
408
- | 175B | 128 | 8 | 4 | 4 | 16 |
409
-
410
- **Assumptions**: BF16, 2K sequence length, A100 80GB
411
-
412
- ## Checkpointing
413
-
414
- ### Save Checkpoint
415
-
416
- ```python
417
- # Save full model state
418
- accelerator.save_state('checkpoint-1000')
419
-
420
- # Megatron saves separate files per rank
421
- # checkpoint-1000/
422
- # pytorch_model_tp_0_pp_0.bin
423
- # pytorch_model_tp_0_pp_1.bin
424
- # pytorch_model_tp_1_pp_0.bin
425
- # pytorch_model_tp_1_pp_1.bin
426
- # optimizer_tp_0_pp_0.bin
427
- # ...
428
- ```
429
-
430
- ### Load Checkpoint
431
-
432
- ```python
433
- # Resume training
434
- accelerator.load_state('checkpoint-1000')
435
-
436
- # Automatically loads correct shard per rank
437
- ```
438
-
439
- ### Convert to Standard PyTorch
440
-
441
- ```bash
442
- # Merge Megatron checkpoint to single file
443
- python merge_megatron_checkpoint.py \
444
- --checkpoint-dir checkpoint-1000 \
445
- --output pytorch_model.bin
446
- ```
447
-
448
- ## Common Issues
449
-
450
- ### Issue: OOM with Pipeline Parallelism
451
-
452
- **Solution**: Increase micro-batches
453
- ```python
454
- megatron_plugin = MegatronLMPlugin(
455
- pp_degree=4,
456
- num_micro_batches=16, # Increase from 4
457
- )
458
- ```
459
-
460
- ### Issue: Slow Training
461
-
462
- **Check 1**: Pipeline bubbles (PP too high)
463
- ```python
464
- # Reduce PP, increase TP
465
- tp_degree=4 # Increase
466
- pp_degree=2 # Decrease
467
- ```
468
-
469
- **Check 2**: Micro-batch size too small
470
- ```python
471
- num_micro_batches=8 # Increase
472
- ```
473
-
474
- ### Issue: NVLink Not Detected
475
-
476
- ```bash
477
- # Verify NVLink
478
- nvidia-smi nvlink -s
479
-
480
- # If no NVLink, avoid TP > 1
481
- # Use PP or DP instead
482
- ```
483
-
484
- ## Resources
485
-
486
- - Megatron-LM: https://github.com/NVIDIA/Megatron-LM
487
- - Accelerate Megatron docs: https://huggingface.co/docs/accelerate/usage_guides/megatron_lm
488
- - Paper: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism"
489
- - NVIDIA Apex: https://github.com/NVIDIA/apex