EvoScientist 0.0.1.dev4__py3-none-any.whl → 0.1.0rc1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (113) hide show
  1. EvoScientist/EvoScientist.py +26 -62
  2. EvoScientist/__init__.py +0 -19
  3. EvoScientist/backends.py +0 -26
  4. EvoScientist/cli.py +1111 -498
  5. EvoScientist/middleware.py +8 -61
  6. EvoScientist/stream/__init__.py +0 -25
  7. EvoScientist/stream/utils.py +16 -23
  8. EvoScientist/tools.py +2 -75
  9. evoscientist-0.1.0rc1.dist-info/METADATA +199 -0
  10. evoscientist-0.1.0rc1.dist-info/RECORD +21 -0
  11. evoscientist-0.1.0rc1.dist-info/entry_points.txt +2 -0
  12. EvoScientist/config.py +0 -274
  13. EvoScientist/llm/__init__.py +0 -21
  14. EvoScientist/llm/models.py +0 -99
  15. EvoScientist/memory.py +0 -715
  16. EvoScientist/onboard.py +0 -725
  17. EvoScientist/paths.py +0 -44
  18. EvoScientist/skills/accelerate/SKILL.md +0 -332
  19. EvoScientist/skills/accelerate/references/custom-plugins.md +0 -453
  20. EvoScientist/skills/accelerate/references/megatron-integration.md +0 -489
  21. EvoScientist/skills/accelerate/references/performance.md +0 -525
  22. EvoScientist/skills/bitsandbytes/SKILL.md +0 -411
  23. EvoScientist/skills/bitsandbytes/references/memory-optimization.md +0 -521
  24. EvoScientist/skills/bitsandbytes/references/qlora-training.md +0 -521
  25. EvoScientist/skills/bitsandbytes/references/quantization-formats.md +0 -447
  26. EvoScientist/skills/find-skills/SKILL.md +0 -133
  27. EvoScientist/skills/find-skills/scripts/install_skill.py +0 -211
  28. EvoScientist/skills/flash-attention/SKILL.md +0 -367
  29. EvoScientist/skills/flash-attention/references/benchmarks.md +0 -215
  30. EvoScientist/skills/flash-attention/references/transformers-integration.md +0 -293
  31. EvoScientist/skills/llama-cpp/SKILL.md +0 -258
  32. EvoScientist/skills/llama-cpp/references/optimization.md +0 -89
  33. EvoScientist/skills/llama-cpp/references/quantization.md +0 -213
  34. EvoScientist/skills/llama-cpp/references/server.md +0 -125
  35. EvoScientist/skills/lm-evaluation-harness/SKILL.md +0 -490
  36. EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +0 -490
  37. EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +0 -488
  38. EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +0 -602
  39. EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +0 -519
  40. EvoScientist/skills/ml-paper-writing/SKILL.md +0 -937
  41. EvoScientist/skills/ml-paper-writing/references/checklists.md +0 -361
  42. EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +0 -562
  43. EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +0 -367
  44. EvoScientist/skills/ml-paper-writing/references/sources.md +0 -159
  45. EvoScientist/skills/ml-paper-writing/references/writing-guide.md +0 -476
  46. EvoScientist/skills/ml-paper-writing/templates/README.md +0 -251
  47. EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +0 -534
  48. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +0 -144
  49. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +0 -952
  50. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +0 -111
  51. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +0 -1493
  52. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +0 -315
  53. EvoScientist/skills/ml-paper-writing/templates/acl/README.md +0 -50
  54. EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +0 -312
  55. EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +0 -377
  56. EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +0 -101
  57. EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +0 -1940
  58. EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +0 -26
  59. EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +0 -70
  60. EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +0 -326
  61. EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +0 -3
  62. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +0 -11
  63. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +0 -1440
  64. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
  65. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +0 -218
  66. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +0 -305
  67. EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +0 -485
  68. EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +0 -508
  69. EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +0 -1246
  70. EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +0 -485
  71. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +0 -24
  72. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +0 -1440
  73. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
  74. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +0 -246
  75. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +0 -414
  76. EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +0 -508
  77. EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +0 -1246
  78. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +0 -79
  79. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +0 -201
  80. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +0 -75
  81. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
  82. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +0 -662
  83. EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +0 -864
  84. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +0 -1443
  85. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +0 -767
  86. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
  87. EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +0 -36
  88. EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +0 -53
  89. EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +0 -38
  90. EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +0 -382
  91. EvoScientist/skills/peft/SKILL.md +0 -431
  92. EvoScientist/skills/peft/references/advanced-usage.md +0 -514
  93. EvoScientist/skills/peft/references/troubleshooting.md +0 -480
  94. EvoScientist/skills/ray-data/SKILL.md +0 -326
  95. EvoScientist/skills/ray-data/references/integration.md +0 -82
  96. EvoScientist/skills/ray-data/references/transformations.md +0 -83
  97. EvoScientist/skills/skill-creator/LICENSE.txt +0 -202
  98. EvoScientist/skills/skill-creator/SKILL.md +0 -356
  99. EvoScientist/skills/skill-creator/references/output-patterns.md +0 -82
  100. EvoScientist/skills/skill-creator/references/workflows.md +0 -28
  101. EvoScientist/skills/skill-creator/scripts/init_skill.py +0 -303
  102. EvoScientist/skills/skill-creator/scripts/package_skill.py +0 -110
  103. EvoScientist/skills/skill-creator/scripts/quick_validate.py +0 -95
  104. EvoScientist/skills_manager.py +0 -391
  105. EvoScientist/stream/display.py +0 -604
  106. EvoScientist/stream/events.py +0 -415
  107. EvoScientist/stream/state.py +0 -343
  108. evoscientist-0.0.1.dev4.dist-info/METADATA +0 -367
  109. evoscientist-0.0.1.dev4.dist-info/RECORD +0 -117
  110. evoscientist-0.0.1.dev4.dist-info/entry_points.txt +0 -5
  111. {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc1.dist-info}/WHEEL +0 -0
  112. {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc1.dist-info}/licenses/LICENSE +0 -0
  113. {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc1.dist-info}/top_level.txt +0 -0
@@ -1,521 +0,0 @@
1
- # Memory Optimization
2
-
3
- Complete guide to CPU offloading, gradient checkpointing, memory profiling, and advanced memory-saving strategies with bitsandbytes.
4
-
5
- ## Overview
6
-
7
- Memory optimization techniques for fitting large models:
8
- - **Quantization**: 50-75% reduction (covered in other docs)
9
- - **CPU offloading**: Move weights to CPU/disk
10
- - **Gradient checkpointing**: Trade compute for memory
11
- - **Optimizer strategies**: 8-bit, paged optimizers
12
- - **Mixed precision**: FP16/BF16 training
13
-
14
- ## CPU Offloading
15
-
16
- ### Basic CPU Offloading
17
-
18
- Move parts of the model to CPU RAM when not in use.
19
-
20
- ```python
21
- from transformers import AutoModelForCausalLM, BitsAndBytesConfig
22
- import torch
23
-
24
- config = BitsAndBytesConfig(
25
- load_in_4bit=True,
26
- bnb_4bit_compute_dtype=torch.bfloat16
27
- )
28
-
29
- model = AutoModelForCausalLM.from_pretrained(
30
- "meta-llama/Llama-2-70b-hf",
31
- quantization_config=config,
32
- device_map="auto", # Automatic device placement
33
- max_memory={0: "40GB", "cpu": "100GB"} # 40GB GPU, 100GB CPU
34
- )
35
- ```
36
-
37
- **How it works**:
38
- - Weights stored on CPU
39
- - Moved to GPU only when needed for computation
40
- - Automatically managed by `accelerate`
41
-
42
- **Trade-off**: ~5-10× slower but enables larger models
43
-
44
- ### Multi-GPU Offloading
45
-
46
- Distribute across multiple GPUs + CPU:
47
-
48
- ```python
49
- model = AutoModelForCausalLM.from_pretrained(
50
- "meta-llama/Llama-2-405b-hf",
51
- quantization_config=config,
52
- device_map="auto",
53
- max_memory={
54
- 0: "70GB", # GPU 0
55
- 1: "70GB", # GPU 1
56
- 2: "70GB", # GPU 2
57
- 3: "70GB", # GPU 3
58
- "cpu": "200GB" # CPU RAM
59
- }
60
- )
61
- ```
62
-
63
- **Result**: 405B model (4-bit = ~200GB) fits on 4×80GB GPUs + CPU
64
-
65
- ### Disk Offloading
66
-
67
- For models too large even for CPU RAM:
68
-
69
- ```python
70
- model = AutoModelForCausalLM.from_pretrained(
71
- "meta-llama/Llama-2-405b-hf",
72
- quantization_config=config,
73
- device_map="auto",
74
- offload_folder="./offload", # Disk offload directory
75
- offload_state_dict=True,
76
- max_memory={0: "40GB", "cpu": "50GB"}
77
- )
78
- ```
79
-
80
- **Trade-off**: Extremely slow (~100× slower) but works
81
-
82
- ### Manual Device Mapping
83
-
84
- For precise control:
85
-
86
- ```python
87
- device_map = {
88
- "model.embed_tokens": 0, # GPU 0
89
- "model.layers.0": 0,
90
- "model.layers.1": 0,
91
- # ...
92
- "model.layers.40": 1, # GPU 1
93
- "model.layers.41": 1,
94
- # ...
95
- "model.layers.79": "cpu", # CPU
96
- "model.norm": "cpu",
97
- "lm_head": "cpu"
98
- }
99
-
100
- model = AutoModelForCausalLM.from_pretrained(
101
- "meta-llama/Llama-2-70b-hf",
102
- quantization_config=config,
103
- device_map=device_map
104
- )
105
- ```
106
-
107
- ## Gradient Checkpointing
108
-
109
- Recompute activations during backward pass instead of storing them.
110
-
111
- ### Enable for HuggingFace Models
112
-
113
- ```python
114
- from transformers import AutoModelForCausalLM
115
-
116
- model = AutoModelForCausalLM.from_pretrained(
117
- "meta-llama/Llama-2-13b-hf",
118
- quantization_config=config
119
- )
120
-
121
- # Enable gradient checkpointing
122
- model.gradient_checkpointing_enable()
123
- ```
124
-
125
- **Memory savings**: ~30-50% activation memory
126
- **Cost**: ~20% slower training
127
-
128
- ### With QLoRA
129
-
130
- ```python
131
- from peft import prepare_model_for_kbit_training
132
-
133
- # Enable gradient checkpointing before preparing for training
134
- model.gradient_checkpointing_enable()
135
- model = prepare_model_for_kbit_training(
136
- model,
137
- use_gradient_checkpointing=True
138
- )
139
- ```
140
-
141
- ### Configure Checkpointing Frequency
142
-
143
- ```python
144
- # Checkpoint every layer (maximum memory savings)
145
- model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
146
- ```
147
-
148
- ### Memory Breakdown
149
-
150
- Example: Llama 2 13B forward pass
151
-
152
- | Component | Without Checkpointing | With Checkpointing |
153
- |-----------|----------------------|-------------------|
154
- | Model weights | 26 GB | 26 GB |
155
- | Activations | 12 GB | **3 GB** |
156
- | Gradients | 26 GB | 26 GB |
157
- | Optimizer | 52 GB | 52 GB |
158
- | **Total** | 116 GB | **107 GB** |
159
-
160
- **Savings**: ~9GB for 13B model
161
-
162
- ## 8-Bit Optimizers
163
-
164
- Use 8-bit optimizer states instead of 32-bit.
165
-
166
- ### Standard AdamW Memory
167
-
168
- ```
169
- Optimizer memory = 2 × model_params × 4 bytes (FP32)
170
- = 8 × model_params
171
-
172
- Example (Llama 2 70B):
173
- = 8 × 70B = 560 GB
174
- ```
175
-
176
- ### 8-Bit AdamW Memory
177
-
178
- ```
179
- Optimizer memory = 2 × model_params × 1 byte (INT8)
180
- = 2 × model_params
181
-
182
- Example (Llama 2 70B):
183
- = 2 × 70B = 140 GB
184
-
185
- Savings: 420 GB (75% reduction!)
186
- ```
187
-
188
- ### Enable in Transformers
189
-
190
- ```python
191
- from transformers import TrainingArguments
192
-
193
- training_args = TrainingArguments(
194
- output_dir="./output",
195
- per_device_train_batch_size=4,
196
- optim="paged_adamw_8bit", # 8-bit optimizer
197
- learning_rate=2e-4
198
- )
199
- ```
200
-
201
- ### Available 8-Bit Optimizers
202
-
203
- | Optimizer | Name | Use Case |
204
- |-----------|------|----------|
205
- | AdamW 8-bit | `adamw_8bit` | General training |
206
- | Paged AdamW 8-bit | `paged_adamw_8bit` | **Recommended** (prevents OOM) |
207
- | Paged AdamW 32-bit | `paged_adamw_32bit` | High accuracy needed |
208
-
209
- **Recommendation**: Always use `paged_adamw_8bit`
210
-
211
- ### Manual Usage
212
-
213
- ```python
214
- import bitsandbytes as bnb
215
-
216
- optimizer = bnb.optim.PagedAdamW8bit(
217
- model.parameters(),
218
- lr=1e-4,
219
- betas=(0.9, 0.999),
220
- eps=1e-8
221
- )
222
- ```
223
-
224
- ## Paged Optimizers
225
-
226
- Paged optimizers use unified memory (GPU + CPU) to prevent OOM.
227
-
228
- ### How It Works
229
-
230
- - Optimizer states stored in paged memory
231
- - Pages swap between GPU and CPU as needed
232
- - Prevents hard OOM crashes
233
-
234
- ### Configuration
235
-
236
- ```python
237
- from transformers import TrainingArguments
238
-
239
- training_args = TrainingArguments(
240
- optim="paged_adamw_8bit", # Enables paging
241
- # Paging happens automatically
242
- )
243
- ```
244
-
245
- ### Benefits
246
-
247
- ✅ No hard OOM (graceful degradation)
248
- ✅ Enables larger batch sizes
249
- ✅ Combines with 8-bit for maximum savings
250
-
251
- ### Performance
252
-
253
- **Speed**: ~5-10% slower than standard optimizer
254
- **Memory**: Effectively unlimited (uses CPU + swap)
255
-
256
- ## Mixed Precision Training
257
-
258
- Use lower precision for faster training and less memory.
259
-
260
- ### BF16 Training (Recommended)
261
-
262
- ```python
263
- training_args = TrainingArguments(
264
- bf16=True, # BFloat16 training
265
- bf16_full_eval=True
266
- )
267
- ```
268
-
269
- **Requirements**: Ampere+ GPUs (A100, H100, RTX 3090+)
270
-
271
- **Benefits**:
272
- - 2× faster training
273
- - 50% less activation memory
274
- - Better stability than FP16
275
-
276
- ### FP16 Training
277
-
278
- ```python
279
- training_args = TrainingArguments(
280
- fp16=True, # Float16 training
281
- fp16_full_eval=True
282
- )
283
- ```
284
-
285
- **Requirements**: Volta+ GPUs (V100, A100, RTX 2080+)
286
-
287
- **Benefits**:
288
- - 2× faster training
289
- - 50% less activation memory
290
- - Slightly less stable than BF16
291
-
292
- ### Precision Comparison
293
-
294
- | Precision | Speed | Memory | Stability | Use Case |
295
- |-----------|-------|--------|-----------|----------|
296
- | FP32 | 1× | 100% | Best | Debugging |
297
- | BF16 | 2× | 50% | Good | **Recommended** |
298
- | FP16 | 2× | 50% | Fair | V100 only |
299
-
300
- ## Complete Memory Optimization Stack
301
-
302
- ### Maximum Optimization (Llama 2 70B on Single A100 80GB)
303
-
304
- ```python
305
- from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
306
- from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
307
- import torch
308
-
309
- # Step 1: 4-bit quantization
310
- bnb_config = BitsAndBytesConfig(
311
- load_in_4bit=True,
312
- bnb_4bit_compute_dtype=torch.bfloat16,
313
- bnb_4bit_quant_type="nf4",
314
- bnb_4bit_use_double_quant=True
315
- )
316
-
317
- model = AutoModelForCausalLM.from_pretrained(
318
- "meta-llama/Llama-2-70b-hf",
319
- quantization_config=bnb_config,
320
- device_map="auto",
321
- max_memory={0: "70GB", "cpu": "100GB"} # CPU offload if needed
322
- )
323
-
324
- # Step 2: Gradient checkpointing
325
- model.gradient_checkpointing_enable()
326
-
327
- # Step 3: Prepare for training
328
- model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
329
-
330
- # Step 4: LoRA adapters
331
- lora_config = LoraConfig(
332
- r=16, # Lower rank for memory
333
- lora_alpha=32,
334
- target_modules="all-linear",
335
- lora_dropout=0.05,
336
- bias="none",
337
- task_type="CAUSAL_LM"
338
- )
339
-
340
- model = get_peft_model(model, lora_config)
341
-
342
- # Step 5: Training arguments
343
- training_args = TrainingArguments(
344
- output_dir="./output",
345
- per_device_train_batch_size=1, # Small batch
346
- gradient_accumulation_steps=16, # Effective batch = 16
347
- bf16=True, # Mixed precision
348
- optim="paged_adamw_8bit", # 8-bit optimizer
349
- max_grad_norm=0.3,
350
- learning_rate=2e-4
351
- )
352
-
353
- # Memory usage: ~75GB (fits on A100 80GB!)
354
- ```
355
-
356
- ### Memory Breakdown
357
-
358
- | Component | Memory |
359
- |-----------|--------|
360
- | Model (4-bit) | 35 GB |
361
- | LoRA adapters | 0.5 GB |
362
- | Activations (with checkpointing) | 8 GB |
363
- | Gradients | 0.5 GB |
364
- | Optimizer (8-bit paged) | 1 GB |
365
- | Batch buffer | 10 GB |
366
- | CUDA overhead | 5 GB |
367
- | **Total** | **~75 GB** |
368
-
369
- ## Memory Profiling
370
-
371
- ### PyTorch Memory Profiler
372
-
373
- ```python
374
- import torch
375
-
376
- # Start profiling
377
- torch.cuda.empty_cache()
378
- torch.cuda.reset_peak_memory_stats()
379
-
380
- # Your code here
381
- model = AutoModelForCausalLM.from_pretrained(...)
382
- model.generate(...)
383
-
384
- # Check memory
385
- print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
386
- print(f"Peak: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
387
- print(f"Cached: {torch.cuda.memory_reserved()/1e9:.2f} GB")
388
- ```
389
-
390
- ### Detailed Memory Summary
391
-
392
- ```python
393
- print(torch.cuda.memory_summary())
394
- ```
395
-
396
- Output:
397
- ```
398
- |===========================================================================|
399
- | PyTorch CUDA memory summary |
400
- |---------------------------------------------------------------------------|
401
- | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
402
- |---------------------------------------------------------------------------|
403
- | Allocated memory | 45.2 GB | 52.3 GB | 156.8 GB | 111.6 GB |
404
- | Active memory | 45.2 GB | 52.3 GB | 156.8 GB | 111.6 GB |
405
- | GPU reserved | 46.0 GB | 54.0 GB | 54.0 GB | 8.0 GB |
406
- |===========================================================================|
407
- ```
408
-
409
- ### Track Memory During Training
410
-
411
- ```python
412
- from transformers import TrainerCallback
413
-
414
- class MemoryCallback(TrainerCallback):
415
- def on_step_end(self, args, state, control, **kwargs):
416
- if state.global_step % 10 == 0:
417
- allocated = torch.cuda.memory_allocated() / 1e9
418
- reserved = torch.cuda.memory_reserved() / 1e9
419
- print(f"Step {state.global_step}: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")
420
-
421
- trainer = Trainer(
422
- model=model,
423
- args=training_args,
424
- callbacks=[MemoryCallback()]
425
- )
426
- ```
427
-
428
- ## Troubleshooting OOM
429
-
430
- ### Diagnostic Steps
431
-
432
- 1. **Check current memory**:
433
- ```python
434
- print(torch.cuda.memory_summary())
435
- ```
436
-
437
- 2. **Try smaller batch**:
438
- ```python
439
- per_device_train_batch_size=1
440
- ```
441
-
442
- 3. **Enable gradient checkpointing**:
443
- ```python
444
- model.gradient_checkpointing_enable()
445
- ```
446
-
447
- 4. **Use 8-bit optimizer**:
448
- ```python
449
- optim="paged_adamw_8bit"
450
- ```
451
-
452
- 5. **Add CPU offloading**:
453
- ```python
454
- max_memory={0: "70GB", "cpu": "100GB"}
455
- ```
456
-
457
- 6. **Reduce LoRA rank**:
458
- ```python
459
- r=8 # Instead of 16
460
- ```
461
-
462
- ### Emergency: Last Resort
463
-
464
- ```python
465
- # Absolute minimum memory config
466
- model = AutoModelForCausalLM.from_pretrained(
467
- "model-name",
468
- quantization_config=BitsAndBytesConfig(load_in_4bit=True),
469
- device_map="auto",
470
- max_memory={0: "20GB", "cpu": "200GB"},
471
- offload_folder="./offload"
472
- )
473
-
474
- model.gradient_checkpointing_enable()
475
-
476
- training_args = TrainingArguments(
477
- per_device_train_batch_size=1,
478
- gradient_accumulation_steps=64,
479
- bf16=True,
480
- optim="paged_adamw_8bit"
481
- )
482
- ```
483
-
484
- **Result**: Extremely slow but will probably work
485
-
486
- ## Best Practices
487
-
488
- 1. **Start with quantization**: 4-bit gives 75% savings
489
- 2. **Add gradient checkpointing**: 30-50% activation savings
490
- 3. **Use 8-bit optimizer**: 75% optimizer savings
491
- 4. **Enable mixed precision**: 50% activation savings
492
- 5. **CPU offload only if needed**: Slow but enables larger models
493
- 6. **Profile regularly**: Identify memory bottlenecks
494
- 7. **Test with small batches**: Prevent OOM during development
495
-
496
- ## Memory Estimation Formula
497
-
498
- ```
499
- Total Memory = Model + Activations + Gradients + Optimizer + Buffer
500
-
501
- Model = Parameters × Bytes per param
502
- Activations = Batch × Seq × Hidden × Layers × Bytes per activation
503
- Gradients = Parameters × Bytes per gradient
504
- Optimizer = Parameters × Optimizer factor × Bytes
505
- Buffer = 2-5 GB (CUDA overhead)
506
- ```
507
-
508
- **With all optimizations**:
509
- ```
510
- Model = Parameters × 0.5 (4-bit)
511
- Activations = Activations × 0.3 (checkpointing + BF16)
512
- Gradients = Parameters × 0.5 (LoRA only)
513
- Optimizer = Parameters × 2 (8-bit)
514
- ```
515
-
516
- ## References
517
-
518
- - PyTorch memory management: https://pytorch.org/docs/stable/notes/cuda.html
519
- - Accelerate device_map: https://huggingface.co/docs/accelerate/usage_guides/big_modeling
520
- - Gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html
521
- - bitsandbytes optimizers: https://github.com/bitsandbytes-foundation/bitsandbytes#optimizer