EvoScientist 0.0.1.dev2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (107) hide show
  1. EvoScientist/EvoScientist.py +157 -0
  2. EvoScientist/__init__.py +24 -0
  3. EvoScientist/__main__.py +4 -0
  4. EvoScientist/backends.py +392 -0
  5. EvoScientist/cli.py +1553 -0
  6. EvoScientist/middleware.py +35 -0
  7. EvoScientist/prompts.py +277 -0
  8. EvoScientist/skills/accelerate/SKILL.md +332 -0
  9. EvoScientist/skills/accelerate/references/custom-plugins.md +453 -0
  10. EvoScientist/skills/accelerate/references/megatron-integration.md +489 -0
  11. EvoScientist/skills/accelerate/references/performance.md +525 -0
  12. EvoScientist/skills/bitsandbytes/SKILL.md +411 -0
  13. EvoScientist/skills/bitsandbytes/references/memory-optimization.md +521 -0
  14. EvoScientist/skills/bitsandbytes/references/qlora-training.md +521 -0
  15. EvoScientist/skills/bitsandbytes/references/quantization-formats.md +447 -0
  16. EvoScientist/skills/find-skills/SKILL.md +133 -0
  17. EvoScientist/skills/find-skills/scripts/install_skill.py +211 -0
  18. EvoScientist/skills/flash-attention/SKILL.md +367 -0
  19. EvoScientist/skills/flash-attention/references/benchmarks.md +215 -0
  20. EvoScientist/skills/flash-attention/references/transformers-integration.md +293 -0
  21. EvoScientist/skills/llama-cpp/SKILL.md +258 -0
  22. EvoScientist/skills/llama-cpp/references/optimization.md +89 -0
  23. EvoScientist/skills/llama-cpp/references/quantization.md +213 -0
  24. EvoScientist/skills/llama-cpp/references/server.md +125 -0
  25. EvoScientist/skills/lm-evaluation-harness/SKILL.md +490 -0
  26. EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  27. EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  28. EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  29. EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  30. EvoScientist/skills/ml-paper-writing/SKILL.md +937 -0
  31. EvoScientist/skills/ml-paper-writing/references/checklists.md +361 -0
  32. EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +562 -0
  33. EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +367 -0
  34. EvoScientist/skills/ml-paper-writing/references/sources.md +159 -0
  35. EvoScientist/skills/ml-paper-writing/references/writing-guide.md +476 -0
  36. EvoScientist/skills/ml-paper-writing/templates/README.md +251 -0
  37. EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +534 -0
  38. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +144 -0
  39. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +952 -0
  40. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +111 -0
  41. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +1493 -0
  42. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +315 -0
  43. EvoScientist/skills/ml-paper-writing/templates/acl/README.md +50 -0
  44. EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +312 -0
  45. EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +377 -0
  46. EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +101 -0
  47. EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +1940 -0
  48. EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +26 -0
  49. EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +70 -0
  50. EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +326 -0
  51. EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +3 -0
  52. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +11 -0
  53. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +1440 -0
  54. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
  55. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +218 -0
  56. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +305 -0
  57. EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +485 -0
  58. EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +508 -0
  59. EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +1246 -0
  60. EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +485 -0
  61. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +24 -0
  62. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +1440 -0
  63. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
  64. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +246 -0
  65. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +414 -0
  66. EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +508 -0
  67. EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +1246 -0
  68. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +79 -0
  69. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +201 -0
  70. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +75 -0
  71. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
  72. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +662 -0
  73. EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +864 -0
  74. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +1443 -0
  75. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +767 -0
  76. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
  77. EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +36 -0
  78. EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +53 -0
  79. EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +38 -0
  80. EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +382 -0
  81. EvoScientist/skills/peft/SKILL.md +431 -0
  82. EvoScientist/skills/peft/references/advanced-usage.md +514 -0
  83. EvoScientist/skills/peft/references/troubleshooting.md +480 -0
  84. EvoScientist/skills/ray-data/SKILL.md +326 -0
  85. EvoScientist/skills/ray-data/references/integration.md +82 -0
  86. EvoScientist/skills/ray-data/references/transformations.md +83 -0
  87. EvoScientist/skills/skill-creator/LICENSE.txt +202 -0
  88. EvoScientist/skills/skill-creator/SKILL.md +356 -0
  89. EvoScientist/skills/skill-creator/references/output-patterns.md +82 -0
  90. EvoScientist/skills/skill-creator/references/workflows.md +28 -0
  91. EvoScientist/skills/skill-creator/scripts/init_skill.py +303 -0
  92. EvoScientist/skills/skill-creator/scripts/package_skill.py +110 -0
  93. EvoScientist/skills/skill-creator/scripts/quick_validate.py +95 -0
  94. EvoScientist/stream/__init__.py +53 -0
  95. EvoScientist/stream/emitter.py +94 -0
  96. EvoScientist/stream/formatter.py +168 -0
  97. EvoScientist/stream/tracker.py +115 -0
  98. EvoScientist/stream/utils.py +255 -0
  99. EvoScientist/subagent.yaml +147 -0
  100. EvoScientist/tools.py +135 -0
  101. EvoScientist/utils.py +207 -0
  102. evoscientist-0.0.1.dev2.dist-info/METADATA +227 -0
  103. evoscientist-0.0.1.dev2.dist-info/RECORD +107 -0
  104. evoscientist-0.0.1.dev2.dist-info/WHEEL +5 -0
  105. evoscientist-0.0.1.dev2.dist-info/entry_points.txt +5 -0
  106. evoscientist-0.0.1.dev2.dist-info/licenses/LICENSE +21 -0
  107. evoscientist-0.0.1.dev2.dist-info/top_level.txt +1 -0
@@ -0,0 +1,521 @@
1
+ # QLoRA Training
2
+
3
+ Complete guide to fine-tuning large language models using 4-bit quantization with QLoRA (Quantized Low-Rank Adaptation).
4
+
5
+ ## Overview
6
+
7
+ QLoRA enables fine-tuning 70B+ parameter models on consumer GPUs by:
8
+ - Loading base model in 4-bit (75% memory reduction)
9
+ - Training only small LoRA adapters (~20MB)
10
+ - Maintaining near-full-precision quality
11
+
12
+ **Memory savings**:
13
+ - Llama 2 70B: 140GB → 35GB (4-bit) + 20MB (LoRA) = **35GB total**
14
+ - Fits on single A100 80GB!
15
+
16
+ **Accuracy**: <1% degradation vs full fine-tuning
17
+
18
+ ## Quick Start
19
+
20
+ ### Basic QLoRA Fine-tuning
21
+
22
+ ```python
23
+ from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
24
+ from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
25
+ import torch
26
+
27
+ # Step 1: Load model in 4-bit
28
+ bnb_config = BitsAndBytesConfig(
29
+ load_in_4bit=True,
30
+ bnb_4bit_compute_dtype=torch.bfloat16,
31
+ bnb_4bit_quant_type="nf4",
32
+ bnb_4bit_use_double_quant=True
33
+ )
34
+
35
+ model = AutoModelForCausalLM.from_pretrained(
36
+ "meta-llama/Llama-2-70b-hf",
37
+ quantization_config=bnb_config,
38
+ device_map="auto",
39
+ torch_dtype=torch.bfloat16
40
+ )
41
+
42
+ # Step 2: Prepare for k-bit training
43
+ model = prepare_model_for_kbit_training(model)
44
+
45
+ # Step 3: Add LoRA adapters
46
+ lora_config = LoraConfig(
47
+ r=64,
48
+ lora_alpha=16,
49
+ target_modules="all-linear",
50
+ lora_dropout=0.1,
51
+ bias="none",
52
+ task_type="CAUSAL_LM"
53
+ )
54
+
55
+ model = get_peft_model(model, lora_config)
56
+ model.print_trainable_parameters()
57
+ # trainable params: 335M || all params: 70B || trainable%: 0.48%
58
+
59
+ # Step 4: Train
60
+ from trl import SFTTrainer
61
+
62
+ training_args = TrainingArguments(
63
+ output_dir="./qlora-70b",
64
+ per_device_train_batch_size=4,
65
+ gradient_accumulation_steps=4,
66
+ num_train_epochs=3,
67
+ learning_rate=2e-4,
68
+ bf16=True,
69
+ optim="paged_adamw_8bit",
70
+ logging_steps=10,
71
+ save_strategy="epoch"
72
+ )
73
+
74
+ trainer = SFTTrainer(
75
+ model=model,
76
+ args=training_args,
77
+ train_dataset=dataset,
78
+ tokenizer=tokenizer
79
+ )
80
+
81
+ trainer.train()
82
+ ```
83
+
84
+ ## Complete Training Workflows
85
+
86
+ ### Workflow 1: Single GPU Training (Consumer GPU)
87
+
88
+ Train Llama 2 13B on RTX 4090 (24GB).
89
+
90
+ **Step 1: Prepare dataset**
91
+
92
+ ```python
93
+ from datasets import load_dataset
94
+
95
+ # Load instruction dataset
96
+ dataset = load_dataset("timdettmers/openassistant-guanaco")
97
+
98
+ # Format for instruction tuning
99
+ def format_instruction(example):
100
+ return {
101
+ "text": f"### Human: {example['text']}\n### Assistant: {example['output']}"
102
+ }
103
+
104
+ dataset = dataset.map(format_instruction)
105
+ ```
106
+
107
+ **Step 2: Configure quantization**
108
+
109
+ ```python
110
+ bnb_config = BitsAndBytesConfig(
111
+ load_in_4bit=True,
112
+ bnb_4bit_compute_dtype=torch.bfloat16, # BF16 for stability
113
+ bnb_4bit_quant_type="nf4", # NormalFloat4 (recommended)
114
+ bnb_4bit_use_double_quant=True # Nested quantization
115
+ )
116
+ ```
117
+
118
+ **Step 3: Load and prepare model**
119
+
120
+ ```python
121
+ from transformers import AutoModelForCausalLM, AutoTokenizer
122
+
123
+ model = AutoModelForCausalLM.from_pretrained(
124
+ "meta-llama/Llama-2-13b-hf",
125
+ quantization_config=bnb_config,
126
+ device_map="auto"
127
+ )
128
+
129
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
130
+ tokenizer.pad_token = tokenizer.eos_token
131
+
132
+ # Enable gradient checkpointing (further memory savings)
133
+ model.gradient_checkpointing_enable()
134
+ model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
135
+ ```
136
+
137
+ **Step 4: Configure LoRA**
138
+
139
+ ```python
140
+ from peft import LoraConfig
141
+
142
+ lora_config = LoraConfig(
143
+ r=16, # LoRA rank (lower = less memory)
144
+ lora_alpha=32, # Scaling factor
145
+ target_modules="all-linear", # Apply to all linear layers
146
+ lora_dropout=0.05,
147
+ bias="none",
148
+ task_type="CAUSAL_LM"
149
+ )
150
+
151
+ model = get_peft_model(model, lora_config)
152
+ ```
153
+
154
+ **Step 5: Train**
155
+
156
+ ```python
157
+ training_args = TrainingArguments(
158
+ output_dir="./qlora-13b-results",
159
+ per_device_train_batch_size=4,
160
+ gradient_accumulation_steps=4, # Effective batch = 16
161
+ warmup_steps=100,
162
+ num_train_epochs=1,
163
+ learning_rate=2e-4,
164
+ bf16=True,
165
+ logging_steps=10,
166
+ save_strategy="steps",
167
+ save_steps=100,
168
+ eval_strategy="steps",
169
+ eval_steps=100,
170
+ optim="paged_adamw_8bit", # 8-bit optimizer
171
+ max_grad_norm=0.3,
172
+ max_steps=1000
173
+ )
174
+
175
+ trainer = SFTTrainer(
176
+ model=model,
177
+ args=training_args,
178
+ train_dataset=dataset["train"],
179
+ eval_dataset=dataset["test"],
180
+ tokenizer=tokenizer,
181
+ max_seq_length=512
182
+ )
183
+
184
+ trainer.train()
185
+ ```
186
+
187
+ **Memory usage**: ~18GB on RTX 4090 (24GB)
188
+
189
+ ### Workflow 2: Multi-GPU Training (FSDP + QLoRA)
190
+
191
+ Train Llama 2 70B on 8×A100 (80GB each).
192
+
193
+ **Step 1: Configure FSDP-compatible quantization**
194
+
195
+ ```python
196
+ bnb_config = BitsAndBytesConfig(
197
+ load_in_4bit=True,
198
+ bnb_4bit_compute_dtype=torch.bfloat16,
199
+ bnb_4bit_quant_type="nf4",
200
+ bnb_4bit_use_double_quant=True,
201
+ bnb_4bit_quant_storage=torch.bfloat16 # CRITICAL for FSDP!
202
+ )
203
+ ```
204
+
205
+ **Important**: `bnb_4bit_quant_storage=torch.bfloat16` ensures 4-bit layers are wrapped identically to regular layers for FSDP sharding.
206
+
207
+ **Step 2: Launch with accelerate**
208
+
209
+ Create `fsdp_config.yaml`:
210
+ ```yaml
211
+ compute_environment: LOCAL_MACHINE
212
+ distributed_type: FSDP
213
+ fsdp_config:
214
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
215
+ fsdp_backward_prefetch_policy: BACKWARD_PRE
216
+ fsdp_forward_prefetch: true
217
+ fsdp_sharding_strategy: 1 # FULL_SHARD
218
+ fsdp_state_dict_type: SHARDED_STATE_DICT
219
+ fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
220
+ mixed_precision: bf16
221
+ num_processes: 8
222
+ ```
223
+
224
+ **Launch training**:
225
+ ```bash
226
+ accelerate launch --config_file fsdp_config.yaml train_qlora.py
227
+ ```
228
+
229
+ **train_qlora.py**:
230
+ ```python
231
+ model = AutoModelForCausalLM.from_pretrained(
232
+ "meta-llama/Llama-2-70b-hf",
233
+ quantization_config=bnb_config,
234
+ torch_dtype=torch.bfloat16
235
+ )
236
+
237
+ # Rest same as single-GPU workflow
238
+ model = prepare_model_for_kbit_training(model)
239
+ model = get_peft_model(model, lora_config)
240
+
241
+ trainer = SFTTrainer(...)
242
+ trainer.train()
243
+ ```
244
+
245
+ **Memory per GPU**: ~40GB (70B model sharded across 8 GPUs)
246
+
247
+ ### Workflow 3: Extremely Large Models (405B)
248
+
249
+ Train Llama 3.1 405B on 8×H100 (80GB each).
250
+
251
+ **Requirements**:
252
+ - 8×H100 80GB GPUs
253
+ - 256GB+ system RAM
254
+ - FSDP + QLoRA
255
+
256
+ **Configuration**:
257
+ ```python
258
+ bnb_config = BitsAndBytesConfig(
259
+ load_in_4bit=True,
260
+ bnb_4bit_compute_dtype=torch.bfloat16,
261
+ bnb_4bit_quant_type="nf4",
262
+ bnb_4bit_use_double_quant=True,
263
+ bnb_4bit_quant_storage=torch.bfloat16
264
+ )
265
+
266
+ lora_config = LoraConfig(
267
+ r=32, # Higher rank for 405B
268
+ lora_alpha=64,
269
+ target_modules="all-linear",
270
+ lora_dropout=0.1,
271
+ bias="none",
272
+ task_type="CAUSAL_LM"
273
+ )
274
+
275
+ training_args = TrainingArguments(
276
+ per_device_train_batch_size=1, # Small batch
277
+ gradient_accumulation_steps=32, # Effective batch = 256
278
+ learning_rate=1e-4, # Lower LR for large model
279
+ bf16=True,
280
+ optim="paged_adamw_8bit",
281
+ gradient_checkpointing=True
282
+ )
283
+ ```
284
+
285
+ **Memory per GPU**: ~70GB (405B in 4-bit / 8 GPUs)
286
+
287
+ ## Hyperparameter Tuning
288
+
289
+ ### LoRA Rank (r)
290
+
291
+ Controls adapter capacity:
292
+
293
+ | Model Size | Recommended r | Trainable Params | Use Case |
294
+ |------------|---------------|------------------|----------|
295
+ | 7B | 8-16 | ~4M | Simple tasks |
296
+ | 13B | 16-32 | ~8M | General fine-tuning |
297
+ | 70B | 32-64 | ~80M | Complex tasks |
298
+ | 405B | 64-128 | ~300M | Maximum capacity |
299
+
300
+ **Trade-off**: Higher r = more capacity but more memory and slower training
301
+
302
+ ### LoRA Alpha
303
+
304
+ Scaling factor for LoRA updates:
305
+
306
+ ```python
307
+ effective_learning_rate = learning_rate * (lora_alpha / r)
308
+ ```
309
+
310
+ **Recommended**: `lora_alpha = 2 × r`
311
+ - r=16 → alpha=32
312
+ - r=64 → alpha=128
313
+
314
+ ### Target Modules
315
+
316
+ **Options**:
317
+ - `"all-linear"`: All linear layers (recommended for QLoRA)
318
+ - `["q_proj", "v_proj"]`: Only attention (minimal)
319
+ - `["q_proj", "k_proj", "v_proj", "o_proj"]`: All attention
320
+ - `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`: Attention + FFN
321
+
322
+ **Trade-off**: More modules = better performance but more memory
323
+
324
+ ### Learning Rate
325
+
326
+ | Model Size | Recommended LR |
327
+ |------------|----------------|
328
+ | 7-13B | 2e-4 to 3e-4 |
329
+ | 70B | 1e-4 to 2e-4 |
330
+ | 405B | 5e-5 to 1e-4 |
331
+
332
+ **Rule**: Larger models need lower learning rates
333
+
334
+ ### Batch Size
335
+
336
+ ```python
337
+ effective_batch_size = per_device_batch_size × gradient_accumulation_steps × num_gpus
338
+ ```
339
+
340
+ **Recommended effective batch sizes**:
341
+ - Instruction tuning: 64-128
342
+ - Continued pretraining: 256-512
343
+
344
+ ### Quantization Dtype
345
+
346
+ | Dtype | Speed | Accuracy | Use Case |
347
+ |-------|-------|----------|----------|
348
+ | `torch.float32` | Slow | Best | Debugging |
349
+ | `torch.bfloat16` | Fast | Good | **Recommended** |
350
+ | `torch.float16` | Fastest | Risky | May have precision issues |
351
+
352
+ ## Advanced Techniques
353
+
354
+ ### Gradient Checkpointing
355
+
356
+ Save memory by recomputing activations:
357
+
358
+ ```python
359
+ model.gradient_checkpointing_enable()
360
+ model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
361
+ ```
362
+
363
+ **Memory savings**: ~30-40% activation memory
364
+ **Cost**: ~20% slower training
365
+
366
+ ### Nested Quantization
367
+
368
+ Quantize the quantization constants:
369
+
370
+ ```python
371
+ bnb_config = BitsAndBytesConfig(
372
+ bnb_4bit_use_double_quant=True # Enable nested quantization
373
+ )
374
+ ```
375
+
376
+ **Memory savings**: Additional ~2-3% reduction
377
+ **Accuracy**: Minimal impact
378
+
379
+ ### CPU Offloading
380
+
381
+ For models that still don't fit:
382
+
383
+ ```python
384
+ model = AutoModelForCausalLM.from_pretrained(
385
+ "model-name",
386
+ quantization_config=bnb_config,
387
+ device_map="auto",
388
+ max_memory={0: "40GB", "cpu": "100GB"}
389
+ )
390
+ ```
391
+
392
+ **Trade-off**: Much slower but enables larger models
393
+
394
+ ### Paged Optimizers
395
+
396
+ Use paged memory for optimizer states:
397
+
398
+ ```python
399
+ training_args = TrainingArguments(
400
+ optim="paged_adamw_8bit" # Or paged_adamw_32bit
401
+ )
402
+ ```
403
+
404
+ **Benefit**: Prevents OOM from optimizer states
405
+
406
+ ## Deployment
407
+
408
+ ### Save LoRA Adapters
409
+
410
+ ```python
411
+ # Save only adapters (~20MB)
412
+ model.save_pretrained("./qlora-adapters")
413
+ tokenizer.save_pretrained("./qlora-adapters")
414
+ ```
415
+
416
+ ### Load for Inference
417
+
418
+ ```python
419
+ from peft import PeftModel
420
+
421
+ # Load base model in 4-bit
422
+ base_model = AutoModelForCausalLM.from_pretrained(
423
+ "meta-llama/Llama-2-70b-hf",
424
+ quantization_config=bnb_config,
425
+ device_map="auto"
426
+ )
427
+
428
+ # Load adapters
429
+ model = PeftModel.from_pretrained(base_model, "./qlora-adapters")
430
+
431
+ # Inference
432
+ inputs = tokenizer("Question here", return_tensors="pt").to("cuda")
433
+ outputs = model.generate(**inputs, max_length=200)
434
+ ```
435
+
436
+ ### Merge Adapters (Optional)
437
+
438
+ ```python
439
+ # Merge LoRA into base weights
440
+ model = model.merge_and_unload()
441
+
442
+ # Save merged model
443
+ model.save_pretrained("./merged-model")
444
+ ```
445
+
446
+ **Note**: Merged model loses 4-bit quantization (back to FP16/BF16)
447
+
448
+ ## Troubleshooting
449
+
450
+ ### OOM During Training
451
+
452
+ 1. Reduce batch size:
453
+ ```python
454
+ per_device_train_batch_size=1
455
+ ```
456
+
457
+ 2. Increase gradient accumulation:
458
+ ```python
459
+ gradient_accumulation_steps=16
460
+ ```
461
+
462
+ 3. Lower LoRA rank:
463
+ ```python
464
+ r=8 # Instead of 16
465
+ ```
466
+
467
+ 4. Enable gradient checkpointing
468
+
469
+ 5. Use CPU offloading
470
+
471
+ ### Low Quality Results
472
+
473
+ 1. Increase LoRA rank:
474
+ ```python
475
+ r=64 # Instead of 16
476
+ ```
477
+
478
+ 2. Train longer:
479
+ ```python
480
+ num_train_epochs=3 # Instead of 1
481
+ ```
482
+
483
+ 3. Use more target modules:
484
+ ```python
485
+ target_modules="all-linear"
486
+ ```
487
+
488
+ 4. Check learning rate (try 1e-4 to 3e-4)
489
+
490
+ ### Slow Training
491
+
492
+ 1. Disable gradient checkpointing (if memory allows)
493
+
494
+ 2. Increase batch size
495
+
496
+ 3. Use BF16:
497
+ ```python
498
+ bf16=True
499
+ ```
500
+
501
+ 4. Use paged optimizer
502
+
503
+ ## Best Practices
504
+
505
+ 1. **Start small**: Test on 7B before 70B
506
+ 2. **Monitor loss**: Should decrease steadily
507
+ 3. **Use validation**: Track eval loss to detect overfitting
508
+ 4. **Save checkpoints**: Every 100-500 steps
509
+ 5. **Log hyperparameters**: For reproducibility
510
+ 6. **Test inference**: Verify quality before full training
511
+
512
+ ## Example: Complete Training Script
513
+
514
+ See full working example at `examples/qlora_training.py` in the repository.
515
+
516
+ ## References
517
+
518
+ - QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
519
+ - bitsandbytes GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
520
+ - PEFT documentation: https://huggingface.co/docs/peft
521
+ - FSDP+QLoRA guide: https://huggingface.co/blog/fsdp-qlora