EvoScientist 0.0.1.dev1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (107) hide show
  1. EvoScientist/EvoScientist.py +157 -0
  2. EvoScientist/__init__.py +24 -0
  3. EvoScientist/__main__.py +4 -0
  4. EvoScientist/backends.py +392 -0
  5. EvoScientist/cli.py +1553 -0
  6. EvoScientist/middleware.py +35 -0
  7. EvoScientist/prompts.py +277 -0
  8. EvoScientist/skills/accelerate/SKILL.md +332 -0
  9. EvoScientist/skills/accelerate/references/custom-plugins.md +453 -0
  10. EvoScientist/skills/accelerate/references/megatron-integration.md +489 -0
  11. EvoScientist/skills/accelerate/references/performance.md +525 -0
  12. EvoScientist/skills/bitsandbytes/SKILL.md +411 -0
  13. EvoScientist/skills/bitsandbytes/references/memory-optimization.md +521 -0
  14. EvoScientist/skills/bitsandbytes/references/qlora-training.md +521 -0
  15. EvoScientist/skills/bitsandbytes/references/quantization-formats.md +447 -0
  16. EvoScientist/skills/find-skills/SKILL.md +133 -0
  17. EvoScientist/skills/find-skills/scripts/install_skill.py +211 -0
  18. EvoScientist/skills/flash-attention/SKILL.md +367 -0
  19. EvoScientist/skills/flash-attention/references/benchmarks.md +215 -0
  20. EvoScientist/skills/flash-attention/references/transformers-integration.md +293 -0
  21. EvoScientist/skills/llama-cpp/SKILL.md +258 -0
  22. EvoScientist/skills/llama-cpp/references/optimization.md +89 -0
  23. EvoScientist/skills/llama-cpp/references/quantization.md +213 -0
  24. EvoScientist/skills/llama-cpp/references/server.md +125 -0
  25. EvoScientist/skills/lm-evaluation-harness/SKILL.md +490 -0
  26. EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  27. EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  28. EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  29. EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  30. EvoScientist/skills/ml-paper-writing/SKILL.md +937 -0
  31. EvoScientist/skills/ml-paper-writing/references/checklists.md +361 -0
  32. EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +562 -0
  33. EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +367 -0
  34. EvoScientist/skills/ml-paper-writing/references/sources.md +159 -0
  35. EvoScientist/skills/ml-paper-writing/references/writing-guide.md +476 -0
  36. EvoScientist/skills/ml-paper-writing/templates/README.md +251 -0
  37. EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +534 -0
  38. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +144 -0
  39. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +952 -0
  40. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +111 -0
  41. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +1493 -0
  42. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +315 -0
  43. EvoScientist/skills/ml-paper-writing/templates/acl/README.md +50 -0
  44. EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +312 -0
  45. EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +377 -0
  46. EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +101 -0
  47. EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +1940 -0
  48. EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +26 -0
  49. EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +70 -0
  50. EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +326 -0
  51. EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +3 -0
  52. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +11 -0
  53. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +1440 -0
  54. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
  55. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +218 -0
  56. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +305 -0
  57. EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +485 -0
  58. EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +508 -0
  59. EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +1246 -0
  60. EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +485 -0
  61. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +24 -0
  62. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +1440 -0
  63. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
  64. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +246 -0
  65. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +414 -0
  66. EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +508 -0
  67. EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +1246 -0
  68. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +79 -0
  69. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +201 -0
  70. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +75 -0
  71. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
  72. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +662 -0
  73. EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +864 -0
  74. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +1443 -0
  75. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +767 -0
  76. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
  77. EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +36 -0
  78. EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +53 -0
  79. EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +38 -0
  80. EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +382 -0
  81. EvoScientist/skills/peft/SKILL.md +431 -0
  82. EvoScientist/skills/peft/references/advanced-usage.md +514 -0
  83. EvoScientist/skills/peft/references/troubleshooting.md +480 -0
  84. EvoScientist/skills/ray-data/SKILL.md +326 -0
  85. EvoScientist/skills/ray-data/references/integration.md +82 -0
  86. EvoScientist/skills/ray-data/references/transformations.md +83 -0
  87. EvoScientist/skills/skill-creator/LICENSE.txt +202 -0
  88. EvoScientist/skills/skill-creator/SKILL.md +356 -0
  89. EvoScientist/skills/skill-creator/references/output-patterns.md +82 -0
  90. EvoScientist/skills/skill-creator/references/workflows.md +28 -0
  91. EvoScientist/skills/skill-creator/scripts/init_skill.py +303 -0
  92. EvoScientist/skills/skill-creator/scripts/package_skill.py +110 -0
  93. EvoScientist/skills/skill-creator/scripts/quick_validate.py +95 -0
  94. EvoScientist/stream/__init__.py +53 -0
  95. EvoScientist/stream/emitter.py +94 -0
  96. EvoScientist/stream/formatter.py +168 -0
  97. EvoScientist/stream/tracker.py +115 -0
  98. EvoScientist/stream/utils.py +255 -0
  99. EvoScientist/subagent.yaml +147 -0
  100. EvoScientist/tools.py +135 -0
  101. EvoScientist/utils.py +207 -0
  102. evoscientist-0.0.1.dev1.dist-info/METADATA +222 -0
  103. evoscientist-0.0.1.dev1.dist-info/RECORD +107 -0
  104. evoscientist-0.0.1.dev1.dist-info/WHEEL +5 -0
  105. evoscientist-0.0.1.dev1.dist-info/entry_points.txt +2 -0
  106. evoscientist-0.0.1.dev1.dist-info/licenses/LICENSE +21 -0
  107. evoscientist-0.0.1.dev1.dist-info/top_level.txt +1 -0
@@ -0,0 +1,490 @@
1
+ ---
2
+ name: lm-evaluation-harness
3
+ description: Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
4
+ version: 1.0.0
5
+ author: Orchestra Research
6
+ license: MIT
7
+ tags: [Evaluation, LM Evaluation Harness, Benchmarking, MMLU, HumanEval, GSM8K, EleutherAI, Model Quality, Academic Benchmarks, Industry Standard]
8
+ dependencies: [lm-eval, transformers, vllm]
9
+ ---
10
+
11
+ # lm-evaluation-harness - LLM Benchmarking
12
+
13
+ ## Quick start
14
+
15
+ lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.
16
+
17
+ **Installation**:
18
+ ```bash
19
+ pip install lm-eval
20
+ ```
21
+
22
+ **Evaluate any HuggingFace model**:
23
+ ```bash
24
+ lm_eval --model hf \
25
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
26
+ --tasks mmlu,gsm8k,hellaswag \
27
+ --device cuda:0 \
28
+ --batch_size 8
29
+ ```
30
+
31
+ **View available tasks**:
32
+ ```bash
33
+ lm_eval --tasks list
34
+ ```
35
+
36
+ ## Common workflows
37
+
38
+ ### Workflow 1: Standard benchmark evaluation
39
+
40
+ Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).
41
+
42
+ Copy this checklist:
43
+
44
+ ```
45
+ Benchmark Evaluation:
46
+ - [ ] Step 1: Choose benchmark suite
47
+ - [ ] Step 2: Configure model
48
+ - [ ] Step 3: Run evaluation
49
+ - [ ] Step 4: Analyze results
50
+ ```
51
+
52
+ **Step 1: Choose benchmark suite**
53
+
54
+ **Core reasoning benchmarks**:
55
+ - **MMLU** (Massive Multitask Language Understanding) - 57 subjects, multiple choice
56
+ - **GSM8K** - Grade school math word problems
57
+ - **HellaSwag** - Common sense reasoning
58
+ - **TruthfulQA** - Truthfulness and factuality
59
+ - **ARC** (AI2 Reasoning Challenge) - Science questions
60
+
61
+ **Code benchmarks**:
62
+ - **HumanEval** - Python code generation (164 problems)
63
+ - **MBPP** (Mostly Basic Python Problems) - Python coding
64
+
65
+ **Standard suite** (recommended for model releases):
66
+ ```bash
67
+ --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
68
+ ```
69
+
70
+ **Step 2: Configure model**
71
+
72
+ **HuggingFace model**:
73
+ ```bash
74
+ lm_eval --model hf \
75
+ --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
76
+ --tasks mmlu \
77
+ --device cuda:0 \
78
+ --batch_size auto # Auto-detect optimal batch size
79
+ ```
80
+
81
+ **Quantized model (4-bit/8-bit)**:
82
+ ```bash
83
+ lm_eval --model hf \
84
+ --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
85
+ --tasks mmlu \
86
+ --device cuda:0
87
+ ```
88
+
89
+ **Custom checkpoint**:
90
+ ```bash
91
+ lm_eval --model hf \
92
+ --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
93
+ --tasks mmlu \
94
+ --device cuda:0
95
+ ```
96
+
97
+ **Step 3: Run evaluation**
98
+
99
+ ```bash
100
+ # Full MMLU evaluation (57 subjects)
101
+ lm_eval --model hf \
102
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
103
+ --tasks mmlu \
104
+ --num_fewshot 5 \ # 5-shot evaluation (standard)
105
+ --batch_size 8 \
106
+ --output_path results/ \
107
+ --log_samples # Save individual predictions
108
+
109
+ # Multiple benchmarks at once
110
+ lm_eval --model hf \
111
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
112
+ --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
113
+ --num_fewshot 5 \
114
+ --batch_size 8 \
115
+ --output_path results/llama2-7b-eval.json
116
+ ```
117
+
118
+ **Step 4: Analyze results**
119
+
120
+ Results saved to `results/llama2-7b-eval.json`:
121
+
122
+ ```json
123
+ {
124
+ "results": {
125
+ "mmlu": {
126
+ "acc": 0.459,
127
+ "acc_stderr": 0.004
128
+ },
129
+ "gsm8k": {
130
+ "exact_match": 0.142,
131
+ "exact_match_stderr": 0.006
132
+ },
133
+ "hellaswag": {
134
+ "acc_norm": 0.765,
135
+ "acc_norm_stderr": 0.004
136
+ }
137
+ },
138
+ "config": {
139
+ "model": "hf",
140
+ "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
141
+ "num_fewshot": 5
142
+ }
143
+ }
144
+ ```
145
+
146
+ ### Workflow 2: Track training progress
147
+
148
+ Evaluate checkpoints during training.
149
+
150
+ ```
151
+ Training Progress Tracking:
152
+ - [ ] Step 1: Set up periodic evaluation
153
+ - [ ] Step 2: Choose quick benchmarks
154
+ - [ ] Step 3: Automate evaluation
155
+ - [ ] Step 4: Plot learning curves
156
+ ```
157
+
158
+ **Step 1: Set up periodic evaluation**
159
+
160
+ Evaluate every N training steps:
161
+
162
+ ```bash
163
+ #!/bin/bash
164
+ # eval_checkpoint.sh
165
+
166
+ CHECKPOINT_DIR=$1
167
+ STEP=$2
168
+
169
+ lm_eval --model hf \
170
+ --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
171
+ --tasks gsm8k,hellaswag \
172
+ --num_fewshot 0 \ # 0-shot for speed
173
+ --batch_size 16 \
174
+ --output_path results/step-$STEP.json
175
+ ```
176
+
177
+ **Step 2: Choose quick benchmarks**
178
+
179
+ Fast benchmarks for frequent evaluation:
180
+ - **HellaSwag**: ~10 minutes on 1 GPU
181
+ - **GSM8K**: ~5 minutes
182
+ - **PIQA**: ~2 minutes
183
+
184
+ Avoid for frequent eval (too slow):
185
+ - **MMLU**: ~2 hours (57 subjects)
186
+ - **HumanEval**: Requires code execution
187
+
188
+ **Step 3: Automate evaluation**
189
+
190
+ Integrate with training script:
191
+
192
+ ```python
193
+ # In training loop
194
+ if step % eval_interval == 0:
195
+ model.save_pretrained(f"checkpoints/step-{step}")
196
+
197
+ # Run evaluation
198
+ os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
199
+ ```
200
+
201
+ Or use PyTorch Lightning callbacks:
202
+
203
+ ```python
204
+ from pytorch_lightning import Callback
205
+
206
+ class EvalHarnessCallback(Callback):
207
+ def on_validation_epoch_end(self, trainer, pl_module):
208
+ step = trainer.global_step
209
+ checkpoint_path = f"checkpoints/step-{step}"
210
+
211
+ # Save checkpoint
212
+ trainer.save_checkpoint(checkpoint_path)
213
+
214
+ # Run lm-eval
215
+ os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
216
+ ```
217
+
218
+ **Step 4: Plot learning curves**
219
+
220
+ ```python
221
+ import json
222
+ import matplotlib.pyplot as plt
223
+
224
+ # Load all results
225
+ steps = []
226
+ mmlu_scores = []
227
+
228
+ for file in sorted(glob.glob("results/step-*.json")):
229
+ with open(file) as f:
230
+ data = json.load(f)
231
+ step = int(file.split("-")[1].split(".")[0])
232
+ steps.append(step)
233
+ mmlu_scores.append(data["results"]["mmlu"]["acc"])
234
+
235
+ # Plot
236
+ plt.plot(steps, mmlu_scores)
237
+ plt.xlabel("Training Step")
238
+ plt.ylabel("MMLU Accuracy")
239
+ plt.title("Training Progress")
240
+ plt.savefig("training_curve.png")
241
+ ```
242
+
243
+ ### Workflow 3: Compare multiple models
244
+
245
+ Benchmark suite for model comparison.
246
+
247
+ ```
248
+ Model Comparison:
249
+ - [ ] Step 1: Define model list
250
+ - [ ] Step 2: Run evaluations
251
+ - [ ] Step 3: Generate comparison table
252
+ ```
253
+
254
+ **Step 1: Define model list**
255
+
256
+ ```bash
257
+ # models.txt
258
+ meta-llama/Llama-2-7b-hf
259
+ meta-llama/Llama-2-13b-hf
260
+ mistralai/Mistral-7B-v0.1
261
+ microsoft/phi-2
262
+ ```
263
+
264
+ **Step 2: Run evaluations**
265
+
266
+ ```bash
267
+ #!/bin/bash
268
+ # eval_all_models.sh
269
+
270
+ TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
271
+
272
+ while read model; do
273
+ echo "Evaluating $model"
274
+
275
+ # Extract model name for output file
276
+ model_name=$(echo $model | sed 's/\//-/g')
277
+
278
+ lm_eval --model hf \
279
+ --model_args pretrained=$model,dtype=bfloat16 \
280
+ --tasks $TASKS \
281
+ --num_fewshot 5 \
282
+ --batch_size auto \
283
+ --output_path results/$model_name.json
284
+
285
+ done < models.txt
286
+ ```
287
+
288
+ **Step 3: Generate comparison table**
289
+
290
+ ```python
291
+ import json
292
+ import pandas as pd
293
+
294
+ models = [
295
+ "meta-llama-Llama-2-7b-hf",
296
+ "meta-llama-Llama-2-13b-hf",
297
+ "mistralai-Mistral-7B-v0.1",
298
+ "microsoft-phi-2"
299
+ ]
300
+
301
+ tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
302
+
303
+ results = []
304
+ for model in models:
305
+ with open(f"results/{model}.json") as f:
306
+ data = json.load(f)
307
+ row = {"Model": model.replace("-", "/")}
308
+ for task in tasks:
309
+ # Get primary metric for each task
310
+ metrics = data["results"][task]
311
+ if "acc" in metrics:
312
+ row[task.upper()] = f"{metrics['acc']:.3f}"
313
+ elif "exact_match" in metrics:
314
+ row[task.upper()] = f"{metrics['exact_match']:.3f}"
315
+ results.append(row)
316
+
317
+ df = pd.DataFrame(results)
318
+ print(df.to_markdown(index=False))
319
+ ```
320
+
321
+ Output:
322
+ ```
323
+ | Model | MMLU | GSM8K | HELLASWAG | TRUTHFULQA |
324
+ |------------------------|-------|-------|-----------|------------|
325
+ | meta-llama/Llama-2-7b | 0.459 | 0.142 | 0.765 | 0.391 |
326
+ | meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801 | 0.430 |
327
+ | mistralai/Mistral-7B | 0.626 | 0.395 | 0.812 | 0.428 |
328
+ | microsoft/phi-2 | 0.560 | 0.613 | 0.682 | 0.447 |
329
+ ```
330
+
331
+ ### Workflow 4: Evaluate with vLLM (faster inference)
332
+
333
+ Use vLLM backend for 5-10x faster evaluation.
334
+
335
+ ```
336
+ vLLM Evaluation:
337
+ - [ ] Step 1: Install vLLM
338
+ - [ ] Step 2: Configure vLLM backend
339
+ - [ ] Step 3: Run evaluation
340
+ ```
341
+
342
+ **Step 1: Install vLLM**
343
+
344
+ ```bash
345
+ pip install vllm
346
+ ```
347
+
348
+ **Step 2: Configure vLLM backend**
349
+
350
+ ```bash
351
+ lm_eval --model vllm \
352
+ --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
353
+ --tasks mmlu \
354
+ --batch_size auto
355
+ ```
356
+
357
+ **Step 3: Run evaluation**
358
+
359
+ vLLM is 5-10× faster than standard HuggingFace:
360
+
361
+ ```bash
362
+ # Standard HF: ~2 hours for MMLU on 7B model
363
+ lm_eval --model hf \
364
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
365
+ --tasks mmlu \
366
+ --batch_size 8
367
+
368
+ # vLLM: ~15-20 minutes for MMLU on 7B model
369
+ lm_eval --model vllm \
370
+ --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
371
+ --tasks mmlu \
372
+ --batch_size auto
373
+ ```
374
+
375
+ ## When to use vs alternatives
376
+
377
+ **Use lm-evaluation-harness when:**
378
+ - Benchmarking models for academic papers
379
+ - Comparing model quality across standard tasks
380
+ - Tracking training progress
381
+ - Reporting standardized metrics (everyone uses same prompts)
382
+ - Need reproducible evaluation
383
+
384
+ **Use alternatives instead:**
385
+ - **HELM** (Stanford): Broader evaluation (fairness, efficiency, calibration)
386
+ - **AlpacaEval**: Instruction-following evaluation with LLM judges
387
+ - **MT-Bench**: Conversational multi-turn evaluation
388
+ - **Custom scripts**: Domain-specific evaluation
389
+
390
+ ## Common issues
391
+
392
+ **Issue: Evaluation too slow**
393
+
394
+ Use vLLM backend:
395
+ ```bash
396
+ lm_eval --model vllm \
397
+ --model_args pretrained=model-name,tensor_parallel_size=2
398
+ ```
399
+
400
+ Or reduce fewshot examples:
401
+ ```bash
402
+ --num_fewshot 0 # Instead of 5
403
+ ```
404
+
405
+ Or evaluate subset of MMLU:
406
+ ```bash
407
+ --tasks mmlu_stem # Only STEM subjects
408
+ ```
409
+
410
+ **Issue: Out of memory**
411
+
412
+ Reduce batch size:
413
+ ```bash
414
+ --batch_size 1 # Or --batch_size auto
415
+ ```
416
+
417
+ Use quantization:
418
+ ```bash
419
+ --model_args pretrained=model-name,load_in_8bit=True
420
+ ```
421
+
422
+ Enable CPU offloading:
423
+ ```bash
424
+ --model_args pretrained=model-name,device_map=auto,offload_folder=offload
425
+ ```
426
+
427
+ **Issue: Different results than reported**
428
+
429
+ Check fewshot count:
430
+ ```bash
431
+ --num_fewshot 5 # Most papers use 5-shot
432
+ ```
433
+
434
+ Check exact task name:
435
+ ```bash
436
+ --tasks mmlu # Not mmlu_direct or mmlu_fewshot
437
+ ```
438
+
439
+ Verify model and tokenizer match:
440
+ ```bash
441
+ --model_args pretrained=model-name,tokenizer=same-model-name
442
+ ```
443
+
444
+ **Issue: HumanEval not executing code**
445
+
446
+ Install execution dependencies:
447
+ ```bash
448
+ pip install human-eval
449
+ ```
450
+
451
+ Enable code execution:
452
+ ```bash
453
+ lm_eval --model hf \
454
+ --model_args pretrained=model-name \
455
+ --tasks humaneval \
456
+ --allow_code_execution # Required for HumanEval
457
+ ```
458
+
459
+ ## Advanced topics
460
+
461
+ **Benchmark descriptions**: See [references/benchmark-guide.md](references/benchmark-guide.md) for detailed description of all 60+ tasks, what they measure, and interpretation.
462
+
463
+ **Custom tasks**: See [references/custom-tasks.md](references/custom-tasks.md) for creating domain-specific evaluation tasks.
464
+
465
+ **API evaluation**: See [references/api-evaluation.md](references/api-evaluation.md) for evaluating OpenAI, Anthropic, and other API models.
466
+
467
+ **Multi-GPU strategies**: See [references/distributed-eval.md](references/distributed-eval.md) for data parallel and tensor parallel evaluation.
468
+
469
+ ## Hardware requirements
470
+
471
+ - **GPU**: NVIDIA (CUDA 11.8+), works on CPU (very slow)
472
+ - **VRAM**:
473
+ - 7B model: 16GB (bf16) or 8GB (8-bit)
474
+ - 13B model: 28GB (bf16) or 14GB (8-bit)
475
+ - 70B model: Requires multi-GPU or quantization
476
+ - **Time** (7B model, single A100):
477
+ - HellaSwag: 10 minutes
478
+ - GSM8K: 5 minutes
479
+ - MMLU (full): 2 hours
480
+ - HumanEval: 20 minutes
481
+
482
+ ## Resources
483
+
484
+ - GitHub: https://github.com/EleutherAI/lm-evaluation-harness
485
+ - Docs: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
486
+ - Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
487
+ - Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)
488
+
489
+
490
+