EvoScientist 0.0.1.dev2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (107) hide show
  1. EvoScientist/EvoScientist.py +157 -0
  2. EvoScientist/__init__.py +24 -0
  3. EvoScientist/__main__.py +4 -0
  4. EvoScientist/backends.py +392 -0
  5. EvoScientist/cli.py +1553 -0
  6. EvoScientist/middleware.py +35 -0
  7. EvoScientist/prompts.py +277 -0
  8. EvoScientist/skills/accelerate/SKILL.md +332 -0
  9. EvoScientist/skills/accelerate/references/custom-plugins.md +453 -0
  10. EvoScientist/skills/accelerate/references/megatron-integration.md +489 -0
  11. EvoScientist/skills/accelerate/references/performance.md +525 -0
  12. EvoScientist/skills/bitsandbytes/SKILL.md +411 -0
  13. EvoScientist/skills/bitsandbytes/references/memory-optimization.md +521 -0
  14. EvoScientist/skills/bitsandbytes/references/qlora-training.md +521 -0
  15. EvoScientist/skills/bitsandbytes/references/quantization-formats.md +447 -0
  16. EvoScientist/skills/find-skills/SKILL.md +133 -0
  17. EvoScientist/skills/find-skills/scripts/install_skill.py +211 -0
  18. EvoScientist/skills/flash-attention/SKILL.md +367 -0
  19. EvoScientist/skills/flash-attention/references/benchmarks.md +215 -0
  20. EvoScientist/skills/flash-attention/references/transformers-integration.md +293 -0
  21. EvoScientist/skills/llama-cpp/SKILL.md +258 -0
  22. EvoScientist/skills/llama-cpp/references/optimization.md +89 -0
  23. EvoScientist/skills/llama-cpp/references/quantization.md +213 -0
  24. EvoScientist/skills/llama-cpp/references/server.md +125 -0
  25. EvoScientist/skills/lm-evaluation-harness/SKILL.md +490 -0
  26. EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  27. EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  28. EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  29. EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  30. EvoScientist/skills/ml-paper-writing/SKILL.md +937 -0
  31. EvoScientist/skills/ml-paper-writing/references/checklists.md +361 -0
  32. EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +562 -0
  33. EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +367 -0
  34. EvoScientist/skills/ml-paper-writing/references/sources.md +159 -0
  35. EvoScientist/skills/ml-paper-writing/references/writing-guide.md +476 -0
  36. EvoScientist/skills/ml-paper-writing/templates/README.md +251 -0
  37. EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +534 -0
  38. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +144 -0
  39. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +952 -0
  40. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +111 -0
  41. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +1493 -0
  42. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +315 -0
  43. EvoScientist/skills/ml-paper-writing/templates/acl/README.md +50 -0
  44. EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +312 -0
  45. EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +377 -0
  46. EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +101 -0
  47. EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +1940 -0
  48. EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +26 -0
  49. EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +70 -0
  50. EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +326 -0
  51. EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +3 -0
  52. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +11 -0
  53. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +1440 -0
  54. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
  55. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +218 -0
  56. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +305 -0
  57. EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +485 -0
  58. EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +508 -0
  59. EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +1246 -0
  60. EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +485 -0
  61. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +24 -0
  62. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +1440 -0
  63. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
  64. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +246 -0
  65. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +414 -0
  66. EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +508 -0
  67. EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +1246 -0
  68. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +79 -0
  69. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +201 -0
  70. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +75 -0
  71. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
  72. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +662 -0
  73. EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +864 -0
  74. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +1443 -0
  75. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +767 -0
  76. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
  77. EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +36 -0
  78. EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +53 -0
  79. EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +38 -0
  80. EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +382 -0
  81. EvoScientist/skills/peft/SKILL.md +431 -0
  82. EvoScientist/skills/peft/references/advanced-usage.md +514 -0
  83. EvoScientist/skills/peft/references/troubleshooting.md +480 -0
  84. EvoScientist/skills/ray-data/SKILL.md +326 -0
  85. EvoScientist/skills/ray-data/references/integration.md +82 -0
  86. EvoScientist/skills/ray-data/references/transformations.md +83 -0
  87. EvoScientist/skills/skill-creator/LICENSE.txt +202 -0
  88. EvoScientist/skills/skill-creator/SKILL.md +356 -0
  89. EvoScientist/skills/skill-creator/references/output-patterns.md +82 -0
  90. EvoScientist/skills/skill-creator/references/workflows.md +28 -0
  91. EvoScientist/skills/skill-creator/scripts/init_skill.py +303 -0
  92. EvoScientist/skills/skill-creator/scripts/package_skill.py +110 -0
  93. EvoScientist/skills/skill-creator/scripts/quick_validate.py +95 -0
  94. EvoScientist/stream/__init__.py +53 -0
  95. EvoScientist/stream/emitter.py +94 -0
  96. EvoScientist/stream/formatter.py +168 -0
  97. EvoScientist/stream/tracker.py +115 -0
  98. EvoScientist/stream/utils.py +255 -0
  99. EvoScientist/subagent.yaml +147 -0
  100. EvoScientist/tools.py +135 -0
  101. EvoScientist/utils.py +207 -0
  102. evoscientist-0.0.1.dev2.dist-info/METADATA +227 -0
  103. evoscientist-0.0.1.dev2.dist-info/RECORD +107 -0
  104. evoscientist-0.0.1.dev2.dist-info/WHEEL +5 -0
  105. evoscientist-0.0.1.dev2.dist-info/entry_points.txt +5 -0
  106. evoscientist-0.0.1.dev2.dist-info/licenses/LICENSE +21 -0
  107. evoscientist-0.0.1.dev2.dist-info/top_level.txt +1 -0
@@ -0,0 +1,488 @@
1
+ # Benchmark Guide
2
+
3
+ Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results.
4
+
5
+ ## Overview
6
+
7
+ The lm-evaluation-harness includes 60+ benchmarks spanning:
8
+ - Language understanding (MMLU, GLUE)
9
+ - Mathematical reasoning (GSM8K, MATH)
10
+ - Code generation (HumanEval, MBPP)
11
+ - Instruction following (IFEval, AlpacaEval)
12
+ - Long-context understanding (LongBench)
13
+ - Multilingual capabilities (AfroBench, NorEval)
14
+ - Reasoning (BBH, ARC)
15
+ - Truthfulness (TruthfulQA)
16
+
17
+ **List all tasks**:
18
+ ```bash
19
+ lm_eval --tasks list
20
+ ```
21
+
22
+ ## Major Benchmarks
23
+
24
+ ### MMLU (Massive Multitask Language Understanding)
25
+
26
+ **What it measures**: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law).
27
+
28
+ **Task variants**:
29
+ - `mmlu`: Original 57-subject benchmark
30
+ - `mmlu_pro`: More challenging version with reasoning-focused questions
31
+ - `mmlu_prox`: Multilingual extension
32
+
33
+ **Format**: Multiple choice (4 options)
34
+
35
+ **Example**:
36
+ ```
37
+ Question: What is the capital of France?
38
+ A. Berlin
39
+ B. Paris
40
+ C. London
41
+ D. Madrid
42
+ Answer: B
43
+ ```
44
+
45
+ **Command**:
46
+ ```bash
47
+ lm_eval --model hf \
48
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
49
+ --tasks mmlu \
50
+ --num_fewshot 5
51
+ ```
52
+
53
+ **Interpretation**:
54
+ - Random: 25% (chance)
55
+ - GPT-3 (175B): 43.9%
56
+ - GPT-4: 86.4%
57
+ - Human expert: ~90%
58
+
59
+ **Good for**: Assessing general knowledge and domain expertise.
60
+
61
+ ### GSM8K (Grade School Math 8K)
62
+
63
+ **What it measures**: Mathematical reasoning on grade-school level word problems.
64
+
65
+ **Task variants**:
66
+ - `gsm8k`: Base task
67
+ - `gsm8k_cot`: With chain-of-thought prompting
68
+ - `gsm_plus`: Adversarial variant with perturbations
69
+
70
+ **Format**: Free-form generation, extract numerical answer
71
+
72
+ **Example**:
73
+ ```
74
+ Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left?
75
+ Answer: 60
76
+ ```
77
+
78
+ **Command**:
79
+ ```bash
80
+ lm_eval --model hf \
81
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
82
+ --tasks gsm8k \
83
+ --num_fewshot 5
84
+ ```
85
+
86
+ **Interpretation**:
87
+ - Random: ~0%
88
+ - GPT-3 (175B): 17.0%
89
+ - GPT-4: 92.0%
90
+ - Llama 2 70B: 56.8%
91
+
92
+ **Good for**: Testing multi-step reasoning and arithmetic.
93
+
94
+ ### HumanEval
95
+
96
+ **What it measures**: Python code generation from docstrings (functional correctness).
97
+
98
+ **Task variants**:
99
+ - `humaneval`: Standard benchmark
100
+ - `humaneval_instruct`: For instruction-tuned models
101
+
102
+ **Format**: Code generation, execution-based evaluation
103
+
104
+ **Example**:
105
+ ```python
106
+ def has_close_elements(numbers: List[float], threshold: float) -> bool:
107
+ """ Check if in given list of numbers, are any two numbers closer to each other than
108
+ given threshold.
109
+ >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
110
+ False
111
+ >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
112
+ True
113
+ """
114
+ ```
115
+
116
+ **Command**:
117
+ ```bash
118
+ lm_eval --model hf \
119
+ --model_args pretrained=codellama/CodeLlama-7b-hf \
120
+ --tasks humaneval \
121
+ --batch_size 1
122
+ ```
123
+
124
+ **Interpretation**:
125
+ - Random: 0%
126
+ - GPT-3 (175B): 0%
127
+ - Codex: 28.8%
128
+ - GPT-4: 67.0%
129
+ - Code Llama 34B: 53.7%
130
+
131
+ **Good for**: Evaluating code generation capabilities.
132
+
133
+ ### BBH (BIG-Bench Hard)
134
+
135
+ **What it measures**: 23 challenging reasoning tasks where models previously failed to beat humans.
136
+
137
+ **Categories**:
138
+ - Logical reasoning
139
+ - Math word problems
140
+ - Social understanding
141
+ - Algorithmic reasoning
142
+
143
+ **Format**: Multiple choice and free-form
144
+
145
+ **Command**:
146
+ ```bash
147
+ lm_eval --model hf \
148
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
149
+ --tasks bbh \
150
+ --num_fewshot 3
151
+ ```
152
+
153
+ **Interpretation**:
154
+ - Random: ~25%
155
+ - GPT-3 (175B): 33.9%
156
+ - PaLM 540B: 58.3%
157
+ - GPT-4: 86.7%
158
+
159
+ **Good for**: Testing advanced reasoning capabilities.
160
+
161
+ ### IFEval (Instruction-Following Evaluation)
162
+
163
+ **What it measures**: Ability to follow specific, verifiable instructions.
164
+
165
+ **Instruction types**:
166
+ - Format constraints (e.g., "answer in 3 sentences")
167
+ - Length constraints (e.g., "use at least 100 words")
168
+ - Content constraints (e.g., "include the word 'banana'")
169
+ - Structural constraints (e.g., "use bullet points")
170
+
171
+ **Format**: Free-form generation with rule-based verification
172
+
173
+ **Command**:
174
+ ```bash
175
+ lm_eval --model hf \
176
+ --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
177
+ --tasks ifeval \
178
+ --batch_size auto
179
+ ```
180
+
181
+ **Interpretation**:
182
+ - Measures: Instruction adherence (not quality)
183
+ - GPT-4: 86% instruction following
184
+ - Claude 2: 84%
185
+
186
+ **Good for**: Evaluating chat/instruct models.
187
+
188
+ ### GLUE (General Language Understanding Evaluation)
189
+
190
+ **What it measures**: Natural language understanding across 9 tasks.
191
+
192
+ **Tasks**:
193
+ - `cola`: Grammatical acceptability
194
+ - `sst2`: Sentiment analysis
195
+ - `mrpc`: Paraphrase detection
196
+ - `qqp`: Question pairs
197
+ - `stsb`: Semantic similarity
198
+ - `mnli`: Natural language inference
199
+ - `qnli`: Question answering NLI
200
+ - `rte`: Recognizing textual entailment
201
+ - `wnli`: Winograd schemas
202
+
203
+ **Command**:
204
+ ```bash
205
+ lm_eval --model hf \
206
+ --model_args pretrained=bert-base-uncased \
207
+ --tasks glue \
208
+ --num_fewshot 0
209
+ ```
210
+
211
+ **Interpretation**:
212
+ - BERT Base: 78.3 (GLUE score)
213
+ - RoBERTa Large: 88.5
214
+ - Human baseline: 87.1
215
+
216
+ **Good for**: Encoder-only models, fine-tuning baselines.
217
+
218
+ ### LongBench
219
+
220
+ **What it measures**: Long-context understanding (4K-32K tokens).
221
+
222
+ **21 tasks covering**:
223
+ - Single-document QA
224
+ - Multi-document QA
225
+ - Summarization
226
+ - Few-shot learning
227
+ - Code completion
228
+ - Synthetic tasks
229
+
230
+ **Command**:
231
+ ```bash
232
+ lm_eval --model hf \
233
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
234
+ --tasks longbench \
235
+ --batch_size 1
236
+ ```
237
+
238
+ **Interpretation**:
239
+ - Tests context utilization
240
+ - Many models struggle beyond 4K tokens
241
+ - GPT-4 Turbo: 54.3%
242
+
243
+ **Good for**: Evaluating long-context models.
244
+
245
+ ## Additional Benchmarks
246
+
247
+ ### TruthfulQA
248
+
249
+ **What it measures**: Model's propensity to be truthful vs. generate plausible-sounding falsehoods.
250
+
251
+ **Format**: Multiple choice with 4-5 options
252
+
253
+ **Command**:
254
+ ```bash
255
+ lm_eval --model hf \
256
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
257
+ --tasks truthfulqa_mc2 \
258
+ --batch_size auto
259
+ ```
260
+
261
+ **Interpretation**:
262
+ - Larger models often score worse (more convincing lies)
263
+ - GPT-3: 58.8%
264
+ - GPT-4: 59.0%
265
+ - Human: ~94%
266
+
267
+ ### ARC (AI2 Reasoning Challenge)
268
+
269
+ **What it measures**: Grade-school science questions.
270
+
271
+ **Variants**:
272
+ - `arc_easy`: Easier questions
273
+ - `arc_challenge`: Harder questions requiring reasoning
274
+
275
+ **Command**:
276
+ ```bash
277
+ lm_eval --model hf \
278
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
279
+ --tasks arc_challenge \
280
+ --num_fewshot 25
281
+ ```
282
+
283
+ **Interpretation**:
284
+ - ARC-Easy: Most models >80%
285
+ - ARC-Challenge random: 25%
286
+ - GPT-4: 96.3%
287
+
288
+ ### HellaSwag
289
+
290
+ **What it measures**: Commonsense reasoning about everyday situations.
291
+
292
+ **Format**: Choose most plausible continuation
293
+
294
+ **Command**:
295
+ ```bash
296
+ lm_eval --model hf \
297
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
298
+ --tasks hellaswag \
299
+ --num_fewshot 10
300
+ ```
301
+
302
+ **Interpretation**:
303
+ - Random: 25%
304
+ - GPT-3: 78.9%
305
+ - Llama 2 70B: 85.3%
306
+
307
+ ### WinoGrande
308
+
309
+ **What it measures**: Commonsense reasoning via pronoun resolution.
310
+
311
+ **Example**:
312
+ ```
313
+ The trophy doesn't fit in the brown suitcase because _ is too large.
314
+ A. the trophy
315
+ B. the suitcase
316
+ ```
317
+
318
+ **Command**:
319
+ ```bash
320
+ lm_eval --model hf \
321
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
322
+ --tasks winogrande \
323
+ --num_fewshot 5
324
+ ```
325
+
326
+ ### PIQA
327
+
328
+ **What it measures**: Physical commonsense reasoning.
329
+
330
+ **Example**: "To clean a keyboard, use compressed air or..."
331
+
332
+ **Command**:
333
+ ```bash
334
+ lm_eval --model hf \
335
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
336
+ --tasks piqa
337
+ ```
338
+
339
+ ## Multilingual Benchmarks
340
+
341
+ ### AfroBench
342
+
343
+ **What it measures**: Performance across 64 African languages.
344
+
345
+ **15 tasks**: NLU, text generation, knowledge, QA, math reasoning
346
+
347
+ **Command**:
348
+ ```bash
349
+ lm_eval --model hf \
350
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
351
+ --tasks afrobench
352
+ ```
353
+
354
+ ### NorEval
355
+
356
+ **What it measures**: Norwegian language understanding (9 task categories).
357
+
358
+ **Command**:
359
+ ```bash
360
+ lm_eval --model hf \
361
+ --model_args pretrained=NbAiLab/nb-gpt-j-6B \
362
+ --tasks noreval
363
+ ```
364
+
365
+ ## Domain-Specific Benchmarks
366
+
367
+ ### MATH
368
+
369
+ **What it measures**: High-school competition math problems.
370
+
371
+ **Command**:
372
+ ```bash
373
+ lm_eval --model hf \
374
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
375
+ --tasks math \
376
+ --num_fewshot 4
377
+ ```
378
+
379
+ **Interpretation**:
380
+ - Very challenging
381
+ - GPT-4: 42.5%
382
+ - Minerva 540B: 33.6%
383
+
384
+ ### MBPP (Mostly Basic Python Problems)
385
+
386
+ **What it measures**: Python programming from natural language descriptions.
387
+
388
+ **Command**:
389
+ ```bash
390
+ lm_eval --model hf \
391
+ --model_args pretrained=codellama/CodeLlama-7b-hf \
392
+ --tasks mbpp \
393
+ --batch_size 1
394
+ ```
395
+
396
+ ### DROP
397
+
398
+ **What it measures**: Reading comprehension requiring discrete reasoning.
399
+
400
+ **Command**:
401
+ ```bash
402
+ lm_eval --model hf \
403
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
404
+ --tasks drop
405
+ ```
406
+
407
+ ## Benchmark Selection Guide
408
+
409
+ ### For General Purpose Models
410
+
411
+ Run this suite:
412
+ ```bash
413
+ lm_eval --model hf \
414
+ --model_args pretrained=meta-llama/Llama-2-7b-hf \
415
+ --tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \
416
+ --num_fewshot 5
417
+ ```
418
+
419
+ ### For Code Models
420
+
421
+ ```bash
422
+ lm_eval --model hf \
423
+ --model_args pretrained=codellama/CodeLlama-7b-hf \
424
+ --tasks humaneval,mbpp \
425
+ --batch_size 1
426
+ ```
427
+
428
+ ### For Chat/Instruct Models
429
+
430
+ ```bash
431
+ lm_eval --model hf \
432
+ --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
433
+ --tasks ifeval,mmlu,gsm8k_cot \
434
+ --batch_size auto
435
+ ```
436
+
437
+ ### For Long Context Models
438
+
439
+ ```bash
440
+ lm_eval --model hf \
441
+ --model_args pretrained=meta-llama/Llama-3.1-8B \
442
+ --tasks longbench \
443
+ --batch_size 1
444
+ ```
445
+
446
+ ## Interpreting Results
447
+
448
+ ### Understanding Metrics
449
+
450
+ **Accuracy**: Percentage of correct answers (most common)
451
+
452
+ **Exact Match (EM)**: Requires exact string match (strict)
453
+
454
+ **F1 Score**: Balances precision and recall
455
+
456
+ **BLEU/ROUGE**: Text generation similarity
457
+
458
+ **Pass@k**: Percentage passing when generating k samples
459
+
460
+ ### Typical Score Ranges
461
+
462
+ | Model Size | MMLU | GSM8K | HumanEval | HellaSwag |
463
+ |------------|------|-------|-----------|-----------|
464
+ | 7B | 40-50% | 10-20% | 5-15% | 70-80% |
465
+ | 13B | 45-55% | 20-35% | 15-25% | 75-82% |
466
+ | 70B | 60-70% | 50-65% | 35-50% | 82-87% |
467
+ | GPT-4 | 86% | 92% | 67% | 95% |
468
+
469
+ ### Red Flags
470
+
471
+ - **All tasks at random chance**: Model not trained properly
472
+ - **Exact 0% on generation tasks**: Likely format/parsing issue
473
+ - **Huge variance across runs**: Check seed/sampling settings
474
+ - **Better than GPT-4 on everything**: Likely contamination
475
+
476
+ ## Best Practices
477
+
478
+ 1. **Always report few-shot setting**: 0-shot, 5-shot, etc.
479
+ 2. **Run multiple seeds**: Report mean ± std
480
+ 3. **Check for data contamination**: Search training data for benchmark examples
481
+ 4. **Compare to published baselines**: Validate your setup
482
+ 5. **Report all hyperparameters**: Model, batch size, max tokens, temperature
483
+
484
+ ## References
485
+
486
+ - Task list: `lm_eval --tasks list`
487
+ - Task README: `lm_eval/tasks/README.md`
488
+ - Papers: See individual benchmark papers