# Benchmark Guide Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results. ## Overview The lm-evaluation-harness includes 60+ benchmarks spanning: - Language understanding (MMLU, GLUE) - Mathematical reasoning (GSM8K, MATH) - Code generation (HumanEval, MBPP) - Instruction following (IFEval, AlpacaEval) - Long-context understanding (LongBench) - Multilingual capabilities (AfroBench, NorEval) - Reasoning (BBH, ARC) - Truthfulness (TruthfulQA) **List all tasks**: ```bash lm_eval --tasks list ``` ## Major Benchmarks ### MMLU (Massive Multitask Language Understanding) **What it measures**: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law). **Task variants**: - `mmlu`: Original 57-subject benchmark - `mmlu_pro`: More challenging version with reasoning-focused questions - `mmlu_prox`: Multilingual extension **Format**: Multiple choice (4 options) **Example**: ``` Question: What is the capital of France? A. Berlin B. Paris C. London D. Madrid Answer: B ``` **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks mmlu \ --num_fewshot 5 ``` **Interpretation**: - Random: 25% (chance) - GPT-3 (175B): 43.9% - GPT-4: 86.4% - Human expert: ~90% **Good for**: Assessing general knowledge and domain expertise. ### GSM8K (Grade School Math 8K) **What it measures**: Mathematical reasoning on grade-school level word problems. **Task variants**: - `gsm8k`: Base task - `gsm8k_cot`: With chain-of-thought prompting - `gsm_plus`: Adversarial variant with perturbations **Format**: Free-form generation, extract numerical answer **Example**: ``` Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left? Answer: 60 ``` **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks gsm8k \ --num_fewshot 5 ``` **Interpretation**: - Random: ~0% - GPT-3 (175B): 17.0% - GPT-4: 92.0% - Llama 2 70B: 56.8% **Good for**: Testing multi-step reasoning and arithmetic. ### HumanEval **What it measures**: Python code generation from docstrings (functional correctness). **Task variants**: - `humaneval`: Standard benchmark - `humaneval_instruct`: For instruction-tuned models **Format**: Code generation, execution-based evaluation **Example**: ```python def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ ``` **Command**: ```bash lm_eval --model hf \ --model_args pretrained=codellama/CodeLlama-7b-hf \ --tasks humaneval \ --batch_size 1 ``` **Interpretation**: - Random: 0% - GPT-3 (175B): 0% - Codex: 28.8% - GPT-4: 67.0% - Code Llama 34B: 53.7% **Good for**: Evaluating code generation capabilities. ### BBH (BIG-Bench Hard) **What it measures**: 23 challenging reasoning tasks where models previously failed to beat humans. **Categories**: - Logical reasoning - Math word problems - Social understanding - Algorithmic reasoning **Format**: Multiple choice and free-form **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks bbh \ --num_fewshot 3 ``` **Interpretation**: - Random: ~25% - GPT-3 (175B): 33.9% - PaLM 540B: 58.3% - GPT-4: 86.7% **Good for**: Testing advanced reasoning capabilities. ### IFEval (Instruction-Following Evaluation) **What it measures**: Ability to follow specific, verifiable instructions. **Instruction types**: - Format constraints (e.g., "answer in 3 sentences") - Length constraints (e.g., "use at least 100 words") - Content constraints (e.g., "include the word 'banana'") - Structural constraints (e.g., "use bullet points") **Format**: Free-form generation with rule-based verification **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \ --tasks ifeval \ --batch_size auto ``` **Interpretation**: - Measures: Instruction adherence (not quality) - GPT-4: 86% instruction following - Claude 2: 84% **Good for**: Evaluating chat/instruct models. ### GLUE (General Language Understanding Evaluation) **What it measures**: Natural language understanding across 9 tasks. **Tasks**: - `cola`: Grammatical acceptability - `sst2`: Sentiment analysis - `mrpc`: Paraphrase detection - `qqp`: Question pairs - `stsb`: Semantic similarity - `mnli`: Natural language inference - `qnli`: Question answering NLI - `rte`: Recognizing textual entailment - `wnli`: Winograd schemas **Command**: ```bash lm_eval --model hf \ --model_args pretrained=bert-base-uncased \ --tasks glue \ --num_fewshot 0 ``` **Interpretation**: - BERT Base: 78.3 (GLUE score) - RoBERTa Large: 88.5 - Human baseline: 87.1 **Good for**: Encoder-only models, fine-tuning baselines. ### LongBench **What it measures**: Long-context understanding (4K-32K tokens). **21 tasks covering**: - Single-document QA - Multi-document QA - Summarization - Few-shot learning - Code completion - Synthetic tasks **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks longbench \ --batch_size 1 ``` **Interpretation**: - Tests context utilization - Many models struggle beyond 4K tokens - GPT-4 Turbo: 54.3% **Good for**: Evaluating long-context models. ## Additional Benchmarks ### TruthfulQA **What it measures**: Model's propensity to be truthful vs. generate plausible-sounding falsehoods. **Format**: Multiple choice with 4-5 options **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks truthfulqa_mc2 \ --batch_size auto ``` **Interpretation**: - Larger models often score worse (more convincing lies) - GPT-3: 58.8% - GPT-4: 59.0% - Human: ~94% ### ARC (AI2 Reasoning Challenge) **What it measures**: Grade-school science questions. **Variants**: - `arc_easy`: Easier questions - `arc_challenge`: Harder questions requiring reasoning **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks arc_challenge \ --num_fewshot 25 ``` **Interpretation**: - ARC-Easy: Most models >80% - ARC-Challenge random: 25% - GPT-4: 96.3% ### HellaSwag **What it measures**: Commonsense reasoning about everyday situations. **Format**: Choose most plausible continuation **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks hellaswag \ --num_fewshot 10 ``` **Interpretation**: - Random: 25% - GPT-3: 78.9% - Llama 2 70B: 85.3% ### WinoGrande **What it measures**: Commonsense reasoning via pronoun resolution. **Example**: ``` The trophy doesn't fit in the brown suitcase because _ is too large. A. the trophy B. the suitcase ``` **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks winogrande \ --num_fewshot 5 ``` ### PIQA **What it measures**: Physical commonsense reasoning. **Example**: "To clean a keyboard, use compressed air or..." **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks piqa ``` ## Multilingual Benchmarks ### AfroBench **What it measures**: Performance across 64 African languages. **15 tasks**: NLU, text generation, knowledge, QA, math reasoning **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks afrobench ``` ### NorEval **What it measures**: Norwegian language understanding (9 task categories). **Command**: ```bash lm_eval --model hf \ --model_args pretrained=NbAiLab/nb-gpt-j-6B \ --tasks noreval ``` ## Domain-Specific Benchmarks ### MATH **What it measures**: High-school competition math problems. **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks math \ --num_fewshot 4 ``` **Interpretation**: - Very challenging - GPT-4: 42.5% - Minerva 540B: 33.6% ### MBPP (Mostly Basic Python Problems) **What it measures**: Python programming from natural language descriptions. **Command**: ```bash lm_eval --model hf \ --model_args pretrained=codellama/CodeLlama-7b-hf \ --tasks mbpp \ --batch_size 1 ``` ### DROP **What it measures**: Reading comprehension requiring discrete reasoning. **Command**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks drop ``` ## Benchmark Selection Guide ### For General Purpose Models Run this suite: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \ --num_fewshot 5 ``` ### For Code Models ```bash lm_eval --model hf \ --model_args pretrained=codellama/CodeLlama-7b-hf \ --tasks humaneval,mbpp \ --batch_size 1 ``` ### For Chat/Instruct Models ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \ --tasks ifeval,mmlu,gsm8k_cot \ --batch_size auto ``` ### For Long Context Models ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.1-8B \ --tasks longbench \ --batch_size 1 ``` ## Interpreting Results ### Understanding Metrics **Accuracy**: Percentage of correct answers (most common) **Exact Match (EM)**: Requires exact string match (strict) **F1 Score**: Balances precision and recall **BLEU/ROUGE**: Text generation similarity **Pass@k**: Percentage passing when generating k samples ### Typical Score Ranges | Model Size | MMLU | GSM8K | HumanEval | HellaSwag | |------------|------|-------|-----------|-----------| | 7B | 40-50% | 10-20% | 5-15% | 70-80% | | 13B | 45-55% | 20-35% | 15-25% | 75-82% | | 70B | 60-70% | 50-65% | 35-50% | 82-87% | | GPT-4 | 86% | 92% | 67% | 95% | ### Red Flags - **All tasks at random chance**: Model not trained properly - **Exact 0% on generation tasks**: Likely format/parsing issue - **Huge variance across runs**: Check seed/sampling settings - **Better than GPT-4 on everything**: Likely contamination ## Best Practices 1. **Always report few-shot setting**: 0-shot, 5-shot, etc. 2. **Run multiple seeds**: Report mean ± std 3. **Check for data contamination**: Search training data for benchmark examples 4. **Compare to published baselines**: Validate your setup 5. **Report all hyperparameters**: Model, batch size, max tokens, temperature ## References - Task list: `lm_eval --tasks list` - Task README: `lm_eval/tasks/README.md` - Papers: See individual benchmark papers