Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/evaluation/lm-evaluation-harness/references/benchmark-guide.md
+++ b/skills/mlops/evaluation/lm-evaluation-harness/references/benchmark-guide.md
@@ -0,0 +1,488 @@
+# Benchmark Guide
+
+Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results.
+
+## Overview
+
+The lm-evaluation-harness includes 60+ benchmarks spanning:
+- Language understanding (MMLU, GLUE)
+- Mathematical reasoning (GSM8K, MATH)
+- Code generation (HumanEval, MBPP)
+- Instruction following (IFEval, AlpacaEval)
+- Long-context understanding (LongBench)
+- Multilingual capabilities (AfroBench, NorEval)
+- Reasoning (BBH, ARC)
+- Truthfulness (TruthfulQA)
+
+**List all tasks**:
+```bash
+lm_eval --tasks list
+```
+
+## Major Benchmarks
+
+### MMLU (Massive Multitask Language Understanding)
+
+**What it measures**: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law).
+
+**Task variants**:
+- `mmlu`: Original 57-subject benchmark
+- `mmlu_pro`: More challenging version with reasoning-focused questions
+- `mmlu_prox`: Multilingual extension
+
+**Format**: Multiple choice (4 options)
+
+**Example**:
+```
+Question: What is the capital of France?
+A. Berlin
+B. Paris
+C. London
+D. Madrid
+Answer: B
+```
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu \
+  --num_fewshot 5
+```
+
+**Interpretation**:
+- Random: 25% (chance)
+- GPT-3 (175B): 43.9%
+- GPT-4: 86.4%
+- Human expert: ~90%
+
+**Good for**: Assessing general knowledge and domain expertise.
+
+### GSM8K (Grade School Math 8K)
+
+**What it measures**: Mathematical reasoning on grade-school level word problems.
+
+**Task variants**:
+- `gsm8k`: Base task
+- `gsm8k_cot`: With chain-of-thought prompting
+- `gsm_plus`: Adversarial variant with perturbations
+
+**Format**: Free-form generation, extract numerical answer
+
+**Example**:
+```
+Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left?
+Answer: 60
+```
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks gsm8k \
+  --num_fewshot 5
+```
+
+**Interpretation**:
+- Random: ~0%
+- GPT-3 (175B): 17.0%
+- GPT-4: 92.0%
+- Llama 2 70B: 56.8%
+
+**Good for**: Testing multi-step reasoning and arithmetic.
+
+### HumanEval
+
+**What it measures**: Python code generation from docstrings (functional correctness).
+
+**Task variants**:
+- `humaneval`: Standard benchmark
+- `humaneval_instruct`: For instruction-tuned models
+
+**Format**: Code generation, execution-based evaluation
+
+**Example**:
+```python
+def has_close_elements(numbers: List[float], threshold: float) -> bool:
+    """ Check if in given list of numbers, are any two numbers closer to each other than
+    given threshold.
+    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
+    False
+    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
+    True
+    """
+```
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=codellama/CodeLlama-7b-hf \
+  --tasks humaneval \
+  --batch_size 1
+```
+
+**Interpretation**:
+- Random: 0%
+- GPT-3 (175B): 0%
+- Codex: 28.8%
+- GPT-4: 67.0%
+- Code Llama 34B: 53.7%
+
+**Good for**: Evaluating code generation capabilities.
+
+### BBH (BIG-Bench Hard)
+
+**What it measures**: 23 challenging reasoning tasks where models previously failed to beat humans.
+
+**Categories**:
+- Logical reasoning
+- Math word problems
+- Social understanding
+- Algorithmic reasoning
+
+**Format**: Multiple choice and free-form
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks bbh \
+  --num_fewshot 3
+```
+
+**Interpretation**:
+- Random: ~25%
+- GPT-3 (175B): 33.9%
+- PaLM 540B: 58.3%
+- GPT-4: 86.7%
+
+**Good for**: Testing advanced reasoning capabilities.
+
+### IFEval (Instruction-Following Evaluation)
+
+**What it measures**: Ability to follow specific, verifiable instructions.
+
+**Instruction types**:
+- Format constraints (e.g., "answer in 3 sentences")
+- Length constraints (e.g., "use at least 100 words")
+- Content constraints (e.g., "include the word 'banana'")
+- Structural constraints (e.g., "use bullet points")
+
+**Format**: Free-form generation with rule-based verification
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
+  --tasks ifeval \
+  --batch_size auto
+```
+
+**Interpretation**:
+- Measures: Instruction adherence (not quality)
+- GPT-4: 86% instruction following
+- Claude 2: 84%
+
+**Good for**: Evaluating chat/instruct models.
+
+### GLUE (General Language Understanding Evaluation)
+
+**What it measures**: Natural language understanding across 9 tasks.
+
+**Tasks**:
+- `cola`: Grammatical acceptability
+- `sst2`: Sentiment analysis
+- `mrpc`: Paraphrase detection
+- `qqp`: Question pairs
+- `stsb`: Semantic similarity
+- `mnli`: Natural language inference
+- `qnli`: Question answering NLI
+- `rte`: Recognizing textual entailment
+- `wnli`: Winograd schemas
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=bert-base-uncased \
+  --tasks glue \
+  --num_fewshot 0
+```
+
+**Interpretation**:
+- BERT Base: 78.3 (GLUE score)
+- RoBERTa Large: 88.5
+- Human baseline: 87.1
+
+**Good for**: Encoder-only models, fine-tuning baselines.
+
+### LongBench
+
+**What it measures**: Long-context understanding (4K-32K tokens).
+
+**21 tasks covering**:
+- Single-document QA
+- Multi-document QA
+- Summarization
+- Few-shot learning
+- Code completion
+- Synthetic tasks
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks longbench \
+  --batch_size 1
+```
+
+**Interpretation**:
+- Tests context utilization
+- Many models struggle beyond 4K tokens
+- GPT-4 Turbo: 54.3%
+
+**Good for**: Evaluating long-context models.
+
+## Additional Benchmarks
+
+### TruthfulQA
+
+**What it measures**: Model's propensity to be truthful vs. generate plausible-sounding falsehoods.
+
+**Format**: Multiple choice with 4-5 options
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks truthfulqa_mc2 \
+  --batch_size auto
+```
+
+**Interpretation**:
+- Larger models often score worse (more convincing lies)
+- GPT-3: 58.8%
+- GPT-4: 59.0%
+- Human: ~94%
+
+### ARC (AI2 Reasoning Challenge)
+
+**What it measures**: Grade-school science questions.
+
+**Variants**:
+- `arc_easy`: Easier questions
+- `arc_challenge`: Harder questions requiring reasoning
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks arc_challenge \
+  --num_fewshot 25
+```
+
+**Interpretation**:
+- ARC-Easy: Most models >80%
+- ARC-Challenge random: 25%
+- GPT-4: 96.3%
+
+### HellaSwag
+
+**What it measures**: Commonsense reasoning about everyday situations.
+
+**Format**: Choose most plausible continuation
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks hellaswag \
+  --num_fewshot 10
+```
+
+**Interpretation**:
+- Random: 25%
+- GPT-3: 78.9%
+- Llama 2 70B: 85.3%
+
+### WinoGrande
+
+**What it measures**: Commonsense reasoning via pronoun resolution.
+
+**Example**:
+```
+The trophy doesn't fit in the brown suitcase because _ is too large.
+A. the trophy
+B. the suitcase
+```
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks winogrande \
+  --num_fewshot 5
+```
+
+### PIQA
+
+**What it measures**: Physical commonsense reasoning.
+
+**Example**: "To clean a keyboard, use compressed air or..."
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks piqa
+```
+
+## Multilingual Benchmarks
+
+### AfroBench
+
+**What it measures**: Performance across 64 African languages.
+
+**15 tasks**: NLU, text generation, knowledge, QA, math reasoning
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks afrobench
+```
+
+### NorEval
+
+**What it measures**: Norwegian language understanding (9 task categories).
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=NbAiLab/nb-gpt-j-6B \
+  --tasks noreval
+```
+
+## Domain-Specific Benchmarks
+
+### MATH
+
+**What it measures**: High-school competition math problems.
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks math \
+  --num_fewshot 4
+```
+
+**Interpretation**:
+- Very challenging
+- GPT-4: 42.5%
+- Minerva 540B: 33.6%
+
+### MBPP (Mostly Basic Python Problems)
+
+**What it measures**: Python programming from natural language descriptions.
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=codellama/CodeLlama-7b-hf \
+  --tasks mbpp \
+  --batch_size 1
+```
+
+### DROP
+
+**What it measures**: Reading comprehension requiring discrete reasoning.
+
+**Command**:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks drop
+```
+
+## Benchmark Selection Guide
+
+### For General Purpose Models
+
+Run this suite:
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-hf \
+  --tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \
+  --num_fewshot 5
+```
+
+### For Code Models
+
+```bash
+lm_eval --model hf \
+  --model_args pretrained=codellama/CodeLlama-7b-hf \
+  --tasks humaneval,mbpp \
+  --batch_size 1
+```
+
+### For Chat/Instruct Models
+
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
+  --tasks ifeval,mmlu,gsm8k_cot \
+  --batch_size auto
+```
+
+### For Long Context Models
+
+```bash
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-3.1-8B \
+  --tasks longbench \
+  --batch_size 1
+```
+
+## Interpreting Results
+
+### Understanding Metrics
+
+**Accuracy**: Percentage of correct answers (most common)
+
+**Exact Match (EM)**: Requires exact string match (strict)
+
+**F1 Score**: Balances precision and recall
+
+**BLEU/ROUGE**: Text generation similarity
+
+**Pass@k**: Percentage passing when generating k samples
+
+### Typical Score Ranges
+
+| Model Size | MMLU | GSM8K | HumanEval | HellaSwag |
+|------------|------|-------|-----------|-----------|
+| 7B | 40-50% | 10-20% | 5-15% | 70-80% |
+| 13B | 45-55% | 20-35% | 15-25% | 75-82% |
+| 70B | 60-70% | 50-65% | 35-50% | 82-87% |
+| GPT-4 | 86% | 92% | 67% | 95% |
+
+### Red Flags
+
+- **All tasks at random chance**: Model not trained properly
+- **Exact 0% on generation tasks**: Likely format/parsing issue
+- **Huge variance across runs**: Check seed/sampling settings
+- **Better than GPT-4 on everything**: Likely contamination
+
+## Best Practices
+
+1. **Always report few-shot setting**: 0-shot, 5-shot, etc.
+2. **Run multiple seeds**: Report mean ± std
+3. **Check for data contamination**: Search training data for benchmark examples
+4. **Compare to published baselines**: Validate your setup
+5. **Report all hyperparameters**: Model, batch size, max tokens, temperature
+
+## References
+
+- Task list: `lm_eval --tasks list`
+- Task README: `lm_eval/tasks/README.md`
+- Papers: See individual benchmark papers