Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/evaluation/lm-evaluation-harness/references/api-evaluation.md
+++ b/skills/mlops/evaluation/lm-evaluation-harness/references/api-evaluation.md
@@ -0,0 +1,490 @@
+# API Evaluation
+
+Guide to evaluating OpenAI, Anthropic, and other API-based language models.
+
+## Overview
+
+The lm-evaluation-harness supports evaluating API-based models through a unified `TemplateAPI` interface. This allows benchmarking of:
+- OpenAI models (GPT-4, GPT-3.5, etc.)
+- Anthropic models (Claude 3, Claude 2, etc.)
+- Local OpenAI-compatible APIs
+- Custom API endpoints
+
+**Why evaluate API models**:
+- Benchmark closed-source models
+- Compare API models to open models
+- Validate API performance
+- Track model updates over time
+
+## Supported API Models
+
+| Provider | Model Type | Request Types | Logprobs |
+|----------|------------|---------------|----------|
+| OpenAI (completions) | `openai-completions` | All | ✅ Yes |
+| OpenAI (chat) | `openai-chat-completions` | `generate_until` only | ❌ No |
+| Anthropic (completions) | `anthropic-completions` | All | ❌ No |
+| Anthropic (chat) | `anthropic-chat` | `generate_until` only | ❌ No |
+| Local (OpenAI-compatible) | `local-completions` | Depends on server | Varies |
+
+**Note**: Models without logprobs can only be evaluated on generation tasks, not perplexity or loglikelihood tasks.
+
+## OpenAI Models
+
+### Setup
+
+```bash
+export OPENAI_API_KEY=sk-...
+```
+
+### Completion Models (Legacy)
+
+**Available models**: `davinci-002`, `babbage-002`
+
+```bash
+lm_eval --model openai-completions \
+  --model_args model=davinci-002 \
+  --tasks lambada_openai,hellaswag \
+  --batch_size auto
+```
+
+**Supports**:
+- `generate_until`: ✅
+- `loglikelihood`: ✅
+- `loglikelihood_rolling`: ✅
+
+### Chat Models
+
+**Available models**: `gpt-4`, `gpt-4-turbo`, `gpt-3.5-turbo`
+
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu,gsm8k,humaneval \
+  --num_fewshot 5 \
+  --batch_size auto
+```
+
+**Supports**:
+- `generate_until`: ✅
+- `loglikelihood`: ❌ (no logprobs)
+- `loglikelihood_rolling`: ❌
+
+**Important**: Chat models don't provide logprobs, so they can only be used with generation tasks (MMLU, GSM8K, HumanEval), not perplexity tasks.
+
+### Configuration Options
+
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args \
+    model=gpt-4-turbo,\
+    base_url=https://api.openai.com/v1,\
+    num_concurrent=5,\
+    max_retries=3,\
+    timeout=60,\
+    batch_size=auto
+```
+
+**Parameters**:
+- `model`: Model identifier (required)
+- `base_url`: API endpoint (default: OpenAI)
+- `num_concurrent`: Concurrent requests (default: 5)
+- `max_retries`: Retry failed requests (default: 3)
+- `timeout`: Request timeout in seconds (default: 60)
+- `tokenizer`: Tokenizer to use (default: matches model)
+- `tokenizer_backend`: `"tiktoken"` or `"huggingface"`
+
+### Cost Management
+
+OpenAI charges per token. Estimate costs before running:
+
+```python
+# Rough estimate
+num_samples = 1000
+avg_tokens_per_sample = 500  # input + output
+cost_per_1k_tokens = 0.01  # GPT-3.5 Turbo
+
+total_cost = (num_samples * avg_tokens_per_sample / 1000) * cost_per_1k_tokens
+print(f"Estimated cost: ${total_cost:.2f}")
+```
+
+**Cost-saving tips**:
+- Use `--limit N` for testing
+- Start with `gpt-3.5-turbo` before `gpt-4`
+- Set `max_gen_toks` to minimum needed
+- Use `num_fewshot=0` for zero-shot when possible
+
+## Anthropic Models
+
+### Setup
+
+```bash
+export ANTHROPIC_API_KEY=sk-ant-...
+```
+
+### Completion Models (Legacy)
+
+```bash
+lm_eval --model anthropic-completions \
+  --model_args model=claude-2.1 \
+  --tasks lambada_openai,hellaswag \
+  --batch_size auto
+```
+
+### Chat Models (Recommended)
+
+**Available models**: `claude-3-5-sonnet-20241022`, `claude-3-opus-20240229`, `claude-3-sonnet-20240229`, `claude-3-haiku-20240307`
+
+```bash
+lm_eval --model anthropic-chat \
+  --model_args model=claude-3-5-sonnet-20241022 \
+  --tasks mmlu,gsm8k,humaneval \
+  --num_fewshot 5 \
+  --batch_size auto
+```
+
+**Aliases**: `anthropic-chat-completions` (same as `anthropic-chat`)
+
+### Configuration Options
+
+```bash
+lm_eval --model anthropic-chat \
+  --model_args \
+    model=claude-3-5-sonnet-20241022,\
+    base_url=https://api.anthropic.com,\
+    num_concurrent=5,\
+    max_retries=3,\
+    timeout=60
+```
+
+### Cost Management
+
+Anthropic pricing (as of 2024):
+- Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output
+- Claude 3 Opus: $15.00 / 1M input, $75.00 / 1M output
+- Claude 3 Haiku: $0.25 / 1M input, $1.25 / 1M output
+
+**Budget-friendly strategy**:
+```bash
+# Test on small sample first
+lm_eval --model anthropic-chat \
+  --model_args model=claude-3-haiku-20240307 \
+  --tasks mmlu \
+  --limit 100
+
+# Then run full eval on best model
+lm_eval --model anthropic-chat \
+  --model_args model=claude-3-5-sonnet-20241022 \
+  --tasks mmlu \
+  --num_fewshot 5
+```
+
+## Local OpenAI-Compatible APIs
+
+Many local inference servers expose OpenAI-compatible APIs (vLLM, Text Generation Inference, llama.cpp, Ollama).
+
+### vLLM Local Server
+
+**Start server**:
+```bash
+vllm serve meta-llama/Llama-2-7b-hf \
+  --host 0.0.0.0 \
+  --port 8000
+```
+
+**Evaluate**:
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    model=meta-llama/Llama-2-7b-hf,\
+    base_url=http://localhost:8000/v1,\
+    num_concurrent=1 \
+  --tasks mmlu,gsm8k \
+  --batch_size auto
+```
+
+### Text Generation Inference (TGI)
+
+**Start server**:
+```bash
+docker run --gpus all --shm-size 1g -p 8080:80 \
+  ghcr.io/huggingface/text-generation-inference:latest \
+  --model-id meta-llama/Llama-2-7b-hf
+```
+
+**Evaluate**:
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    model=meta-llama/Llama-2-7b-hf,\
+    base_url=http://localhost:8080/v1 \
+  --tasks hellaswag,arc_challenge
+```
+
+### Ollama
+
+**Start server**:
+```bash
+ollama serve
+ollama pull llama2:7b
+```
+
+**Evaluate**:
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    model=llama2:7b,\
+    base_url=http://localhost:11434/v1 \
+  --tasks mmlu
+```
+
+### llama.cpp Server
+
+**Start server**:
+```bash
+./server -m models/llama-2-7b.gguf --host 0.0.0.0 --port 8080
+```
+
+**Evaluate**:
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    model=llama2,\
+    base_url=http://localhost:8080/v1 \
+  --tasks gsm8k
+```
+
+## Custom API Implementation
+
+For custom API endpoints, subclass `TemplateAPI`:
+
+### Create `my_api.py`
+
+```python
+from lm_eval.models.api_models import TemplateAPI
+import requests
+
+class MyCustomAPI(TemplateAPI):
+    """Custom API model."""
+
+    def __init__(self, base_url, api_key, **kwargs):
+        super().__init__(base_url=base_url, **kwargs)
+        self.api_key = api_key
+
+    def _create_payload(self, messages, gen_kwargs):
+        """Create API request payload."""
+        return {
+            "messages": messages,
+            "api_key": self.api_key,
+            **gen_kwargs
+        }
+
+    def parse_generations(self, response):
+        """Parse generation response."""
+        return response.json()["choices"][0]["text"]
+
+    def parse_logprobs(self, response):
+        """Parse logprobs (if available)."""
+        # Return None if API doesn't provide logprobs
+        logprobs = response.json().get("logprobs")
+        if logprobs:
+            return logprobs["token_logprobs"]
+        return None
+```
+
+### Register and Use
+
+```python
+from lm_eval import evaluator
+from my_api import MyCustomAPI
+
+model = MyCustomAPI(
+    base_url="https://api.example.com/v1",
+    api_key="your-key"
+)
+
+results = evaluator.simple_evaluate(
+    model=model,
+    tasks=["mmlu", "gsm8k"],
+    num_fewshot=5,
+    batch_size="auto"
+)
+```
+
+## Comparing API and Open Models
+
+### Side-by-Side Evaluation
+
+```bash
+# Evaluate OpenAI GPT-4
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu,gsm8k,hellaswag \
+  --num_fewshot 5 \
+  --output_path results/gpt4.json
+
+# Evaluate open Llama 2 70B
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-2-70b-hf,dtype=bfloat16 \
+  --tasks mmlu,gsm8k,hellaswag \
+  --num_fewshot 5 \
+  --output_path results/llama2-70b.json
+
+# Compare results
+python scripts/compare_results.py \
+  results/gpt4.json \
+  results/llama2-70b.json
+```
+
+### Typical Comparisons
+
+| Model | MMLU | GSM8K | HumanEval | Cost |
+|-------|------|-------|-----------|------|
+| GPT-4 Turbo | 86.4% | 92.0% | 67.0% | $$$$ |
+| Claude 3 Opus | 86.8% | 95.0% | 84.9% | $$$$ |
+| GPT-3.5 Turbo | 70.0% | 57.1% | 48.1% | $$ |
+| Llama 2 70B | 68.9% | 56.8% | 29.9% | Free (self-host) |
+| Mixtral 8x7B | 70.6% | 58.4% | 40.2% | Free (self-host) |
+
+## Best Practices
+
+### Rate Limiting
+
+Respect API rate limits:
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args \
+    model=gpt-4-turbo,\
+    num_concurrent=3,\  # Lower concurrency
+    timeout=120 \  # Longer timeout
+  --tasks mmlu
+```
+
+### Reproducibility
+
+Set temperature to 0 for deterministic results:
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu \
+  --gen_kwargs temperature=0.0
+```
+
+Or use `seed` for sampling:
+```bash
+lm_eval --model anthropic-chat \
+  --model_args model=claude-3-5-sonnet-20241022 \
+  --tasks gsm8k \
+  --gen_kwargs temperature=0.7,seed=42
+```
+
+### Caching
+
+API models automatically cache responses to avoid redundant calls:
+```bash
+# First run: makes API calls
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu \
+  --limit 100
+
+# Second run: uses cache (instant, free)
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu \
+  --limit 100
+```
+
+Cache location: `~/.cache/lm_eval/`
+
+### Error Handling
+
+APIs can fail. Use retries:
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args \
+    model=gpt-4-turbo,\
+    max_retries=5,\
+    timeout=120 \
+  --tasks mmlu
+```
+
+## Troubleshooting
+
+### "Authentication failed"
+
+Check API key:
+```bash
+echo $OPENAI_API_KEY  # Should print sk-...
+echo $ANTHROPIC_API_KEY  # Should print sk-ant-...
+```
+
+### "Rate limit exceeded"
+
+Reduce concurrency:
+```bash
+--model_args num_concurrent=1
+```
+
+Or add delays between requests.
+
+### "Timeout error"
+
+Increase timeout:
+```bash
+--model_args timeout=180
+```
+
+### "Model not found"
+
+For local APIs, verify server is running:
+```bash
+curl http://localhost:8000/v1/models
+```
+
+### Cost Runaway
+
+Use `--limit` for testing:
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args model=gpt-4-turbo \
+  --tasks mmlu \
+  --limit 50  # Only 50 samples
+```
+
+## Advanced Features
+
+### Custom Headers
+
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    base_url=http://api.example.com/v1,\
+    header="Authorization: Bearer token,X-Custom: value"
+```
+
+### Disable SSL Verification (Development Only)
+
+```bash
+lm_eval --model local-completions \
+  --model_args \
+    base_url=https://localhost:8000/v1,\
+    verify_certificate=false
+```
+
+### Custom Tokenizer
+
+```bash
+lm_eval --model openai-chat-completions \
+  --model_args \
+    model=gpt-4-turbo,\
+    tokenizer=gpt2,\
+    tokenizer_backend=huggingface
+```
+
+## References
+
+- OpenAI API: https://platform.openai.com/docs/api-reference
+- Anthropic API: https://docs.anthropic.com/claude/reference
+- TemplateAPI: `lm_eval/models/api_models.py`
+- OpenAI models: `lm_eval/models/openai_completions.py`
+- Anthropic models: `lm_eval/models/anthropic_llms.py`