11 KiB
API Evaluation
Guide to evaluating OpenAI, Anthropic, and other API-based language models.
Overview
The lm-evaluation-harness supports evaluating API-based models through a unified TemplateAPI interface. This allows benchmarking of:
- OpenAI models (GPT-4, GPT-3.5, etc.)
- Anthropic models (Claude 3, Claude 2, etc.)
- Local OpenAI-compatible APIs
- Custom API endpoints
Why evaluate API models:
- Benchmark closed-source models
- Compare API models to open models
- Validate API performance
- Track model updates over time
Supported API Models
| Provider | Model Type | Request Types | Logprobs |
|---|---|---|---|
| OpenAI (completions) | openai-completions |
All | ✅ Yes |
| OpenAI (chat) | openai-chat-completions |
generate_until only |
❌ No |
| Anthropic (completions) | anthropic-completions |
All | ❌ No |
| Anthropic (chat) | anthropic-chat |
generate_until only |
❌ No |
| Local (OpenAI-compatible) | local-completions |
Depends on server | Varies |
Note: Models without logprobs can only be evaluated on generation tasks, not perplexity or loglikelihood tasks.
OpenAI Models
Setup
export OPENAI_API_KEY=sk-...
Completion Models (Legacy)
Available models: davinci-002, babbage-002
lm_eval --model openai-completions \
--model_args model=davinci-002 \
--tasks lambada_openai,hellaswag \
--batch_size auto
Supports:
generate_until: ✅loglikelihood: ✅loglikelihood_rolling: ✅
Chat Models
Available models: gpt-4, gpt-4-turbo, gpt-3.5-turbo
lm_eval --model openai-chat-completions \
--model_args model=gpt-4-turbo \
--tasks mmlu,gsm8k,humaneval \
--num_fewshot 5 \
--batch_size auto
Supports:
generate_until: ✅loglikelihood: ❌ (no logprobs)loglikelihood_rolling: ❌
Important: Chat models don't provide logprobs, so they can only be used with generation tasks (MMLU, GSM8K, HumanEval), not perplexity tasks.
Configuration Options
lm_eval --model openai-chat-completions \
--model_args \
model=gpt-4-turbo,\
base_url=https://api.openai.com/v1,\
num_concurrent=5,\
max_retries=3,\
timeout=60,\
batch_size=auto
Parameters:
model: Model identifier (required)base_url: API endpoint (default: OpenAI)num_concurrent: Concurrent requests (default: 5)max_retries: Retry failed requests (default: 3)timeout: Request timeout in seconds (default: 60)tokenizer: Tokenizer to use (default: matches model)tokenizer_backend:"tiktoken"or"huggingface"
Cost Management
OpenAI charges per token. Estimate costs before running:
# Rough estimate
num_samples = 1000
avg_tokens_per_sample = 500 # input + output
cost_per_1k_tokens = 0.01 # GPT-3.5 Turbo
total_cost = (num_samples * avg_tokens_per_sample / 1000) * cost_per_1k_tokens
print(f"Estimated cost: ${total_cost:.2f}")
Cost-saving tips:
- Use
--limit Nfor testing - Start with
gpt-3.5-turbobeforegpt-4 - Set
max_gen_toksto minimum needed - Use
num_fewshot=0for zero-shot when possible
Anthropic Models
Setup
export ANTHROPIC_API_KEY=sk-ant-...
Completion Models (Legacy)
lm_eval --model anthropic-completions \
--model_args model=claude-2.1 \
--tasks lambada_openai,hellaswag \
--batch_size auto
Chat Models (Recommended)
Available models: claude-3-5-sonnet-20241022, claude-3-opus-20240229, claude-3-sonnet-20240229, claude-3-haiku-20240307
lm_eval --model anthropic-chat \
--model_args model=claude-3-5-sonnet-20241022 \
--tasks mmlu,gsm8k,humaneval \
--num_fewshot 5 \
--batch_size auto
Aliases: anthropic-chat-completions (same as anthropic-chat)
Configuration Options
lm_eval --model anthropic-chat \
--model_args \
model=claude-3-5-sonnet-20241022,\
base_url=https://api.anthropic.com,\
num_concurrent=5,\
max_retries=3,\
timeout=60
Cost Management
Anthropic pricing (as of 2024):
- Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output
- Claude 3 Opus: $15.00 / 1M input, $75.00 / 1M output
- Claude 3 Haiku: $0.25 / 1M input, $1.25 / 1M output
Budget-friendly strategy:
# Test on small sample first
lm_eval --model anthropic-chat \
--model_args model=claude-3-haiku-20240307 \
--tasks mmlu \
--limit 100
# Then run full eval on best model
lm_eval --model anthropic-chat \
--model_args model=claude-3-5-sonnet-20241022 \
--tasks mmlu \
--num_fewshot 5
Local OpenAI-Compatible APIs
Many local inference servers expose OpenAI-compatible APIs (vLLM, Text Generation Inference, llama.cpp, Ollama).
vLLM Local Server
Start server:
vllm serve meta-llama/Llama-2-7b-hf \
--host 0.0.0.0 \
--port 8000
Evaluate:
lm_eval --model local-completions \
--model_args \
model=meta-llama/Llama-2-7b-hf,\
base_url=http://localhost:8000/v1,\
num_concurrent=1 \
--tasks mmlu,gsm8k \
--batch_size auto
Text Generation Inference (TGI)
Start server:
docker run --gpus all --shm-size 1g -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-hf
Evaluate:
lm_eval --model local-completions \
--model_args \
model=meta-llama/Llama-2-7b-hf,\
base_url=http://localhost:8080/v1 \
--tasks hellaswag,arc_challenge
Ollama
Start server:
ollama serve
ollama pull llama2:7b
Evaluate:
lm_eval --model local-completions \
--model_args \
model=llama2:7b,\
base_url=http://localhost:11434/v1 \
--tasks mmlu
llama.cpp Server
Start server:
./server -m models/llama-2-7b.gguf --host 0.0.0.0 --port 8080
Evaluate:
lm_eval --model local-completions \
--model_args \
model=llama2,\
base_url=http://localhost:8080/v1 \
--tasks gsm8k
Custom API Implementation
For custom API endpoints, subclass TemplateAPI:
Create my_api.py
from lm_eval.models.api_models import TemplateAPI
import requests
class MyCustomAPI(TemplateAPI):
"""Custom API model."""
def __init__(self, base_url, api_key, **kwargs):
super().__init__(base_url=base_url, **kwargs)
self.api_key = api_key
def _create_payload(self, messages, gen_kwargs):
"""Create API request payload."""
return {
"messages": messages,
"api_key": self.api_key,
**gen_kwargs
}
def parse_generations(self, response):
"""Parse generation response."""
return response.json()["choices"][0]["text"]
def parse_logprobs(self, response):
"""Parse logprobs (if available)."""
# Return None if API doesn't provide logprobs
logprobs = response.json().get("logprobs")
if logprobs:
return logprobs["token_logprobs"]
return None
Register and Use
from lm_eval import evaluator
from my_api import MyCustomAPI
model = MyCustomAPI(
base_url="https://api.example.com/v1",
api_key="your-key"
)
results = evaluator.simple_evaluate(
model=model,
tasks=["mmlu", "gsm8k"],
num_fewshot=5,
batch_size="auto"
)
Comparing API and Open Models
Side-by-Side Evaluation
# Evaluate OpenAI GPT-4
lm_eval --model openai-chat-completions \
--model_args model=gpt-4-turbo \
--tasks mmlu,gsm8k,hellaswag \
--num_fewshot 5 \
--output_path results/gpt4.json
# Evaluate open Llama 2 70B
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-70b-hf,dtype=bfloat16 \
--tasks mmlu,gsm8k,hellaswag \
--num_fewshot 5 \
--output_path results/llama2-70b.json
# Compare results
python scripts/compare_results.py \
results/gpt4.json \
results/llama2-70b.json
Typical Comparisons
| Model | MMLU | GSM8K | HumanEval | Cost |
|---|---|---|---|---|
| GPT-4 Turbo | 86.4% | 92.0% | 67.0% | |
| Claude 3 Opus | 86.8% | 95.0% | 84.9% | |
| GPT-3.5 Turbo | 70.0% | 57.1% | 48.1% | $$ |
| Llama 2 70B | 68.9% | 56.8% | 29.9% | Free (self-host) |
| Mixtral 8x7B | 70.6% | 58.4% | 40.2% | Free (self-host) |
Best Practices
Rate Limiting
Respect API rate limits:
lm_eval --model openai-chat-completions \
--model_args \
model=gpt-4-turbo,\
num_concurrent=3,\ # Lower concurrency
timeout=120 \ # Longer timeout
--tasks mmlu
Reproducibility
Set temperature to 0 for deterministic results:
lm_eval --model openai-chat-completions \
--model_args model=gpt-4-turbo \
--tasks mmlu \
--gen_kwargs temperature=0.0
Or use seed for sampling:
lm_eval --model anthropic-chat \
--model_args model=claude-3-5-sonnet-20241022 \
--tasks gsm8k \
--gen_kwargs temperature=0.7,seed=42
Caching
API models automatically cache responses to avoid redundant calls:
# First run: makes API calls
lm_eval --model openai-chat-completions \
--model_args model=gpt-4-turbo \
--tasks mmlu \
--limit 100
# Second run: uses cache (instant, free)
lm_eval --model openai-chat-completions \
--model_args model=gpt-4-turbo \
--tasks mmlu \
--limit 100
Cache location: ~/.cache/lm_eval/
Error Handling
APIs can fail. Use retries:
lm_eval --model openai-chat-completions \
--model_args \
model=gpt-4-turbo,\
max_retries=5,\
timeout=120 \
--tasks mmlu
Troubleshooting
"Authentication failed"
Check API key:
echo $OPENAI_API_KEY # Should print sk-...
echo $ANTHROPIC_API_KEY # Should print sk-ant-...
"Rate limit exceeded"
Reduce concurrency:
--model_args num_concurrent=1
Or add delays between requests.
"Timeout error"
Increase timeout:
--model_args timeout=180
"Model not found"
For local APIs, verify server is running:
curl http://localhost:8000/v1/models
Cost Runaway
Use --limit for testing:
lm_eval --model openai-chat-completions \
--model_args model=gpt-4-turbo \
--tasks mmlu \
--limit 50 # Only 50 samples
Advanced Features
Custom Headers
lm_eval --model local-completions \
--model_args \
base_url=http://api.example.com/v1,\
header="Authorization: Bearer token,X-Custom: value"
Disable SSL Verification (Development Only)
lm_eval --model local-completions \
--model_args \
base_url=https://localhost:8000/v1,\
verify_certificate=false
Custom Tokenizer
lm_eval --model openai-chat-completions \
--model_args \
model=gpt-4-turbo,\
tokenizer=gpt2,\
tokenizer_backend=huggingface
References
- OpenAI API: https://platform.openai.com/docs/api-reference
- Anthropic API: https://docs.anthropic.com/claude/reference
- TemplateAPI:
lm_eval/models/api_models.py - OpenAI models:
lm_eval/models/openai_completions.py - Anthropic models:
lm_eval/models/anthropic_llms.py