Files

11 KiB

API Evaluation

Guide to evaluating OpenAI, Anthropic, and other API-based language models.

Overview

The lm-evaluation-harness supports evaluating API-based models through a unified TemplateAPI interface. This allows benchmarking of:

  • OpenAI models (GPT-4, GPT-3.5, etc.)
  • Anthropic models (Claude 3, Claude 2, etc.)
  • Local OpenAI-compatible APIs
  • Custom API endpoints

Why evaluate API models:

  • Benchmark closed-source models
  • Compare API models to open models
  • Validate API performance
  • Track model updates over time

Supported API Models

Provider Model Type Request Types Logprobs
OpenAI (completions) openai-completions All Yes
OpenAI (chat) openai-chat-completions generate_until only No
Anthropic (completions) anthropic-completions All No
Anthropic (chat) anthropic-chat generate_until only No
Local (OpenAI-compatible) local-completions Depends on server Varies

Note: Models without logprobs can only be evaluated on generation tasks, not perplexity or loglikelihood tasks.

OpenAI Models

Setup

export OPENAI_API_KEY=sk-...

Completion Models (Legacy)

Available models: davinci-002, babbage-002

lm_eval --model openai-completions \
  --model_args model=davinci-002 \
  --tasks lambada_openai,hellaswag \
  --batch_size auto

Supports:

  • generate_until:
  • loglikelihood:
  • loglikelihood_rolling:

Chat Models

Available models: gpt-4, gpt-4-turbo, gpt-3.5-turbo

lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu,gsm8k,humaneval \
  --num_fewshot 5 \
  --batch_size auto

Supports:

  • generate_until:
  • loglikelihood: (no logprobs)
  • loglikelihood_rolling:

Important: Chat models don't provide logprobs, so they can only be used with generation tasks (MMLU, GSM8K, HumanEval), not perplexity tasks.

Configuration Options

lm_eval --model openai-chat-completions \
  --model_args \
    model=gpt-4-turbo,\
    base_url=https://api.openai.com/v1,\
    num_concurrent=5,\
    max_retries=3,\
    timeout=60,\
    batch_size=auto

Parameters:

  • model: Model identifier (required)
  • base_url: API endpoint (default: OpenAI)
  • num_concurrent: Concurrent requests (default: 5)
  • max_retries: Retry failed requests (default: 3)
  • timeout: Request timeout in seconds (default: 60)
  • tokenizer: Tokenizer to use (default: matches model)
  • tokenizer_backend: "tiktoken" or "huggingface"

Cost Management

OpenAI charges per token. Estimate costs before running:

# Rough estimate
num_samples = 1000
avg_tokens_per_sample = 500  # input + output
cost_per_1k_tokens = 0.01  # GPT-3.5 Turbo

total_cost = (num_samples * avg_tokens_per_sample / 1000) * cost_per_1k_tokens
print(f"Estimated cost: ${total_cost:.2f}")

Cost-saving tips:

  • Use --limit N for testing
  • Start with gpt-3.5-turbo before gpt-4
  • Set max_gen_toks to minimum needed
  • Use num_fewshot=0 for zero-shot when possible

Anthropic Models

Setup

export ANTHROPIC_API_KEY=sk-ant-...

Completion Models (Legacy)

lm_eval --model anthropic-completions \
  --model_args model=claude-2.1 \
  --tasks lambada_openai,hellaswag \
  --batch_size auto

Available models: claude-3-5-sonnet-20241022, claude-3-opus-20240229, claude-3-sonnet-20240229, claude-3-haiku-20240307

lm_eval --model anthropic-chat \
  --model_args model=claude-3-5-sonnet-20241022 \
  --tasks mmlu,gsm8k,humaneval \
  --num_fewshot 5 \
  --batch_size auto

Aliases: anthropic-chat-completions (same as anthropic-chat)

Configuration Options

lm_eval --model anthropic-chat \
  --model_args \
    model=claude-3-5-sonnet-20241022,\
    base_url=https://api.anthropic.com,\
    num_concurrent=5,\
    max_retries=3,\
    timeout=60

Cost Management

Anthropic pricing (as of 2024):

  • Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output
  • Claude 3 Opus: $15.00 / 1M input, $75.00 / 1M output
  • Claude 3 Haiku: $0.25 / 1M input, $1.25 / 1M output

Budget-friendly strategy:

# Test on small sample first
lm_eval --model anthropic-chat \
  --model_args model=claude-3-haiku-20240307 \
  --tasks mmlu \
  --limit 100

# Then run full eval on best model
lm_eval --model anthropic-chat \
  --model_args model=claude-3-5-sonnet-20241022 \
  --tasks mmlu \
  --num_fewshot 5

Local OpenAI-Compatible APIs

Many local inference servers expose OpenAI-compatible APIs (vLLM, Text Generation Inference, llama.cpp, Ollama).

vLLM Local Server

Start server:

vllm serve meta-llama/Llama-2-7b-hf \
  --host 0.0.0.0 \
  --port 8000

Evaluate:

lm_eval --model local-completions \
  --model_args \
    model=meta-llama/Llama-2-7b-hf,\
    base_url=http://localhost:8000/v1,\
    num_concurrent=1 \
  --tasks mmlu,gsm8k \
  --batch_size auto

Text Generation Inference (TGI)

Start server:

docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-2-7b-hf

Evaluate:

lm_eval --model local-completions \
  --model_args \
    model=meta-llama/Llama-2-7b-hf,\
    base_url=http://localhost:8080/v1 \
  --tasks hellaswag,arc_challenge

Ollama

Start server:

ollama serve
ollama pull llama2:7b

Evaluate:

lm_eval --model local-completions \
  --model_args \
    model=llama2:7b,\
    base_url=http://localhost:11434/v1 \
  --tasks mmlu

llama.cpp Server

Start server:

./server -m models/llama-2-7b.gguf --host 0.0.0.0 --port 8080

Evaluate:

lm_eval --model local-completions \
  --model_args \
    model=llama2,\
    base_url=http://localhost:8080/v1 \
  --tasks gsm8k

Custom API Implementation

For custom API endpoints, subclass TemplateAPI:

Create my_api.py

from lm_eval.models.api_models import TemplateAPI
import requests

class MyCustomAPI(TemplateAPI):
    """Custom API model."""

    def __init__(self, base_url, api_key, **kwargs):
        super().__init__(base_url=base_url, **kwargs)
        self.api_key = api_key

    def _create_payload(self, messages, gen_kwargs):
        """Create API request payload."""
        return {
            "messages": messages,
            "api_key": self.api_key,
            **gen_kwargs
        }

    def parse_generations(self, response):
        """Parse generation response."""
        return response.json()["choices"][0]["text"]

    def parse_logprobs(self, response):
        """Parse logprobs (if available)."""
        # Return None if API doesn't provide logprobs
        logprobs = response.json().get("logprobs")
        if logprobs:
            return logprobs["token_logprobs"]
        return None

Register and Use

from lm_eval import evaluator
from my_api import MyCustomAPI

model = MyCustomAPI(
    base_url="https://api.example.com/v1",
    api_key="your-key"
)

results = evaluator.simple_evaluate(
    model=model,
    tasks=["mmlu", "gsm8k"],
    num_fewshot=5,
    batch_size="auto"
)

Comparing API and Open Models

Side-by-Side Evaluation

# Evaluate OpenAI GPT-4
lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu,gsm8k,hellaswag \
  --num_fewshot 5 \
  --output_path results/gpt4.json

# Evaluate open Llama 2 70B
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-70b-hf,dtype=bfloat16 \
  --tasks mmlu,gsm8k,hellaswag \
  --num_fewshot 5 \
  --output_path results/llama2-70b.json

# Compare results
python scripts/compare_results.py \
  results/gpt4.json \
  results/llama2-70b.json

Typical Comparisons

Model MMLU GSM8K HumanEval Cost
GPT-4 Turbo 86.4% 92.0% 67.0%
Claude 3 Opus 86.8% 95.0% 84.9%
GPT-3.5 Turbo 70.0% 57.1% 48.1% $$
Llama 2 70B 68.9% 56.8% 29.9% Free (self-host)
Mixtral 8x7B 70.6% 58.4% 40.2% Free (self-host)

Best Practices

Rate Limiting

Respect API rate limits:

lm_eval --model openai-chat-completions \
  --model_args \
    model=gpt-4-turbo,\
    num_concurrent=3,\  # Lower concurrency
    timeout=120 \  # Longer timeout
  --tasks mmlu

Reproducibility

Set temperature to 0 for deterministic results:

lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu \
  --gen_kwargs temperature=0.0

Or use seed for sampling:

lm_eval --model anthropic-chat \
  --model_args model=claude-3-5-sonnet-20241022 \
  --tasks gsm8k \
  --gen_kwargs temperature=0.7,seed=42

Caching

API models automatically cache responses to avoid redundant calls:

# First run: makes API calls
lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu \
  --limit 100

# Second run: uses cache (instant, free)
lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu \
  --limit 100

Cache location: ~/.cache/lm_eval/

Error Handling

APIs can fail. Use retries:

lm_eval --model openai-chat-completions \
  --model_args \
    model=gpt-4-turbo,\
    max_retries=5,\
    timeout=120 \
  --tasks mmlu

Troubleshooting

"Authentication failed"

Check API key:

echo $OPENAI_API_KEY  # Should print sk-...
echo $ANTHROPIC_API_KEY  # Should print sk-ant-...

"Rate limit exceeded"

Reduce concurrency:

--model_args num_concurrent=1

Or add delays between requests.

"Timeout error"

Increase timeout:

--model_args timeout=180

"Model not found"

For local APIs, verify server is running:

curl http://localhost:8000/v1/models

Cost Runaway

Use --limit for testing:

lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu \
  --limit 50  # Only 50 samples

Advanced Features

Custom Headers

lm_eval --model local-completions \
  --model_args \
    base_url=http://api.example.com/v1,\
    header="Authorization: Bearer token,X-Custom: value"

Disable SSL Verification (Development Only)

lm_eval --model local-completions \
  --model_args \
    base_url=https://localhost:8000/v1,\
    verify_certificate=false

Custom Tokenizer

lm_eval --model openai-chat-completions \
  --model_args \
    model=gpt-4-turbo,\
    tokenizer=gpt2,\
    tokenizer_backend=huggingface

References