Mixer/hermes-sync

Fork 0

Files

Mixer 1eba2bca95 Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00

11 KiB

Raw Permalink Blame History

API Evaluation

Guide to evaluating OpenAI, Anthropic, and other API-based language models.

Overview

The lm-evaluation-harness supports evaluating API-based models through a unified TemplateAPI interface. This allows benchmarking of:

OpenAI models (GPT-4, GPT-3.5, etc.)
Anthropic models (Claude 3, Claude 2, etc.)
Local OpenAI-compatible APIs
Custom API endpoints

Why evaluate API models:

Benchmark closed-source models
Compare API models to open models
Validate API performance
Track model updates over time

Supported API Models

Provider	Model Type	Request Types	Logprobs
OpenAI (completions)	`openai-completions`	All	✅ Yes
OpenAI (chat)	`openai-chat-completions`	`generate_until` only	❌ No
Anthropic (completions)	`anthropic-completions`	All	❌ No
Anthropic (chat)	`anthropic-chat`	`generate_until` only	❌ No
Local (OpenAI-compatible)	`local-completions`	Depends on server	Varies

Note: Models without logprobs can only be evaluated on generation tasks, not perplexity or loglikelihood tasks.

OpenAI Models

Setup

export OPENAI_API_KEY=sk-...

Completion Models (Legacy)

Available models: davinci-002, babbage-002

lm_eval --model openai-completions \
  --model_args model=davinci-002 \
  --tasks lambada_openai,hellaswag \
  --batch_size auto

Supports:

generate_until: ✅
loglikelihood: ✅
loglikelihood_rolling: ✅

Chat Models

Available models: gpt-4, gpt-4-turbo, gpt-3.5-turbo

lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu,gsm8k,humaneval \
  --num_fewshot 5 \
  --batch_size auto

Supports:

generate_until: ✅
loglikelihood: ❌ (no logprobs)
loglikelihood_rolling: ❌

Important: Chat models don't provide logprobs, so they can only be used with generation tasks (MMLU, GSM8K, HumanEval), not perplexity tasks.

Configuration Options

lm_eval --model openai-chat-completions \
  --model_args \
    model=gpt-4-turbo,\
    base_url=https://api.openai.com/v1,\
    num_concurrent=5,\
    max_retries=3,\
    timeout=60,\
    batch_size=auto

Parameters:

model: Model identifier (required)
base_url: API endpoint (default: OpenAI)
num_concurrent: Concurrent requests (default: 5)
max_retries: Retry failed requests (default: 3)
timeout: Request timeout in seconds (default: 60)
tokenizer: Tokenizer to use (default: matches model)
tokenizer_backend: "tiktoken" or "huggingface"

Cost Management

OpenAI charges per token. Estimate costs before running:

# Rough estimate
num_samples = 1000
avg_tokens_per_sample = 500  # input + output
cost_per_1k_tokens = 0.01  # GPT-3.5 Turbo

total_cost = (num_samples * avg_tokens_per_sample / 1000) * cost_per_1k_tokens
print(f"Estimated cost: ${total_cost:.2f}")

Cost-saving tips:

Use --limit N for testing
Start with gpt-3.5-turbo before gpt-4
Set max_gen_toks to minimum needed
Use num_fewshot=0 for zero-shot when possible

Anthropic Models

Setup

export ANTHROPIC_API_KEY=sk-ant-...

Completion Models (Legacy)

lm_eval --model anthropic-completions \
  --model_args model=claude-2.1 \
  --tasks lambada_openai,hellaswag \
  --batch_size auto

Chat Models (Recommended)

Available models: claude-3-5-sonnet-20241022, claude-3-opus-20240229, claude-3-sonnet-20240229, claude-3-haiku-20240307

lm_eval --model anthropic-chat \
  --model_args model=claude-3-5-sonnet-20241022 \
  --tasks mmlu,gsm8k,humaneval \
  --num_fewshot 5 \
  --batch_size auto

Aliases: anthropic-chat-completions (same as anthropic-chat)

Configuration Options

lm_eval --model anthropic-chat \
  --model_args \
    model=claude-3-5-sonnet-20241022,\
    base_url=https://api.anthropic.com,\
    num_concurrent=5,\
    max_retries=3,\
    timeout=60

Cost Management

Anthropic pricing (as of 2024):

Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output
Claude 3 Opus: $15.00 / 1M input, $75.00 / 1M output
Claude 3 Haiku: $0.25 / 1M input, $1.25 / 1M output

Budget-friendly strategy:

# Test on small sample first
lm_eval --model anthropic-chat \
  --model_args model=claude-3-haiku-20240307 \
  --tasks mmlu \
  --limit 100

# Then run full eval on best model
lm_eval --model anthropic-chat \
  --model_args model=claude-3-5-sonnet-20241022 \
  --tasks mmlu \
  --num_fewshot 5

Local OpenAI-Compatible APIs

Many local inference servers expose OpenAI-compatible APIs (vLLM, Text Generation Inference, llama.cpp, Ollama).

vLLM Local Server

Start server:

vllm serve meta-llama/Llama-2-7b-hf \
  --host 0.0.0.0 \
  --port 8000

Evaluate:

lm_eval --model local-completions \
  --model_args \
    model=meta-llama/Llama-2-7b-hf,\
    base_url=http://localhost:8000/v1,\
    num_concurrent=1 \
  --tasks mmlu,gsm8k \
  --batch_size auto

Text Generation Inference (TGI)

Start server:

docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-2-7b-hf

Evaluate:

lm_eval --model local-completions \
  --model_args \
    model=meta-llama/Llama-2-7b-hf,\
    base_url=http://localhost:8080/v1 \
  --tasks hellaswag,arc_challenge

Ollama

Start server:

ollama serve
ollama pull llama2:7b

Evaluate:

lm_eval --model local-completions \
  --model_args \
    model=llama2:7b,\
    base_url=http://localhost:11434/v1 \
  --tasks mmlu

llama.cpp Server

Start server:

./server -m models/llama-2-7b.gguf --host 0.0.0.0 --port 8080

Evaluate:

lm_eval --model local-completions \
  --model_args \
    model=llama2,\
    base_url=http://localhost:8080/v1 \
  --tasks gsm8k

Custom API Implementation

For custom API endpoints, subclass TemplateAPI:

Create `my_api.py`

from lm_eval.models.api_models import TemplateAPI
import requests

class MyCustomAPI(TemplateAPI):
    """Custom API model."""

    def __init__(self, base_url, api_key, **kwargs):
        super().__init__(base_url=base_url, **kwargs)
        self.api_key = api_key

    def _create_payload(self, messages, gen_kwargs):
        """Create API request payload."""
        return {
            "messages": messages,
            "api_key": self.api_key,
            **gen_kwargs
        }

    def parse_generations(self, response):
        """Parse generation response."""
        return response.json()["choices"][0]["text"]

    def parse_logprobs(self, response):
        """Parse logprobs (if available)."""
        # Return None if API doesn't provide logprobs
        logprobs = response.json().get("logprobs")
        if logprobs:
            return logprobs["token_logprobs"]
        return None

Register and Use

from lm_eval import evaluator
from my_api import MyCustomAPI

model = MyCustomAPI(
    base_url="https://api.example.com/v1",
    api_key="your-key"
)

results = evaluator.simple_evaluate(
    model=model,
    tasks=["mmlu", "gsm8k"],
    num_fewshot=5,
    batch_size="auto"
)

Comparing API and Open Models

Side-by-Side Evaluation

# Evaluate OpenAI GPT-4
lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu,gsm8k,hellaswag \
  --num_fewshot 5 \
  --output_path results/gpt4.json

# Evaluate open Llama 2 70B
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-70b-hf,dtype=bfloat16 \
  --tasks mmlu,gsm8k,hellaswag \
  --num_fewshot 5 \
  --output_path results/llama2-70b.json

# Compare results
python scripts/compare_results.py \
  results/gpt4.json \
  results/llama2-70b.json

Typical Comparisons

Model	MMLU	GSM8K	HumanEval	Cost
GPT-4 Turbo	86.4%	92.0%	67.0%
Claude 3 Opus	86.8%	95.0%	84.9%
GPT-3.5 Turbo	70.0%	57.1%	48.1%	$$
Llama 2 70B	68.9%	56.8%	29.9%	Free (self-host)
Mixtral 8x7B	70.6%	58.4%	40.2%	Free (self-host)

Best Practices

Rate Limiting

Respect API rate limits:

lm_eval --model openai-chat-completions \
  --model_args \
    model=gpt-4-turbo,\
    num_concurrent=3,\  # Lower concurrency
    timeout=120 \  # Longer timeout
  --tasks mmlu

Reproducibility

Set temperature to 0 for deterministic results:

lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu \
  --gen_kwargs temperature=0.0

Or use seed for sampling:

lm_eval --model anthropic-chat \
  --model_args model=claude-3-5-sonnet-20241022 \
  --tasks gsm8k \
  --gen_kwargs temperature=0.7,seed=42

Caching

API models automatically cache responses to avoid redundant calls:

# First run: makes API calls
lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu \
  --limit 100

# Second run: uses cache (instant, free)
lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu \
  --limit 100

Cache location: ~/.cache/lm_eval/

Error Handling

APIs can fail. Use retries:

lm_eval --model openai-chat-completions \
  --model_args \
    model=gpt-4-turbo,\
    max_retries=5,\
    timeout=120 \
  --tasks mmlu

Troubleshooting

"Authentication failed"

Check API key:

echo $OPENAI_API_KEY  # Should print sk-...
echo $ANTHROPIC_API_KEY  # Should print sk-ant-...

"Rate limit exceeded"

Reduce concurrency:

--model_args num_concurrent=1

Or add delays between requests.

"Timeout error"

Increase timeout:

--model_args timeout=180

"Model not found"

For local APIs, verify server is running:

curl http://localhost:8000/v1/models

Cost Runaway

Use --limit for testing:

lm_eval --model openai-chat-completions \
  --model_args model=gpt-4-turbo \
  --tasks mmlu \
  --limit 50  # Only 50 samples

Advanced Features

Custom Headers

lm_eval --model local-completions \
  --model_args \
    base_url=http://api.example.com/v1,\
    header="Authorization: Bearer token,X-Custom: value"

Disable SSL Verification (Development Only)

lm_eval --model local-completions \
  --model_args \
    base_url=https://localhost:8000/v1,\
    verify_certificate=false

Custom Tokenizer

lm_eval --model openai-chat-completions \
  --model_args \
    model=gpt-4-turbo,\
    tokenizer=gpt2,\
    tokenizer_backend=huggingface

References

OpenAI API: https://platform.openai.com/docs/api-reference
Anthropic API: https://docs.anthropic.com/claude/reference
TemplateAPI: lm_eval/models/api_models.py
OpenAI models: lm_eval/models/openai_completions.py
Anthropic models: lm_eval/models/anthropic_llms.py

11 KiB Raw Permalink Blame History

API Evaluation

Overview

Supported API Models

OpenAI Models

Setup

Completion Models (Legacy)

Chat Models

Configuration Options

Cost Management

Anthropic Models

Setup

Completion Models (Legacy)

Chat Models (Recommended)

Configuration Options

Cost Management

Local OpenAI-Compatible APIs

vLLM Local Server

Text Generation Inference (TGI)

Ollama

llama.cpp Server

Custom API Implementation

Create my_api.py

Register and Use

Comparing API and Open Models

Side-by-Side Evaluation

Typical Comparisons

Best Practices

Rate Limiting

Reproducibility

Caching

Error Handling

Troubleshooting

"Authentication failed"

"Rate limit exceeded"

"Timeout error"

"Model not found"

Cost Runaway

Advanced Features

Custom Headers

Disable SSL Verification (Development Only)

Custom Tokenizer

References

11 KiB

Raw Permalink Blame History

Create `my_api.py`