123 lines
2.5 KiB
Markdown
123 lines
2.5 KiB
Markdown
# Reward Modeling
|
|
|
|
Guide to training reward models with TRL for RLHF pipelines.
|
|
|
|
## Overview
|
|
|
|
Reward models score completions based on human preferences. Used in:
|
|
- PPO training (RL feedback)
|
|
- GRPO online RL
|
|
- Completion ranking
|
|
|
|
## Basic Training
|
|
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
|
from trl import RewardTrainer, RewardConfig
|
|
from datasets import load_dataset
|
|
|
|
# Load model (num_labels=1 for single reward score)
|
|
model = AutoModelForSequenceClassification.from_pretrained(
|
|
"Qwen/Qwen2.5-0.5B-Instruct",
|
|
num_labels=1
|
|
)
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
|
|
|
|
# Load preference dataset (chosen/rejected pairs)
|
|
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
|
|
|
|
# Configure
|
|
config = RewardConfig(
|
|
output_dir="Qwen2.5-Reward",
|
|
per_device_train_batch_size=2,
|
|
num_train_epochs=1,
|
|
learning_rate=1e-5
|
|
)
|
|
|
|
# Train
|
|
trainer = RewardTrainer(
|
|
model=model,
|
|
args=config,
|
|
processing_class=tokenizer,
|
|
train_dataset=dataset
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
## Dataset Format
|
|
|
|
Required fields:
|
|
```json
|
|
{
|
|
"prompt": "Question or instruction",
|
|
"chosen": "Better response",
|
|
"rejected": "Worse response"
|
|
}
|
|
```
|
|
|
|
## Bradley-Terry Loss
|
|
|
|
Default loss function:
|
|
```
|
|
loss = -log(sigmoid(reward_chosen - reward_rejected))
|
|
```
|
|
|
|
Learns to score chosen > rejected.
|
|
|
|
## Using Reward Models
|
|
|
|
### Inference
|
|
|
|
```python
|
|
from transformers import pipeline
|
|
|
|
# Load trained reward model
|
|
reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")
|
|
|
|
# Score completions
|
|
texts = ["Good answer", "Bad answer"]
|
|
scores = reward_pipe(texts)
|
|
print(scores) # Higher score = better
|
|
```
|
|
|
|
### In PPO
|
|
|
|
```python
|
|
from trl import PPOTrainer, PPOConfig
|
|
|
|
config = PPOConfig(
|
|
reward_model_path="Qwen2.5-Reward" # Use trained reward model
|
|
)
|
|
|
|
trainer = PPOTrainer(
|
|
model=policy_model,
|
|
config=config,
|
|
# Reward model loaded automatically
|
|
)
|
|
```
|
|
|
|
## Hyperparameters
|
|
|
|
| Model Size | Learning Rate | Batch Size | Epochs |
|
|
|------------|---------------|------------|--------|
|
|
| <1B | 2e-5 | 4-8 | 1-2 |
|
|
| 1-7B | 1e-5 | 2-4 | 1 |
|
|
| 7-13B | 5e-6 | 1-2 | 1 |
|
|
|
|
## Evaluation
|
|
|
|
Check reward separation:
|
|
```python
|
|
# Chosen should score higher than rejected
|
|
chosen_rewards = model(**chosen_inputs).logits
|
|
rejected_rewards = model(**rejected_inputs).logits
|
|
|
|
accuracy = (chosen_rewards > rejected_rewards).float().mean()
|
|
print(f"Accuracy: {accuracy:.2%}") # Target: >80%
|
|
```
|
|
|
|
## References
|
|
|
|
- InstructGPT paper: https://arxiv.org/abs/2203.02155
|
|
- TRL docs: https://huggingface.co/docs/trl/reward_trainer
|