Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/training/trl-fine-tuning/references/reward-modeling.md
+++ b/skills/mlops/training/trl-fine-tuning/references/reward-modeling.md
@@ -0,0 +1,122 @@
+# Reward Modeling
+
+Guide to training reward models with TRL for RLHF pipelines.
+
+## Overview
+
+Reward models score completions based on human preferences. Used in:
+- PPO training (RL feedback)
+- GRPO online RL
+- Completion ranking
+
+## Basic Training
+
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from trl import RewardTrainer, RewardConfig
+from datasets import load_dataset
+
+# Load model (num_labels=1 for single reward score)
+model = AutoModelForSequenceClassification.from_pretrained(
+    "Qwen/Qwen2.5-0.5B-Instruct",
+    num_labels=1
+)
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+
+# Load preference dataset (chosen/rejected pairs)
+dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
+
+# Configure
+config = RewardConfig(
+    output_dir="Qwen2.5-Reward",
+    per_device_train_batch_size=2,
+    num_train_epochs=1,
+    learning_rate=1e-5
+)
+
+# Train
+trainer = RewardTrainer(
+    model=model,
+    args=config,
+    processing_class=tokenizer,
+    train_dataset=dataset
+)
+trainer.train()
+```
+
+## Dataset Format
+
+Required fields:
+```json
+{
+  "prompt": "Question or instruction",
+  "chosen": "Better response",
+  "rejected": "Worse response"
+}
+```
+
+## Bradley-Terry Loss
+
+Default loss function:
+```
+loss = -log(sigmoid(reward_chosen - reward_rejected))
+```
+
+Learns to score chosen > rejected.
+
+## Using Reward Models
+
+### Inference
+
+```python
+from transformers import pipeline
+
+# Load trained reward model
+reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")
+
+# Score completions
+texts = ["Good answer", "Bad answer"]
+scores = reward_pipe(texts)
+print(scores)  # Higher score = better
+```
+
+### In PPO
+
+```python
+from trl import PPOTrainer, PPOConfig
+
+config = PPOConfig(
+    reward_model_path="Qwen2.5-Reward"  # Use trained reward model
+)
+
+trainer = PPOTrainer(
+    model=policy_model,
+    config=config,
+    # Reward model loaded automatically
+)
+```
+
+## Hyperparameters
+
+| Model Size | Learning Rate | Batch Size | Epochs |
+|------------|---------------|------------|--------|
+| <1B | 2e-5 | 4-8 | 1-2 |
+| 1-7B | 1e-5 | 2-4 | 1 |
+| 7-13B | 5e-6 | 1-2 | 1 |
+
+## Evaluation
+
+Check reward separation:
+```python
+# Chosen should score higher than rejected
+chosen_rewards = model(**chosen_inputs).logits
+rejected_rewards = model(**rejected_inputs).logits
+
+accuracy = (chosen_rewards > rejected_rewards).float().mean()
+print(f"Accuracy: {accuracy:.2%}")  # Target: >80%
+```
+
+## References
+
+- InstructGPT paper: https://arxiv.org/abs/2203.02155
+- TRL docs: https://huggingface.co/docs/trl/reward_trainer