Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/training/peft/SKILL.md
+++ b/skills/mlops/training/peft/SKILL.md
@@ -0,0 +1,434 @@
+---
+name: peft-fine-tuning
+description: Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [peft>=0.13.0, transformers>=4.45.0, torch>=2.0.0, bitsandbytes>=0.43.0]
+metadata:
+  hermes:
+    tags: [Fine-Tuning, PEFT, LoRA, QLoRA, Parameter-Efficient, Adapters, Low-Rank, Memory Optimization, Multi-Adapter]
+
+---
+
+# PEFT (Parameter-Efficient Fine-Tuning)
+
+Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.
+
+## When to use PEFT
+
+**Use PEFT/LoRA when:**
+- Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
+- Need to train <1% parameters (6MB adapters vs 14GB full model)
+- Want fast iteration with multiple task-specific adapters
+- Deploying multiple fine-tuned variants from one base model
+
+**Use QLoRA (PEFT + quantization) when:**
+- Fine-tuning 70B models on single 24GB GPU
+- Memory is the primary constraint
+- Can accept ~5% quality trade-off vs full fine-tuning
+
+**Use full fine-tuning instead when:**
+- Training small models (<1B parameters)
+- Need maximum quality and have compute budget
+- Significant domain shift requires updating all weights
+
+## Quick start
+
+### Installation
+
+```bash
+# Basic installation
+pip install peft
+
+# With quantization support (recommended)
+pip install peft bitsandbytes
+
+# Full stack
+pip install peft transformers accelerate bitsandbytes datasets
+```
+
+### LoRA fine-tuning (standard)
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
+from peft import get_peft_model, LoraConfig, TaskType
+from datasets import load_dataset
+
+# Load base model
+model_name = "meta-llama/Llama-3.1-8B"
+model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+tokenizer.pad_token = tokenizer.eos_token
+
+# LoRA configuration
+lora_config = LoraConfig(
+    task_type=TaskType.CAUSAL_LM,
+    r=16,                          # Rank (8-64, higher = more capacity)
+    lora_alpha=32,                 # Scaling factor (typically 2*r)
+    lora_dropout=0.05,             # Dropout for regularization
+    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
+    bias="none"                    # Don't train biases
+)
+
+# Apply LoRA
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%
+
+# Prepare dataset
+dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
+
+def tokenize(example):
+    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
+    return tokenizer(text, truncation=True, max_length=512, padding="max_length")
+
+tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
+
+# Training
+training_args = TrainingArguments(
+    output_dir="./lora-llama",
+    num_train_epochs=3,
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=4,
+    learning_rate=2e-4,
+    fp16=True,
+    logging_steps=10,
+    save_strategy="epoch"
+)
+
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized,
+    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
+                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),
+                                 "labels": torch.stack([f["input_ids"] for f in data])}
+)
+
+trainer.train()
+
+# Save adapter only (6MB vs 16GB)
+model.save_pretrained("./lora-llama-adapter")
+```
+
+### QLoRA fine-tuning (memory-efficient)
+
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
+
+# 4-bit quantization config
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
+    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
+    bnb_4bit_use_double_quant=True       # Nested quantization
+)
+
+# Load quantized model
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.1-70B",
+    quantization_config=bnb_config,
+    device_map="auto"
+)
+
+# Prepare for training (enables gradient checkpointing)
+model = prepare_model_for_kbit_training(model)
+
+# LoRA config for QLoRA
+lora_config = LoraConfig(
+    r=64,                              # Higher rank for 70B
+    lora_alpha=128,
+    lora_dropout=0.1,
+    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+
+model = get_peft_model(model, lora_config)
+# 70B model now fits on single 24GB GPU!
+```
+
+## LoRA parameter selection
+
+### Rank (r) - capacity vs efficiency
+
+| Rank | Trainable Params | Memory | Quality | Use Case |
+|------|-----------------|--------|---------|----------|
+| 4 | ~3M | Minimal | Lower | Simple tasks, prototyping |
+| **8** | ~7M | Low | Good | **Recommended starting point** |
+| **16** | ~14M | Medium | Better | **General fine-tuning** |
+| 32 | ~27M | Higher | High | Complex tasks |
+| 64 | ~54M | High | Highest | Domain adaptation, 70B models |
+
+### Alpha (lora_alpha) - scaling factor
+
+```python
+# Rule of thumb: alpha = 2 * rank
+LoraConfig(r=16, lora_alpha=32)  # Standard
+LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
+LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)
+```
+
+### Target modules by architecture
+
+```python
+# Llama / Mistral / Qwen
+target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
+
+# GPT-2 / GPT-Neo
+target_modules = ["c_attn", "c_proj", "c_fc"]
+
+# Falcon
+target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
+
+# BLOOM
+target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
+
+# Auto-detect all linear layers
+target_modules = "all-linear"  # PEFT 0.6.0+
+```
+
+## Loading and merging adapters
+
+### Load trained adapter
+
+```python
+from peft import PeftModel, AutoPeftModelForCausalLM
+from transformers import AutoModelForCausalLM
+
+# Option 1: Load with PeftModel
+base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
+model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
+
+# Option 2: Load directly (recommended)
+model = AutoPeftModelForCausalLM.from_pretrained(
+    "./lora-llama-adapter",
+    device_map="auto"
+)
+```
+
+### Merge adapter into base model
+
+```python
+# Merge for deployment (no adapter overhead)
+merged_model = model.merge_and_unload()
+
+# Save merged model
+merged_model.save_pretrained("./llama-merged")
+tokenizer.save_pretrained("./llama-merged")
+
+# Push to Hub
+merged_model.push_to_hub("username/llama-finetuned")
+```
+
+### Multi-adapter serving
+
+```python
+from peft import PeftModel
+
+# Load base with first adapter
+model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
+
+# Load additional adapters
+model.load_adapter("./adapter-task2", adapter_name="task2")
+model.load_adapter("./adapter-task3", adapter_name="task3")
+
+# Switch between adapters at runtime
+model.set_adapter("task1")  # Use task1 adapter
+output1 = model.generate(**inputs)
+
+model.set_adapter("task2")  # Switch to task2
+output2 = model.generate(**inputs)
+
+# Disable adapters (use base model)
+with model.disable_adapter():
+    base_output = model.generate(**inputs)
+```
+
+## PEFT methods comparison
+
+| Method | Trainable % | Memory | Speed | Best For |
+|--------|------------|--------|-------|----------|
+| **LoRA** | 0.1-1% | Low | Fast | General fine-tuning |
+| **QLoRA** | 0.1-1% | Very Low | Medium | Memory-constrained |
+| AdaLoRA | 0.1-1% | Low | Medium | Automatic rank selection |
+| IA3 | 0.01% | Minimal | Fastest | Few-shot adaptation |
+| Prefix Tuning | 0.1% | Low | Medium | Generation control |
+| Prompt Tuning | 0.001% | Minimal | Fast | Simple task adaptation |
+| P-Tuning v2 | 0.1% | Low | Medium | NLU tasks |
+
+### IA3 (minimal parameters)
+
+```python
+from peft import IA3Config
+
+ia3_config = IA3Config(
+    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
+    feedforward_modules=["down_proj"]
+)
+model = get_peft_model(model, ia3_config)
+# Trains only 0.01% of parameters!
+```
+
+### Prefix Tuning
+
+```python
+from peft import PrefixTuningConfig
+
+prefix_config = PrefixTuningConfig(
+    task_type="CAUSAL_LM",
+    num_virtual_tokens=20,      # Prepended tokens
+    prefix_projection=True       # Use MLP projection
+)
+model = get_peft_model(model, prefix_config)
+```
+
+## Integration patterns
+
+### With TRL (SFTTrainer)
+
+```python
+from trl import SFTTrainer, SFTConfig
+from peft import LoraConfig
+
+lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
+
+trainer = SFTTrainer(
+    model=model,
+    args=SFTConfig(output_dir="./output", max_seq_length=512),
+    train_dataset=dataset,
+    peft_config=lora_config,  # Pass LoRA config directly
+)
+trainer.train()
+```
+
+### With Axolotl (YAML config)
+
+```yaml
+# axolotl config.yaml
+adapter: lora
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.05
+lora_target_modules:
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+lora_target_linear: true  # Target all linear layers
+```
+
+### With vLLM (inference)
+
+```python
+from vllm import LLM
+from vllm.lora.request import LoRARequest
+
+# Load base model with LoRA support
+llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
+
+# Serve with adapter
+outputs = llm.generate(
+    prompts,
+    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
+)
+```
+
+## Performance benchmarks
+
+### Memory usage (Llama 3.1 8B)
+
+| Method | GPU Memory | Trainable Params |
+|--------|-----------|------------------|
+| Full fine-tuning | 60+ GB | 8B (100%) |
+| LoRA r=16 | 18 GB | 14M (0.17%) |
+| QLoRA r=16 | 6 GB | 14M (0.17%) |
+| IA3 | 16 GB | 800K (0.01%) |
+
+### Training speed (A100 80GB)
+
+| Method | Tokens/sec | vs Full FT |
+|--------|-----------|------------|
+| Full FT | 2,500 | 1x |
+| LoRA | 3,200 | 1.3x |
+| QLoRA | 2,100 | 0.84x |
+
+### Quality (MMLU benchmark)
+
+| Model | Full FT | LoRA | QLoRA |
+|-------|---------|------|-------|
+| Llama 2-7B | 45.3 | 44.8 | 44.1 |
+| Llama 2-13B | 54.8 | 54.2 | 53.5 |
+
+## Common issues
+
+### CUDA OOM during training
+
+```python
+# Solution 1: Enable gradient checkpointing
+model.gradient_checkpointing_enable()
+
+# Solution 2: Reduce batch size + increase accumulation
+TrainingArguments(
+    per_device_train_batch_size=1,
+    gradient_accumulation_steps=16
+)
+
+# Solution 3: Use QLoRA
+from transformers import BitsAndBytesConfig
+bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
+```
+
+### Adapter not applying
+
+```python
+# Verify adapter is active
+print(model.active_adapters)  # Should show adapter name
+
+# Check trainable parameters
+model.print_trainable_parameters()
+
+# Ensure model in training mode
+model.train()
+```
+
+### Quality degradation
+
+```python
+# Increase rank
+LoraConfig(r=32, lora_alpha=64)
+
+# Target more modules
+target_modules = "all-linear"
+
+# Use more training data and epochs
+TrainingArguments(num_train_epochs=5)
+
+# Lower learning rate
+TrainingArguments(learning_rate=1e-4)
+```
+
+## Best practices
+
+1. **Start with r=8-16**, increase if quality insufficient
+2. **Use alpha = 2 * rank** as starting point
+3. **Target attention + MLP layers** for best quality/efficiency
+4. **Enable gradient checkpointing** for memory savings
+5. **Save adapters frequently** (small files, easy rollback)
+6. **Evaluate on held-out data** before merging
+7. **Use QLoRA for 70B+ models** on consumer hardware
+
+## References
+
+- **[Advanced Usage](references/advanced-usage.md)** - DoRA, LoftQ, rank stabilization, custom modules
+- **[Troubleshooting](references/troubleshooting.md)** - Common errors, debugging, optimization
+
+## Resources
+
+- **GitHub**: https://github.com/huggingface/peft
+- **Docs**: https://huggingface.co/docs/peft
+- **LoRA Paper**: arXiv:2106.09685
+- **QLoRA Paper**: arXiv:2305.14314
+- **Models**: https://huggingface.co/models?library=peft