# Comprehensive Hyperparameter Sweeps Guide Complete guide to hyperparameter optimization with W&B Sweeps. ## Table of Contents - Sweep Configuration - Search Strategies - Parameter Distributions - Early Termination - Parallel Execution - Advanced Patterns - Real-World Examples ## Sweep Configuration ### Basic Sweep Config ```python sweep_config = { 'method': 'bayes', # Search strategy 'metric': { 'name': 'val/accuracy', 'goal': 'maximize' # or 'minimize' }, 'parameters': { 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1 }, 'batch_size': { 'values': [16, 32, 64, 128] } } } # Initialize sweep sweep_id = wandb.sweep(sweep_config, project="my-project") ``` ### Complete Config Example ```python sweep_config = { # Required: Search method 'method': 'bayes', # Required: Optimization metric 'metric': { 'name': 'val/f1_score', 'goal': 'maximize' }, # Required: Parameters to search 'parameters': { # Continuous parameter 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1 }, # Discrete values 'batch_size': { 'values': [16, 32, 64, 128] }, # Categorical 'optimizer': { 'values': ['adam', 'sgd', 'rmsprop', 'adamw'] }, # Uniform distribution 'dropout': { 'distribution': 'uniform', 'min': 0.1, 'max': 0.5 }, # Integer range 'num_layers': { 'distribution': 'int_uniform', 'min': 2, 'max': 10 }, # Fixed value (constant across runs) 'epochs': { 'value': 50 } }, # Optional: Early termination 'early_terminate': { 'type': 'hyperband', 'min_iter': 5, 's': 2, 'eta': 3, 'max_iter': 27 } } ``` ## Search Strategies ### 1. Grid Search Exhaustively search all combinations. ```python sweep_config = { 'method': 'grid', 'parameters': { 'learning_rate': { 'values': [0.001, 0.01, 0.1] }, 'batch_size': { 'values': [16, 32, 64] }, 'optimizer': { 'values': ['adam', 'sgd'] } } } # Total runs: 3 × 3 × 2 = 18 runs ``` **Pros:** - Comprehensive search - Reproducible results - No randomness **Cons:** - Exponential growth with parameters - Inefficient for continuous parameters - Not scalable beyond 3-4 parameters **When to use:** - Few parameters (< 4) - All discrete values - Need complete coverage ### 2. Random Search Randomly sample parameter combinations. ```python sweep_config = { 'method': 'random', 'parameters': { 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1 }, 'batch_size': { 'values': [16, 32, 64, 128, 256] }, 'dropout': { 'distribution': 'uniform', 'min': 0.0, 'max': 0.5 }, 'num_layers': { 'distribution': 'int_uniform', 'min': 2, 'max': 8 } } } # Run 100 random trials wandb.agent(sweep_id, function=train, count=100) ``` **Pros:** - Scales to many parameters - Can run indefinitely - Often finds good solutions quickly **Cons:** - No learning from previous runs - May miss optimal region - Results vary with random seed **When to use:** - Many parameters (> 4) - Quick exploration - Limited budget ### 3. Bayesian Optimization (Recommended) Learn from previous trials to sample promising regions. ```python sweep_config = { 'method': 'bayes', 'metric': { 'name': 'val/loss', 'goal': 'minimize' }, 'parameters': { 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1 }, 'weight_decay': { 'distribution': 'log_uniform', 'min': 1e-6, 'max': 1e-2 }, 'dropout': { 'distribution': 'uniform', 'min': 0.1, 'max': 0.5 }, 'num_layers': { 'values': [2, 3, 4, 5, 6] } } } ``` **Pros:** - Most sample-efficient - Learns from past trials - Focuses on promising regions **Cons:** - Initial random exploration phase - May get stuck in local optima - Slower per iteration **When to use:** - Expensive training runs - Need best performance - Limited compute budget ## Parameter Distributions ### Continuous Distributions ```python # Log-uniform: Good for learning rates, regularization 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-6, 'max': 1e-1 } # Uniform: Good for dropout, momentum 'dropout': { 'distribution': 'uniform', 'min': 0.0, 'max': 0.5 } # Normal distribution 'parameter': { 'distribution': 'normal', 'mu': 0.5, 'sigma': 0.1 } # Log-normal distribution 'parameter': { 'distribution': 'log_normal', 'mu': 0.0, 'sigma': 1.0 } ``` ### Discrete Distributions ```python # Fixed values 'batch_size': { 'values': [16, 32, 64, 128, 256] } # Integer uniform 'num_layers': { 'distribution': 'int_uniform', 'min': 2, 'max': 10 } # Quantized uniform (step size) 'layer_size': { 'distribution': 'q_uniform', 'min': 32, 'max': 512, 'q': 32 # Step by 32: 32, 64, 96, 128... } # Quantized log-uniform 'hidden_size': { 'distribution': 'q_log_uniform', 'min': 32, 'max': 1024, 'q': 32 } ``` ### Categorical Parameters ```python # Optimizers 'optimizer': { 'values': ['adam', 'sgd', 'rmsprop', 'adamw'] } # Model architectures 'model': { 'values': ['resnet18', 'resnet34', 'resnet50', 'efficientnet_b0'] } # Activation functions 'activation': { 'values': ['relu', 'gelu', 'silu', 'leaky_relu'] } ``` ## Early Termination Stop underperforming runs early to save compute. ### Hyperband ```python sweep_config = { 'method': 'bayes', 'metric': {'name': 'val/accuracy', 'goal': 'maximize'}, 'parameters': {...}, # Hyperband early termination 'early_terminate': { 'type': 'hyperband', 'min_iter': 3, # Minimum iterations before termination 's': 2, # Bracket count 'eta': 3, # Downsampling rate 'max_iter': 27 # Maximum iterations } } ``` **How it works:** - Runs trials in brackets - Keeps top 1/eta performers each round - Eliminates bottom performers early ### Custom Termination ```python def train(): run = wandb.init() for epoch in range(MAX_EPOCHS): loss = train_epoch() val_acc = validate() wandb.log({'val/accuracy': val_acc, 'epoch': epoch}) # Custom early stopping if epoch > 5 and val_acc < 0.5: print("Early stop: Poor performance") break if epoch > 10 and val_acc > best_acc - 0.01: print("Early stop: No improvement") break ``` ## Training Function ### Basic Template ```python def train(): # Initialize W&B run run = wandb.init() # Get hyperparameters config = wandb.config # Build model with config model = build_model( hidden_size=config.hidden_size, num_layers=config.num_layers, dropout=config.dropout ) # Create optimizer optimizer = create_optimizer( model.parameters(), name=config.optimizer, lr=config.learning_rate, weight_decay=config.weight_decay ) # Training loop for epoch in range(config.epochs): # Train train_loss, train_acc = train_epoch( model, optimizer, train_loader, config.batch_size ) # Validate val_loss, val_acc = validate(model, val_loader) # Log metrics wandb.log({ 'train/loss': train_loss, 'train/accuracy': train_acc, 'val/loss': val_loss, 'val/accuracy': val_acc, 'epoch': epoch }) # Log final model torch.save(model.state_dict(), 'model.pth') wandb.save('model.pth') # Finish run wandb.finish() ``` ### With PyTorch ```python import torch import torch.nn as nn from torch.utils.data import DataLoader import wandb def train(): run = wandb.init() config = wandb.config # Data train_loader = DataLoader( train_dataset, batch_size=config.batch_size, shuffle=True ) # Model model = ResNet( num_classes=config.num_classes, dropout=config.dropout ).to(device) # Optimizer if config.optimizer == 'adam': optimizer = torch.optim.Adam( model.parameters(), lr=config.learning_rate, weight_decay=config.weight_decay ) elif config.optimizer == 'sgd': optimizer = torch.optim.SGD( model.parameters(), lr=config.learning_rate, momentum=config.momentum, weight_decay=config.weight_decay ) # Scheduler scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=config.epochs ) # Training for epoch in range(config.epochs): model.train() train_loss = 0.0 for data, target in train_loader: data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = nn.CrossEntropyLoss()(output, target) loss.backward() optimizer.step() train_loss += loss.item() # Validation model.eval() val_loss, val_acc = validate(model, val_loader) # Step scheduler scheduler.step() # Log wandb.log({ 'train/loss': train_loss / len(train_loader), 'val/loss': val_loss, 'val/accuracy': val_acc, 'learning_rate': scheduler.get_last_lr()[0], 'epoch': epoch }) ``` ## Parallel Execution ### Multiple Agents Run sweep agents in parallel to speed up search. ```python # Initialize sweep once sweep_id = wandb.sweep(sweep_config, project="my-project") # Run multiple agents in parallel # Agent 1 (Terminal 1) wandb.agent(sweep_id, function=train, count=20) # Agent 2 (Terminal 2) wandb.agent(sweep_id, function=train, count=20) # Agent 3 (Terminal 3) wandb.agent(sweep_id, function=train, count=20) # Total: 60 runs across 3 agents ``` ### Multi-GPU Execution ```python import os def train(): # Get available GPU gpu_id = os.environ.get('CUDA_VISIBLE_DEVICES', '0') run = wandb.init() config = wandb.config # Train on specific GPU device = torch.device(f'cuda:{gpu_id}') model = model.to(device) # ... rest of training ... # Run agents on different GPUs # Terminal 1 # CUDA_VISIBLE_DEVICES=0 wandb agent sweep_id # Terminal 2 # CUDA_VISIBLE_DEVICES=1 wandb agent sweep_id # Terminal 3 # CUDA_VISIBLE_DEVICES=2 wandb agent sweep_id ``` ## Advanced Patterns ### Nested Parameters ```python sweep_config = { 'method': 'bayes', 'metric': {'name': 'val/accuracy', 'goal': 'maximize'}, 'parameters': { 'model': { 'parameters': { 'type': { 'values': ['resnet', 'efficientnet'] }, 'size': { 'values': ['small', 'medium', 'large'] } } }, 'optimizer': { 'parameters': { 'type': { 'values': ['adam', 'sgd'] }, 'lr': { 'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1 } } } } } # Access nested config def train(): run = wandb.init() model_type = wandb.config.model.type model_size = wandb.config.model.size opt_type = wandb.config.optimizer.type lr = wandb.config.optimizer.lr ``` ### Conditional Parameters ```python sweep_config = { 'method': 'bayes', 'parameters': { 'optimizer': { 'values': ['adam', 'sgd'] }, 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1 }, # Only used if optimizer == 'sgd' 'momentum': { 'distribution': 'uniform', 'min': 0.5, 'max': 0.99 } } } def train(): run = wandb.init() config = wandb.config if config.optimizer == 'adam': optimizer = torch.optim.Adam( model.parameters(), lr=config.learning_rate ) elif config.optimizer == 'sgd': optimizer = torch.optim.SGD( model.parameters(), lr=config.learning_rate, momentum=config.momentum # Conditional parameter ) ``` ## Real-World Examples ### Image Classification ```python sweep_config = { 'method': 'bayes', 'metric': { 'name': 'val/top1_accuracy', 'goal': 'maximize' }, 'parameters': { # Model 'architecture': { 'values': ['resnet50', 'resnet101', 'efficientnet_b0', 'efficientnet_b3'] }, 'pretrained': { 'values': [True, False] }, # Training 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-2 }, 'batch_size': { 'values': [16, 32, 64, 128] }, 'optimizer': { 'values': ['adam', 'sgd', 'adamw'] }, 'weight_decay': { 'distribution': 'log_uniform', 'min': 1e-6, 'max': 1e-2 }, # Regularization 'dropout': { 'distribution': 'uniform', 'min': 0.0, 'max': 0.5 }, 'label_smoothing': { 'distribution': 'uniform', 'min': 0.0, 'max': 0.2 }, # Data augmentation 'mixup_alpha': { 'distribution': 'uniform', 'min': 0.0, 'max': 1.0 }, 'cutmix_alpha': { 'distribution': 'uniform', 'min': 0.0, 'max': 1.0 } }, 'early_terminate': { 'type': 'hyperband', 'min_iter': 5 } } ``` ### NLP Fine-Tuning ```python sweep_config = { 'method': 'bayes', 'metric': {'name': 'eval/f1', 'goal': 'maximize'}, 'parameters': { # Model 'model_name': { 'values': ['bert-base-uncased', 'roberta-base', 'distilbert-base-uncased'] }, # Training 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-6, 'max': 1e-4 }, 'per_device_train_batch_size': { 'values': [8, 16, 32] }, 'num_train_epochs': { 'values': [3, 4, 5] }, 'warmup_ratio': { 'distribution': 'uniform', 'min': 0.0, 'max': 0.1 }, 'weight_decay': { 'distribution': 'log_uniform', 'min': 1e-4, 'max': 1e-1 }, # Optimizer 'adam_beta1': { 'distribution': 'uniform', 'min': 0.8, 'max': 0.95 }, 'adam_beta2': { 'distribution': 'uniform', 'min': 0.95, 'max': 0.999 } } } ``` ## Best Practices ### 1. Start Small ```python # Initial exploration: Random search, 20 runs sweep_config_v1 = { 'method': 'random', 'parameters': {...} } wandb.agent(sweep_id_v1, train, count=20) # Refined search: Bayes, narrow ranges sweep_config_v2 = { 'method': 'bayes', 'parameters': { 'learning_rate': { 'min': 5e-5, # Narrowed from 1e-6 to 1e-4 'max': 1e-4 } } } ``` ### 2. Use Log Scales ```python # ✅ Good: Log scale for learning rate 'learning_rate': { 'distribution': 'log_uniform', 'min': 1e-6, 'max': 1e-2 } # ❌ Bad: Linear scale 'learning_rate': { 'distribution': 'uniform', 'min': 0.000001, 'max': 0.01 } ``` ### 3. Set Reasonable Ranges ```python # Base ranges on prior knowledge 'learning_rate': {'min': 1e-5, 'max': 1e-3}, # Typical for Adam 'batch_size': {'values': [16, 32, 64]}, # GPU memory limits 'dropout': {'min': 0.1, 'max': 0.5} # Too high hurts training ``` ### 4. Monitor Resource Usage ```python def train(): run = wandb.init() # Log system metrics wandb.log({ 'system/gpu_memory_allocated': torch.cuda.memory_allocated(), 'system/gpu_memory_reserved': torch.cuda.memory_reserved() }) ``` ### 5. Save Best Models ```python def train(): run = wandb.init() best_acc = 0.0 for epoch in range(config.epochs): val_acc = validate(model) if val_acc > best_acc: best_acc = val_acc # Save best checkpoint torch.save(model.state_dict(), 'best_model.pth') wandb.save('best_model.pth') ``` ## Resources - **Sweeps Documentation**: https://docs.wandb.ai/guides/sweeps - **Configuration Reference**: https://docs.wandb.ai/guides/sweeps/configuration - **Examples**: https://github.com/wandb/examples/tree/master/examples/wandb-sweeps