Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/inference/vllm/references/troubleshooting.md
+++ b/skills/mlops/inference/vllm/references/troubleshooting.md
@@ -0,0 +1,447 @@
+# Troubleshooting Guide
+
+## Contents
+- Out of memory (OOM) errors
+- Performance issues
+- Model loading errors
+- Network and connection issues
+- Quantization problems
+- Distributed serving issues
+- Debugging tools and commands
+
+## Out of memory (OOM) errors
+
+### Symptom: `torch.cuda.OutOfMemoryError` during model loading
+
+**Cause**: Model + KV cache exceeds available VRAM
+
+**Solutions (try in order)**:
+
+1. **Reduce GPU memory utilization**:
+```bash
+vllm serve MODEL --gpu-memory-utilization 0.7  # Try 0.7, 0.75, 0.8
+```
+
+2. **Reduce max sequence length**:
+```bash
+vllm serve MODEL --max-model-len 4096  # Instead of 8192
+```
+
+3. **Enable quantization**:
+```bash
+vllm serve MODEL --quantization awq  # 4x memory reduction
+```
+
+4. **Use tensor parallelism** (multiple GPUs):
+```bash
+vllm serve MODEL --tensor-parallel-size 2  # Split across 2 GPUs
+```
+
+5. **Reduce max concurrent sequences**:
+```bash
+vllm serve MODEL --max-num-seqs 128  # Default is 256
+```
+
+### Symptom: OOM during inference (not model loading)
+
+**Cause**: KV cache fills up during generation
+
+**Solutions**:
+
+```bash
+# Reduce KV cache allocation
+vllm serve MODEL --gpu-memory-utilization 0.85
+
+# Reduce batch size
+vllm serve MODEL --max-num-seqs 64
+
+# Reduce max tokens per request
+# Set in client request: max_tokens=512
+```
+
+### Symptom: OOM with quantized model
+
+**Cause**: Quantization overhead or incorrect configuration
+
+**Solution**:
+```bash
+# Ensure quantization flag matches model
+vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq  # Must specify
+
+# Try different dtype
+vllm serve MODEL --quantization awq --dtype float16
+```
+
+## Performance issues
+
+### Symptom: Low throughput (<50 req/sec expected >100)
+
+**Diagnostic steps**:
+
+1. **Check GPU utilization**:
+```bash
+watch -n 1 nvidia-smi
+# GPU utilization should be >80%
+```
+
+If <80%, increase concurrent requests:
+```bash
+vllm serve MODEL --max-num-seqs 512  # Increase from 256
+```
+
+2. **Check if memory-bound**:
+```bash
+# If memory at 100% but GPU <80%, reduce sequence length
+vllm serve MODEL --max-model-len 4096
+```
+
+3. **Enable optimizations**:
+```bash
+vllm serve MODEL \
+  --enable-prefix-caching \
+  --enable-chunked-prefill \
+  --max-num-seqs 512
+```
+
+4. **Check tensor parallelism settings**:
+```bash
+# Must use power-of-2 GPUs
+vllm serve MODEL --tensor-parallel-size 4  # Not 3 or 5
+```
+
+### Symptom: High TTFT (time to first token >1 second)
+
+**Causes and solutions**:
+
+**Long prompts**:
+```bash
+vllm serve MODEL --enable-chunked-prefill
+```
+
+**No prefix caching**:
+```bash
+vllm serve MODEL --enable-prefix-caching  # For repeated prompts
+```
+
+**Too many concurrent requests**:
+```bash
+vllm serve MODEL --max-num-seqs 64  # Reduce to prioritize latency
+```
+
+**Model too large for single GPU**:
+```bash
+vllm serve MODEL --tensor-parallel-size 2  # Parallelize prefill
+```
+
+### Symptom: Slow token generation (low tokens/sec)
+
+**Diagnostic**:
+```bash
+# Check if model is correct size
+vllm serve MODEL  # Should see model size in logs
+
+# Check speculative decoding
+vllm serve MODEL --speculative-model DRAFT_MODEL
+```
+
+**For H100 GPUs**, enable FP8:
+```bash
+vllm serve MODEL --quantization fp8
+```
+
+## Model loading errors
+
+### Symptom: `OSError: MODEL not found`
+
+**Causes**:
+
+1. **Model name typo**:
+```bash
+# Check exact model name on HuggingFace
+vllm serve meta-llama/Llama-3-8B-Instruct  # Correct capitalization
+```
+
+2. **Private/gated model**:
+```bash
+# Login to HuggingFace first
+huggingface-cli login
+# Then run vLLM
+vllm serve meta-llama/Llama-3-70B-Instruct
+```
+
+3. **Custom model needs trust flag**:
+```bash
+vllm serve MODEL --trust-remote-code
+```
+
+### Symptom: `ValueError: Tokenizer not found`
+
+**Solution**:
+```bash
+# Download model manually first
+python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"
+
+# Then launch vLLM
+vllm serve MODEL
+```
+
+### Symptom: `ImportError: No module named 'flash_attn'`
+
+**Solution**:
+```bash
+# Install flash attention
+pip install flash-attn --no-build-isolation
+
+# Or disable flash attention
+vllm serve MODEL --disable-flash-attn
+```
+
+## Network and connection issues
+
+### Symptom: `Connection refused` when querying server
+
+**Diagnostic**:
+
+1. **Check server is running**:
+```bash
+curl http://localhost:8000/health
+```
+
+2. **Check port binding**:
+```bash
+# Bind to all interfaces for remote access
+vllm serve MODEL --host 0.0.0.0 --port 8000
+
+# Check if port is in use
+lsof -i :8000
+```
+
+3. **Check firewall**:
+```bash
+# Allow port through firewall
+sudo ufw allow 8000
+```
+
+### Symptom: Slow response times over network
+
+**Solutions**:
+
+1. **Increase timeout**:
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="EMPTY",
+    timeout=300.0  # 5 minute timeout
+)
+```
+
+2. **Check network latency**:
+```bash
+ping SERVER_IP  # Should be <10ms for local network
+```
+
+3. **Use connection pooling**:
+```python
+import requests
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry
+
+session = requests.Session()
+retries = Retry(total=3, backoff_factor=1)
+session.mount('http://', HTTPAdapter(max_retries=retries))
+```
+
+## Quantization problems
+
+### Symptom: `RuntimeError: Quantization format not supported`
+
+**Solution**:
+```bash
+# Ensure correct quantization method
+vllm serve MODEL --quantization awq  # For AWQ models
+vllm serve MODEL --quantization gptq  # For GPTQ models
+
+# Check model card for quantization type
+```
+
+### Symptom: Poor quality outputs after quantization
+
+**Diagnostic**:
+
+1. **Verify model is correctly quantized**:
+```bash
+# Check model config.json for quantization_config
+cat ~/.cache/huggingface/hub/models--MODEL/config.json
+```
+
+2. **Try different quantization method**:
+```bash
+# If AWQ quality issues, try FP8 (H100 only)
+vllm serve MODEL --quantization fp8
+
+# Or use less aggressive quantization
+vllm serve MODEL  # No quantization
+```
+
+3. **Increase temperature for better diversity**:
+```python
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+```
+
+## Distributed serving issues
+
+### Symptom: `RuntimeError: Distributed init failed`
+
+**Diagnostic**:
+
+1. **Check environment variables**:
+```bash
+# On all nodes
+echo $MASTER_ADDR  # Should be same
+echo $MASTER_PORT  # Should be same
+echo $RANK  # Should be unique per node (0, 1, 2, ...)
+echo $WORLD_SIZE  # Should be same (total nodes)
+```
+
+2. **Check network connectivity**:
+```bash
+# From node 1 to node 2
+ping NODE2_IP
+nc -zv NODE2_IP 29500  # Check port accessibility
+```
+
+3. **Check NCCL settings**:
+```bash
+export NCCL_DEBUG=INFO
+export NCCL_SOCKET_IFNAME=eth0  # Or your network interface
+vllm serve MODEL --tensor-parallel-size 8
+```
+
+### Symptom: `NCCL error: unhandled cuda error`
+
+**Solutions**:
+
+```bash
+# Set NCCL to use correct network interface
+export NCCL_SOCKET_IFNAME=eth0  # Replace with your interface
+
+# Increase timeout
+export NCCL_TIMEOUT=1800  # 30 minutes
+
+# Force P2P for debugging
+export NCCL_P2P_DISABLE=1
+```
+
+## Debugging tools and commands
+
+### Enable debug logging
+
+```bash
+export VLLM_LOGGING_LEVEL=DEBUG
+vllm serve MODEL
+```
+
+### Monitor GPU usage
+
+```bash
+# Real-time GPU monitoring
+watch -n 1 nvidia-smi
+
+# Memory breakdown
+nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
+```
+
+### Profile performance
+
+```bash
+# Built-in benchmarking
+vllm bench throughput \
+  --model MODEL \
+  --input-tokens 128 \
+  --output-tokens 256 \
+  --num-prompts 100
+
+vllm bench latency \
+  --model MODEL \
+  --input-tokens 128 \
+  --output-tokens 256 \
+  --batch-size 8
+```
+
+### Check metrics
+
+```bash
+# Prometheus metrics
+curl http://localhost:9090/metrics
+
+# Filter for specific metrics
+curl http://localhost:9090/metrics | grep vllm_time_to_first_token
+
+# Key metrics to monitor:
+# - vllm_time_to_first_token_seconds
+# - vllm_time_per_output_token_seconds
+# - vllm_num_requests_running
+# - vllm_gpu_cache_usage_perc
+# - vllm_request_success_total
+```
+
+### Test server health
+
+```bash
+# Health check
+curl http://localhost:8000/health
+
+# Model info
+curl http://localhost:8000/v1/models
+
+# Test completion
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "MODEL",
+    "prompt": "Hello",
+    "max_tokens": 10
+  }'
+```
+
+### Common environment variables
+
+```bash
+# CUDA settings
+export CUDA_VISIBLE_DEVICES=0,1,2,3  # Limit to specific GPUs
+
+# vLLM settings
+export VLLM_LOGGING_LEVEL=DEBUG
+export VLLM_TRACE_FUNCTION=1  # Profile functions
+export VLLM_USE_V1=1  # Use v1.0 engine (faster)
+
+# NCCL settings (distributed)
+export NCCL_DEBUG=INFO
+export NCCL_SOCKET_IFNAME=eth0
+export NCCL_IB_DISABLE=0  # Enable InfiniBand
+```
+
+### Collect diagnostic info for bug reports
+
+```bash
+# System info
+nvidia-smi
+python --version
+pip show vllm
+
+# vLLM version and config
+vllm --version
+python -c "import vllm; print(vllm.__version__)"
+
+# Run with debug logging
+export VLLM_LOGGING_LEVEL=DEBUG
+vllm serve MODEL 2>&1 | tee vllm_debug.log
+
+# Include in bug report:
+# - vllm_debug.log
+# - nvidia-smi output
+# - Full command used
+# - Expected vs actual behavior
+```