Sync all skills and memories 2026-04-14 07:27
This commit is contained in:
367
skills/mlops/inference/vllm/SKILL.md
Normal file
367
skills/mlops/inference/vllm/SKILL.md
Normal file
@@ -0,0 +1,367 @@
|
||||
---
|
||||
name: serving-llms-vllm
|
||||
description: Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [vllm, torch, transformers]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [vLLM, Inference Serving, PagedAttention, Continuous Batching, High Throughput, Production, OpenAI API, Quantization, Tensor Parallelism]
|
||||
|
||||
---
|
||||
|
||||
# vLLM - High-Performance LLM Serving
|
||||
|
||||
## Quick start
|
||||
|
||||
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install vllm
|
||||
```
|
||||
|
||||
**Basic offline inference**:
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
|
||||
sampling = SamplingParams(temperature=0.7, max_tokens=256)
|
||||
|
||||
outputs = llm.generate(["Explain quantum computing"], sampling)
|
||||
print(outputs[0].outputs[0].text)
|
||||
```
|
||||
|
||||
**OpenAI-compatible server**:
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct
|
||||
|
||||
# Query with OpenAI SDK
|
||||
python -c "
|
||||
from openai import OpenAI
|
||||
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
|
||||
print(client.chat.completions.create(
|
||||
model='meta-llama/Llama-3-8B-Instruct',
|
||||
messages=[{'role': 'user', 'content': 'Hello!'}]
|
||||
).choices[0].message.content)
|
||||
"
|
||||
```
|
||||
|
||||
## Common workflows
|
||||
|
||||
### Workflow 1: Production API deployment
|
||||
|
||||
Copy this checklist and track progress:
|
||||
|
||||
```
|
||||
Deployment Progress:
|
||||
- [ ] Step 1: Configure server settings
|
||||
- [ ] Step 2: Test with limited traffic
|
||||
- [ ] Step 3: Enable monitoring
|
||||
- [ ] Step 4: Deploy to production
|
||||
- [ ] Step 5: Verify performance metrics
|
||||
```
|
||||
|
||||
**Step 1: Configure server settings**
|
||||
|
||||
Choose configuration based on your model size:
|
||||
|
||||
```bash
|
||||
# For 7B-13B models on single GPU
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-model-len 8192 \
|
||||
--port 8000
|
||||
|
||||
# For 30B-70B models with tensor parallelism
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 4 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization awq \
|
||||
--port 8000
|
||||
|
||||
# For production with caching and metrics
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-prefix-caching \
|
||||
--enable-metrics \
|
||||
--metrics-port 9090 \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0
|
||||
```
|
||||
|
||||
**Step 2: Test with limited traffic**
|
||||
|
||||
Run load test before production:
|
||||
|
||||
```bash
|
||||
# Install load testing tool
|
||||
pip install locust
|
||||
|
||||
# Create test_load.py with sample requests
|
||||
# Run: locust -f test_load.py --host http://localhost:8000
|
||||
```
|
||||
|
||||
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
|
||||
|
||||
**Step 3: Enable monitoring**
|
||||
|
||||
vLLM exposes Prometheus metrics on port 9090:
|
||||
|
||||
```bash
|
||||
curl http://localhost:9090/metrics | grep vllm
|
||||
```
|
||||
|
||||
Key metrics to monitor:
|
||||
- `vllm:time_to_first_token_seconds` - Latency
|
||||
- `vllm:num_requests_running` - Active requests
|
||||
- `vllm:gpu_cache_usage_perc` - KV cache utilization
|
||||
|
||||
**Step 4: Deploy to production**
|
||||
|
||||
Use Docker for consistent deployment:
|
||||
|
||||
```bash
|
||||
# Run vLLM in Docker
|
||||
docker run --gpus all -p 8000:8000 \
|
||||
vllm/vllm-openai:latest \
|
||||
--model meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-prefix-caching
|
||||
```
|
||||
|
||||
**Step 5: Verify performance metrics**
|
||||
|
||||
Check that deployment meets targets:
|
||||
- TTFT < 500ms (for short prompts)
|
||||
- Throughput > target req/sec
|
||||
- GPU utilization > 80%
|
||||
- No OOM errors in logs
|
||||
|
||||
### Workflow 2: Offline batch inference
|
||||
|
||||
For processing large datasets without server overhead.
|
||||
|
||||
Copy this checklist:
|
||||
|
||||
```
|
||||
Batch Processing:
|
||||
- [ ] Step 1: Prepare input data
|
||||
- [ ] Step 2: Configure LLM engine
|
||||
- [ ] Step 3: Run batch inference
|
||||
- [ ] Step 4: Process results
|
||||
```
|
||||
|
||||
**Step 1: Prepare input data**
|
||||
|
||||
```python
|
||||
# Load prompts from file
|
||||
prompts = []
|
||||
with open("prompts.txt") as f:
|
||||
prompts = [line.strip() for line in f]
|
||||
|
||||
print(f"Loaded {len(prompts)} prompts")
|
||||
```
|
||||
|
||||
**Step 2: Configure LLM engine**
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3-8B-Instruct",
|
||||
tensor_parallel_size=2, # Use 2 GPUs
|
||||
gpu_memory_utilization=0.9,
|
||||
max_model_len=4096
|
||||
)
|
||||
|
||||
sampling = SamplingParams(
|
||||
temperature=0.7,
|
||||
top_p=0.95,
|
||||
max_tokens=512,
|
||||
stop=["</s>", "\n\n"]
|
||||
)
|
||||
```
|
||||
|
||||
**Step 3: Run batch inference**
|
||||
|
||||
vLLM automatically batches requests for efficiency:
|
||||
|
||||
```python
|
||||
# Process all prompts in one call
|
||||
outputs = llm.generate(prompts, sampling)
|
||||
|
||||
# vLLM handles batching internally
|
||||
# No need to manually chunk prompts
|
||||
```
|
||||
|
||||
**Step 4: Process results**
|
||||
|
||||
```python
|
||||
# Extract generated text
|
||||
results = []
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated = output.outputs[0].text
|
||||
results.append({
|
||||
"prompt": prompt,
|
||||
"generated": generated,
|
||||
"tokens": len(output.outputs[0].token_ids)
|
||||
})
|
||||
|
||||
# Save to file
|
||||
import json
|
||||
with open("results.jsonl", "w") as f:
|
||||
for result in results:
|
||||
f.write(json.dumps(result) + "\n")
|
||||
|
||||
print(f"Processed {len(results)} prompts")
|
||||
```
|
||||
|
||||
### Workflow 3: Quantized model serving
|
||||
|
||||
Fit large models in limited GPU memory.
|
||||
|
||||
```
|
||||
Quantization Setup:
|
||||
- [ ] Step 1: Choose quantization method
|
||||
- [ ] Step 2: Find or create quantized model
|
||||
- [ ] Step 3: Launch with quantization flag
|
||||
- [ ] Step 4: Verify accuracy
|
||||
```
|
||||
|
||||
**Step 1: Choose quantization method**
|
||||
|
||||
- **AWQ**: Best for 70B models, minimal accuracy loss
|
||||
- **GPTQ**: Wide model support, good compression
|
||||
- **FP8**: Fastest on H100 GPUs
|
||||
|
||||
**Step 2: Find or create quantized model**
|
||||
|
||||
Use pre-quantized models from HuggingFace:
|
||||
|
||||
```bash
|
||||
# Search for AWQ models
|
||||
# Example: TheBloke/Llama-2-70B-AWQ
|
||||
```
|
||||
|
||||
**Step 3: Launch with quantization flag**
|
||||
|
||||
```bash
|
||||
# Using pre-quantized model
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95
|
||||
|
||||
# Results: 70B model in ~40GB VRAM
|
||||
```
|
||||
|
||||
**Step 4: Verify accuracy**
|
||||
|
||||
Test outputs match expected quality:
|
||||
|
||||
```python
|
||||
# Compare quantized vs non-quantized responses
|
||||
# Verify task-specific performance unchanged
|
||||
```
|
||||
|
||||
## When to use vs alternatives
|
||||
|
||||
**Use vLLM when:**
|
||||
- Deploying production LLM APIs (100+ req/sec)
|
||||
- Serving OpenAI-compatible endpoints
|
||||
- Limited GPU memory but need large models
|
||||
- Multi-user applications (chatbots, assistants)
|
||||
- Need low latency with high throughput
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **llama.cpp**: CPU/edge inference, single-user
|
||||
- **HuggingFace transformers**: Research, prototyping, one-off generation
|
||||
- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance
|
||||
- **Text-Generation-Inference**: Already in HuggingFace ecosystem
|
||||
|
||||
## Common issues
|
||||
|
||||
**Issue: Out of memory during model loading**
|
||||
|
||||
Reduce memory usage:
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--max-model-len 4096
|
||||
```
|
||||
|
||||
Or use quantization:
|
||||
```bash
|
||||
vllm serve MODEL --quantization awq
|
||||
```
|
||||
|
||||
**Issue: Slow first token (TTFT > 1 second)**
|
||||
|
||||
Enable prefix caching for repeated prompts:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching
|
||||
```
|
||||
|
||||
For long prompts, enable chunked prefill:
|
||||
```bash
|
||||
vllm serve MODEL --enable-chunked-prefill
|
||||
```
|
||||
|
||||
**Issue: Model not found error**
|
||||
|
||||
Use `--trust-remote-code` for custom models:
|
||||
```bash
|
||||
vllm serve MODEL --trust-remote-code
|
||||
```
|
||||
|
||||
**Issue: Low throughput (<50 req/sec)**
|
||||
|
||||
Increase concurrent sequences:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 512
|
||||
```
|
||||
|
||||
Check GPU utilization with `nvidia-smi` - should be >80%.
|
||||
|
||||
**Issue: Inference slower than expected**
|
||||
|
||||
Verify tensor parallelism uses power of 2 GPUs:
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 4 # Not 3
|
||||
```
|
||||
|
||||
Enable speculative decoding for faster generation:
|
||||
```bash
|
||||
vllm serve MODEL --speculative-model DRAFT_MODEL
|
||||
```
|
||||
|
||||
## Advanced topics
|
||||
|
||||
**Server deployment patterns**: See [references/server-deployment.md](references/server-deployment.md) for Docker, Kubernetes, and load balancing configurations.
|
||||
|
||||
**Performance optimization**: See [references/optimization.md](references/optimization.md) for PagedAttention tuning, continuous batching details, and benchmark results.
|
||||
|
||||
**Quantization guide**: See [references/quantization.md](references/quantization.md) for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
|
||||
|
||||
**Troubleshooting**: See [references/troubleshooting.md](references/troubleshooting.md) for detailed error messages, debugging steps, and performance diagnostics.
|
||||
|
||||
## Hardware requirements
|
||||
|
||||
- **Small models (7B-13B)**: 1x A10 (24GB) or A100 (40GB)
|
||||
- **Medium models (30B-40B)**: 2x A100 (40GB) with tensor parallelism
|
||||
- **Large models (70B+)**: 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
|
||||
|
||||
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
|
||||
|
||||
## Resources
|
||||
|
||||
- Official docs: https://docs.vllm.ai
|
||||
- GitHub: https://github.com/vllm-project/vllm
|
||||
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
|
||||
- Community: https://discuss.vllm.ai
|
||||
|
||||
|
||||
|
||||
226
skills/mlops/inference/vllm/references/optimization.md
Normal file
226
skills/mlops/inference/vllm/references/optimization.md
Normal file
@@ -0,0 +1,226 @@
|
||||
# Performance Optimization
|
||||
|
||||
## Contents
|
||||
- PagedAttention explained
|
||||
- Continuous batching mechanics
|
||||
- Prefix caching strategies
|
||||
- Speculative decoding setup
|
||||
- Benchmark results and comparisons
|
||||
- Performance tuning guide
|
||||
|
||||
## PagedAttention explained
|
||||
|
||||
**Traditional attention problem**:
|
||||
- KV cache stored in contiguous memory
|
||||
- Wastes ~50% GPU memory due to fragmentation
|
||||
- Cannot dynamically reallocate for varying sequence lengths
|
||||
|
||||
**PagedAttention solution**:
|
||||
- Divides KV cache into fixed-size blocks (like OS virtual memory)
|
||||
- Dynamic allocation from free block queue
|
||||
- Shares blocks across sequences (for prefix caching)
|
||||
|
||||
**Memory savings example**:
|
||||
```
|
||||
Traditional: 70B model needs 160GB KV cache → OOM on 8x A100
|
||||
PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100
|
||||
```
|
||||
|
||||
**Configuration**:
|
||||
```bash
|
||||
# Block size (default: 16 tokens)
|
||||
vllm serve MODEL --block-size 16
|
||||
|
||||
# Number of GPU blocks (auto-calculated)
|
||||
# Controlled by --gpu-memory-utilization
|
||||
vllm serve MODEL --gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
## Continuous batching mechanics
|
||||
|
||||
**Traditional batching**:
|
||||
- Wait for all sequences in batch to finish
|
||||
- GPU idle while waiting for longest sequence
|
||||
- Low GPU utilization (~40-60%)
|
||||
|
||||
**Continuous batching**:
|
||||
- Add new requests as slots become available
|
||||
- Mix prefill (new requests) and decode (ongoing) in same batch
|
||||
- High GPU utilization (>90%)
|
||||
|
||||
**Throughput improvement**:
|
||||
```
|
||||
Traditional batching: 50 req/sec @ 50% GPU util
|
||||
Continuous batching: 200 req/sec @ 90% GPU util
|
||||
= 4x throughput improvement
|
||||
```
|
||||
|
||||
**Tuning parameters**:
|
||||
```bash
|
||||
# Max concurrent sequences (higher = more batching)
|
||||
vllm serve MODEL --max-num-seqs 256
|
||||
|
||||
# Prefill/decode schedule (auto-balanced by default)
|
||||
# No manual tuning needed
|
||||
```
|
||||
|
||||
## Prefix caching strategies
|
||||
|
||||
Reuse computed KV cache for common prompt prefixes.
|
||||
|
||||
**Use cases**:
|
||||
- System prompts repeated across requests
|
||||
- Few-shot examples in every prompt
|
||||
- RAG contexts with overlapping chunks
|
||||
|
||||
**Example savings**:
|
||||
```
|
||||
Prompt: [System: 500 tokens] + [User: 100 tokens]
|
||||
|
||||
Without caching: Compute 600 tokens every request
|
||||
With caching: Compute 500 tokens once, then 100 tokens/request
|
||||
= 83% faster TTFT
|
||||
```
|
||||
|
||||
**Enable prefix caching**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching
|
||||
```
|
||||
|
||||
**Automatic prefix detection**:
|
||||
- vLLM detects common prefixes automatically
|
||||
- No code changes required
|
||||
- Works with OpenAI-compatible API
|
||||
|
||||
**Cache hit rate monitoring**:
|
||||
```bash
|
||||
curl http://localhost:9090/metrics | grep cache_hit
|
||||
# vllm_cache_hit_rate: 0.75 (75% hit rate)
|
||||
```
|
||||
|
||||
## Speculative decoding setup
|
||||
|
||||
Use smaller "draft" model to propose tokens, larger model to verify.
|
||||
|
||||
**Speed improvement**:
|
||||
```
|
||||
Standard: Generate 1 token per forward pass
|
||||
Speculative: Generate 3-5 tokens per forward pass
|
||||
= 2-3x faster generation
|
||||
```
|
||||
|
||||
**How it works**:
|
||||
1. Draft model proposes K tokens (fast)
|
||||
2. Target model verifies all K tokens in parallel (one pass)
|
||||
3. Accept verified tokens, restart from first rejection
|
||||
|
||||
**Setup with separate draft model**:
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct \
|
||||
--speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||
--num-speculative-tokens 5
|
||||
```
|
||||
|
||||
**Setup with n-gram draft** (no separate model):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--speculative-method ngram \
|
||||
--num-speculative-tokens 3
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Output length > 100 tokens
|
||||
- Draft model 5-10x smaller than target
|
||||
- Acceptable 2-3% accuracy trade-off
|
||||
|
||||
## Benchmark results
|
||||
|
||||
**vLLM vs HuggingFace Transformers** (Llama 3 8B, A100):
|
||||
```
|
||||
Metric | HF Transformers | vLLM | Improvement
|
||||
------------------------|-----------------|--------|------------
|
||||
Throughput (req/sec) | 12 | 280 | 23x
|
||||
TTFT (ms) | 850 | 120 | 7x
|
||||
Tokens/sec | 45 | 2,100 | 47x
|
||||
GPU Memory (GB) | 28 | 16 | 1.75x less
|
||||
```
|
||||
|
||||
**vLLM vs TensorRT-LLM** (Llama 2 70B, 4x A100):
|
||||
```
|
||||
Metric | TensorRT-LLM | vLLM | Notes
|
||||
------------------------|--------------|--------|------------------
|
||||
Throughput (req/sec) | 320 | 285 | TRT 12% faster
|
||||
Setup complexity | High | Low | vLLM much easier
|
||||
NVIDIA-only | Yes | No | vLLM multi-platform
|
||||
Quantization support | FP8, INT8 | AWQ/GPTQ/FP8 | vLLM more options
|
||||
```
|
||||
|
||||
## Performance tuning guide
|
||||
|
||||
**Step 1: Measure baseline**
|
||||
|
||||
```bash
|
||||
# Install benchmarking tool
|
||||
pip install locust
|
||||
|
||||
# Run baseline benchmark
|
||||
vllm bench throughput \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--num-prompts 1000
|
||||
|
||||
# Record: throughput, TTFT, tokens/sec
|
||||
```
|
||||
|
||||
**Step 2: Tune memory utilization**
|
||||
|
||||
```bash
|
||||
# Try different values: 0.7, 0.85, 0.9, 0.95
|
||||
vllm serve MODEL --gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Higher = more batch capacity = higher throughput, but risk OOM.
|
||||
|
||||
**Step 3: Tune concurrency**
|
||||
|
||||
```bash
|
||||
# Try values: 128, 256, 512, 1024
|
||||
vllm serve MODEL --max-num-seqs 256
|
||||
```
|
||||
|
||||
Higher = more batching opportunity, but may increase latency.
|
||||
|
||||
**Step 4: Enable optimizations**
|
||||
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--enable-prefix-caching \ # For repeated prompts
|
||||
--enable-chunked-prefill \ # For long prompts
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num-seqs 512
|
||||
```
|
||||
|
||||
**Step 5: Re-benchmark and compare**
|
||||
|
||||
Target improvements:
|
||||
- Throughput: +30-100%
|
||||
- TTFT: -20-50%
|
||||
- GPU utilization: >85%
|
||||
|
||||
**Common performance issues**:
|
||||
|
||||
**Low throughput (<50 req/sec)**:
|
||||
- Increase `--max-num-seqs`
|
||||
- Enable `--enable-prefix-caching`
|
||||
- Check GPU utilization (should be >80%)
|
||||
|
||||
**High TTFT (>1 second)**:
|
||||
- Enable `--enable-chunked-prefill`
|
||||
- Reduce `--max-model-len` if possible
|
||||
- Check if model is too large for GPU
|
||||
|
||||
**OOM errors**:
|
||||
- Reduce `--gpu-memory-utilization` to 0.7
|
||||
- Reduce `--max-model-len`
|
||||
- Use quantization (`--quantization awq`)
|
||||
284
skills/mlops/inference/vllm/references/quantization.md
Normal file
284
skills/mlops/inference/vllm/references/quantization.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# Quantization Guide
|
||||
|
||||
## Contents
|
||||
- Quantization methods comparison
|
||||
- AWQ setup and usage
|
||||
- GPTQ setup and usage
|
||||
- FP8 quantization (H100)
|
||||
- Model preparation
|
||||
- Accuracy vs compression trade-offs
|
||||
|
||||
## Quantization methods comparison
|
||||
|
||||
| Method | Compression | Accuracy Loss | Speed | Best For |
|
||||
|--------|-------------|---------------|-------|----------|
|
||||
| **AWQ** | 4-bit (75%) | <1% | Fast | 70B models, production |
|
||||
| **GPTQ** | 4-bit (75%) | 1-2% | Fast | Wide model support |
|
||||
| **FP8** | 8-bit (50%) | <0.5% | Fastest | H100 GPUs only |
|
||||
| **SqueezeLLM** | 3-4 bit (75-80%) | 2-3% | Medium | Extreme compression |
|
||||
|
||||
**Recommendation**:
|
||||
- **Production**: Use AWQ for 70B models
|
||||
- **H100 GPUs**: Use FP8 for best speed
|
||||
- **Maximum compatibility**: Use GPTQ
|
||||
- **Extreme compression**: Use SqueezeLLM
|
||||
|
||||
## AWQ setup and usage
|
||||
|
||||
**AWQ** (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.
|
||||
|
||||
**Step 1: Find pre-quantized model**
|
||||
|
||||
Search HuggingFace for AWQ models:
|
||||
```bash
|
||||
# Example: TheBloke/Llama-2-70B-AWQ
|
||||
# Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
|
||||
```
|
||||
|
||||
**Step 2: Launch with AWQ**
|
||||
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95
|
||||
```
|
||||
|
||||
**Memory savings**:
|
||||
```
|
||||
Llama 2 70B fp16: 140GB VRAM (4x A100 needed)
|
||||
Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)
|
||||
= 4x memory reduction
|
||||
```
|
||||
|
||||
**Step 3: Verify performance**
|
||||
|
||||
Test that outputs are acceptable:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
||||
|
||||
# Test complex reasoning
|
||||
response = client.chat.completions.create(
|
||||
model="TheBloke/Llama-2-70B-AWQ",
|
||||
messages=[{"role": "user", "content": "Explain quantum entanglement"}]
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
# Verify quality matches your requirements
|
||||
```
|
||||
|
||||
**Quantize your own model** (requires GPU with 80GB+ VRAM):
|
||||
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
model_path = "meta-llama/Llama-2-70b-hf"
|
||||
quant_path = "llama-2-70b-awq"
|
||||
|
||||
# Load model
|
||||
model = AutoAWQForCausalLM.from_pretrained(model_path)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
|
||||
# Quantize
|
||||
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
|
||||
model.quantize(tokenizer, quant_config=quant_config)
|
||||
|
||||
# Save
|
||||
model.save_quantized(quant_path)
|
||||
tokenizer.save_pretrained(quant_path)
|
||||
```
|
||||
|
||||
## GPTQ setup and usage
|
||||
|
||||
**GPTQ** has widest model support and good compression.
|
||||
|
||||
**Step 1: Find GPTQ model**
|
||||
|
||||
```bash
|
||||
# Example: TheBloke/Llama-2-13B-GPTQ
|
||||
# Example: TheBloke/CodeLlama-34B-GPTQ
|
||||
```
|
||||
|
||||
**Step 2: Launch with GPTQ**
|
||||
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-13B-GPTQ \
|
||||
--quantization gptq \
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
**GPTQ configuration options**:
|
||||
```bash
|
||||
# Specify GPTQ parameters if needed
|
||||
vllm serve MODEL \
|
||||
--quantization gptq \
|
||||
--gptq-act-order \ # Activation ordering
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
**Quantize your own model**:
|
||||
|
||||
```python
|
||||
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
model_name = "meta-llama/Llama-2-13b-hf"
|
||||
quantized_name = "llama-2-13b-gptq"
|
||||
|
||||
# Load model
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
|
||||
|
||||
# Prepare calibration data
|
||||
calib_data = [...] # List of sample texts
|
||||
|
||||
# Quantize
|
||||
quantize_config = BaseQuantizeConfig(
|
||||
bits=4,
|
||||
group_size=128,
|
||||
desc_act=True
|
||||
)
|
||||
model.quantize(calib_data)
|
||||
|
||||
# Save
|
||||
model.save_quantized(quantized_name)
|
||||
```
|
||||
|
||||
## FP8 quantization (H100)
|
||||
|
||||
**FP8** (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.
|
||||
|
||||
**Requirements**:
|
||||
- H100 or H800 GPU
|
||||
- CUDA 12.3+ (12.8 recommended)
|
||||
- Hopper architecture support
|
||||
|
||||
**Step 1: Enable FP8**
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct \
|
||||
--quantization fp8 \
|
||||
--tensor-parallel-size 2
|
||||
```
|
||||
|
||||
**Performance gains on H100**:
|
||||
```
|
||||
fp16: 180 tokens/sec
|
||||
FP8: 320 tokens/sec
|
||||
= 1.8x speedup
|
||||
```
|
||||
|
||||
**Step 2: Verify accuracy**
|
||||
|
||||
FP8 typically has <0.5% accuracy degradation:
|
||||
```python
|
||||
# Run evaluation suite
|
||||
# Compare FP8 vs FP16 on your tasks
|
||||
# Verify acceptable accuracy
|
||||
```
|
||||
|
||||
**Dynamic FP8 quantization** (no pre-quantized model needed):
|
||||
|
||||
```bash
|
||||
# vLLM automatically quantizes at runtime
|
||||
vllm serve MODEL --quantization fp8
|
||||
# No model preparation required
|
||||
```
|
||||
|
||||
## Model preparation
|
||||
|
||||
**Pre-quantized models (easiest)**:
|
||||
|
||||
1. Search HuggingFace: `[model name] AWQ` or `[model name] GPTQ`
|
||||
2. Download or use directly: `TheBloke/[Model]-AWQ`
|
||||
3. Launch with appropriate `--quantization` flag
|
||||
|
||||
**Quantize your own model**:
|
||||
|
||||
**AWQ**:
|
||||
```bash
|
||||
# Install AutoAWQ
|
||||
pip install autoawq
|
||||
|
||||
# Run quantization script
|
||||
python quantize_awq.py --model MODEL --output OUTPUT
|
||||
```
|
||||
|
||||
**GPTQ**:
|
||||
```bash
|
||||
# Install AutoGPTQ
|
||||
pip install auto-gptq
|
||||
|
||||
# Run quantization script
|
||||
python quantize_gptq.py --model MODEL --output OUTPUT
|
||||
```
|
||||
|
||||
**Calibration data**:
|
||||
- Use 128-512 diverse examples from target domain
|
||||
- Representative of production inputs
|
||||
- Higher quality calibration = better accuracy
|
||||
|
||||
## Accuracy vs compression trade-offs
|
||||
|
||||
**Empirical results** (Llama 2 70B on MMLU benchmark):
|
||||
|
||||
| Quantization | Accuracy | Memory | Speed | Production-Ready |
|
||||
|--------------|----------|--------|-------|------------------|
|
||||
| FP16 (baseline) | 100% | 140GB | 1.0x | ✅ (if memory available) |
|
||||
| FP8 | 99.5% | 70GB | 1.8x | ✅ (H100 only) |
|
||||
| AWQ 4-bit | 99.0% | 35GB | 1.5x | ✅ (best for 70B) |
|
||||
| GPTQ 4-bit | 98.5% | 35GB | 1.5x | ✅ (good compatibility) |
|
||||
| SqueezeLLM 3-bit | 96.0% | 26GB | 1.3x | ⚠️ (check accuracy) |
|
||||
|
||||
**When to use each**:
|
||||
|
||||
**No quantization (FP16)**:
|
||||
- Have sufficient GPU memory
|
||||
- Need absolute best accuracy
|
||||
- Model <13B parameters
|
||||
|
||||
**FP8**:
|
||||
- Using H100/H800 GPUs
|
||||
- Need best speed with minimal accuracy loss
|
||||
- Production deployment
|
||||
|
||||
**AWQ 4-bit**:
|
||||
- Need to fit 70B model in 40GB GPU
|
||||
- Production deployment
|
||||
- <1% accuracy loss acceptable
|
||||
|
||||
**GPTQ 4-bit**:
|
||||
- Wide model support needed
|
||||
- Not on H100 (use FP8 instead)
|
||||
- 1-2% accuracy loss acceptable
|
||||
|
||||
**Testing strategy**:
|
||||
|
||||
1. **Baseline**: Measure FP16 accuracy on your evaluation set
|
||||
2. **Quantize**: Create quantized version
|
||||
3. **Evaluate**: Compare quantized vs baseline on same tasks
|
||||
4. **Decide**: Accept if degradation < threshold (typically 1-2%)
|
||||
|
||||
**Example evaluation**:
|
||||
```python
|
||||
from evaluate import load_evaluation_suite
|
||||
|
||||
# Run on FP16 baseline
|
||||
baseline_score = evaluate(model_fp16, eval_suite)
|
||||
|
||||
# Run on quantized
|
||||
quant_score = evaluate(model_awq, eval_suite)
|
||||
|
||||
# Compare
|
||||
degradation = (baseline_score - quant_score) / baseline_score * 100
|
||||
print(f"Accuracy degradation: {degradation:.2f}%")
|
||||
|
||||
# Decision
|
||||
if degradation < 1.0:
|
||||
print("✅ Quantization acceptable for production")
|
||||
else:
|
||||
print("⚠️ Review accuracy loss")
|
||||
```
|
||||
255
skills/mlops/inference/vllm/references/server-deployment.md
Normal file
255
skills/mlops/inference/vllm/references/server-deployment.md
Normal file
@@ -0,0 +1,255 @@
|
||||
# Server Deployment Patterns
|
||||
|
||||
## Contents
|
||||
- Docker deployment
|
||||
- Kubernetes deployment
|
||||
- Load balancing with Nginx
|
||||
- Multi-node distributed serving
|
||||
- Production configuration examples
|
||||
- Health checks and monitoring
|
||||
|
||||
## Docker deployment
|
||||
|
||||
**Basic Dockerfile**:
|
||||
```dockerfile
|
||||
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
|
||||
|
||||
RUN apt-get update && apt-get install -y python3-pip
|
||||
RUN pip install vllm
|
||||
|
||||
EXPOSE 8000
|
||||
|
||||
CMD ["vllm", "serve", "meta-llama/Llama-3-8B-Instruct", \
|
||||
"--host", "0.0.0.0", "--port", "8000", \
|
||||
"--gpu-memory-utilization", "0.9"]
|
||||
```
|
||||
|
||||
**Build and run**:
|
||||
```bash
|
||||
docker build -t vllm-server .
|
||||
docker run --gpus all -p 8000:8000 vllm-server
|
||||
```
|
||||
|
||||
**Docker Compose** (with metrics):
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
vllm:
|
||||
image: vllm/vllm-openai:latest
|
||||
command: >
|
||||
--model meta-llama/Llama-3-8B-Instruct
|
||||
--gpu-memory-utilization 0.9
|
||||
--enable-metrics
|
||||
--metrics-port 9090
|
||||
ports:
|
||||
- "8000:8000"
|
||||
- "9090:9090"
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: all
|
||||
capabilities: [gpu]
|
||||
```
|
||||
|
||||
## Kubernetes deployment
|
||||
|
||||
**Deployment manifest**:
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: vllm-server
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: vllm
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: vllm
|
||||
spec:
|
||||
containers:
|
||||
- name: vllm
|
||||
image: vllm/vllm-openai:latest
|
||||
args:
|
||||
- "--model=meta-llama/Llama-3-8B-Instruct"
|
||||
- "--gpu-memory-utilization=0.9"
|
||||
- "--enable-prefix-caching"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
name: http
|
||||
- containerPort: 9090
|
||||
name: metrics
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 30
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: vllm-service
|
||||
spec:
|
||||
selector:
|
||||
app: vllm
|
||||
ports:
|
||||
- port: 8000
|
||||
targetPort: 8000
|
||||
name: http
|
||||
- port: 9090
|
||||
targetPort: 9090
|
||||
name: metrics
|
||||
type: LoadBalancer
|
||||
```
|
||||
|
||||
## Load balancing with Nginx
|
||||
|
||||
**Nginx configuration**:
|
||||
```nginx
|
||||
upstream vllm_backend {
|
||||
least_conn; # Route to least-loaded server
|
||||
server localhost:8001;
|
||||
server localhost:8002;
|
||||
server localhost:8003;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
|
||||
location / {
|
||||
proxy_pass http://vllm_backend;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
|
||||
# Timeouts for long-running inference
|
||||
proxy_read_timeout 300s;
|
||||
proxy_connect_timeout 75s;
|
||||
}
|
||||
|
||||
# Metrics endpoint
|
||||
location /metrics {
|
||||
proxy_pass http://localhost:9090/metrics;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Start multiple vLLM instances**:
|
||||
```bash
|
||||
# Terminal 1
|
||||
vllm serve MODEL --port 8001 --tensor-parallel-size 1
|
||||
|
||||
# Terminal 2
|
||||
vllm serve MODEL --port 8002 --tensor-parallel-size 1
|
||||
|
||||
# Terminal 3
|
||||
vllm serve MODEL --port 8003 --tensor-parallel-size 1
|
||||
|
||||
# Start Nginx
|
||||
nginx -c /path/to/nginx.conf
|
||||
```
|
||||
|
||||
## Multi-node distributed serving
|
||||
|
||||
For models too large for single node:
|
||||
|
||||
**Node 1** (master):
|
||||
```bash
|
||||
export MASTER_ADDR=192.168.1.10
|
||||
export MASTER_PORT=29500
|
||||
export RANK=0
|
||||
export WORLD_SIZE=2
|
||||
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
**Node 2** (worker):
|
||||
```bash
|
||||
export MASTER_ADDR=192.168.1.10
|
||||
export MASTER_PORT=29500
|
||||
export RANK=1
|
||||
export WORLD_SIZE=2
|
||||
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
## Production configuration examples
|
||||
|
||||
**High throughput** (batch-heavy workload):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--max-num-seqs 512 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--enable-prefix-caching \
|
||||
--trust-remote-code
|
||||
```
|
||||
|
||||
**Low latency** (interactive workload):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--max-num-seqs 64 \
|
||||
--gpu-memory-utilization 0.85 \
|
||||
--enable-chunked-prefill
|
||||
```
|
||||
|
||||
**Memory-constrained** (40GB GPU for 70B model):
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--max-model-len 4096
|
||||
```
|
||||
|
||||
## Health checks and monitoring
|
||||
|
||||
**Health check endpoint**:
|
||||
```bash
|
||||
curl http://localhost:8000/health
|
||||
# Returns: {"status": "ok"}
|
||||
```
|
||||
|
||||
**Readiness check** (wait for model loaded):
|
||||
```bash
|
||||
#!/bin/bash
|
||||
until curl -f http://localhost:8000/health; do
|
||||
echo "Waiting for vLLM to be ready..."
|
||||
sleep 5
|
||||
done
|
||||
echo "vLLM is ready!"
|
||||
```
|
||||
|
||||
**Prometheus scraping**:
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'vllm'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
metrics_path: '/metrics'
|
||||
scrape_interval: 15s
|
||||
```
|
||||
|
||||
**Grafana dashboard** (key metrics):
|
||||
- Requests per second: `rate(vllm_request_success_total[5m])`
|
||||
- TTFT p50: `histogram_quantile(0.5, vllm_time_to_first_token_seconds_bucket)`
|
||||
- TTFT p99: `histogram_quantile(0.99, vllm_time_to_first_token_seconds_bucket)`
|
||||
- GPU cache usage: `vllm_gpu_cache_usage_perc`
|
||||
- Active requests: `vllm_num_requests_running`
|
||||
447
skills/mlops/inference/vllm/references/troubleshooting.md
Normal file
447
skills/mlops/inference/vllm/references/troubleshooting.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
## Contents
|
||||
- Out of memory (OOM) errors
|
||||
- Performance issues
|
||||
- Model loading errors
|
||||
- Network and connection issues
|
||||
- Quantization problems
|
||||
- Distributed serving issues
|
||||
- Debugging tools and commands
|
||||
|
||||
## Out of memory (OOM) errors
|
||||
|
||||
### Symptom: `torch.cuda.OutOfMemoryError` during model loading
|
||||
|
||||
**Cause**: Model + KV cache exceeds available VRAM
|
||||
|
||||
**Solutions (try in order)**:
|
||||
|
||||
1. **Reduce GPU memory utilization**:
|
||||
```bash
|
||||
vllm serve MODEL --gpu-memory-utilization 0.7 # Try 0.7, 0.75, 0.8
|
||||
```
|
||||
|
||||
2. **Reduce max sequence length**:
|
||||
```bash
|
||||
vllm serve MODEL --max-model-len 4096 # Instead of 8192
|
||||
```
|
||||
|
||||
3. **Enable quantization**:
|
||||
```bash
|
||||
vllm serve MODEL --quantization awq # 4x memory reduction
|
||||
```
|
||||
|
||||
4. **Use tensor parallelism** (multiple GPUs):
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 2 # Split across 2 GPUs
|
||||
```
|
||||
|
||||
5. **Reduce max concurrent sequences**:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 128 # Default is 256
|
||||
```
|
||||
|
||||
### Symptom: OOM during inference (not model loading)
|
||||
|
||||
**Cause**: KV cache fills up during generation
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Reduce KV cache allocation
|
||||
vllm serve MODEL --gpu-memory-utilization 0.85
|
||||
|
||||
# Reduce batch size
|
||||
vllm serve MODEL --max-num-seqs 64
|
||||
|
||||
# Reduce max tokens per request
|
||||
# Set in client request: max_tokens=512
|
||||
```
|
||||
|
||||
### Symptom: OOM with quantized model
|
||||
|
||||
**Cause**: Quantization overhead or incorrect configuration
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Ensure quantization flag matches model
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq # Must specify
|
||||
|
||||
# Try different dtype
|
||||
vllm serve MODEL --quantization awq --dtype float16
|
||||
```
|
||||
|
||||
## Performance issues
|
||||
|
||||
### Symptom: Low throughput (<50 req/sec expected >100)
|
||||
|
||||
**Diagnostic steps**:
|
||||
|
||||
1. **Check GPU utilization**:
|
||||
```bash
|
||||
watch -n 1 nvidia-smi
|
||||
# GPU utilization should be >80%
|
||||
```
|
||||
|
||||
If <80%, increase concurrent requests:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 512 # Increase from 256
|
||||
```
|
||||
|
||||
2. **Check if memory-bound**:
|
||||
```bash
|
||||
# If memory at 100% but GPU <80%, reduce sequence length
|
||||
vllm serve MODEL --max-model-len 4096
|
||||
```
|
||||
|
||||
3. **Enable optimizations**:
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--enable-prefix-caching \
|
||||
--enable-chunked-prefill \
|
||||
--max-num-seqs 512
|
||||
```
|
||||
|
||||
4. **Check tensor parallelism settings**:
|
||||
```bash
|
||||
# Must use power-of-2 GPUs
|
||||
vllm serve MODEL --tensor-parallel-size 4 # Not 3 or 5
|
||||
```
|
||||
|
||||
### Symptom: High TTFT (time to first token >1 second)
|
||||
|
||||
**Causes and solutions**:
|
||||
|
||||
**Long prompts**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-chunked-prefill
|
||||
```
|
||||
|
||||
**No prefix caching**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching # For repeated prompts
|
||||
```
|
||||
|
||||
**Too many concurrent requests**:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 64 # Reduce to prioritize latency
|
||||
```
|
||||
|
||||
**Model too large for single GPU**:
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 2 # Parallelize prefill
|
||||
```
|
||||
|
||||
### Symptom: Slow token generation (low tokens/sec)
|
||||
|
||||
**Diagnostic**:
|
||||
```bash
|
||||
# Check if model is correct size
|
||||
vllm serve MODEL # Should see model size in logs
|
||||
|
||||
# Check speculative decoding
|
||||
vllm serve MODEL --speculative-model DRAFT_MODEL
|
||||
```
|
||||
|
||||
**For H100 GPUs**, enable FP8:
|
||||
```bash
|
||||
vllm serve MODEL --quantization fp8
|
||||
```
|
||||
|
||||
## Model loading errors
|
||||
|
||||
### Symptom: `OSError: MODEL not found`
|
||||
|
||||
**Causes**:
|
||||
|
||||
1. **Model name typo**:
|
||||
```bash
|
||||
# Check exact model name on HuggingFace
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct # Correct capitalization
|
||||
```
|
||||
|
||||
2. **Private/gated model**:
|
||||
```bash
|
||||
# Login to HuggingFace first
|
||||
huggingface-cli login
|
||||
# Then run vLLM
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct
|
||||
```
|
||||
|
||||
3. **Custom model needs trust flag**:
|
||||
```bash
|
||||
vllm serve MODEL --trust-remote-code
|
||||
```
|
||||
|
||||
### Symptom: `ValueError: Tokenizer not found`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Download model manually first
|
||||
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"
|
||||
|
||||
# Then launch vLLM
|
||||
vllm serve MODEL
|
||||
```
|
||||
|
||||
### Symptom: `ImportError: No module named 'flash_attn'`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Install flash attention
|
||||
pip install flash-attn --no-build-isolation
|
||||
|
||||
# Or disable flash attention
|
||||
vllm serve MODEL --disable-flash-attn
|
||||
```
|
||||
|
||||
## Network and connection issues
|
||||
|
||||
### Symptom: `Connection refused` when querying server
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Check server is running**:
|
||||
```bash
|
||||
curl http://localhost:8000/health
|
||||
```
|
||||
|
||||
2. **Check port binding**:
|
||||
```bash
|
||||
# Bind to all interfaces for remote access
|
||||
vllm serve MODEL --host 0.0.0.0 --port 8000
|
||||
|
||||
# Check if port is in use
|
||||
lsof -i :8000
|
||||
```
|
||||
|
||||
3. **Check firewall**:
|
||||
```bash
|
||||
# Allow port through firewall
|
||||
sudo ufw allow 8000
|
||||
```
|
||||
|
||||
### Symptom: Slow response times over network
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Increase timeout**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="EMPTY",
|
||||
timeout=300.0 # 5 minute timeout
|
||||
)
|
||||
```
|
||||
|
||||
2. **Check network latency**:
|
||||
```bash
|
||||
ping SERVER_IP # Should be <10ms for local network
|
||||
```
|
||||
|
||||
3. **Use connection pooling**:
|
||||
```python
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util.retry import Retry
|
||||
|
||||
session = requests.Session()
|
||||
retries = Retry(total=3, backoff_factor=1)
|
||||
session.mount('http://', HTTPAdapter(max_retries=retries))
|
||||
```
|
||||
|
||||
## Quantization problems
|
||||
|
||||
### Symptom: `RuntimeError: Quantization format not supported`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Ensure correct quantization method
|
||||
vllm serve MODEL --quantization awq # For AWQ models
|
||||
vllm serve MODEL --quantization gptq # For GPTQ models
|
||||
|
||||
# Check model card for quantization type
|
||||
```
|
||||
|
||||
### Symptom: Poor quality outputs after quantization
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Verify model is correctly quantized**:
|
||||
```bash
|
||||
# Check model config.json for quantization_config
|
||||
cat ~/.cache/huggingface/hub/models--MODEL/config.json
|
||||
```
|
||||
|
||||
2. **Try different quantization method**:
|
||||
```bash
|
||||
# If AWQ quality issues, try FP8 (H100 only)
|
||||
vllm serve MODEL --quantization fp8
|
||||
|
||||
# Or use less aggressive quantization
|
||||
vllm serve MODEL # No quantization
|
||||
```
|
||||
|
||||
3. **Increase temperature for better diversity**:
|
||||
```python
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
```
|
||||
|
||||
## Distributed serving issues
|
||||
|
||||
### Symptom: `RuntimeError: Distributed init failed`
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Check environment variables**:
|
||||
```bash
|
||||
# On all nodes
|
||||
echo $MASTER_ADDR # Should be same
|
||||
echo $MASTER_PORT # Should be same
|
||||
echo $RANK # Should be unique per node (0, 1, 2, ...)
|
||||
echo $WORLD_SIZE # Should be same (total nodes)
|
||||
```
|
||||
|
||||
2. **Check network connectivity**:
|
||||
```bash
|
||||
# From node 1 to node 2
|
||||
ping NODE2_IP
|
||||
nc -zv NODE2_IP 29500 # Check port accessibility
|
||||
```
|
||||
|
||||
3. **Check NCCL settings**:
|
||||
```bash
|
||||
export NCCL_DEBUG=INFO
|
||||
export NCCL_SOCKET_IFNAME=eth0 # Or your network interface
|
||||
vllm serve MODEL --tensor-parallel-size 8
|
||||
```
|
||||
|
||||
### Symptom: `NCCL error: unhandled cuda error`
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Set NCCL to use correct network interface
|
||||
export NCCL_SOCKET_IFNAME=eth0 # Replace with your interface
|
||||
|
||||
# Increase timeout
|
||||
export NCCL_TIMEOUT=1800 # 30 minutes
|
||||
|
||||
# Force P2P for debugging
|
||||
export NCCL_P2P_DISABLE=1
|
||||
```
|
||||
|
||||
## Debugging tools and commands
|
||||
|
||||
### Enable debug logging
|
||||
|
||||
```bash
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
vllm serve MODEL
|
||||
```
|
||||
|
||||
### Monitor GPU usage
|
||||
|
||||
```bash
|
||||
# Real-time GPU monitoring
|
||||
watch -n 1 nvidia-smi
|
||||
|
||||
# Memory breakdown
|
||||
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
|
||||
```
|
||||
|
||||
### Profile performance
|
||||
|
||||
```bash
|
||||
# Built-in benchmarking
|
||||
vllm bench throughput \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--num-prompts 100
|
||||
|
||||
vllm bench latency \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--batch-size 8
|
||||
```
|
||||
|
||||
### Check metrics
|
||||
|
||||
```bash
|
||||
# Prometheus metrics
|
||||
curl http://localhost:9090/metrics
|
||||
|
||||
# Filter for specific metrics
|
||||
curl http://localhost:9090/metrics | grep vllm_time_to_first_token
|
||||
|
||||
# Key metrics to monitor:
|
||||
# - vllm_time_to_first_token_seconds
|
||||
# - vllm_time_per_output_token_seconds
|
||||
# - vllm_num_requests_running
|
||||
# - vllm_gpu_cache_usage_perc
|
||||
# - vllm_request_success_total
|
||||
```
|
||||
|
||||
### Test server health
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:8000/health
|
||||
|
||||
# Model info
|
||||
curl http://localhost:8000/v1/models
|
||||
|
||||
# Test completion
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "MODEL",
|
||||
"prompt": "Hello",
|
||||
"max_tokens": 10
|
||||
}'
|
||||
```
|
||||
|
||||
### Common environment variables
|
||||
|
||||
```bash
|
||||
# CUDA settings
|
||||
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Limit to specific GPUs
|
||||
|
||||
# vLLM settings
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
export VLLM_TRACE_FUNCTION=1 # Profile functions
|
||||
export VLLM_USE_V1=1 # Use v1.0 engine (faster)
|
||||
|
||||
# NCCL settings (distributed)
|
||||
export NCCL_DEBUG=INFO
|
||||
export NCCL_SOCKET_IFNAME=eth0
|
||||
export NCCL_IB_DISABLE=0 # Enable InfiniBand
|
||||
```
|
||||
|
||||
### Collect diagnostic info for bug reports
|
||||
|
||||
```bash
|
||||
# System info
|
||||
nvidia-smi
|
||||
python --version
|
||||
pip show vllm
|
||||
|
||||
# vLLM version and config
|
||||
vllm --version
|
||||
python -c "import vllm; print(vllm.__version__)"
|
||||
|
||||
# Run with debug logging
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
vllm serve MODEL 2>&1 | tee vllm_debug.log
|
||||
|
||||
# Include in bug report:
|
||||
# - vllm_debug.log
|
||||
# - nvidia-smi output
|
||||
# - Full command used
|
||||
# - Expected vs actual behavior
|
||||
```
|
||||
Reference in New Issue
Block a user