Sync all skills and memories 2026-04-14 07:27
This commit is contained in:
3
skills/mlops/inference/DESCRIPTION.md
Normal file
3
skills/mlops/inference/DESCRIPTION.md
Normal file
@@ -0,0 +1,3 @@
|
||||
---
|
||||
description: Model serving, quantization (GGUF/GPTQ), structured output, inference optimization, and model surgery tools for deploying and running LLMs.
|
||||
---
|
||||
430
skills/mlops/inference/gguf/SKILL.md
Normal file
430
skills/mlops/inference/gguf/SKILL.md
Normal file
@@ -0,0 +1,430 @@
|
||||
---
|
||||
name: gguf-quantization
|
||||
description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [llama-cpp-python>=0.2.0]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
|
||||
|
||||
---
|
||||
|
||||
# GGUF - Quantization Format for llama.cpp
|
||||
|
||||
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
|
||||
|
||||
## When to use GGUF
|
||||
|
||||
**Use GGUF when:**
|
||||
- Deploying on consumer hardware (laptops, desktops)
|
||||
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
|
||||
- Need CPU inference without GPU requirements
|
||||
- Want flexible quantization (Q2_K to Q8_0)
|
||||
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
|
||||
|
||||
**Key advantages:**
|
||||
- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
|
||||
- **No Python runtime**: Pure C/C++ inference
|
||||
- **Flexible quantization**: 2-8 bit with various methods (K-quants)
|
||||
- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
|
||||
- **imatrix**: Importance matrix for better low-bit quality
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
|
||||
- **HQQ**: Fast calibration-free quantization for HuggingFace
|
||||
- **bitsandbytes**: Simple integration with transformers library
|
||||
- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Clone llama.cpp
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
|
||||
# Build (CPU)
|
||||
make
|
||||
|
||||
# Build with CUDA (NVIDIA)
|
||||
make GGML_CUDA=1
|
||||
|
||||
# Build with Metal (Apple Silicon)
|
||||
make GGML_METAL=1
|
||||
|
||||
# Install Python bindings (optional)
|
||||
pip install llama-cpp-python
|
||||
```
|
||||
|
||||
### Convert model to GGUF
|
||||
|
||||
```bash
|
||||
# Install requirements
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Convert HuggingFace model to GGUF (FP16)
|
||||
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
|
||||
|
||||
# Or specify output type
|
||||
python convert_hf_to_gguf.py ./path/to/model \
|
||||
--outfile model-f16.gguf \
|
||||
--outtype f16
|
||||
```
|
||||
|
||||
### Quantize model
|
||||
|
||||
```bash
|
||||
# Basic quantization to Q4_K_M
|
||||
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
|
||||
# Quantize with importance matrix (better quality)
|
||||
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
|
||||
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
```
|
||||
|
||||
### Run inference
|
||||
|
||||
```bash
|
||||
# CLI inference
|
||||
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
|
||||
|
||||
# Interactive mode
|
||||
./llama-cli -m model-q4_k_m.gguf --interactive
|
||||
|
||||
# With GPU offload
|
||||
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
|
||||
```
|
||||
|
||||
## Quantization types
|
||||
|
||||
### K-quant methods (recommended)
|
||||
|
||||
| Type | Bits | Size (7B) | Quality | Use Case |
|
||||
|------|------|-----------|---------|----------|
|
||||
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
|
||||
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
|
||||
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
|
||||
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
|
||||
| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
|
||||
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
|
||||
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
|
||||
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
|
||||
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
|
||||
|
||||
### Legacy methods
|
||||
|
||||
| Type | Description |
|
||||
|------|-------------|
|
||||
| Q4_0 | 4-bit, basic |
|
||||
| Q4_1 | 4-bit with delta |
|
||||
| Q5_0 | 5-bit, basic |
|
||||
| Q5_1 | 5-bit with delta |
|
||||
|
||||
**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
|
||||
|
||||
## Conversion workflows
|
||||
|
||||
### Workflow 1: HuggingFace to GGUF
|
||||
|
||||
```bash
|
||||
# 1. Download model
|
||||
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
|
||||
|
||||
# 2. Convert to GGUF (FP16)
|
||||
python convert_hf_to_gguf.py ./llama-3.1-8b \
|
||||
--outfile llama-3.1-8b-f16.gguf \
|
||||
--outtype f16
|
||||
|
||||
# 3. Quantize
|
||||
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
|
||||
|
||||
# 4. Test
|
||||
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
|
||||
```
|
||||
|
||||
### Workflow 2: With importance matrix (better quality)
|
||||
|
||||
```bash
|
||||
# 1. Convert to GGUF
|
||||
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
|
||||
|
||||
# 2. Create calibration text (diverse samples)
|
||||
cat > calibration.txt << 'EOF'
|
||||
The quick brown fox jumps over the lazy dog.
|
||||
Machine learning is a subset of artificial intelligence.
|
||||
Python is a popular programming language.
|
||||
# Add more diverse text samples...
|
||||
EOF
|
||||
|
||||
# 3. Generate importance matrix
|
||||
./llama-imatrix -m model-f16.gguf \
|
||||
-f calibration.txt \
|
||||
--chunk 512 \
|
||||
-o model.imatrix \
|
||||
-ngl 35 # GPU layers if available
|
||||
|
||||
# 4. Quantize with imatrix
|
||||
./llama-quantize --imatrix model.imatrix \
|
||||
model-f16.gguf \
|
||||
model-q4_k_m.gguf \
|
||||
Q4_K_M
|
||||
```
|
||||
|
||||
### Workflow 3: Multiple quantizations
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
MODEL="llama-3.1-8b-f16.gguf"
|
||||
IMATRIX="llama-3.1-8b.imatrix"
|
||||
|
||||
# Generate imatrix once
|
||||
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
|
||||
|
||||
# Create multiple quantizations
|
||||
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
||||
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
|
||||
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
|
||||
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
|
||||
done
|
||||
```
|
||||
|
||||
## Python usage
|
||||
|
||||
### llama-cpp-python
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
# Load model
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096, # Context window
|
||||
n_gpu_layers=35, # GPU offload (0 for CPU only)
|
||||
n_threads=8 # CPU threads
|
||||
)
|
||||
|
||||
# Generate
|
||||
output = llm(
|
||||
"What is machine learning?",
|
||||
max_tokens=256,
|
||||
temperature=0.7,
|
||||
stop=["</s>", "\n\n"]
|
||||
)
|
||||
print(output["choices"][0]["text"])
|
||||
```
|
||||
|
||||
### Chat completion
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35,
|
||||
chat_format="llama-3" # Or "chatml", "mistral", etc.
|
||||
)
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "What is Python?"}
|
||||
]
|
||||
|
||||
response = llm.create_chat_completion(
|
||||
messages=messages,
|
||||
max_tokens=256,
|
||||
temperature=0.7
|
||||
)
|
||||
print(response["choices"][0]["message"]["content"])
|
||||
```
|
||||
|
||||
### Streaming
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
|
||||
|
||||
# Stream tokens
|
||||
for chunk in llm(
|
||||
"Explain quantum computing:",
|
||||
max_tokens=256,
|
||||
stream=True
|
||||
):
|
||||
print(chunk["choices"][0]["text"], end="", flush=True)
|
||||
```
|
||||
|
||||
## Server mode
|
||||
|
||||
### Start OpenAI-compatible server
|
||||
|
||||
```bash
|
||||
# Start server
|
||||
./llama-server -m model-q4_k_m.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-ngl 35 \
|
||||
-c 4096
|
||||
|
||||
# Or with Python bindings
|
||||
python -m llama_cpp.server \
|
||||
--model model-q4_k_m.gguf \
|
||||
--n_gpu_layers 35 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080
|
||||
```
|
||||
|
||||
### Use with OpenAI client
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8080/v1",
|
||||
api_key="not-needed"
|
||||
)
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="local-model",
|
||||
messages=[{"role": "user", "content": "Hello!"}],
|
||||
max_tokens=256
|
||||
)
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
## Hardware optimization
|
||||
|
||||
### Apple Silicon (Metal)
|
||||
|
||||
```bash
|
||||
# Build with Metal
|
||||
make clean && make GGML_METAL=1
|
||||
|
||||
# Run with Metal acceleration
|
||||
./llama-cli -m model.gguf -ngl 99 -p "Hello"
|
||||
|
||||
# Python with Metal
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_gpu_layers=99, # Offload all layers
|
||||
n_threads=1 # Metal handles parallelism
|
||||
)
|
||||
```
|
||||
|
||||
### NVIDIA CUDA
|
||||
|
||||
```bash
|
||||
# Build with CUDA
|
||||
make clean && make GGML_CUDA=1
|
||||
|
||||
# Run with CUDA
|
||||
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
||||
|
||||
# Specify GPU
|
||||
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
|
||||
```
|
||||
|
||||
### CPU optimization
|
||||
|
||||
```bash
|
||||
# Build with AVX2/AVX512
|
||||
make clean && make
|
||||
|
||||
# Run with optimal threads
|
||||
./llama-cli -m model.gguf -t 8 -p "Hello"
|
||||
|
||||
# Python CPU config
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_gpu_layers=0, # CPU only
|
||||
n_threads=8, # Match physical cores
|
||||
n_batch=512 # Batch size for prompt processing
|
||||
)
|
||||
```
|
||||
|
||||
## Integration with tools
|
||||
|
||||
### Ollama
|
||||
|
||||
```bash
|
||||
# Create Modelfile
|
||||
cat > Modelfile << 'EOF'
|
||||
FROM ./model-q4_k_m.gguf
|
||||
TEMPLATE """{{ .System }}
|
||||
{{ .Prompt }}"""
|
||||
PARAMETER temperature 0.7
|
||||
PARAMETER num_ctx 4096
|
||||
EOF
|
||||
|
||||
# Create Ollama model
|
||||
ollama create mymodel -f Modelfile
|
||||
|
||||
# Run
|
||||
ollama run mymodel "Hello!"
|
||||
```
|
||||
|
||||
### LM Studio
|
||||
|
||||
1. Place GGUF file in `~/.cache/lm-studio/models/`
|
||||
2. Open LM Studio and select the model
|
||||
3. Configure context length and GPU offload
|
||||
4. Start inference
|
||||
|
||||
### text-generation-webui
|
||||
|
||||
```bash
|
||||
# Place in models folder
|
||||
cp model-q4_k_m.gguf text-generation-webui/models/
|
||||
|
||||
# Start with llama.cpp loader
|
||||
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
|
||||
```
|
||||
|
||||
## Best practices
|
||||
|
||||
1. **Use K-quants**: Q4_K_M offers best quality/size balance
|
||||
2. **Use imatrix**: Always use importance matrix for Q4 and below
|
||||
3. **GPU offload**: Offload as many layers as VRAM allows
|
||||
4. **Context length**: Start with 4096, increase if needed
|
||||
5. **Thread count**: Match physical CPU cores, not logical
|
||||
6. **Batch size**: Increase n_batch for faster prompt processing
|
||||
|
||||
## Common issues
|
||||
|
||||
**Model loads slowly:**
|
||||
```bash
|
||||
# Use mmap for faster loading
|
||||
./llama-cli -m model.gguf --mmap
|
||||
```
|
||||
|
||||
**Out of memory:**
|
||||
```bash
|
||||
# Reduce GPU layers
|
||||
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
|
||||
|
||||
# Or use smaller quantization
|
||||
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
|
||||
```
|
||||
|
||||
**Poor quality at low bits:**
|
||||
```bash
|
||||
# Always use imatrix for Q4 and below
|
||||
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
|
||||
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
|
||||
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
|
||||
|
||||
## Resources
|
||||
|
||||
- **Repository**: https://github.com/ggml-org/llama.cpp
|
||||
- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
|
||||
- **Pre-quantized Models**: https://huggingface.co/TheBloke
|
||||
- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
|
||||
- **License**: MIT
|
||||
504
skills/mlops/inference/gguf/references/advanced-usage.md
Normal file
504
skills/mlops/inference/gguf/references/advanced-usage.md
Normal file
@@ -0,0 +1,504 @@
|
||||
# GGUF Advanced Usage Guide
|
||||
|
||||
## Speculative Decoding
|
||||
|
||||
### Draft Model Approach
|
||||
|
||||
```bash
|
||||
# Use smaller model as draft for faster generation
|
||||
./llama-speculative \
|
||||
-m large-model-q4_k_m.gguf \
|
||||
-md draft-model-q4_k_m.gguf \
|
||||
-p "Write a story about AI" \
|
||||
-n 500 \
|
||||
--draft 8 # Draft tokens before verification
|
||||
```
|
||||
|
||||
### Self-Speculative Decoding
|
||||
|
||||
```bash
|
||||
# Use same model with different context for speculation
|
||||
./llama-cli -m model-q4_k_m.gguf \
|
||||
--lookup-cache-static lookup.bin \
|
||||
--lookup-cache-dynamic lookup-dynamic.bin \
|
||||
-p "Hello world"
|
||||
```
|
||||
|
||||
## Batched Inference
|
||||
|
||||
### Process Multiple Prompts
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35,
|
||||
n_batch=512 # Larger batch for parallel processing
|
||||
)
|
||||
|
||||
prompts = [
|
||||
"What is Python?",
|
||||
"Explain machine learning.",
|
||||
"Describe neural networks."
|
||||
]
|
||||
|
||||
# Process in batch (each prompt gets separate context)
|
||||
for prompt in prompts:
|
||||
output = llm(prompt, max_tokens=100)
|
||||
print(f"Q: {prompt}")
|
||||
print(f"A: {output['choices'][0]['text']}\n")
|
||||
```
|
||||
|
||||
### Server Batching
|
||||
|
||||
```bash
|
||||
# Start server with batching
|
||||
./llama-server -m model-q4_k_m.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-ngl 35 \
|
||||
-c 4096 \
|
||||
--parallel 4 # Concurrent requests
|
||||
--cont-batching # Continuous batching
|
||||
```
|
||||
|
||||
## Custom Model Conversion
|
||||
|
||||
### Convert with Vocabulary Modifications
|
||||
|
||||
```python
|
||||
# custom_convert.py
|
||||
import sys
|
||||
sys.path.insert(0, './llama.cpp')
|
||||
|
||||
from convert_hf_to_gguf import main
|
||||
from gguf import GGUFWriter
|
||||
|
||||
# Custom conversion with modified vocab
|
||||
def convert_with_custom_vocab(model_path, output_path):
|
||||
# Load and modify tokenizer
|
||||
from transformers import AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
|
||||
# Add special tokens if needed
|
||||
special_tokens = {"additional_special_tokens": ["<|custom|>"]}
|
||||
tokenizer.add_special_tokens(special_tokens)
|
||||
tokenizer.save_pretrained(model_path)
|
||||
|
||||
# Then run standard conversion
|
||||
main([model_path, "--outfile", output_path])
|
||||
```
|
||||
|
||||
### Convert Specific Architecture
|
||||
|
||||
```bash
|
||||
# For Mistral-style models
|
||||
python convert_hf_to_gguf.py ./mistral-model \
|
||||
--outfile mistral-f16.gguf \
|
||||
--outtype f16
|
||||
|
||||
# For Qwen models
|
||||
python convert_hf_to_gguf.py ./qwen-model \
|
||||
--outfile qwen-f16.gguf \
|
||||
--outtype f16
|
||||
|
||||
# For Phi models
|
||||
python convert_hf_to_gguf.py ./phi-model \
|
||||
--outfile phi-f16.gguf \
|
||||
--outtype f16
|
||||
```
|
||||
|
||||
## Advanced Quantization
|
||||
|
||||
### Mixed Quantization
|
||||
|
||||
```bash
|
||||
# Quantize different layer types differently
|
||||
./llama-quantize model-f16.gguf model-mixed.gguf Q4_K_M \
|
||||
--allow-requantize \
|
||||
--leave-output-tensor
|
||||
```
|
||||
|
||||
### Quantization with Token Embeddings
|
||||
|
||||
```bash
|
||||
# Keep embeddings at higher precision
|
||||
./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M \
|
||||
--token-embedding-type f16
|
||||
```
|
||||
|
||||
### IQ Quantization (Importance-aware)
|
||||
|
||||
```bash
|
||||
# Ultra-low bit quantization with importance
|
||||
./llama-quantize --imatrix model.imatrix \
|
||||
model-f16.gguf model-iq2_xxs.gguf IQ2_XXS
|
||||
|
||||
# Available IQ types: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS
|
||||
```
|
||||
|
||||
## Memory Optimization
|
||||
|
||||
### Memory Mapping
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
# Use memory mapping for large models
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
use_mmap=True, # Memory map the model
|
||||
use_mlock=False, # Don't lock in RAM
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
### Partial GPU Offload
|
||||
|
||||
```python
|
||||
# Calculate layers to offload based on VRAM
|
||||
import subprocess
|
||||
|
||||
def get_free_vram_gb():
|
||||
result = subprocess.run(
|
||||
['nvidia-smi', '--query-gpu=memory.free', '--format=csv,nounits,noheader'],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
return int(result.stdout.strip()) / 1024
|
||||
|
||||
# Estimate layers based on VRAM (rough: 0.5GB per layer for 7B Q4)
|
||||
free_vram = get_free_vram_gb()
|
||||
layers_to_offload = int(free_vram / 0.5)
|
||||
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
n_gpu_layers=min(layers_to_offload, 35) # Cap at total layers
|
||||
)
|
||||
```
|
||||
|
||||
### KV Cache Optimization
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
# Optimize KV cache for long contexts
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
n_ctx=8192, # Large context
|
||||
n_gpu_layers=35,
|
||||
type_k=1, # Q8_0 for K cache (1)
|
||||
type_v=1, # Q8_0 for V cache (1)
|
||||
# Or use Q4_0 (2) for more compression
|
||||
)
|
||||
```
|
||||
|
||||
## Context Management
|
||||
|
||||
### Context Shifting
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35
|
||||
)
|
||||
|
||||
# Handle long conversations with context shifting
|
||||
conversation = []
|
||||
max_history = 10
|
||||
|
||||
def chat(user_message):
|
||||
conversation.append({"role": "user", "content": user_message})
|
||||
|
||||
# Keep only recent history
|
||||
if len(conversation) > max_history * 2:
|
||||
conversation = conversation[-max_history * 2:]
|
||||
|
||||
response = llm.create_chat_completion(
|
||||
messages=conversation,
|
||||
max_tokens=256
|
||||
)
|
||||
|
||||
assistant_message = response["choices"][0]["message"]["content"]
|
||||
conversation.append({"role": "assistant", "content": assistant_message})
|
||||
return assistant_message
|
||||
```
|
||||
|
||||
### Save and Load State
|
||||
|
||||
```bash
|
||||
# Save state to file
|
||||
./llama-cli -m model.gguf \
|
||||
-p "Once upon a time" \
|
||||
--save-session session.bin \
|
||||
-n 100
|
||||
|
||||
# Load and continue
|
||||
./llama-cli -m model.gguf \
|
||||
--load-session session.bin \
|
||||
-p " and they lived" \
|
||||
-n 100
|
||||
```
|
||||
|
||||
## Grammar Constrained Generation
|
||||
|
||||
### JSON Output
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama, LlamaGrammar
|
||||
|
||||
# Define JSON grammar
|
||||
json_grammar = LlamaGrammar.from_string('''
|
||||
root ::= object
|
||||
object ::= "{" ws pair ("," ws pair)* "}" ws
|
||||
pair ::= string ":" ws value
|
||||
value ::= string | number | object | array | "true" | "false" | "null"
|
||||
array ::= "[" ws value ("," ws value)* "]" ws
|
||||
string ::= "\\"" [^"\\\\]* "\\""
|
||||
number ::= [0-9]+
|
||||
ws ::= [ \\t\\n]*
|
||||
''')
|
||||
|
||||
llm = Llama(model_path="model-q4_k_m.gguf", n_gpu_layers=35)
|
||||
|
||||
output = llm(
|
||||
"Output a JSON object with name and age:",
|
||||
grammar=json_grammar,
|
||||
max_tokens=100
|
||||
)
|
||||
print(output["choices"][0]["text"])
|
||||
```
|
||||
|
||||
### Custom Grammar
|
||||
|
||||
```python
|
||||
# Grammar for specific format
|
||||
answer_grammar = LlamaGrammar.from_string('''
|
||||
root ::= "Answer: " letter "\\n" "Explanation: " explanation
|
||||
letter ::= [A-D]
|
||||
explanation ::= [a-zA-Z0-9 .,!?]+
|
||||
''')
|
||||
|
||||
output = llm(
|
||||
"Q: What is 2+2? A) 3 B) 4 C) 5 D) 6",
|
||||
grammar=answer_grammar,
|
||||
max_tokens=100
|
||||
)
|
||||
```
|
||||
|
||||
## LoRA Integration
|
||||
|
||||
### Load LoRA Adapter
|
||||
|
||||
```bash
|
||||
# Apply LoRA at runtime
|
||||
./llama-cli -m base-model-q4_k_m.gguf \
|
||||
--lora lora-adapter.gguf \
|
||||
--lora-scale 1.0 \
|
||||
-p "Hello!"
|
||||
```
|
||||
|
||||
### Multiple LoRA Adapters
|
||||
|
||||
```bash
|
||||
# Stack multiple adapters
|
||||
./llama-cli -m base-model.gguf \
|
||||
--lora adapter1.gguf --lora-scale 0.5 \
|
||||
--lora adapter2.gguf --lora-scale 0.5 \
|
||||
-p "Hello!"
|
||||
```
|
||||
|
||||
### Python LoRA Usage
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="base-model-q4_k_m.gguf",
|
||||
lora_path="lora-adapter.gguf",
|
||||
lora_scale=1.0,
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
## Embedding Generation
|
||||
|
||||
### Extract Embeddings
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
embedding=True, # Enable embedding mode
|
||||
n_gpu_layers=35
|
||||
)
|
||||
|
||||
# Get embeddings
|
||||
embeddings = llm.embed("This is a test sentence.")
|
||||
print(f"Embedding dimension: {len(embeddings)}")
|
||||
```
|
||||
|
||||
### Batch Embeddings
|
||||
|
||||
```python
|
||||
texts = [
|
||||
"Machine learning is fascinating.",
|
||||
"Deep learning uses neural networks.",
|
||||
"Python is a programming language."
|
||||
]
|
||||
|
||||
embeddings = [llm.embed(text) for text in texts]
|
||||
|
||||
# Calculate similarity
|
||||
import numpy as np
|
||||
|
||||
def cosine_similarity(a, b):
|
||||
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
|
||||
|
||||
sim = cosine_similarity(embeddings[0], embeddings[1])
|
||||
print(f"Similarity: {sim:.4f}")
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Benchmark Script
|
||||
|
||||
```python
|
||||
import time
|
||||
from llama_cpp import Llama
|
||||
|
||||
def benchmark(model_path, prompt, n_tokens=100, n_runs=5):
|
||||
llm = Llama(
|
||||
model_path=model_path,
|
||||
n_gpu_layers=35,
|
||||
n_ctx=2048,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
# Warmup
|
||||
llm(prompt, max_tokens=10)
|
||||
|
||||
# Benchmark
|
||||
times = []
|
||||
for _ in range(n_runs):
|
||||
start = time.time()
|
||||
output = llm(prompt, max_tokens=n_tokens)
|
||||
elapsed = time.time() - start
|
||||
times.append(elapsed)
|
||||
|
||||
avg_time = sum(times) / len(times)
|
||||
tokens_per_sec = n_tokens / avg_time
|
||||
|
||||
print(f"Model: {model_path}")
|
||||
print(f"Avg time: {avg_time:.2f}s")
|
||||
print(f"Tokens/sec: {tokens_per_sec:.1f}")
|
||||
|
||||
return tokens_per_sec
|
||||
|
||||
# Compare quantizations
|
||||
for quant in ["q4_k_m", "q5_k_m", "q8_0"]:
|
||||
benchmark(f"model-{quant}.gguf", "Explain quantum computing:", 100)
|
||||
```
|
||||
|
||||
### Optimal Configuration Finder
|
||||
|
||||
```python
|
||||
def find_optimal_config(model_path, target_vram_gb=8):
|
||||
"""Find optimal n_gpu_layers and n_batch for target VRAM."""
|
||||
from llama_cpp import Llama
|
||||
import gc
|
||||
|
||||
best_config = None
|
||||
best_speed = 0
|
||||
|
||||
for n_gpu_layers in range(0, 50, 5):
|
||||
for n_batch in [128, 256, 512, 1024]:
|
||||
try:
|
||||
gc.collect()
|
||||
llm = Llama(
|
||||
model_path=model_path,
|
||||
n_gpu_layers=n_gpu_layers,
|
||||
n_batch=n_batch,
|
||||
n_ctx=2048,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
# Quick benchmark
|
||||
start = time.time()
|
||||
llm("Hello", max_tokens=50)
|
||||
speed = 50 / (time.time() - start)
|
||||
|
||||
if speed > best_speed:
|
||||
best_speed = speed
|
||||
best_config = {
|
||||
"n_gpu_layers": n_gpu_layers,
|
||||
"n_batch": n_batch,
|
||||
"speed": speed
|
||||
}
|
||||
|
||||
del llm
|
||||
gc.collect()
|
||||
|
||||
except Exception as e:
|
||||
print(f"OOM at layers={n_gpu_layers}, batch={n_batch}")
|
||||
break
|
||||
|
||||
return best_config
|
||||
```
|
||||
|
||||
## Multi-GPU Setup
|
||||
|
||||
### Distribute Across GPUs
|
||||
|
||||
```bash
|
||||
# Split model across multiple GPUs
|
||||
./llama-cli -m large-model.gguf \
|
||||
--tensor-split 0.5,0.5 \
|
||||
-ngl 60 \
|
||||
-p "Hello!"
|
||||
```
|
||||
|
||||
### Python Multi-GPU
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
|
||||
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="large-model-q4_k_m.gguf",
|
||||
n_gpu_layers=60,
|
||||
tensor_split=[0.5, 0.5] # Split evenly across 2 GPUs
|
||||
)
|
||||
```
|
||||
|
||||
## Custom Builds
|
||||
|
||||
### Build with All Optimizations
|
||||
|
||||
```bash
|
||||
# Clean build with all CPU optimizations
|
||||
make clean
|
||||
LLAMA_OPENBLAS=1 LLAMA_BLAS_VENDOR=OpenBLAS make -j
|
||||
|
||||
# With CUDA and cuBLAS
|
||||
make clean
|
||||
GGML_CUDA=1 LLAMA_CUBLAS=1 make -j
|
||||
|
||||
# With specific CUDA architecture
|
||||
GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_86 make -j
|
||||
```
|
||||
|
||||
### CMake Build
|
||||
|
||||
```bash
|
||||
mkdir build && cd build
|
||||
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build . --config Release -j
|
||||
```
|
||||
442
skills/mlops/inference/gguf/references/troubleshooting.md
Normal file
442
skills/mlops/inference/gguf/references/troubleshooting.md
Normal file
@@ -0,0 +1,442 @@
|
||||
# GGUF Troubleshooting Guide
|
||||
|
||||
## Installation Issues
|
||||
|
||||
### Build Fails
|
||||
|
||||
**Error**: `make: *** No targets specified and no makefile found`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Ensure you're in llama.cpp directory
|
||||
cd llama.cpp
|
||||
make
|
||||
```
|
||||
|
||||
**Error**: `fatal error: cuda_runtime.h: No such file or directory`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Install CUDA toolkit
|
||||
# Ubuntu
|
||||
sudo apt install nvidia-cuda-toolkit
|
||||
|
||||
# Or set CUDA path
|
||||
export CUDA_PATH=/usr/local/cuda
|
||||
export PATH=$CUDA_PATH/bin:$PATH
|
||||
make GGML_CUDA=1
|
||||
```
|
||||
|
||||
### Python Bindings Issues
|
||||
|
||||
**Error**: `ERROR: Failed building wheel for llama-cpp-python`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Install build dependencies
|
||||
pip install cmake scikit-build-core
|
||||
|
||||
# For CUDA support
|
||||
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||
|
||||
# For Metal (macOS)
|
||||
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||
```
|
||||
|
||||
**Error**: `ImportError: libcudart.so.XX: cannot open shared object file`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Add CUDA libraries to path
|
||||
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
||||
|
||||
# Or reinstall with correct CUDA version
|
||||
pip uninstall llama-cpp-python
|
||||
CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
|
||||
```
|
||||
|
||||
## Conversion Issues
|
||||
|
||||
### Model Not Supported
|
||||
|
||||
**Error**: `KeyError: 'model.embed_tokens.weight'`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check model architecture
|
||||
python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"
|
||||
|
||||
# Use appropriate conversion script
|
||||
# For most models:
|
||||
python convert_hf_to_gguf.py ./model --outfile model.gguf
|
||||
|
||||
# For older models, check if legacy script needed
|
||||
```
|
||||
|
||||
### Vocabulary Mismatch
|
||||
|
||||
**Error**: `RuntimeError: Vocabulary size mismatch`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Ensure tokenizer matches model
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("./model")
|
||||
model = AutoModelForCausalLM.from_pretrained("./model")
|
||||
|
||||
print(f"Tokenizer vocab size: {len(tokenizer)}")
|
||||
print(f"Model vocab size: {model.config.vocab_size}")
|
||||
|
||||
# If mismatch, resize embeddings before conversion
|
||||
model.resize_token_embeddings(len(tokenizer))
|
||||
model.save_pretrained("./model-fixed")
|
||||
```
|
||||
|
||||
### Out of Memory During Conversion
|
||||
|
||||
**Error**: `torch.cuda.OutOfMemoryError` during conversion
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Use CPU for conversion
|
||||
CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf
|
||||
|
||||
# Or use low memory mode
|
||||
python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
|
||||
```
|
||||
|
||||
## Quantization Issues
|
||||
|
||||
### Wrong Output File Size
|
||||
|
||||
**Problem**: Quantized file is larger than expected
|
||||
|
||||
**Check**:
|
||||
```bash
|
||||
# Verify quantization type
|
||||
./llama-cli -m model.gguf --verbose
|
||||
|
||||
# Expected sizes for 7B model:
|
||||
# Q4_K_M: ~4.1 GB
|
||||
# Q5_K_M: ~4.8 GB
|
||||
# Q8_0: ~7.2 GB
|
||||
# F16: ~13.5 GB
|
||||
```
|
||||
|
||||
### Quantization Crashes
|
||||
|
||||
**Error**: `Segmentation fault` during quantization
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Increase stack size
|
||||
ulimit -s unlimited
|
||||
|
||||
# Or use less threads
|
||||
./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
|
||||
```
|
||||
|
||||
### Poor Quality After Quantization
|
||||
|
||||
**Problem**: Model outputs gibberish after quantization
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Use importance matrix**:
|
||||
```bash
|
||||
# Generate imatrix with good calibration data
|
||||
./llama-imatrix -m model-f16.gguf \
|
||||
-f wiki_sample.txt \
|
||||
--chunk 512 \
|
||||
-o model.imatrix
|
||||
|
||||
# Quantize with imatrix
|
||||
./llama-quantize --imatrix model.imatrix \
|
||||
model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
```
|
||||
|
||||
2. **Try higher precision**:
|
||||
```bash
|
||||
# Use Q5_K_M or Q6_K instead of Q4
|
||||
./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
|
||||
```
|
||||
|
||||
3. **Check original model**:
|
||||
```bash
|
||||
# Test FP16 version first
|
||||
./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
|
||||
```
|
||||
|
||||
## Inference Issues
|
||||
|
||||
### Slow Generation
|
||||
|
||||
**Problem**: Generation is slower than expected
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Enable GPU offload**:
|
||||
```bash
|
||||
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
||||
```
|
||||
|
||||
2. **Optimize batch size**:
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_batch=512, # Increase for faster prompt processing
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
3. **Use appropriate threads**:
|
||||
```bash
|
||||
# Match physical cores, not logical
|
||||
./llama-cli -m model.gguf -t 8 -p "Hello"
|
||||
```
|
||||
|
||||
4. **Enable Flash Attention** (if supported):
|
||||
```bash
|
||||
./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
|
||||
```
|
||||
|
||||
### Out of Memory
|
||||
|
||||
**Error**: `CUDA out of memory` or system freeze
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Reduce GPU layers**:
|
||||
```python
|
||||
# Start low and increase
|
||||
llm = Llama(model_path="model.gguf", n_gpu_layers=10)
|
||||
```
|
||||
|
||||
2. **Use smaller quantization**:
|
||||
```bash
|
||||
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
|
||||
```
|
||||
|
||||
3. **Reduce context length**:
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_ctx=2048, # Reduce from 4096
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
4. **Quantize KV cache**:
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
type_k=2, # Q4_0 for K cache
|
||||
type_v=2, # Q4_0 for V cache
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
### Garbage Output
|
||||
|
||||
**Problem**: Model outputs random characters or nonsense
|
||||
|
||||
**Diagnose**:
|
||||
```python
|
||||
# Check model loading
|
||||
llm = Llama(model_path="model.gguf", verbose=True)
|
||||
|
||||
# Test with simple prompt
|
||||
output = llm("1+1=", max_tokens=5, temperature=0)
|
||||
print(output)
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check model integrity**:
|
||||
```bash
|
||||
# Verify GGUF file
|
||||
./llama-cli -m model.gguf --verbose 2>&1 | head -50
|
||||
```
|
||||
|
||||
2. **Use correct chat format**:
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
chat_format="llama-3" # Match your model: chatml, mistral, etc.
|
||||
)
|
||||
```
|
||||
|
||||
3. **Check temperature**:
|
||||
```python
|
||||
# Use lower temperature for deterministic output
|
||||
output = llm("Hello", max_tokens=50, temperature=0.1)
|
||||
```
|
||||
|
||||
### Token Issues
|
||||
|
||||
**Error**: `RuntimeError: unknown token` or encoding errors
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Ensure UTF-8 encoding
|
||||
prompt = "Hello, world!".encode('utf-8').decode('utf-8')
|
||||
output = llm(prompt, max_tokens=50)
|
||||
```
|
||||
|
||||
## Server Issues
|
||||
|
||||
### Connection Refused
|
||||
|
||||
**Error**: `Connection refused` when accessing server
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Bind to all interfaces
|
||||
./llama-server -m model.gguf --host 0.0.0.0 --port 8080
|
||||
|
||||
# Check if port is in use
|
||||
lsof -i :8080
|
||||
```
|
||||
|
||||
### Server Crashes Under Load
|
||||
|
||||
**Problem**: Server crashes with multiple concurrent requests
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Limit parallelism**:
|
||||
```bash
|
||||
./llama-server -m model.gguf \
|
||||
--parallel 2 \
|
||||
-c 4096 \
|
||||
--cont-batching
|
||||
```
|
||||
|
||||
2. **Add request timeout**:
|
||||
```bash
|
||||
./llama-server -m model.gguf --timeout 300
|
||||
```
|
||||
|
||||
3. **Monitor memory**:
|
||||
```bash
|
||||
watch -n 1 nvidia-smi # For GPU
|
||||
watch -n 1 free -h # For RAM
|
||||
```
|
||||
|
||||
### API Compatibility Issues
|
||||
|
||||
**Problem**: OpenAI client not working with server
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Use correct base URL format
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8080/v1", # Include /v1
|
||||
api_key="not-needed"
|
||||
)
|
||||
|
||||
# Use correct model name
|
||||
response = client.chat.completions.create(
|
||||
model="local", # Or the actual model name
|
||||
messages=[{"role": "user", "content": "Hello"}]
|
||||
)
|
||||
```
|
||||
|
||||
## Apple Silicon Issues
|
||||
|
||||
### Metal Not Working
|
||||
|
||||
**Problem**: Metal acceleration not enabled
|
||||
|
||||
**Check**:
|
||||
```bash
|
||||
# Verify Metal support
|
||||
./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
|
||||
```
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Rebuild with Metal
|
||||
make clean
|
||||
make GGML_METAL=1
|
||||
|
||||
# Python bindings
|
||||
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
|
||||
```
|
||||
|
||||
### Incorrect Memory Usage on M1/M2
|
||||
|
||||
**Problem**: Model uses too much unified memory
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Offload all layers for Metal
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_gpu_layers=99, # Offload everything
|
||||
n_threads=1 # Metal handles parallelism
|
||||
)
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
### Enable Verbose Output
|
||||
|
||||
```bash
|
||||
# CLI verbose mode
|
||||
./llama-cli -m model.gguf --verbose -p "Hello" -n 50
|
||||
|
||||
# Python verbose
|
||||
llm = Llama(model_path="model.gguf", verbose=True)
|
||||
```
|
||||
|
||||
### Check Model Metadata
|
||||
|
||||
```bash
|
||||
# View GGUF metadata
|
||||
./llama-cli -m model.gguf --verbose 2>&1 | head -100
|
||||
```
|
||||
|
||||
### Validate GGUF File
|
||||
|
||||
```python
|
||||
import struct
|
||||
|
||||
def validate_gguf(filepath):
|
||||
with open(filepath, 'rb') as f:
|
||||
magic = f.read(4)
|
||||
if magic != b'GGUF':
|
||||
print(f"Invalid magic: {magic}")
|
||||
return False
|
||||
|
||||
version = struct.unpack('<I', f.read(4))[0]
|
||||
print(f"GGUF version: {version}")
|
||||
|
||||
tensor_count = struct.unpack('<Q', f.read(8))[0]
|
||||
metadata_count = struct.unpack('<Q', f.read(8))[0]
|
||||
print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")
|
||||
|
||||
return True
|
||||
|
||||
validate_gguf("model.gguf")
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. **GitHub Issues**: https://github.com/ggml-org/llama.cpp/issues
|
||||
2. **Discussions**: https://github.com/ggml-org/llama.cpp/discussions
|
||||
3. **Reddit**: r/LocalLLaMA
|
||||
|
||||
### Reporting Issues
|
||||
|
||||
Include:
|
||||
- llama.cpp version/commit hash
|
||||
- Build command used
|
||||
- Model name and quantization
|
||||
- Full error message/stack trace
|
||||
- Hardware: CPU/GPU model, RAM, VRAM
|
||||
- OS version
|
||||
- Minimal reproduction steps
|
||||
575
skills/mlops/inference/guidance/SKILL.md
Normal file
575
skills/mlops/inference/guidance/SKILL.md
Normal file
@@ -0,0 +1,575 @@
|
||||
---
|
||||
name: guidance
|
||||
description: Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [guidance, transformers]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Prompt Engineering, Guidance, Constrained Generation, Structured Output, JSON Validation, Grammar, Microsoft Research, Format Enforcement, Multi-Step Workflows]
|
||||
|
||||
---
|
||||
|
||||
# Guidance: Constrained LLM Generation
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use Guidance when you need to:
|
||||
- **Control LLM output syntax** with regex or grammars
|
||||
- **Guarantee valid JSON/XML/code** generation
|
||||
- **Reduce latency** vs traditional prompting approaches
|
||||
- **Enforce structured formats** (dates, emails, IDs, etc.)
|
||||
- **Build multi-step workflows** with Pythonic control flow
|
||||
- **Prevent invalid outputs** through grammatical constraints
|
||||
|
||||
**GitHub Stars**: 18,000+ | **From**: Microsoft Research
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Base installation
|
||||
pip install guidance
|
||||
|
||||
# With specific backends
|
||||
pip install guidance[transformers] # Hugging Face models
|
||||
pip install guidance[llama_cpp] # llama.cpp models
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Example: Structured Generation
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
# Load model (supports OpenAI, Transformers, llama.cpp)
|
||||
lm = models.OpenAI("gpt-4")
|
||||
|
||||
# Generate with constraints
|
||||
result = lm + "The capital of France is " + gen("capital", max_tokens=5)
|
||||
|
||||
print(result["capital"]) # "Paris"
|
||||
```
|
||||
|
||||
### With Anthropic Claude
|
||||
|
||||
```python
|
||||
from guidance import models, gen, system, user, assistant
|
||||
|
||||
# Configure Claude
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Use context managers for chat format
|
||||
with system():
|
||||
lm += "You are a helpful assistant."
|
||||
|
||||
with user():
|
||||
lm += "What is the capital of France?"
|
||||
|
||||
with assistant():
|
||||
lm += gen(max_tokens=20)
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Context Managers
|
||||
|
||||
Guidance uses Pythonic context managers for chat-style interactions.
|
||||
|
||||
```python
|
||||
from guidance import system, user, assistant, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# System message
|
||||
with system():
|
||||
lm += "You are a JSON generation expert."
|
||||
|
||||
# User message
|
||||
with user():
|
||||
lm += "Generate a person object with name and age."
|
||||
|
||||
# Assistant response
|
||||
with assistant():
|
||||
lm += gen("response", max_tokens=100)
|
||||
|
||||
print(lm["response"])
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Natural chat flow
|
||||
- Clear role separation
|
||||
- Easy to read and maintain
|
||||
|
||||
### 2. Constrained Generation
|
||||
|
||||
Guidance ensures outputs match specified patterns using regex or grammars.
|
||||
|
||||
#### Regex Constraints
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Constrain to valid email format
|
||||
lm += "Email: " + gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
|
||||
|
||||
# Constrain to date format (YYYY-MM-DD)
|
||||
lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}")
|
||||
|
||||
# Constrain to phone number
|
||||
lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}")
|
||||
|
||||
print(lm["email"]) # Guaranteed valid email
|
||||
print(lm["date"]) # Guaranteed YYYY-MM-DD format
|
||||
```
|
||||
|
||||
**How it works:**
|
||||
- Regex converted to grammar at token level
|
||||
- Invalid tokens filtered during generation
|
||||
- Model can only produce matching outputs
|
||||
|
||||
#### Selection Constraints
|
||||
|
||||
```python
|
||||
from guidance import models, gen, select
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Constrain to specific choices
|
||||
lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
|
||||
|
||||
# Multiple-choice selection
|
||||
lm += "Best answer: " + select(
|
||||
["A) Paris", "B) London", "C) Berlin", "D) Madrid"],
|
||||
name="answer"
|
||||
)
|
||||
|
||||
print(lm["sentiment"]) # One of: positive, negative, neutral
|
||||
print(lm["answer"]) # One of: A, B, C, or D
|
||||
```
|
||||
|
||||
### 3. Token Healing
|
||||
|
||||
Guidance automatically "heals" token boundaries between prompt and generation.
|
||||
|
||||
**Problem:** Tokenization creates unnatural boundaries.
|
||||
|
||||
```python
|
||||
# Without token healing
|
||||
prompt = "The capital of France is "
|
||||
# Last token: " is "
|
||||
# First generated token might be " Par" (with leading space)
|
||||
# Result: "The capital of France is Paris" (double space!)
|
||||
```
|
||||
|
||||
**Solution:** Guidance backs up one token and regenerates.
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Token healing enabled by default
|
||||
lm += "The capital of France is " + gen("capital", max_tokens=5)
|
||||
# Result: "The capital of France is Paris" (correct spacing)
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Natural text boundaries
|
||||
- No awkward spacing issues
|
||||
- Better model performance (sees natural token sequences)
|
||||
|
||||
### 4. Grammar-Based Generation
|
||||
|
||||
Define complex structures using context-free grammars.
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# JSON grammar (simplified)
|
||||
json_grammar = """
|
||||
{
|
||||
"name": <gen name regex="[A-Za-z ]+" max_tokens=20>,
|
||||
"age": <gen age regex="[0-9]+" max_tokens=3>,
|
||||
"email": <gen email regex="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}" max_tokens=50>
|
||||
}
|
||||
"""
|
||||
|
||||
# Generate valid JSON
|
||||
lm += gen("person", grammar=json_grammar)
|
||||
|
||||
print(lm["person"]) # Guaranteed valid JSON structure
|
||||
```
|
||||
|
||||
**Use cases:**
|
||||
- Complex structured outputs
|
||||
- Nested data structures
|
||||
- Programming language syntax
|
||||
- Domain-specific languages
|
||||
|
||||
### 5. Guidance Functions
|
||||
|
||||
Create reusable generation patterns with the `@guidance` decorator.
|
||||
|
||||
```python
|
||||
from guidance import guidance, gen, models
|
||||
|
||||
@guidance
|
||||
def generate_person(lm):
|
||||
"""Generate a person with name and age."""
|
||||
lm += "Name: " + gen("name", max_tokens=20, stop="\n")
|
||||
lm += "\nAge: " + gen("age", regex=r"[0-9]+", max_tokens=3)
|
||||
return lm
|
||||
|
||||
# Use the function
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = generate_person(lm)
|
||||
|
||||
print(lm["name"])
|
||||
print(lm["age"])
|
||||
```
|
||||
|
||||
**Stateful Functions:**
|
||||
|
||||
```python
|
||||
@guidance(stateless=False)
|
||||
def react_agent(lm, question, tools, max_rounds=5):
|
||||
"""ReAct agent with tool use."""
|
||||
lm += f"Question: {question}\n\n"
|
||||
|
||||
for i in range(max_rounds):
|
||||
# Thought
|
||||
lm += f"Thought {i+1}: " + gen("thought", stop="\n")
|
||||
|
||||
# Action
|
||||
lm += "\nAction: " + select(list(tools.keys()), name="action")
|
||||
|
||||
# Execute tool
|
||||
tool_result = tools[lm["action"]]()
|
||||
lm += f"\nObservation: {tool_result}\n\n"
|
||||
|
||||
# Check if done
|
||||
lm += "Done? " + select(["Yes", "No"], name="done")
|
||||
if lm["done"] == "Yes":
|
||||
break
|
||||
|
||||
# Final answer
|
||||
lm += "\nFinal Answer: " + gen("answer", max_tokens=100)
|
||||
return lm
|
||||
```
|
||||
|
||||
## Backend Configuration
|
||||
|
||||
### Anthropic Claude
|
||||
|
||||
```python
|
||||
from guidance import models
|
||||
|
||||
lm = models.Anthropic(
|
||||
model="claude-sonnet-4-5-20250929",
|
||||
api_key="your-api-key" # Or set ANTHROPIC_API_KEY env var
|
||||
)
|
||||
```
|
||||
|
||||
### OpenAI
|
||||
|
||||
```python
|
||||
lm = models.OpenAI(
|
||||
model="gpt-4o-mini",
|
||||
api_key="your-api-key" # Or set OPENAI_API_KEY env var
|
||||
)
|
||||
```
|
||||
|
||||
### Local Models (Transformers)
|
||||
|
||||
```python
|
||||
from guidance.models import Transformers
|
||||
|
||||
lm = Transformers(
|
||||
"microsoft/Phi-4-mini-instruct",
|
||||
device="cuda" # Or "cpu"
|
||||
)
|
||||
```
|
||||
|
||||
### Local Models (llama.cpp)
|
||||
|
||||
```python
|
||||
from guidance.models import LlamaCpp
|
||||
|
||||
lm = LlamaCpp(
|
||||
model_path="/path/to/model.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: JSON Generation
|
||||
|
||||
```python
|
||||
from guidance import models, gen, system, user, assistant
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
with system():
|
||||
lm += "You generate valid JSON."
|
||||
|
||||
with user():
|
||||
lm += "Generate a user profile with name, age, and email."
|
||||
|
||||
with assistant():
|
||||
lm += """{
|
||||
"name": """ + gen("name", regex=r'"[A-Za-z ]+"', max_tokens=30) + """,
|
||||
"age": """ + gen("age", regex=r"[0-9]+", max_tokens=3) + """,
|
||||
"email": """ + gen("email", regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"', max_tokens=50) + """
|
||||
}"""
|
||||
|
||||
print(lm) # Valid JSON guaranteed
|
||||
```
|
||||
|
||||
### Pattern 2: Classification
|
||||
|
||||
```python
|
||||
from guidance import models, gen, select
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
text = "This product is amazing! I love it."
|
||||
|
||||
lm += f"Text: {text}\n"
|
||||
lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
|
||||
lm += "\nConfidence: " + gen("confidence", regex=r"[0-9]+", max_tokens=3) + "%"
|
||||
|
||||
print(f"Sentiment: {lm['sentiment']}")
|
||||
print(f"Confidence: {lm['confidence']}%")
|
||||
```
|
||||
|
||||
### Pattern 3: Multi-Step Reasoning
|
||||
|
||||
```python
|
||||
from guidance import models, gen, guidance
|
||||
|
||||
@guidance
|
||||
def chain_of_thought(lm, question):
|
||||
"""Generate answer with step-by-step reasoning."""
|
||||
lm += f"Question: {question}\n\n"
|
||||
|
||||
# Generate multiple reasoning steps
|
||||
for i in range(3):
|
||||
lm += f"Step {i+1}: " + gen(f"step_{i+1}", stop="\n", max_tokens=100) + "\n"
|
||||
|
||||
# Final answer
|
||||
lm += "\nTherefore, the answer is: " + gen("answer", max_tokens=50)
|
||||
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = chain_of_thought(lm, "What is 15% of 200?")
|
||||
|
||||
print(lm["answer"])
|
||||
```
|
||||
|
||||
### Pattern 4: ReAct Agent
|
||||
|
||||
```python
|
||||
from guidance import models, gen, select, guidance
|
||||
|
||||
@guidance(stateless=False)
|
||||
def react_agent(lm, question):
|
||||
"""ReAct agent with tool use."""
|
||||
tools = {
|
||||
"calculator": lambda expr: eval(expr),
|
||||
"search": lambda query: f"Search results for: {query}",
|
||||
}
|
||||
|
||||
lm += f"Question: {question}\n\n"
|
||||
|
||||
for round in range(5):
|
||||
# Thought
|
||||
lm += f"Thought: " + gen("thought", stop="\n") + "\n"
|
||||
|
||||
# Action selection
|
||||
lm += "Action: " + select(["calculator", "search", "answer"], name="action")
|
||||
|
||||
if lm["action"] == "answer":
|
||||
lm += "\nFinal Answer: " + gen("answer", max_tokens=100)
|
||||
break
|
||||
|
||||
# Action input
|
||||
lm += "\nAction Input: " + gen("action_input", stop="\n") + "\n"
|
||||
|
||||
# Execute tool
|
||||
if lm["action"] in tools:
|
||||
result = tools[lm["action"]](lm["action_input"])
|
||||
lm += f"Observation: {result}\n\n"
|
||||
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = react_agent(lm, "What is 25 * 4 + 10?")
|
||||
print(lm["answer"])
|
||||
```
|
||||
|
||||
### Pattern 5: Data Extraction
|
||||
|
||||
```python
|
||||
from guidance import models, gen, guidance
|
||||
|
||||
@guidance
|
||||
def extract_entities(lm, text):
|
||||
"""Extract structured entities from text."""
|
||||
lm += f"Text: {text}\n\n"
|
||||
|
||||
# Extract person
|
||||
lm += "Person: " + gen("person", stop="\n", max_tokens=30) + "\n"
|
||||
|
||||
# Extract organization
|
||||
lm += "Organization: " + gen("organization", stop="\n", max_tokens=30) + "\n"
|
||||
|
||||
# Extract date
|
||||
lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}", max_tokens=10) + "\n"
|
||||
|
||||
# Extract location
|
||||
lm += "Location: " + gen("location", stop="\n", max_tokens=30) + "\n"
|
||||
|
||||
return lm
|
||||
|
||||
text = "Tim Cook announced at Apple Park on 2024-09-15 in Cupertino."
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = extract_entities(lm, text)
|
||||
|
||||
print(f"Person: {lm['person']}")
|
||||
print(f"Organization: {lm['organization']}")
|
||||
print(f"Date: {lm['date']}")
|
||||
print(f"Location: {lm['location']}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Use Regex for Format Validation
|
||||
|
||||
```python
|
||||
# ✅ Good: Regex ensures valid format
|
||||
lm += "Email: " + gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
|
||||
|
||||
# ❌ Bad: Free generation may produce invalid emails
|
||||
lm += "Email: " + gen("email", max_tokens=50)
|
||||
```
|
||||
|
||||
### 2. Use select() for Fixed Categories
|
||||
|
||||
```python
|
||||
# ✅ Good: Guaranteed valid category
|
||||
lm += "Status: " + select(["pending", "approved", "rejected"], name="status")
|
||||
|
||||
# ❌ Bad: May generate typos or invalid values
|
||||
lm += "Status: " + gen("status", max_tokens=20)
|
||||
```
|
||||
|
||||
### 3. Leverage Token Healing
|
||||
|
||||
```python
|
||||
# Token healing is enabled by default
|
||||
# No special action needed - just concatenate naturally
|
||||
lm += "The capital is " + gen("capital") # Automatic healing
|
||||
```
|
||||
|
||||
### 4. Use stop Sequences
|
||||
|
||||
```python
|
||||
# ✅ Good: Stop at newline for single-line outputs
|
||||
lm += "Name: " + gen("name", stop="\n")
|
||||
|
||||
# ❌ Bad: May generate multiple lines
|
||||
lm += "Name: " + gen("name", max_tokens=50)
|
||||
```
|
||||
|
||||
### 5. Create Reusable Functions
|
||||
|
||||
```python
|
||||
# ✅ Good: Reusable pattern
|
||||
@guidance
|
||||
def generate_person(lm):
|
||||
lm += "Name: " + gen("name", stop="\n")
|
||||
lm += "\nAge: " + gen("age", regex=r"[0-9]+")
|
||||
return lm
|
||||
|
||||
# Use multiple times
|
||||
lm = generate_person(lm)
|
||||
lm += "\n\n"
|
||||
lm = generate_person(lm)
|
||||
```
|
||||
|
||||
### 6. Balance Constraints
|
||||
|
||||
```python
|
||||
# ✅ Good: Reasonable constraints
|
||||
lm += gen("name", regex=r"[A-Za-z ]+", max_tokens=30)
|
||||
|
||||
# ❌ Too strict: May fail or be very slow
|
||||
lm += gen("name", regex=r"^(John|Jane)$", max_tokens=10)
|
||||
```
|
||||
|
||||
## Comparison to Alternatives
|
||||
|
||||
| Feature | Guidance | Instructor | Outlines | LMQL |
|
||||
|---------|----------|------------|----------|------|
|
||||
| Regex Constraints | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
|
||||
| Grammar Support | ✅ CFG | ❌ No | ✅ CFG | ✅ CFG |
|
||||
| Pydantic Validation | ❌ No | ✅ Yes | ✅ Yes | ❌ No |
|
||||
| Token Healing | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
|
||||
| Local Models | ✅ Yes | ⚠️ Limited | ✅ Yes | ✅ Yes |
|
||||
| API Models | ✅ Yes | ✅ Yes | ⚠️ Limited | ✅ Yes |
|
||||
| Pythonic Syntax | ✅ Yes | ✅ Yes | ✅ Yes | ❌ SQL-like |
|
||||
| Learning Curve | Low | Low | Medium | High |
|
||||
|
||||
**When to choose Guidance:**
|
||||
- Need regex/grammar constraints
|
||||
- Want token healing
|
||||
- Building complex workflows with control flow
|
||||
- Using local models (Transformers, llama.cpp)
|
||||
- Prefer Pythonic syntax
|
||||
|
||||
**When to choose alternatives:**
|
||||
- Instructor: Need Pydantic validation with automatic retrying
|
||||
- Outlines: Need JSON schema validation
|
||||
- LMQL: Prefer declarative query syntax
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
**Latency Reduction:**
|
||||
- 30-50% faster than traditional prompting for constrained outputs
|
||||
- Token healing reduces unnecessary regeneration
|
||||
- Grammar constraints prevent invalid token generation
|
||||
|
||||
**Memory Usage:**
|
||||
- Minimal overhead vs unconstrained generation
|
||||
- Grammar compilation cached after first use
|
||||
- Efficient token filtering at inference time
|
||||
|
||||
**Token Efficiency:**
|
||||
- Prevents wasted tokens on invalid outputs
|
||||
- No need for retry loops
|
||||
- Direct path to valid outputs
|
||||
|
||||
## Resources
|
||||
|
||||
- **Documentation**: https://guidance.readthedocs.io
|
||||
- **GitHub**: https://github.com/guidance-ai/guidance (18k+ stars)
|
||||
- **Notebooks**: https://github.com/guidance-ai/guidance/tree/main/notebooks
|
||||
- **Discord**: Community support available
|
||||
|
||||
## See Also
|
||||
|
||||
- `references/constraints.md` - Comprehensive regex and grammar patterns
|
||||
- `references/backends.md` - Backend-specific configuration
|
||||
- `references/examples.md` - Production-ready examples
|
||||
|
||||
|
||||
554
skills/mlops/inference/guidance/references/backends.md
Normal file
554
skills/mlops/inference/guidance/references/backends.md
Normal file
@@ -0,0 +1,554 @@
|
||||
# Backend Configuration Guide
|
||||
|
||||
Complete guide to configuring Guidance with different LLM backends.
|
||||
|
||||
## Table of Contents
|
||||
- API-Based Models (Anthropic, OpenAI)
|
||||
- Local Models (Transformers, llama.cpp)
|
||||
- Backend Comparison
|
||||
- Performance Tuning
|
||||
- Advanced Configuration
|
||||
|
||||
## API-Based Models
|
||||
|
||||
### Anthropic Claude
|
||||
|
||||
#### Basic Setup
|
||||
|
||||
```python
|
||||
from guidance import models
|
||||
|
||||
# Using environment variable
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
# Reads ANTHROPIC_API_KEY from environment
|
||||
|
||||
# Explicit API key
|
||||
lm = models.Anthropic(
|
||||
model="claude-sonnet-4-5-20250929",
|
||||
api_key="your-api-key-here"
|
||||
)
|
||||
```
|
||||
|
||||
#### Available Models
|
||||
|
||||
```python
|
||||
# Claude 3.5 Sonnet (Latest, recommended)
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Claude 3.7 Sonnet (Fast, cost-effective)
|
||||
lm = models.Anthropic("claude-sonnet-3.7-20250219")
|
||||
|
||||
# Claude 3 Opus (Most capable)
|
||||
lm = models.Anthropic("claude-3-opus-20240229")
|
||||
|
||||
# Claude 3.5 Haiku (Fastest, cheapest)
|
||||
lm = models.Anthropic("claude-3-5-haiku-20241022")
|
||||
```
|
||||
|
||||
#### Configuration Options
|
||||
|
||||
```python
|
||||
lm = models.Anthropic(
|
||||
model="claude-sonnet-4-5-20250929",
|
||||
api_key="your-api-key",
|
||||
max_tokens=4096, # Max tokens to generate
|
||||
temperature=0.7, # Sampling temperature (0-1)
|
||||
top_p=0.9, # Nucleus sampling
|
||||
timeout=30, # Request timeout (seconds)
|
||||
max_retries=3 # Retry failed requests
|
||||
)
|
||||
```
|
||||
|
||||
#### With Context Managers
|
||||
|
||||
```python
|
||||
from guidance import models, system, user, assistant, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
with system():
|
||||
lm += "You are a helpful assistant."
|
||||
|
||||
with user():
|
||||
lm += "What is the capital of France?"
|
||||
|
||||
with assistant():
|
||||
lm += gen(max_tokens=50)
|
||||
|
||||
print(lm)
|
||||
```
|
||||
|
||||
### OpenAI
|
||||
|
||||
#### Basic Setup
|
||||
|
||||
```python
|
||||
from guidance import models
|
||||
|
||||
# Using environment variable
|
||||
lm = models.OpenAI("gpt-4o")
|
||||
# Reads OPENAI_API_KEY from environment
|
||||
|
||||
# Explicit API key
|
||||
lm = models.OpenAI(
|
||||
model="gpt-4o",
|
||||
api_key="your-api-key-here"
|
||||
)
|
||||
```
|
||||
|
||||
#### Available Models
|
||||
|
||||
```python
|
||||
# GPT-4o (Latest, multimodal)
|
||||
lm = models.OpenAI("gpt-4o")
|
||||
|
||||
# GPT-4o Mini (Fast, cost-effective)
|
||||
lm = models.OpenAI("gpt-4o-mini")
|
||||
|
||||
# GPT-4 Turbo
|
||||
lm = models.OpenAI("gpt-4-turbo")
|
||||
|
||||
# GPT-3.5 Turbo (Cheapest)
|
||||
lm = models.OpenAI("gpt-3.5-turbo")
|
||||
```
|
||||
|
||||
#### Configuration Options
|
||||
|
||||
```python
|
||||
lm = models.OpenAI(
|
||||
model="gpt-4o-mini",
|
||||
api_key="your-api-key",
|
||||
max_tokens=2048,
|
||||
temperature=0.7,
|
||||
top_p=1.0,
|
||||
frequency_penalty=0.0,
|
||||
presence_penalty=0.0,
|
||||
timeout=30
|
||||
)
|
||||
```
|
||||
|
||||
#### Chat Format
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
lm = models.OpenAI("gpt-4o-mini")
|
||||
|
||||
# OpenAI uses chat format
|
||||
lm += [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "What is 2+2?"}
|
||||
]
|
||||
|
||||
# Generate response
|
||||
lm += gen(max_tokens=50)
|
||||
```
|
||||
|
||||
### Azure OpenAI
|
||||
|
||||
```python
|
||||
from guidance import models
|
||||
|
||||
lm = models.AzureOpenAI(
|
||||
model="gpt-4o",
|
||||
azure_endpoint="https://your-resource.openai.azure.com/",
|
||||
api_key="your-azure-api-key",
|
||||
api_version="2024-02-15-preview",
|
||||
deployment_name="your-deployment-name"
|
||||
)
|
||||
```
|
||||
|
||||
## Local Models
|
||||
|
||||
### Transformers (Hugging Face)
|
||||
|
||||
#### Basic Setup
|
||||
|
||||
```python
|
||||
from guidance.models import Transformers
|
||||
|
||||
# Load model from Hugging Face
|
||||
lm = Transformers("microsoft/Phi-4-mini-instruct")
|
||||
```
|
||||
|
||||
#### GPU Configuration
|
||||
|
||||
```python
|
||||
# Use GPU
|
||||
lm = Transformers(
|
||||
"microsoft/Phi-4-mini-instruct",
|
||||
device="cuda"
|
||||
)
|
||||
|
||||
# Use specific GPU
|
||||
lm = Transformers(
|
||||
"microsoft/Phi-4-mini-instruct",
|
||||
device="cuda:0" # GPU 0
|
||||
)
|
||||
|
||||
# Use CPU
|
||||
lm = Transformers(
|
||||
"microsoft/Phi-4-mini-instruct",
|
||||
device="cpu"
|
||||
)
|
||||
```
|
||||
|
||||
#### Advanced Configuration
|
||||
|
||||
```python
|
||||
lm = Transformers(
|
||||
"microsoft/Phi-4-mini-instruct",
|
||||
device="cuda",
|
||||
torch_dtype="float16", # Use FP16 (faster, less memory)
|
||||
load_in_8bit=True, # 8-bit quantization
|
||||
max_memory={0: "20GB"}, # GPU memory limit
|
||||
offload_folder="./offload" # Offload to disk if needed
|
||||
)
|
||||
```
|
||||
|
||||
#### Popular Models
|
||||
|
||||
```python
|
||||
# Phi-4 (Microsoft)
|
||||
lm = Transformers("microsoft/Phi-4-mini-instruct")
|
||||
lm = Transformers("microsoft/Phi-3-medium-4k-instruct")
|
||||
|
||||
# Llama 3 (Meta)
|
||||
lm = Transformers("meta-llama/Llama-3.1-8B-Instruct")
|
||||
lm = Transformers("meta-llama/Llama-3.1-70B-Instruct")
|
||||
|
||||
# Mistral (Mistral AI)
|
||||
lm = Transformers("mistralai/Mistral-7B-Instruct-v0.3")
|
||||
lm = Transformers("mistralai/Mixtral-8x7B-Instruct-v0.1")
|
||||
|
||||
# Qwen (Alibaba)
|
||||
lm = Transformers("Qwen/Qwen2.5-7B-Instruct")
|
||||
|
||||
# Gemma (Google)
|
||||
lm = Transformers("google/gemma-2-9b-it")
|
||||
```
|
||||
|
||||
#### Generation Configuration
|
||||
|
||||
```python
|
||||
lm = Transformers(
|
||||
"microsoft/Phi-4-mini-instruct",
|
||||
device="cuda"
|
||||
)
|
||||
|
||||
# Configure generation
|
||||
from guidance import gen
|
||||
|
||||
result = lm + gen(
|
||||
max_tokens=100,
|
||||
temperature=0.7,
|
||||
top_p=0.9,
|
||||
top_k=50,
|
||||
repetition_penalty=1.1
|
||||
)
|
||||
```
|
||||
|
||||
### llama.cpp
|
||||
|
||||
#### Basic Setup
|
||||
|
||||
```python
|
||||
from guidance.models import LlamaCpp
|
||||
|
||||
# Load GGUF model
|
||||
lm = LlamaCpp(
|
||||
model_path="/path/to/model.gguf",
|
||||
n_ctx=4096 # Context window
|
||||
)
|
||||
```
|
||||
|
||||
#### GPU Configuration
|
||||
|
||||
```python
|
||||
# Use GPU acceleration
|
||||
lm = LlamaCpp(
|
||||
model_path="/path/to/model.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35, # Offload 35 layers to GPU
|
||||
n_threads=8 # CPU threads for remaining layers
|
||||
)
|
||||
|
||||
# Full GPU offload
|
||||
lm = LlamaCpp(
|
||||
model_path="/path/to/model.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=-1 # Offload all layers
|
||||
)
|
||||
```
|
||||
|
||||
#### Advanced Configuration
|
||||
|
||||
```python
|
||||
lm = LlamaCpp(
|
||||
model_path="/path/to/llama-3.1-8b-instruct.Q4_K_M.gguf",
|
||||
n_ctx=8192, # Context window (tokens)
|
||||
n_gpu_layers=35, # GPU layers
|
||||
n_threads=8, # CPU threads
|
||||
n_batch=512, # Batch size for prompt processing
|
||||
use_mmap=True, # Memory-map the model file
|
||||
use_mlock=False, # Lock model in RAM
|
||||
seed=42, # Random seed
|
||||
verbose=False # Suppress verbose output
|
||||
)
|
||||
```
|
||||
|
||||
#### Quantized Models
|
||||
|
||||
```python
|
||||
# Q4_K_M (4-bit, recommended for most cases)
|
||||
lm = LlamaCpp("/path/to/model.Q4_K_M.gguf")
|
||||
|
||||
# Q5_K_M (5-bit, better quality)
|
||||
lm = LlamaCpp("/path/to/model.Q5_K_M.gguf")
|
||||
|
||||
# Q8_0 (8-bit, high quality)
|
||||
lm = LlamaCpp("/path/to/model.Q8_0.gguf")
|
||||
|
||||
# F16 (16-bit float, highest quality)
|
||||
lm = LlamaCpp("/path/to/model.F16.gguf")
|
||||
```
|
||||
|
||||
#### Popular GGUF Models
|
||||
|
||||
```python
|
||||
# Llama 3.1
|
||||
lm = LlamaCpp("llama-3.1-8b-instruct.Q4_K_M.gguf")
|
||||
|
||||
# Mistral
|
||||
lm = LlamaCpp("mistral-7b-instruct-v0.3.Q4_K_M.gguf")
|
||||
|
||||
# Phi-4
|
||||
lm = LlamaCpp("phi-4-mini-instruct.Q4_K_M.gguf")
|
||||
```
|
||||
|
||||
## Backend Comparison
|
||||
|
||||
### Feature Matrix
|
||||
|
||||
| Feature | Anthropic | OpenAI | Transformers | llama.cpp |
|
||||
|---------|-----------|--------|--------------|-----------|
|
||||
| Constrained Generation | ✅ Full | ✅ Full | ✅ Full | ✅ Full |
|
||||
| Token Healing | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
|
||||
| Streaming | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
|
||||
| GPU Support | N/A | N/A | ✅ Yes | ✅ Yes |
|
||||
| Quantization | N/A | N/A | ✅ Yes | ✅ Yes |
|
||||
| Cost | $$$ | $$$ | Free | Free |
|
||||
| Latency | Low | Low | Medium | Low |
|
||||
| Setup Difficulty | Easy | Easy | Medium | Medium |
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
**Anthropic Claude:**
|
||||
- **Latency**: 200-500ms (API call)
|
||||
- **Throughput**: Limited by API rate limits
|
||||
- **Cost**: $3-15 per 1M input tokens
|
||||
- **Best for**: Production systems, high-quality outputs
|
||||
|
||||
**OpenAI:**
|
||||
- **Latency**: 200-400ms (API call)
|
||||
- **Throughput**: Limited by API rate limits
|
||||
- **Cost**: $0.15-30 per 1M input tokens
|
||||
- **Best for**: Cost-sensitive production, gpt-4o-mini
|
||||
|
||||
**Transformers:**
|
||||
- **Latency**: 50-200ms (local inference)
|
||||
- **Throughput**: GPU-dependent (10-100 tokens/sec)
|
||||
- **Cost**: Hardware cost only
|
||||
- **Best for**: Privacy-sensitive, high-volume, experimentation
|
||||
|
||||
**llama.cpp:**
|
||||
- **Latency**: 30-150ms (local inference)
|
||||
- **Throughput**: Hardware-dependent (20-150 tokens/sec)
|
||||
- **Cost**: Hardware cost only
|
||||
- **Best for**: Edge deployment, Apple Silicon, CPU inference
|
||||
|
||||
### Memory Requirements
|
||||
|
||||
**Transformers (FP16):**
|
||||
- 7B model: ~14GB GPU VRAM
|
||||
- 13B model: ~26GB GPU VRAM
|
||||
- 70B model: ~140GB GPU VRAM (multi-GPU)
|
||||
|
||||
**llama.cpp (Q4_K_M):**
|
||||
- 7B model: ~4.5GB RAM
|
||||
- 13B model: ~8GB RAM
|
||||
- 70B model: ~40GB RAM
|
||||
|
||||
**Optimization Tips:**
|
||||
- Use quantized models (Q4_K_M) for lower memory
|
||||
- Use GPU offloading for faster inference
|
||||
- Use CPU inference for smaller models (<7B)
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### API Models (Anthropic, OpenAI)
|
||||
|
||||
#### Reduce Latency
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Use lower max_tokens (faster response)
|
||||
lm += gen(max_tokens=100) # Instead of 1000
|
||||
|
||||
# Use streaming (perceived latency reduction)
|
||||
for chunk in lm.stream(gen(max_tokens=500)):
|
||||
print(chunk, end="", flush=True)
|
||||
```
|
||||
|
||||
#### Reduce Cost
|
||||
|
||||
```python
|
||||
# Use cheaper models
|
||||
lm = models.Anthropic("claude-3-5-haiku-20241022") # vs Sonnet
|
||||
lm = models.OpenAI("gpt-4o-mini") # vs gpt-4o
|
||||
|
||||
# Reduce context size
|
||||
# - Keep prompts concise
|
||||
# - Avoid large few-shot examples
|
||||
# - Use max_tokens limits
|
||||
```
|
||||
|
||||
### Local Models (Transformers, llama.cpp)
|
||||
|
||||
#### Optimize GPU Usage
|
||||
|
||||
```python
|
||||
from guidance.models import Transformers
|
||||
|
||||
# Use FP16 for 2x speedup
|
||||
lm = Transformers(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
device="cuda",
|
||||
torch_dtype="float16"
|
||||
)
|
||||
|
||||
# Use 8-bit quantization for 4x memory reduction
|
||||
lm = Transformers(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
device="cuda",
|
||||
load_in_8bit=True
|
||||
)
|
||||
|
||||
# Use flash attention (requires flash-attn package)
|
||||
lm = Transformers(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
device="cuda",
|
||||
use_flash_attention_2=True
|
||||
)
|
||||
```
|
||||
|
||||
#### Optimize llama.cpp
|
||||
|
||||
```python
|
||||
from guidance.models import LlamaCpp
|
||||
|
||||
# Maximize GPU layers
|
||||
lm = LlamaCpp(
|
||||
model_path="/path/to/model.Q4_K_M.gguf",
|
||||
n_gpu_layers=-1 # All layers on GPU
|
||||
)
|
||||
|
||||
# Optimize batch size
|
||||
lm = LlamaCpp(
|
||||
model_path="/path/to/model.Q4_K_M.gguf",
|
||||
n_batch=512, # Larger batch = faster prompt processing
|
||||
n_gpu_layers=-1
|
||||
)
|
||||
|
||||
# Use Metal (Apple Silicon)
|
||||
lm = LlamaCpp(
|
||||
model_path="/path/to/model.Q4_K_M.gguf",
|
||||
n_gpu_layers=-1, # Use Metal GPU acceleration
|
||||
use_mmap=True
|
||||
)
|
||||
```
|
||||
|
||||
#### Batch Processing
|
||||
|
||||
```python
|
||||
# Process multiple requests efficiently
|
||||
requests = [
|
||||
"What is 2+2?",
|
||||
"What is the capital of France?",
|
||||
"What is photosynthesis?"
|
||||
]
|
||||
|
||||
# Bad: Sequential processing
|
||||
for req in requests:
|
||||
lm = Transformers("microsoft/Phi-4-mini-instruct")
|
||||
lm += req + gen(max_tokens=50)
|
||||
|
||||
# Good: Reuse loaded model
|
||||
lm = Transformers("microsoft/Phi-4-mini-instruct")
|
||||
for req in requests:
|
||||
lm += req + gen(max_tokens=50)
|
||||
```
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Custom Model Configurations
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
from guidance.models import Transformers
|
||||
|
||||
# Load custom model
|
||||
tokenizer = AutoTokenizer.from_pretrained("your-model")
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"your-model",
|
||||
device_map="auto",
|
||||
torch_dtype="float16"
|
||||
)
|
||||
|
||||
# Use with Guidance
|
||||
lm = Transformers(model=model, tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# API keys
|
||||
export ANTHROPIC_API_KEY="sk-ant-..."
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
|
||||
# Transformers cache
|
||||
export HF_HOME="/path/to/cache"
|
||||
export TRANSFORMERS_CACHE="/path/to/cache"
|
||||
|
||||
# GPU selection
|
||||
export CUDA_VISIBLE_DEVICES=0,1 # Use GPU 0 and 1
|
||||
```
|
||||
|
||||
### Debugging
|
||||
|
||||
```python
|
||||
# Enable verbose logging
|
||||
import logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
# Check backend info
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
print(f"Model: {lm.model_name}")
|
||||
print(f"Backend: {lm.backend}")
|
||||
|
||||
# Check GPU usage (Transformers)
|
||||
lm = Transformers("microsoft/Phi-4-mini-instruct", device="cuda")
|
||||
print(f"Device: {lm.device}")
|
||||
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **Anthropic Docs**: https://docs.anthropic.com
|
||||
- **OpenAI Docs**: https://platform.openai.com/docs
|
||||
- **Hugging Face Models**: https://huggingface.co/models
|
||||
- **llama.cpp**: https://github.com/ggerganov/llama.cpp
|
||||
- **GGUF Models**: https://huggingface.co/models?library=gguf
|
||||
674
skills/mlops/inference/guidance/references/constraints.md
Normal file
674
skills/mlops/inference/guidance/references/constraints.md
Normal file
@@ -0,0 +1,674 @@
|
||||
# Comprehensive Constraint Patterns
|
||||
|
||||
Guide to regex constraints, grammar-based generation, and token healing in Guidance.
|
||||
|
||||
## Table of Contents
|
||||
- Regex Constraints
|
||||
- Grammar-Based Generation
|
||||
- Token Healing
|
||||
- Selection Constraints
|
||||
- Complex Patterns
|
||||
- Performance Optimization
|
||||
|
||||
## Regex Constraints
|
||||
|
||||
### Basic Patterns
|
||||
|
||||
#### Numeric Constraints
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Integer (positive)
|
||||
lm += "Age: " + gen("age", regex=r"[0-9]+")
|
||||
|
||||
# Integer (with negatives)
|
||||
lm += "Temperature: " + gen("temp", regex=r"-?[0-9]+")
|
||||
|
||||
# Float (positive)
|
||||
lm += "Price: $" + gen("price", regex=r"[0-9]+\.[0-9]{2}")
|
||||
|
||||
# Float (with negatives and optional decimals)
|
||||
lm += "Value: " + gen("value", regex=r"-?[0-9]+(\.[0-9]+)?")
|
||||
|
||||
# Percentage (0-100)
|
||||
lm += "Progress: " + gen("progress", regex=r"(100|[0-9]{1,2})")
|
||||
|
||||
# Range (1-5 stars)
|
||||
lm += "Rating: " + gen("rating", regex=r"[1-5]") + " stars"
|
||||
```
|
||||
|
||||
#### Text Constraints
|
||||
|
||||
```python
|
||||
# Alphabetic only
|
||||
lm += "Name: " + gen("name", regex=r"[A-Za-z]+")
|
||||
|
||||
# Alphabetic with spaces
|
||||
lm += "Full Name: " + gen("full_name", regex=r"[A-Za-z ]+")
|
||||
|
||||
# Alphanumeric
|
||||
lm += "Username: " + gen("username", regex=r"[A-Za-z0-9_]+")
|
||||
|
||||
# Capitalized words
|
||||
lm += "Title: " + gen("title", regex=r"[A-Z][a-z]+( [A-Z][a-z]+)*")
|
||||
|
||||
# Lowercase only
|
||||
lm += "Code: " + gen("code", regex=r"[a-z0-9-]+")
|
||||
|
||||
# Specific length
|
||||
lm += "ID: " + gen("id", regex=r"[A-Z]{3}-[0-9]{6}") # e.g., "ABC-123456"
|
||||
```
|
||||
|
||||
#### Date and Time Constraints
|
||||
|
||||
```python
|
||||
# Date (YYYY-MM-DD)
|
||||
lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}")
|
||||
|
||||
# Date (MM/DD/YYYY)
|
||||
lm += "Date: " + gen("date_us", regex=r"\d{2}/\d{2}/\d{4}")
|
||||
|
||||
# Time (HH:MM)
|
||||
lm += "Time: " + gen("time", regex=r"\d{2}:\d{2}")
|
||||
|
||||
# Time (HH:MM:SS)
|
||||
lm += "Time: " + gen("time_full", regex=r"\d{2}:\d{2}:\d{2}")
|
||||
|
||||
# ISO 8601 datetime
|
||||
lm += "Timestamp: " + gen(
|
||||
"timestamp",
|
||||
regex=r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z"
|
||||
)
|
||||
|
||||
# Year (YYYY)
|
||||
lm += "Year: " + gen("year", regex=r"(19|20)\d{2}")
|
||||
|
||||
# Month name
|
||||
lm += "Month: " + gen(
|
||||
"month",
|
||||
regex=r"(January|February|March|April|May|June|July|August|September|October|November|December)"
|
||||
)
|
||||
```
|
||||
|
||||
#### Contact Information
|
||||
|
||||
```python
|
||||
# Email
|
||||
lm += "Email: " + gen(
|
||||
"email",
|
||||
regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
|
||||
)
|
||||
|
||||
# Phone (US format)
|
||||
lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}")
|
||||
|
||||
# Phone (international format)
|
||||
lm += "Phone: " + gen("phone_intl", regex=r"\+[0-9]{1,3}-[0-9]{1,14}")
|
||||
|
||||
# ZIP code (US)
|
||||
lm += "ZIP: " + gen("zip", regex=r"\d{5}(-\d{4})?")
|
||||
|
||||
# Postal code (Canada)
|
||||
lm += "Postal: " + gen("postal", regex=r"[A-Z]\d[A-Z] \d[A-Z]\d")
|
||||
|
||||
# URL
|
||||
lm += "URL: " + gen(
|
||||
"url",
|
||||
regex=r"https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/[a-zA-Z0-9._~:/?#\[\]@!$&'()*+,;=-]*)?"
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Patterns
|
||||
|
||||
#### JSON Field Constraints
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# String field with quotes
|
||||
lm += '"name": ' + gen("name", regex=r'"[A-Za-z ]+"')
|
||||
|
||||
# Numeric field (no quotes)
|
||||
lm += '"age": ' + gen("age", regex=r"[0-9]+")
|
||||
|
||||
# Boolean field
|
||||
lm += '"active": ' + gen("active", regex=r"(true|false)")
|
||||
|
||||
# Null field
|
||||
lm += '"optional": ' + gen("optional", regex=r"(null|[0-9]+)")
|
||||
|
||||
# Array of strings
|
||||
lm += '"tags": [' + gen(
|
||||
"tags",
|
||||
regex=r'"[a-z]+"(, "[a-z]+")*'
|
||||
) + ']'
|
||||
|
||||
# Complete JSON object
|
||||
lm += """{
|
||||
"name": """ + gen("name", regex=r'"[A-Za-z ]+"') + """,
|
||||
"age": """ + gen("age", regex=r"[0-9]+") + """,
|
||||
"email": """ + gen(
|
||||
"email",
|
||||
regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
|
||||
) + """
|
||||
}"""
|
||||
```
|
||||
|
||||
#### Code Patterns
|
||||
|
||||
```python
|
||||
# Python variable name
|
||||
lm += "Variable: " + gen("var", regex=r"[a-z_][a-z0-9_]*")
|
||||
|
||||
# Python function name
|
||||
lm += "Function: " + gen("func", regex=r"[a-z_][a-z0-9_]*")
|
||||
|
||||
# Hex color code
|
||||
lm += "Color: #" + gen("color", regex=r"[0-9A-Fa-f]{6}")
|
||||
|
||||
# UUID
|
||||
lm += "UUID: " + gen(
|
||||
"uuid",
|
||||
regex=r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
|
||||
)
|
||||
|
||||
# Git commit hash (short)
|
||||
lm += "Commit: " + gen("commit", regex=r"[0-9a-f]{7}")
|
||||
|
||||
# Semantic version
|
||||
lm += "Version: " + gen("version", regex=r"[0-9]+\.[0-9]+\.[0-9]+")
|
||||
|
||||
# IP address (IPv4)
|
||||
lm += "IP: " + gen(
|
||||
"ip",
|
||||
regex=r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
|
||||
)
|
||||
```
|
||||
|
||||
#### Domain-Specific Patterns
|
||||
|
||||
```python
|
||||
# Credit card number
|
||||
lm += "Card: " + gen("card", regex=r"\d{4}-\d{4}-\d{4}-\d{4}")
|
||||
|
||||
# Social Security Number (US)
|
||||
lm += "SSN: " + gen("ssn", regex=r"\d{3}-\d{2}-\d{4}")
|
||||
|
||||
# ISBN-13
|
||||
lm += "ISBN: " + gen("isbn", regex=r"978-\d{1,5}-\d{1,7}-\d{1,7}-\d")
|
||||
|
||||
# License plate (US)
|
||||
lm += "Plate: " + gen("plate", regex=r"[A-Z]{3}-\d{4}")
|
||||
|
||||
# Currency amount
|
||||
lm += "Amount: $" + gen("amount", regex=r"[0-9]{1,3}(,[0-9]{3})*\.[0-9]{2}")
|
||||
|
||||
# Percentage with decimal
|
||||
lm += "Rate: " + gen("rate", regex=r"[0-9]+\.[0-9]{1,2}%")
|
||||
```
|
||||
|
||||
## Grammar-Based Generation
|
||||
|
||||
### JSON Grammar
|
||||
|
||||
```python
|
||||
from guidance import models, gen, guidance
|
||||
|
||||
@guidance
|
||||
def json_object(lm):
|
||||
"""Generate valid JSON object."""
|
||||
lm += "{\n"
|
||||
|
||||
# Name field (required)
|
||||
lm += ' "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
|
||||
|
||||
# Age field (required)
|
||||
lm += ' "age": ' + gen("age", regex=r"[0-9]+") + ",\n"
|
||||
|
||||
# Email field (required)
|
||||
lm += ' "email": ' + gen(
|
||||
"email",
|
||||
regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
|
||||
) + ",\n"
|
||||
|
||||
# Active field (required, boolean)
|
||||
lm += ' "active": ' + gen("active", regex=r"(true|false)") + "\n"
|
||||
|
||||
lm += "}"
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = json_object(lm)
|
||||
print(lm) # Valid JSON guaranteed
|
||||
```
|
||||
|
||||
### Nested JSON Grammar
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def nested_json(lm):
|
||||
"""Generate nested JSON structure."""
|
||||
lm += "{\n"
|
||||
|
||||
# User object
|
||||
lm += ' "user": {\n'
|
||||
lm += ' "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
|
||||
lm += ' "age": ' + gen("age", regex=r"[0-9]+") + "\n"
|
||||
lm += " },\n"
|
||||
|
||||
# Address object
|
||||
lm += ' "address": {\n'
|
||||
lm += ' "street": ' + gen("street", regex=r'"[A-Za-z0-9 ]+"') + ",\n"
|
||||
lm += ' "city": ' + gen("city", regex=r'"[A-Za-z ]+"') + ",\n"
|
||||
lm += ' "zip": ' + gen("zip", regex=r'"\d{5}"') + "\n"
|
||||
lm += " }\n"
|
||||
|
||||
lm += "}"
|
||||
return lm
|
||||
```
|
||||
|
||||
### Array Grammar
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def json_array(lm, count=3):
|
||||
"""Generate JSON array with fixed count."""
|
||||
lm += "[\n"
|
||||
|
||||
for i in range(count):
|
||||
lm += " {\n"
|
||||
lm += ' "id": ' + gen(f"id_{i}", regex=r"[0-9]+") + ",\n"
|
||||
lm += ' "name": ' + gen(f"name_{i}", regex=r'"[A-Za-z ]+"') + "\n"
|
||||
lm += " }"
|
||||
if i < count - 1:
|
||||
lm += ","
|
||||
lm += "\n"
|
||||
|
||||
lm += "]"
|
||||
return lm
|
||||
```
|
||||
|
||||
### XML Grammar
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def xml_document(lm):
|
||||
"""Generate valid XML document."""
|
||||
lm += '<?xml version="1.0"?>\n'
|
||||
lm += "<person>\n"
|
||||
|
||||
# Name element
|
||||
lm += " <name>" + gen("name", regex=r"[A-Za-z ]+") + "</name>\n"
|
||||
|
||||
# Age element
|
||||
lm += " <age>" + gen("age", regex=r"[0-9]+") + "</age>\n"
|
||||
|
||||
# Email element
|
||||
lm += " <email>" + gen(
|
||||
"email",
|
||||
regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
|
||||
) + "</email>\n"
|
||||
|
||||
lm += "</person>"
|
||||
return lm
|
||||
```
|
||||
|
||||
### CSV Grammar
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def csv_row(lm):
|
||||
"""Generate CSV row."""
|
||||
lm += gen("name", regex=r"[A-Za-z ]+") + ","
|
||||
lm += gen("age", regex=r"[0-9]+") + ","
|
||||
lm += gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
|
||||
return lm
|
||||
|
||||
@guidance
|
||||
def csv_document(lm, rows=5):
|
||||
"""Generate complete CSV."""
|
||||
# Header
|
||||
lm += "Name,Age,Email\n"
|
||||
|
||||
# Rows
|
||||
for i in range(rows):
|
||||
lm = csv_row(lm)
|
||||
if i < rows - 1:
|
||||
lm += "\n"
|
||||
|
||||
return lm
|
||||
```
|
||||
|
||||
## Token Healing
|
||||
|
||||
### How Token Healing Works
|
||||
|
||||
**Problem:** Tokenization creates unnatural boundaries.
|
||||
|
||||
```python
|
||||
# Example without token healing
|
||||
prompt = "The capital of France is "
|
||||
# Tokenization: ["The", " capital", " of", " France", " is", " "]
|
||||
# Model sees last token: " "
|
||||
# First generated token might include leading space: " Paris"
|
||||
# Result: "The capital of France is Paris" (double space)
|
||||
```
|
||||
|
||||
**Solution:** Guidance backs up and regenerates the last token.
|
||||
|
||||
```python
|
||||
from guidance import models, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Token healing enabled by default
|
||||
lm += "The capital of France is " + gen("capital", max_tokens=5)
|
||||
|
||||
# Process:
|
||||
# 1. Back up to token before " is "
|
||||
# 2. Regenerate " is" + "capital" together
|
||||
# 3. Result: "The capital of France is Paris" (correct)
|
||||
```
|
||||
|
||||
### Token Healing Examples
|
||||
|
||||
#### Natural Continuations
|
||||
|
||||
```python
|
||||
# Before token healing
|
||||
lm += "The function name is get" + gen("rest")
|
||||
# Might generate: "The function name is get User" (space before User)
|
||||
|
||||
# With token healing
|
||||
lm += "The function name is get" + gen("rest")
|
||||
# Generates: "The function name is getUser" (correct camelCase)
|
||||
```
|
||||
|
||||
#### Code Generation
|
||||
|
||||
```python
|
||||
# Function name completion
|
||||
lm += "def calculate_" + gen("rest", stop="(")
|
||||
# Token healing ensures smooth connection: "calculate_total"
|
||||
|
||||
# Variable name completion
|
||||
lm += "my_" + gen("var_name", regex=r"[a-z_]+")
|
||||
# Token healing ensures: "my_variable_name" (not "my_ variable_name")
|
||||
```
|
||||
|
||||
#### Domain-Specific Terms
|
||||
|
||||
```python
|
||||
# Medical terms
|
||||
lm += "The patient has hyper" + gen("condition")
|
||||
# Token healing helps: "hypertension" (not "hyper tension")
|
||||
|
||||
# Technical terms
|
||||
lm += "Using micro" + gen("tech")
|
||||
# Token healing helps: "microservices" (not "micro services")
|
||||
```
|
||||
|
||||
### Disabling Token Healing
|
||||
|
||||
```python
|
||||
# Disable token healing if needed (rare)
|
||||
lm += gen("text", token_healing=False)
|
||||
```
|
||||
|
||||
## Selection Constraints
|
||||
|
||||
### Basic Selection
|
||||
|
||||
```python
|
||||
from guidance import models, select
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
# Simple selection
|
||||
lm += "Status: " + select(["active", "inactive", "pending"], name="status")
|
||||
|
||||
# Boolean selection
|
||||
lm += "Approved: " + select(["Yes", "No"], name="approved")
|
||||
|
||||
# Multiple choice
|
||||
lm += "Answer: " + select(
|
||||
["A) Paris", "B) London", "C) Berlin", "D) Madrid"],
|
||||
name="answer"
|
||||
)
|
||||
```
|
||||
|
||||
### Conditional Selection
|
||||
|
||||
```python
|
||||
from guidance import models, select, gen, guidance
|
||||
|
||||
@guidance
|
||||
def conditional_fields(lm):
|
||||
"""Generate fields conditionally based on type."""
|
||||
lm += "Type: " + select(["person", "company"], name="type")
|
||||
|
||||
if lm["type"] == "person":
|
||||
lm += "\nName: " + gen("name", regex=r"[A-Za-z ]+")
|
||||
lm += "\nAge: " + gen("age", regex=r"[0-9]+")
|
||||
else:
|
||||
lm += "\nCompany Name: " + gen("company", regex=r"[A-Za-z ]+")
|
||||
lm += "\nEmployees: " + gen("employees", regex=r"[0-9]+")
|
||||
|
||||
return lm
|
||||
```
|
||||
|
||||
### Repeated Selection
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def multiple_selections(lm):
|
||||
"""Select multiple items."""
|
||||
lm += "Select 3 colors:\n"
|
||||
|
||||
colors = ["red", "blue", "green", "yellow", "purple"]
|
||||
|
||||
for i in range(3):
|
||||
lm += f"{i+1}. " + select(colors, name=f"color_{i}") + "\n"
|
||||
|
||||
return lm
|
||||
```
|
||||
|
||||
## Complex Patterns
|
||||
|
||||
### Pattern 1: Structured Forms
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def user_form(lm):
|
||||
"""Generate structured user form."""
|
||||
lm += "=== User Registration ===\n\n"
|
||||
|
||||
# Name (alphabetic only)
|
||||
lm += "Full Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n") + "\n"
|
||||
|
||||
# Age (numeric)
|
||||
lm += "Age: " + gen("age", regex=r"[0-9]+", max_tokens=3) + "\n"
|
||||
|
||||
# Email (validated format)
|
||||
lm += "Email: " + gen(
|
||||
"email",
|
||||
regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
|
||||
stop="\n"
|
||||
) + "\n"
|
||||
|
||||
# Phone (US format)
|
||||
lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}") + "\n"
|
||||
|
||||
# Account type (selection)
|
||||
lm += "Account Type: " + select(
|
||||
["Standard", "Premium", "Enterprise"],
|
||||
name="account_type"
|
||||
) + "\n"
|
||||
|
||||
# Active status (boolean)
|
||||
lm += "Active: " + select(["Yes", "No"], name="active") + "\n"
|
||||
|
||||
return lm
|
||||
```
|
||||
|
||||
### Pattern 2: Multi-Entity Extraction
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def extract_entities(lm, text):
|
||||
"""Extract multiple entities with constraints."""
|
||||
lm += f"Text: {text}\n\n"
|
||||
|
||||
# Person name (alphabetic)
|
||||
lm += "Person: " + gen("person", regex=r"[A-Za-z ]+", stop="\n") + "\n"
|
||||
|
||||
# Organization (alphanumeric with spaces)
|
||||
lm += "Organization: " + gen(
|
||||
"organization",
|
||||
regex=r"[A-Za-z0-9 ]+",
|
||||
stop="\n"
|
||||
) + "\n"
|
||||
|
||||
# Date (YYYY-MM-DD format)
|
||||
lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}") + "\n"
|
||||
|
||||
# Location (alphabetic with spaces)
|
||||
lm += "Location: " + gen("location", regex=r"[A-Za-z ]+", stop="\n") + "\n"
|
||||
|
||||
# Amount (currency)
|
||||
lm += "Amount: $" + gen("amount", regex=r"[0-9,]+\.[0-9]{2}") + "\n"
|
||||
|
||||
return lm
|
||||
```
|
||||
|
||||
### Pattern 3: Code Generation
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def generate_python_function(lm):
|
||||
"""Generate Python function with constraints."""
|
||||
# Function name (valid Python identifier)
|
||||
lm += "def " + gen("func_name", regex=r"[a-z_][a-z0-9_]*") + "("
|
||||
|
||||
# Parameter name
|
||||
lm += gen("param", regex=r"[a-z_][a-z0-9_]*") + "):\n"
|
||||
|
||||
# Docstring
|
||||
lm += ' """' + gen("docstring", stop='"""', max_tokens=50) + '"""\n'
|
||||
|
||||
# Function body (constrained to valid Python)
|
||||
lm += " return " + gen("return_value", stop="\n") + "\n"
|
||||
|
||||
return lm
|
||||
```
|
||||
|
||||
### Pattern 4: Hierarchical Data
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def org_chart(lm):
|
||||
"""Generate organizational chart."""
|
||||
lm += "Company: " + gen("company", regex=r"[A-Za-z ]+") + "\n\n"
|
||||
|
||||
# CEO
|
||||
lm += "CEO: " + gen("ceo", regex=r"[A-Za-z ]+") + "\n"
|
||||
|
||||
# Departments
|
||||
for dept in ["Engineering", "Sales", "Marketing"]:
|
||||
lm += f"\n{dept} Department:\n"
|
||||
lm += " Head: " + gen(f"{dept.lower()}_head", regex=r"[A-Za-z ]+") + "\n"
|
||||
lm += " Size: " + gen(f"{dept.lower()}_size", regex=r"[0-9]+") + " employees\n"
|
||||
|
||||
return lm
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Best Practices
|
||||
|
||||
#### 1. Use Specific Patterns
|
||||
|
||||
```python
|
||||
# ✅ Good: Specific pattern
|
||||
lm += gen("age", regex=r"[0-9]{1,3}") # Fast
|
||||
|
||||
# ❌ Bad: Overly broad pattern
|
||||
lm += gen("age", regex=r"[0-9]+") # Slower
|
||||
```
|
||||
|
||||
#### 2. Limit Max Tokens
|
||||
|
||||
```python
|
||||
# ✅ Good: Reasonable limit
|
||||
lm += gen("name", max_tokens=30)
|
||||
|
||||
# ❌ Bad: No limit
|
||||
lm += gen("name") # May generate forever
|
||||
```
|
||||
|
||||
#### 3. Use stop Sequences
|
||||
|
||||
```python
|
||||
# ✅ Good: Stop at newline
|
||||
lm += gen("line", stop="\n")
|
||||
|
||||
# ❌ Bad: Rely on max_tokens
|
||||
lm += gen("line", max_tokens=100)
|
||||
```
|
||||
|
||||
#### 4. Cache Compiled Grammars
|
||||
|
||||
```python
|
||||
# Grammars are cached automatically after first use
|
||||
# No manual caching needed
|
||||
@guidance
|
||||
def reusable_pattern(lm):
|
||||
"""This grammar is compiled once and cached."""
|
||||
lm += gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
|
||||
return lm
|
||||
|
||||
# First call: compiles grammar
|
||||
lm = reusable_pattern(lm)
|
||||
|
||||
# Subsequent calls: uses cached grammar (fast)
|
||||
lm = reusable_pattern(lm)
|
||||
```
|
||||
|
||||
#### 5. Avoid Overlapping Constraints
|
||||
|
||||
```python
|
||||
# ✅ Good: Clear constraints
|
||||
lm += gen("age", regex=r"[0-9]+", max_tokens=3)
|
||||
|
||||
# ❌ Bad: Conflicting constraints
|
||||
lm += gen("age", regex=r"[0-9]{2}", max_tokens=10) # max_tokens unnecessary
|
||||
```
|
||||
|
||||
### Performance Benchmarks
|
||||
|
||||
**Regex vs Free Generation:**
|
||||
- Simple regex (digits): ~1.2x slower than free gen
|
||||
- Complex regex (email): ~1.5x slower than free gen
|
||||
- Grammar-based: ~2x slower than free gen
|
||||
|
||||
**But:**
|
||||
- 100% valid outputs (vs ~70% with free gen + validation)
|
||||
- No retry loops needed
|
||||
- Overall faster end-to-end for structured outputs
|
||||
|
||||
**Optimization Tips:**
|
||||
- Use regex for critical fields only
|
||||
- Use `select()` for small fixed sets (fastest)
|
||||
- Use `stop` sequences when possible (faster than max_tokens)
|
||||
- Cache compiled grammars by reusing functions
|
||||
|
||||
## Resources
|
||||
|
||||
- **Token Healing Paper**: https://arxiv.org/abs/2306.17648
|
||||
- **Guidance Docs**: https://guidance.readthedocs.io
|
||||
- **GitHub**: https://github.com/guidance-ai/guidance
|
||||
767
skills/mlops/inference/guidance/references/examples.md
Normal file
767
skills/mlops/inference/guidance/references/examples.md
Normal file
@@ -0,0 +1,767 @@
|
||||
# Production-Ready Examples
|
||||
|
||||
Real-world examples of using Guidance for structured generation, agents, and workflows.
|
||||
|
||||
## Table of Contents
|
||||
- JSON Generation
|
||||
- Data Extraction
|
||||
- Classification Systems
|
||||
- Agent Systems
|
||||
- Multi-Step Workflows
|
||||
- Code Generation
|
||||
- Production Tips
|
||||
|
||||
## JSON Generation
|
||||
|
||||
### Basic JSON
|
||||
|
||||
```python
|
||||
from guidance import models, gen, guidance
|
||||
|
||||
@guidance
|
||||
def generate_user(lm):
|
||||
"""Generate valid user JSON."""
|
||||
lm += "{\n"
|
||||
lm += ' "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
|
||||
lm += ' "age": ' + gen("age", regex=r"[0-9]+") + ",\n"
|
||||
lm += ' "email": ' + gen(
|
||||
"email",
|
||||
regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
|
||||
) + "\n"
|
||||
lm += "}"
|
||||
return lm
|
||||
|
||||
# Use it
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm += "Generate a user profile:\n"
|
||||
lm = generate_user(lm)
|
||||
|
||||
print(lm)
|
||||
# Output: Valid JSON guaranteed
|
||||
```
|
||||
|
||||
### Nested JSON
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def generate_order(lm):
|
||||
"""Generate nested order JSON."""
|
||||
lm += "{\n"
|
||||
|
||||
# Customer info
|
||||
lm += ' "customer": {\n'
|
||||
lm += ' "name": ' + gen("customer_name", regex=r'"[A-Za-z ]+"') + ",\n"
|
||||
lm += ' "email": ' + gen(
|
||||
"customer_email",
|
||||
regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
|
||||
) + "\n"
|
||||
lm += " },\n"
|
||||
|
||||
# Order details
|
||||
lm += ' "order": {\n'
|
||||
lm += ' "id": ' + gen("order_id", regex=r'"ORD-[0-9]{6}"') + ",\n"
|
||||
lm += ' "date": ' + gen("order_date", regex=r'"\d{4}-\d{2}-\d{2}"') + ",\n"
|
||||
lm += ' "total": ' + gen("order_total", regex=r"[0-9]+\.[0-9]{2}") + "\n"
|
||||
lm += " },\n"
|
||||
|
||||
# Status
|
||||
lm += ' "status": ' + gen(
|
||||
"status",
|
||||
regex=r'"(pending|processing|shipped|delivered)"'
|
||||
) + "\n"
|
||||
|
||||
lm += "}"
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = generate_order(lm)
|
||||
```
|
||||
|
||||
### JSON Array
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def generate_user_list(lm, count=3):
|
||||
"""Generate JSON array of users."""
|
||||
lm += "[\n"
|
||||
|
||||
for i in range(count):
|
||||
lm += " {\n"
|
||||
lm += ' "id": ' + gen(f"id_{i}", regex=r"[0-9]+") + ",\n"
|
||||
lm += ' "name": ' + gen(f"name_{i}", regex=r'"[A-Za-z ]+"') + ",\n"
|
||||
lm += ' "active": ' + gen(f"active_{i}", regex=r"(true|false)") + "\n"
|
||||
lm += " }"
|
||||
if i < count - 1:
|
||||
lm += ","
|
||||
lm += "\n"
|
||||
|
||||
lm += "]"
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = generate_user_list(lm, count=5)
|
||||
```
|
||||
|
||||
### Dynamic JSON Schema
|
||||
|
||||
```python
|
||||
import json
|
||||
from guidance import models, gen, guidance
|
||||
|
||||
@guidance
|
||||
def json_from_schema(lm, schema):
|
||||
"""Generate JSON matching a schema."""
|
||||
lm += "{\n"
|
||||
|
||||
fields = list(schema["properties"].items())
|
||||
for i, (field_name, field_schema) in enumerate(fields):
|
||||
lm += f' "{field_name}": '
|
||||
|
||||
# Handle different types
|
||||
if field_schema["type"] == "string":
|
||||
if "pattern" in field_schema:
|
||||
lm += gen(field_name, regex=f'"{field_schema["pattern"]}"')
|
||||
else:
|
||||
lm += gen(field_name, regex=r'"[^"]+"')
|
||||
elif field_schema["type"] == "number":
|
||||
lm += gen(field_name, regex=r"[0-9]+(\.[0-9]+)?")
|
||||
elif field_schema["type"] == "integer":
|
||||
lm += gen(field_name, regex=r"[0-9]+")
|
||||
elif field_schema["type"] == "boolean":
|
||||
lm += gen(field_name, regex=r"(true|false)")
|
||||
|
||||
if i < len(fields) - 1:
|
||||
lm += ","
|
||||
lm += "\n"
|
||||
|
||||
lm += "}"
|
||||
return lm
|
||||
|
||||
# Define schema
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {"type": "string"},
|
||||
"age": {"type": "integer"},
|
||||
"score": {"type": "number"},
|
||||
"active": {"type": "boolean"}
|
||||
}
|
||||
}
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = json_from_schema(lm, schema)
|
||||
```
|
||||
|
||||
## Data Extraction
|
||||
|
||||
### Extract from Text
|
||||
|
||||
```python
|
||||
from guidance import models, gen, guidance, system, user, assistant
|
||||
|
||||
@guidance
|
||||
def extract_person_info(lm, text):
|
||||
"""Extract structured info from text."""
|
||||
lm += f"Text: {text}\n\n"
|
||||
|
||||
with assistant():
|
||||
lm += "Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n") + "\n"
|
||||
lm += "Age: " + gen("age", regex=r"[0-9]+", max_tokens=3) + "\n"
|
||||
lm += "Occupation: " + gen("occupation", regex=r"[A-Za-z ]+", stop="\n") + "\n"
|
||||
lm += "Email: " + gen(
|
||||
"email",
|
||||
regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
|
||||
stop="\n"
|
||||
) + "\n"
|
||||
|
||||
return lm
|
||||
|
||||
text = "John Smith is a 35-year-old software engineer. Contact: john@example.com"
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
with system():
|
||||
lm += "You extract structured information from text."
|
||||
|
||||
with user():
|
||||
lm = extract_person_info(lm, text)
|
||||
|
||||
print(f"Name: {lm['name']}")
|
||||
print(f"Age: {lm['age']}")
|
||||
print(f"Occupation: {lm['occupation']}")
|
||||
print(f"Email: {lm['email']}")
|
||||
```
|
||||
|
||||
### Multi-Entity Extraction
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def extract_entities(lm, text):
|
||||
"""Extract multiple entity types."""
|
||||
lm += f"Analyze: {text}\n\n"
|
||||
|
||||
# Person entities
|
||||
lm += "People:\n"
|
||||
for i in range(3): # Up to 3 people
|
||||
lm += f"- " + gen(f"person_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
|
||||
|
||||
# Organization entities
|
||||
lm += "\nOrganizations:\n"
|
||||
for i in range(2): # Up to 2 orgs
|
||||
lm += f"- " + gen(f"org_{i}", regex=r"[A-Za-z0-9 ]+", stop="\n") + "\n"
|
||||
|
||||
# Dates
|
||||
lm += "\nDates:\n"
|
||||
for i in range(2): # Up to 2 dates
|
||||
lm += f"- " + gen(f"date_{i}", regex=r"\d{4}-\d{2}-\d{2}", stop="\n") + "\n"
|
||||
|
||||
# Locations
|
||||
lm += "\nLocations:\n"
|
||||
for i in range(2): # Up to 2 locations
|
||||
lm += f"- " + gen(f"location_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
|
||||
|
||||
return lm
|
||||
|
||||
text = """
|
||||
Tim Cook and Satya Nadella met at Microsoft headquarters in Redmond on 2024-09-15
|
||||
to discuss the collaboration between Apple and Microsoft. The meeting continued
|
||||
in Cupertino on 2024-09-20.
|
||||
"""
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = extract_entities(lm, text)
|
||||
```
|
||||
|
||||
### Batch Extraction
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def batch_extract(lm, texts):
|
||||
"""Extract from multiple texts."""
|
||||
lm += "Batch Extraction Results:\n\n"
|
||||
|
||||
for i, text in enumerate(texts):
|
||||
lm += f"=== Item {i+1} ===\n"
|
||||
lm += f"Text: {text}\n"
|
||||
lm += "Name: " + gen(f"name_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
|
||||
lm += "Sentiment: " + gen(
|
||||
f"sentiment_{i}",
|
||||
regex=r"(positive|negative|neutral)",
|
||||
stop="\n"
|
||||
) + "\n\n"
|
||||
|
||||
return lm
|
||||
|
||||
texts = [
|
||||
"Alice is happy with the product",
|
||||
"Bob is disappointed with the service",
|
||||
"Carol has no strong feelings either way"
|
||||
]
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = batch_extract(lm, texts)
|
||||
```
|
||||
|
||||
## Classification Systems
|
||||
|
||||
### Sentiment Analysis
|
||||
|
||||
```python
|
||||
from guidance import models, select, gen
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
|
||||
text = "This product is absolutely amazing! Best purchase ever."
|
||||
|
||||
lm += f"Text: {text}\n\n"
|
||||
lm += "Sentiment: " + select(
|
||||
["positive", "negative", "neutral"],
|
||||
name="sentiment"
|
||||
)
|
||||
lm += "\nConfidence: " + gen("confidence", regex=r"[0-9]{1,3}") + "%\n"
|
||||
lm += "Reasoning: " + gen("reasoning", stop="\n", max_tokens=50)
|
||||
|
||||
print(f"Sentiment: {lm['sentiment']}")
|
||||
print(f"Confidence: {lm['confidence']}%")
|
||||
print(f"Reasoning: {lm['reasoning']}")
|
||||
```
|
||||
|
||||
### Multi-Label Classification
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def classify_article(lm, text):
|
||||
"""Classify article with multiple labels."""
|
||||
lm += f"Article: {text}\n\n"
|
||||
|
||||
# Primary category
|
||||
lm += "Primary Category: " + select(
|
||||
["Technology", "Business", "Science", "Politics", "Entertainment"],
|
||||
name="primary_category"
|
||||
) + "\n"
|
||||
|
||||
# Secondary categories (up to 3)
|
||||
lm += "\nSecondary Categories:\n"
|
||||
categories = ["Technology", "Business", "Science", "Politics", "Entertainment"]
|
||||
for i in range(3):
|
||||
lm += f"{i+1}. " + select(categories, name=f"secondary_{i}") + "\n"
|
||||
|
||||
# Tags
|
||||
lm += "\nTags: " + gen("tags", stop="\n", max_tokens=50) + "\n"
|
||||
|
||||
# Target audience
|
||||
lm += "Target Audience: " + select(
|
||||
["General", "Expert", "Beginner"],
|
||||
name="audience"
|
||||
)
|
||||
|
||||
return lm
|
||||
|
||||
article = """
|
||||
Apple announced new AI features in iOS 18, leveraging machine learning to improve
|
||||
battery life and performance. The company's stock rose 5% following the announcement.
|
||||
"""
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = classify_article(lm, article)
|
||||
```
|
||||
|
||||
### Intent Classification
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def classify_intent(lm, message):
|
||||
"""Classify user intent."""
|
||||
lm += f"User Message: {message}\n\n"
|
||||
|
||||
# Intent
|
||||
lm += "Intent: " + select(
|
||||
["question", "complaint", "request", "feedback", "other"],
|
||||
name="intent"
|
||||
) + "\n"
|
||||
|
||||
# Urgency
|
||||
lm += "Urgency: " + select(
|
||||
["low", "medium", "high", "critical"],
|
||||
name="urgency"
|
||||
) + "\n"
|
||||
|
||||
# Department
|
||||
lm += "Route To: " + select(
|
||||
["support", "sales", "billing", "technical"],
|
||||
name="department"
|
||||
) + "\n"
|
||||
|
||||
# Sentiment
|
||||
lm += "Sentiment: " + select(
|
||||
["positive", "neutral", "negative"],
|
||||
name="sentiment"
|
||||
)
|
||||
|
||||
return lm
|
||||
|
||||
message = "My account was charged twice for the same order. Need help ASAP!"
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = classify_intent(lm, message)
|
||||
|
||||
print(f"Intent: {lm['intent']}")
|
||||
print(f"Urgency: {lm['urgency']}")
|
||||
print(f"Department: {lm['department']}")
|
||||
```
|
||||
|
||||
## Agent Systems
|
||||
|
||||
### ReAct Agent
|
||||
|
||||
```python
|
||||
from guidance import models, gen, select, guidance
|
||||
|
||||
@guidance(stateless=False)
|
||||
def react_agent(lm, question, tools, max_rounds=5):
|
||||
"""ReAct agent with tool use."""
|
||||
lm += f"Question: {question}\n\n"
|
||||
|
||||
for round in range(max_rounds):
|
||||
# Thought
|
||||
lm += f"Thought {round+1}: " + gen("thought", stop="\n", max_tokens=100) + "\n"
|
||||
|
||||
# Action selection
|
||||
lm += "Action: " + select(
|
||||
list(tools.keys()) + ["answer"],
|
||||
name="action"
|
||||
)
|
||||
|
||||
if lm["action"] == "answer":
|
||||
lm += "\n\nFinal Answer: " + gen("answer", max_tokens=200)
|
||||
break
|
||||
|
||||
# Action input
|
||||
lm += "\nAction Input: " + gen("action_input", stop="\n", max_tokens=100) + "\n"
|
||||
|
||||
# Execute tool
|
||||
if lm["action"] in tools:
|
||||
try:
|
||||
result = tools[lm["action"]](lm["action_input"])
|
||||
lm += f"Observation: {result}\n\n"
|
||||
except Exception as e:
|
||||
lm += f"Observation: Error - {str(e)}\n\n"
|
||||
|
||||
return lm
|
||||
|
||||
# Define tools
|
||||
tools = {
|
||||
"calculator": lambda expr: eval(expr),
|
||||
"search": lambda query: f"Search results for '{query}': [Mock results]",
|
||||
"weather": lambda city: f"Weather in {city}: Sunny, 72°F"
|
||||
}
|
||||
|
||||
# Use agent
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = react_agent(lm, "What is (25 * 4) + 10?", tools)
|
||||
|
||||
print(lm["answer"])
|
||||
```
|
||||
|
||||
### Multi-Agent System
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def coordinator_agent(lm, task):
|
||||
"""Coordinator that delegates to specialists."""
|
||||
lm += f"Task: {task}\n\n"
|
||||
|
||||
# Determine which specialist to use
|
||||
lm += "Specialist: " + select(
|
||||
["researcher", "writer", "coder", "analyst"],
|
||||
name="specialist"
|
||||
) + "\n"
|
||||
|
||||
lm += "Reasoning: " + gen("reasoning", stop="\n", max_tokens=100) + "\n"
|
||||
|
||||
return lm
|
||||
|
||||
@guidance
|
||||
def researcher_agent(lm, query):
|
||||
"""Research specialist."""
|
||||
lm += f"Research Query: {query}\n\n"
|
||||
lm += "Findings:\n"
|
||||
for i in range(3):
|
||||
lm += f"{i+1}. " + gen(f"finding_{i}", stop="\n", max_tokens=100) + "\n"
|
||||
return lm
|
||||
|
||||
@guidance
|
||||
def writer_agent(lm, topic):
|
||||
"""Writing specialist."""
|
||||
lm += f"Topic: {topic}\n\n"
|
||||
lm += "Title: " + gen("title", stop="\n", max_tokens=50) + "\n"
|
||||
lm += "Content:\n" + gen("content", max_tokens=500)
|
||||
return lm
|
||||
|
||||
# Coordination workflow
|
||||
task = "Write an article about AI safety"
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = coordinator_agent(lm, task)
|
||||
|
||||
specialist = lm["specialist"]
|
||||
if specialist == "researcher":
|
||||
lm = researcher_agent(lm, task)
|
||||
elif specialist == "writer":
|
||||
lm = writer_agent(lm, task)
|
||||
```
|
||||
|
||||
### Tool Use with Validation
|
||||
|
||||
```python
|
||||
@guidance(stateless=False)
|
||||
def validated_tool_agent(lm, question):
|
||||
"""Agent with validated tool calls."""
|
||||
tools = {
|
||||
"add": lambda a, b: float(a) + float(b),
|
||||
"multiply": lambda a, b: float(a) * float(b),
|
||||
"divide": lambda a, b: float(a) / float(b) if float(b) != 0 else "Error: Division by zero"
|
||||
}
|
||||
|
||||
lm += f"Question: {question}\n\n"
|
||||
|
||||
for i in range(5):
|
||||
# Select tool
|
||||
lm += "Tool: " + select(list(tools.keys()) + ["done"], name="tool")
|
||||
|
||||
if lm["tool"] == "done":
|
||||
lm += "\nAnswer: " + gen("answer", max_tokens=100)
|
||||
break
|
||||
|
||||
# Get validated numeric arguments
|
||||
lm += "\nArg1: " + gen("arg1", regex=r"-?[0-9]+(\.[0-9]+)?") + "\n"
|
||||
lm += "Arg2: " + gen("arg2", regex=r"-?[0-9]+(\.[0-9]+)?") + "\n"
|
||||
|
||||
# Execute
|
||||
result = tools[lm["tool"]](lm["arg1"], lm["arg2"])
|
||||
lm += f"Result: {result}\n\n"
|
||||
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = validated_tool_agent(lm, "What is (10 + 5) * 3?")
|
||||
```
|
||||
|
||||
## Multi-Step Workflows
|
||||
|
||||
### Chain of Thought
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def chain_of_thought(lm, question):
|
||||
"""Multi-step reasoning with CoT."""
|
||||
lm += f"Question: {question}\n\n"
|
||||
|
||||
# Generate reasoning steps
|
||||
lm += "Let me think step by step:\n\n"
|
||||
for i in range(4):
|
||||
lm += f"Step {i+1}: " + gen(f"step_{i+1}", stop="\n", max_tokens=100) + "\n"
|
||||
|
||||
# Final answer
|
||||
lm += "\nTherefore, the answer is: " + gen("answer", stop="\n", max_tokens=50)
|
||||
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = chain_of_thought(lm, "If a train travels 60 mph for 2.5 hours, how far does it go?")
|
||||
|
||||
print(lm["answer"])
|
||||
```
|
||||
|
||||
### Self-Consistency
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def self_consistency(lm, question, num_samples=3):
|
||||
"""Generate multiple reasoning paths and aggregate."""
|
||||
lm += f"Question: {question}\n\n"
|
||||
|
||||
answers = []
|
||||
for i in range(num_samples):
|
||||
lm += f"=== Attempt {i+1} ===\n"
|
||||
lm += "Reasoning: " + gen(f"reasoning_{i}", stop="\n", max_tokens=100) + "\n"
|
||||
lm += "Answer: " + gen(f"answer_{i}", stop="\n", max_tokens=50) + "\n\n"
|
||||
answers.append(lm[f"answer_{i}"])
|
||||
|
||||
# Aggregate (simple majority vote)
|
||||
from collections import Counter
|
||||
most_common = Counter(answers).most_common(1)[0][0]
|
||||
|
||||
lm += f"Final Answer (by majority): {most_common}\n"
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = self_consistency(lm, "What is 15% of 200?")
|
||||
```
|
||||
|
||||
### Planning and Execution
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def plan_and_execute(lm, goal):
|
||||
"""Plan tasks then execute them."""
|
||||
lm += f"Goal: {goal}\n\n"
|
||||
|
||||
# Planning phase
|
||||
lm += "Plan:\n"
|
||||
num_steps = 4
|
||||
for i in range(num_steps):
|
||||
lm += f"{i+1}. " + gen(f"plan_step_{i}", stop="\n", max_tokens=100) + "\n"
|
||||
|
||||
# Execution phase
|
||||
lm += "\nExecution:\n\n"
|
||||
for i in range(num_steps):
|
||||
lm += f"Step {i+1}: {lm[f'plan_step_{i}']}\n"
|
||||
lm += "Status: " + select(["completed", "in-progress", "blocked"], name=f"status_{i}") + "\n"
|
||||
lm += "Result: " + gen(f"result_{i}", stop="\n", max_tokens=150) + "\n\n"
|
||||
|
||||
# Summary
|
||||
lm += "Summary: " + gen("summary", max_tokens=200)
|
||||
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = plan_and_execute(lm, "Build a REST API for a blog platform")
|
||||
```
|
||||
|
||||
## Code Generation
|
||||
|
||||
### Python Function
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def generate_python_function(lm, description):
|
||||
"""Generate Python function from description."""
|
||||
lm += f"Description: {description}\n\n"
|
||||
|
||||
# Function signature
|
||||
lm += "def " + gen("func_name", regex=r"[a-z_][a-z0-9_]*") + "("
|
||||
lm += gen("params", regex=r"[a-z_][a-z0-9_]*(, [a-z_][a-z0-9_]*)*") + "):\n"
|
||||
|
||||
# Docstring
|
||||
lm += ' """' + gen("docstring", stop='"""', max_tokens=100) + '"""\n'
|
||||
|
||||
# Function body
|
||||
lm += " " + gen("body", stop="\n", max_tokens=200) + "\n"
|
||||
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = generate_python_function(lm, "Check if a number is prime")
|
||||
|
||||
print(lm)
|
||||
```
|
||||
|
||||
### SQL Query
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def generate_sql(lm, description):
|
||||
"""Generate SQL query from description."""
|
||||
lm += f"Description: {description}\n\n"
|
||||
lm += "SQL Query:\n"
|
||||
|
||||
# SELECT clause
|
||||
lm += "SELECT " + gen("select_clause", stop=" FROM", max_tokens=100)
|
||||
|
||||
# FROM clause
|
||||
lm += " FROM " + gen("from_clause", stop=" WHERE", max_tokens=50)
|
||||
|
||||
# WHERE clause (optional)
|
||||
lm += " WHERE " + gen("where_clause", stop=";", max_tokens=100) + ";"
|
||||
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = generate_sql(lm, "Get all users who signed up in the last 30 days")
|
||||
```
|
||||
|
||||
### API Endpoint
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def generate_api_endpoint(lm, description):
|
||||
"""Generate REST API endpoint."""
|
||||
lm += f"Description: {description}\n\n"
|
||||
|
||||
# HTTP method
|
||||
lm += "Method: " + select(["GET", "POST", "PUT", "DELETE"], name="method") + "\n"
|
||||
|
||||
# Path
|
||||
lm += "Path: /" + gen("path", regex=r"[a-z0-9/-]+", stop="\n") + "\n"
|
||||
|
||||
# Request body (if POST/PUT)
|
||||
if lm["method"] in ["POST", "PUT"]:
|
||||
lm += "\nRequest Body:\n"
|
||||
lm += "{\n"
|
||||
lm += ' "field1": ' + gen("field1", regex=r'"[a-z_]+"') + ",\n"
|
||||
lm += ' "field2": ' + gen("field2", regex=r'"[a-z_]+"') + "\n"
|
||||
lm += "}\n"
|
||||
|
||||
# Response
|
||||
lm += "\nResponse (200 OK):\n"
|
||||
lm += "{\n"
|
||||
lm += ' "status": "success",\n'
|
||||
lm += ' "data": ' + gen("response_data", max_tokens=100) + "\n"
|
||||
lm += "}\n"
|
||||
|
||||
return lm
|
||||
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm = generate_api_endpoint(lm, "Create a new blog post")
|
||||
```
|
||||
|
||||
## Production Tips
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
@guidance
|
||||
def safe_extraction(lm, text):
|
||||
"""Extract with fallback handling."""
|
||||
try:
|
||||
lm += f"Text: {text}\n"
|
||||
lm += "Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n", max_tokens=30)
|
||||
return lm
|
||||
except Exception as e:
|
||||
# Fallback to less strict extraction
|
||||
lm += f"Text: {text}\n"
|
||||
lm += "Name: " + gen("name", stop="\n", max_tokens=30)
|
||||
return lm
|
||||
```
|
||||
|
||||
### Caching
|
||||
|
||||
```python
|
||||
from functools import lru_cache
|
||||
|
||||
@lru_cache(maxsize=100)
|
||||
def cached_generation(text):
|
||||
"""Cache LLM generations."""
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
lm += f"Analyze: {text}\n"
|
||||
lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
|
||||
return lm["sentiment"]
|
||||
|
||||
# First call: hits LLM
|
||||
result1 = cached_generation("This is great!")
|
||||
|
||||
# Second call: returns cached result
|
||||
result2 = cached_generation("This is great!") # Instant!
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
@guidance
|
||||
def monitored_generation(lm, text):
|
||||
"""Track generation metrics."""
|
||||
start_time = time.time()
|
||||
|
||||
lm += f"Text: {text}\n"
|
||||
lm += "Analysis: " + gen("analysis", max_tokens=100)
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
|
||||
# Log metrics
|
||||
print(f"Generation time: {elapsed:.2f}s")
|
||||
print(f"Output length: {len(lm['analysis'])} chars")
|
||||
|
||||
return lm
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
def batch_process(texts, batch_size=10):
|
||||
"""Process texts in batches."""
|
||||
lm = models.Anthropic("claude-sonnet-4-5-20250929")
|
||||
results = []
|
||||
|
||||
for i in range(0, len(texts), batch_size):
|
||||
batch = texts[i:i+batch_size]
|
||||
|
||||
for text in batch:
|
||||
lm += f"Text: {text}\n"
|
||||
lm += "Sentiment: " + select(
|
||||
["positive", "negative", "neutral"],
|
||||
name=f"sentiment_{i}"
|
||||
) + "\n\n"
|
||||
|
||||
results.extend([lm[f"sentiment_{i}"] for i in range(len(batch))])
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **Guidance Notebooks**: https://github.com/guidance-ai/guidance/tree/main/notebooks
|
||||
- **Guidance Docs**: https://guidance.readthedocs.io
|
||||
- **Community Examples**: https://github.com/guidance-ai/guidance/discussions
|
||||
261
skills/mlops/inference/llama-cpp/SKILL.md
Normal file
261
skills/mlops/inference/llama-cpp/SKILL.md
Normal file
@@ -0,0 +1,261 @@
|
||||
---
|
||||
name: llama-cpp
|
||||
description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [llama-cpp-python]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
|
||||
|
||||
---
|
||||
|
||||
# llama.cpp
|
||||
|
||||
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
|
||||
|
||||
## When to use llama.cpp
|
||||
|
||||
**Use llama.cpp when:**
|
||||
- Running on CPU-only machines
|
||||
- Deploying on Apple Silicon (M1/M2/M3/M4)
|
||||
- Using AMD or Intel GPUs (no CUDA)
|
||||
- Edge deployment (Raspberry Pi, embedded systems)
|
||||
- Need simple deployment without Docker/Python
|
||||
|
||||
**Use TensorRT-LLM instead when:**
|
||||
- Have NVIDIA GPUs (A100/H100)
|
||||
- Need maximum throughput (100K+ tok/s)
|
||||
- Running in datacenter with CUDA
|
||||
|
||||
**Use vLLM instead when:**
|
||||
- Have NVIDIA GPUs
|
||||
- Need Python-first API
|
||||
- Want PagedAttention
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# macOS/Linux
|
||||
brew install llama.cpp
|
||||
|
||||
# Or build from source
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
cd llama.cpp
|
||||
make
|
||||
|
||||
# With Metal (Apple Silicon)
|
||||
make LLAMA_METAL=1
|
||||
|
||||
# With CUDA (NVIDIA)
|
||||
make LLAMA_CUDA=1
|
||||
|
||||
# With ROCm (AMD)
|
||||
make LLAMA_HIP=1
|
||||
```
|
||||
|
||||
### Download model
|
||||
|
||||
```bash
|
||||
# Download from HuggingFace (GGUF format)
|
||||
huggingface-cli download \
|
||||
TheBloke/Llama-2-7B-Chat-GGUF \
|
||||
llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--local-dir models/
|
||||
|
||||
# Or convert from HuggingFace
|
||||
python convert_hf_to_gguf.py models/llama-2-7b-chat/
|
||||
```
|
||||
|
||||
### Run inference
|
||||
|
||||
```bash
|
||||
# Simple chat
|
||||
./llama-cli \
|
||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
||||
-p "Explain quantum computing" \
|
||||
-n 256 # Max tokens
|
||||
|
||||
# Interactive chat
|
||||
./llama-cli \
|
||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--interactive
|
||||
```
|
||||
|
||||
### Server mode
|
||||
|
||||
```bash
|
||||
# Start OpenAI-compatible server
|
||||
./llama-server \
|
||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-ngl 32 # Offload 32 layers to GPU
|
||||
|
||||
# Client request
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-2-7b-chat",
|
||||
"messages": [{"role": "user", "content": "Hello!"}],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
## Quantization formats
|
||||
|
||||
### GGUF format overview
|
||||
|
||||
| Format | Bits | Size (7B) | Speed | Quality | Use Case |
|
||||
|--------|------|-----------|-------|---------|----------|
|
||||
| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
|
||||
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
|
||||
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
|
||||
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
|
||||
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
|
||||
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
|
||||
|
||||
### Choosing quantization
|
||||
|
||||
```bash
|
||||
# General use (balanced)
|
||||
Q4_K_M # 4-bit, medium quality
|
||||
|
||||
# Maximum speed (more degradation)
|
||||
Q2_K or Q3_K_M
|
||||
|
||||
# Maximum quality (slower)
|
||||
Q6_K or Q8_0
|
||||
|
||||
# Very large models (70B, 405B)
|
||||
Q3_K_M or Q4_K_S # Lower bits to fit in memory
|
||||
```
|
||||
|
||||
## Hardware acceleration
|
||||
|
||||
### Apple Silicon (Metal)
|
||||
|
||||
```bash
|
||||
# Build with Metal
|
||||
make LLAMA_METAL=1
|
||||
|
||||
# Run with GPU acceleration (automatic)
|
||||
./llama-cli -m model.gguf -ngl 999 # Offload all layers
|
||||
|
||||
# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
|
||||
```
|
||||
|
||||
### NVIDIA GPUs (CUDA)
|
||||
|
||||
```bash
|
||||
# Build with CUDA
|
||||
make LLAMA_CUDA=1
|
||||
|
||||
# Offload layers to GPU
|
||||
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
|
||||
|
||||
# Hybrid CPU+GPU for large models
|
||||
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
|
||||
```
|
||||
|
||||
### AMD GPUs (ROCm)
|
||||
|
||||
```bash
|
||||
# Build with ROCm
|
||||
make LLAMA_HIP=1
|
||||
|
||||
# Run with AMD GPU
|
||||
./llama-cli -m model.gguf -ngl 999
|
||||
```
|
||||
|
||||
## Common patterns
|
||||
|
||||
### Batch processing
|
||||
|
||||
```bash
|
||||
# Process multiple prompts from file
|
||||
cat prompts.txt | ./llama-cli \
|
||||
-m model.gguf \
|
||||
--batch-size 512 \
|
||||
-n 100
|
||||
```
|
||||
|
||||
### Constrained generation
|
||||
|
||||
```bash
|
||||
# JSON output with grammar
|
||||
./llama-cli \
|
||||
-m model.gguf \
|
||||
-p "Generate a person: " \
|
||||
--grammar-file grammars/json.gbnf
|
||||
|
||||
# Outputs valid JSON only
|
||||
```
|
||||
|
||||
### Context size
|
||||
|
||||
```bash
|
||||
# Increase context (default 512)
|
||||
./llama-cli \
|
||||
-m model.gguf \
|
||||
-c 4096 # 4K context window
|
||||
|
||||
# Very long context (if model supports)
|
||||
./llama-cli -m model.gguf -c 32768 # 32K context
|
||||
```
|
||||
|
||||
## Performance benchmarks
|
||||
|
||||
### CPU performance (Llama 2-7B Q4_K_M)
|
||||
|
||||
| CPU | Threads | Speed | Cost |
|
||||
|-----|---------|-------|------|
|
||||
| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
|
||||
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
|
||||
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
|
||||
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
|
||||
|
||||
### GPU acceleration (Llama 2-7B Q4_K_M)
|
||||
|
||||
| GPU | Speed | vs CPU | Cost |
|
||||
|-----|-------|--------|------|
|
||||
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
|
||||
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
|
||||
| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
|
||||
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
|
||||
|
||||
## Supported models
|
||||
|
||||
**LLaMA family**:
|
||||
- Llama 2 (7B, 13B, 70B)
|
||||
- Llama 3 (8B, 70B, 405B)
|
||||
- Code Llama
|
||||
|
||||
**Mistral family**:
|
||||
- Mistral 7B
|
||||
- Mixtral 8x7B, 8x22B
|
||||
|
||||
**Other**:
|
||||
- Falcon, BLOOM, GPT-J
|
||||
- Phi-3, Gemma, Qwen
|
||||
- LLaVA (vision), Whisper (audio)
|
||||
|
||||
**Find models**: https://huggingface.co/models?library=gguf
|
||||
|
||||
## References
|
||||
|
||||
- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
|
||||
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
|
||||
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
|
||||
|
||||
## Resources
|
||||
|
||||
- **GitHub**: https://github.com/ggerganov/llama.cpp
|
||||
- **Models**: https://huggingface.co/models?library=gguf
|
||||
- **Discord**: https://discord.gg/llama-cpp
|
||||
|
||||
|
||||
89
skills/mlops/inference/llama-cpp/references/optimization.md
Normal file
89
skills/mlops/inference/llama-cpp/references/optimization.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# Performance Optimization Guide
|
||||
|
||||
Maximize llama.cpp inference speed and efficiency.
|
||||
|
||||
## CPU Optimization
|
||||
|
||||
### Thread tuning
|
||||
```bash
|
||||
# Set threads (default: physical cores)
|
||||
./llama-cli -m model.gguf -t 8
|
||||
|
||||
# For AMD Ryzen 9 7950X (16 cores, 32 threads)
|
||||
-t 16 # Best: physical cores
|
||||
|
||||
# Avoid hyperthreading (slower for matrix ops)
|
||||
```
|
||||
|
||||
### BLAS acceleration
|
||||
```bash
|
||||
# OpenBLAS (faster matrix ops)
|
||||
make LLAMA_OPENBLAS=1
|
||||
|
||||
# BLAS gives 2-3× speedup
|
||||
```
|
||||
|
||||
## GPU Offloading
|
||||
|
||||
### Layer offloading
|
||||
```bash
|
||||
# Offload 35 layers to GPU (hybrid mode)
|
||||
./llama-cli -m model.gguf -ngl 35
|
||||
|
||||
# Offload all layers
|
||||
./llama-cli -m model.gguf -ngl 999
|
||||
|
||||
# Find optimal value:
|
||||
# Start with -ngl 999
|
||||
# If OOM, reduce by 5 until fits
|
||||
```
|
||||
|
||||
### Memory usage
|
||||
```bash
|
||||
# Check VRAM usage
|
||||
nvidia-smi dmon
|
||||
|
||||
# Reduce context if needed
|
||||
./llama-cli -m model.gguf -c 2048 # 2K context instead of 4K
|
||||
```
|
||||
|
||||
## Batch Processing
|
||||
|
||||
```bash
|
||||
# Increase batch size for throughput
|
||||
./llama-cli -m model.gguf -b 512 # Default: 512
|
||||
|
||||
# Physical batch (GPU)
|
||||
--ubatch 128 # Process 128 tokens at once
|
||||
```
|
||||
|
||||
## Context Management
|
||||
|
||||
```bash
|
||||
# Default context (512 tokens)
|
||||
-c 512
|
||||
|
||||
# Longer context (slower, more memory)
|
||||
-c 4096
|
||||
|
||||
# Very long context (if model supports)
|
||||
-c 32768
|
||||
```
|
||||
|
||||
## Benchmarks
|
||||
|
||||
### CPU Performance (Llama 2-7B Q4_K_M)
|
||||
|
||||
| Setup | Speed | Notes |
|
||||
|-------|-------|-------|
|
||||
| Apple M3 Max | 50 tok/s | Metal acceleration |
|
||||
| AMD 7950X (16c) | 35 tok/s | OpenBLAS |
|
||||
| Intel i9-13900K | 30 tok/s | AVX2 |
|
||||
|
||||
### GPU Offloading (RTX 4090)
|
||||
|
||||
| Layers GPU | Speed | VRAM |
|
||||
|------------|-------|------|
|
||||
| 0 (CPU only) | 30 tok/s | 0 GB |
|
||||
| 20 (hybrid) | 80 tok/s | 8 GB |
|
||||
| 35 (all) | 120 tok/s | 12 GB |
|
||||
213
skills/mlops/inference/llama-cpp/references/quantization.md
Normal file
213
skills/mlops/inference/llama-cpp/references/quantization.md
Normal file
@@ -0,0 +1,213 @@
|
||||
# GGUF Quantization Guide
|
||||
|
||||
Complete guide to GGUF quantization formats and model conversion.
|
||||
|
||||
## Quantization Overview
|
||||
|
||||
**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models.
|
||||
|
||||
### Format Comparison
|
||||
|
||||
| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
|
||||
|--------|------------|-----------|------------|-------|
|
||||
| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
|
||||
| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
|
||||
| **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
|
||||
| **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
|
||||
| **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** |
|
||||
| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
|
||||
| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
|
||||
| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
|
||||
|
||||
**Recommendation**: Use **Q4_K_M** for best balance of quality and speed.
|
||||
|
||||
## Converting Models
|
||||
|
||||
### HuggingFace to GGUF
|
||||
|
||||
```bash
|
||||
# 1. Download HuggingFace model
|
||||
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
|
||||
--local-dir models/llama-2-7b-chat/
|
||||
|
||||
# 2. Convert to FP16 GGUF
|
||||
python convert_hf_to_gguf.py \
|
||||
models/llama-2-7b-chat/ \
|
||||
--outtype f16 \
|
||||
--outfile models/llama-2-7b-chat-f16.gguf
|
||||
|
||||
# 3. Quantize to Q4_K_M
|
||||
./llama-quantize \
|
||||
models/llama-2-7b-chat-f16.gguf \
|
||||
models/llama-2-7b-chat-Q4_K_M.gguf \
|
||||
Q4_K_M
|
||||
```
|
||||
|
||||
### Batch quantization
|
||||
|
||||
```bash
|
||||
# Quantize to multiple formats
|
||||
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
||||
./llama-quantize \
|
||||
model-f16.gguf \
|
||||
model-${quant}.gguf \
|
||||
$quant
|
||||
done
|
||||
```
|
||||
|
||||
## K-Quantization Methods
|
||||
|
||||
**K-quants** use mixed precision for better quality:
|
||||
- Attention weights: Higher precision
|
||||
- Feed-forward weights: Lower precision
|
||||
|
||||
**Variants**:
|
||||
- `_S` (Small): Faster, lower quality
|
||||
- `_M` (Medium): Balanced (recommended)
|
||||
- `_L` (Large): Better quality, larger size
|
||||
|
||||
**Example**: `Q4_K_M`
|
||||
- `Q4`: 4-bit quantization
|
||||
- `K`: Mixed precision method
|
||||
- `M`: Medium quality
|
||||
|
||||
## Quality Testing
|
||||
|
||||
```bash
|
||||
# Calculate perplexity (quality metric)
|
||||
./llama-perplexity \
|
||||
-m model.gguf \
|
||||
-f wikitext-2-raw/wiki.test.raw \
|
||||
-c 512
|
||||
|
||||
# Lower perplexity = better quality
|
||||
# Baseline (FP16): ~5.96
|
||||
# Q4_K_M: ~6.06 (+1.7%)
|
||||
# Q2_K: ~6.87 (+15.3% - too much degradation)
|
||||
```
|
||||
|
||||
## Use Case Guide
|
||||
|
||||
### General purpose (chatbots, assistants)
|
||||
```
|
||||
Q4_K_M - Best balance
|
||||
Q5_K_M - If you have extra RAM
|
||||
```
|
||||
|
||||
### Code generation
|
||||
```
|
||||
Q5_K_M or Q6_K - Higher precision helps with code
|
||||
```
|
||||
|
||||
### Creative writing
|
||||
```
|
||||
Q4_K_M - Sufficient quality
|
||||
Q3_K_M - Acceptable for draft generation
|
||||
```
|
||||
|
||||
### Technical/medical
|
||||
```
|
||||
Q6_K or Q8_0 - Maximum accuracy
|
||||
```
|
||||
|
||||
### Edge devices (Raspberry Pi)
|
||||
```
|
||||
Q2_K or Q3_K_S - Fit in limited RAM
|
||||
```
|
||||
|
||||
## Model Size Scaling
|
||||
|
||||
### 7B parameter models
|
||||
|
||||
| Format | Size | RAM needed |
|
||||
|--------|------|------------|
|
||||
| Q2_K | 2.7 GB | 5 GB |
|
||||
| Q3_K_M | 3.3 GB | 6 GB |
|
||||
| Q4_K_M | 4.1 GB | 7 GB |
|
||||
| Q5_K_M | 4.8 GB | 8 GB |
|
||||
| Q6_K | 5.5 GB | 9 GB |
|
||||
| Q8_0 | 7.0 GB | 11 GB |
|
||||
|
||||
### 13B parameter models
|
||||
|
||||
| Format | Size | RAM needed |
|
||||
|--------|------|------------|
|
||||
| Q2_K | 5.1 GB | 8 GB |
|
||||
| Q3_K_M | 6.2 GB | 10 GB |
|
||||
| Q4_K_M | 7.9 GB | 12 GB |
|
||||
| Q5_K_M | 9.2 GB | 14 GB |
|
||||
| Q6_K | 10.7 GB | 16 GB |
|
||||
|
||||
### 70B parameter models
|
||||
|
||||
| Format | Size | RAM needed |
|
||||
|--------|------|------------|
|
||||
| Q2_K | 26 GB | 32 GB |
|
||||
| Q3_K_M | 32 GB | 40 GB |
|
||||
| Q4_K_M | 41 GB | 48 GB |
|
||||
| Q4_K_S | 39 GB | 46 GB |
|
||||
| Q5_K_M | 48 GB | 56 GB |
|
||||
|
||||
**Recommendation for 70B**: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
|
||||
|
||||
## Finding Pre-Quantized Models
|
||||
|
||||
**TheBloke** on HuggingFace:
|
||||
- https://huggingface.co/TheBloke
|
||||
- Most models available in all GGUF formats
|
||||
- No conversion needed
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
# Download pre-quantized Llama 2-7B
|
||||
huggingface-cli download \
|
||||
TheBloke/Llama-2-7B-Chat-GGUF \
|
||||
llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--local-dir models/
|
||||
```
|
||||
|
||||
## Importance Matrices (imatrix)
|
||||
|
||||
**What**: Calibration data to improve quantization quality.
|
||||
|
||||
**Benefits**:
|
||||
- 10-20% perplexity improvement with Q4
|
||||
- Essential for Q3 and below
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# 1. Generate importance matrix
|
||||
./llama-imatrix \
|
||||
-m model-f16.gguf \
|
||||
-f calibration-data.txt \
|
||||
-o model.imatrix
|
||||
|
||||
# 2. Quantize with imatrix
|
||||
./llama-quantize \
|
||||
--imatrix model.imatrix \
|
||||
model-f16.gguf \
|
||||
model-Q4_K_M.gguf \
|
||||
Q4_K_M
|
||||
```
|
||||
|
||||
**Calibration data**:
|
||||
- Use domain-specific text (e.g., code for code models)
|
||||
- ~100MB of representative text
|
||||
- Higher quality data = better quantization
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Model outputs gibberish**:
|
||||
- Quantization too aggressive (Q2_K)
|
||||
- Try Q4_K_M or Q5_K_M
|
||||
- Verify model converted correctly
|
||||
|
||||
**Out of memory**:
|
||||
- Use lower quantization (Q4_K_S instead of Q5_K_M)
|
||||
- Offload fewer layers to GPU (`-ngl`)
|
||||
- Use smaller context (`-c 2048`)
|
||||
|
||||
**Slow inference**:
|
||||
- Higher quantization uses more compute
|
||||
- Q8_0 much slower than Q4_K_M
|
||||
- Consider speed vs quality trade-off
|
||||
125
skills/mlops/inference/llama-cpp/references/server.md
Normal file
125
skills/mlops/inference/llama-cpp/references/server.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# Server Deployment Guide
|
||||
|
||||
Production deployment of llama.cpp server with OpenAI-compatible API.
|
||||
|
||||
## Server Modes
|
||||
|
||||
### llama-server
|
||||
|
||||
```bash
|
||||
# Basic server
|
||||
./llama-server \
|
||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-c 4096 # Context size
|
||||
|
||||
# With GPU acceleration
|
||||
./llama-server \
|
||||
-m models/llama-2-70b.Q4_K_M.gguf \
|
||||
-ngl 40 # Offload 40 layers to GPU
|
||||
```
|
||||
|
||||
## OpenAI-Compatible API
|
||||
|
||||
### Chat completions
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-2",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are helpful"},
|
||||
{"role": "user", "content": "Hello"}
|
||||
],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
### Streaming
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-2",
|
||||
"messages": [{"role": "user", "content": "Count to 10"}],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
**Dockerfile**:
|
||||
```dockerfile
|
||||
FROM ubuntu:22.04
|
||||
RUN apt-get update && apt-get install -y git build-essential
|
||||
RUN git clone https://github.com/ggerganov/llama.cpp
|
||||
WORKDIR /llama.cpp
|
||||
RUN make LLAMA_CUDA=1
|
||||
COPY models/ /models/
|
||||
EXPOSE 8080
|
||||
CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
|
||||
```
|
||||
|
||||
**Run**:
|
||||
```bash
|
||||
docker run --gpus all -p 8080:8080 llama-cpp:latest
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
```bash
|
||||
# Server metrics endpoint
|
||||
curl http://localhost:8080/metrics
|
||||
|
||||
# Health check
|
||||
curl http://localhost:8080/health
|
||||
```
|
||||
|
||||
**Metrics**:
|
||||
- requests_total
|
||||
- tokens_generated
|
||||
- prompt_tokens
|
||||
- completion_tokens
|
||||
- kv_cache_tokens
|
||||
|
||||
## Load Balancing
|
||||
|
||||
**NGINX**:
|
||||
```nginx
|
||||
upstream llama_cpp {
|
||||
server llama1:8080;
|
||||
server llama2:8080;
|
||||
}
|
||||
|
||||
server {
|
||||
location / {
|
||||
proxy_pass http://llama_cpp;
|
||||
proxy_read_timeout 300s;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
**Parallel requests**:
|
||||
```bash
|
||||
./llama-server \
|
||||
-m model.gguf \
|
||||
-np 4 # 4 parallel slots
|
||||
```
|
||||
|
||||
**Continuous batching**:
|
||||
```bash
|
||||
./llama-server \
|
||||
-m model.gguf \
|
||||
--cont-batching # Enable continuous batching
|
||||
```
|
||||
|
||||
**Context caching**:
|
||||
```bash
|
||||
./llama-server \
|
||||
-m model.gguf \
|
||||
--cache-prompt # Cache processed prompts
|
||||
```
|
||||
330
skills/mlops/inference/obliteratus/SKILL.md
Normal file
330
skills/mlops/inference/obliteratus/SKILL.md
Normal file
@@ -0,0 +1,330 @@
|
||||
---
|
||||
name: obliteratus
|
||||
description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets across 5 compute tiers, tournament evaluation, and telemetry-driven recommendations. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
|
||||
version: 2.0.0
|
||||
author: Hermes Agent
|
||||
license: MIT
|
||||
dependencies: [obliteratus, torch, transformers, bitsandbytes, accelerate, safetensors]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Abliteration, Uncensoring, Refusal-Removal, LLM, Weight-Projection, SVD, Mechanistic-Interpretability, HuggingFace, Model-Surgery]
|
||||
related_skills: [vllm, gguf, huggingface-tokenizers]
|
||||
---
|
||||
|
||||
# OBLITERATUS Skill
|
||||
|
||||
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, LEACE concept erasure, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
|
||||
|
||||
**License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Trigger when the user:
|
||||
- Wants to "uncensor" or "abliterate" an LLM
|
||||
- Asks about removing refusal/guardrails from a model
|
||||
- Wants to create an uncensored version of Llama, Qwen, Mistral, etc.
|
||||
- Mentions "refusal removal", "abliteration", "weight projection"
|
||||
- Wants to analyze how a model's refusal mechanism works
|
||||
- References OBLITERATUS, abliterator, or refusal directions
|
||||
|
||||
## Step 1: Installation
|
||||
|
||||
Check if already installed:
|
||||
```bash
|
||||
obliteratus --version 2>/dev/null && echo "INSTALLED" || echo "NOT INSTALLED"
|
||||
```
|
||||
|
||||
If not installed, clone and install from GitHub:
|
||||
```bash
|
||||
git clone https://github.com/elder-plinius/OBLITERATUS.git
|
||||
cd OBLITERATUS
|
||||
pip install -e .
|
||||
# For Gradio web UI support:
|
||||
# pip install -e ".[spaces]"
|
||||
```
|
||||
|
||||
**IMPORTANT:** Confirm with user before installing. This pulls in ~5-10GB of dependencies (PyTorch, Transformers, bitsandbytes, etc.).
|
||||
|
||||
## Step 2: Check Hardware
|
||||
|
||||
Before anything, check what GPU is available:
|
||||
```bash
|
||||
python3 -c "
|
||||
import torch
|
||||
if torch.cuda.is_available():
|
||||
gpu = torch.cuda.get_device_name(0)
|
||||
vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
|
||||
print(f'GPU: {gpu}')
|
||||
print(f'VRAM: {vram:.1f} GB')
|
||||
if vram < 4: print('TIER: tiny (models under 1B)')
|
||||
elif vram < 8: print('TIER: small (models 1-4B)')
|
||||
elif vram < 16: print('TIER: medium (models 4-9B with 4bit quant)')
|
||||
elif vram < 32: print('TIER: large (models 8-32B with 4bit quant)')
|
||||
else: print('TIER: frontier (models 32B+)')
|
||||
else:
|
||||
print('NO GPU - only tiny models (under 1B) on CPU')
|
||||
"
|
||||
```
|
||||
|
||||
### VRAM Requirements (with 4-bit quantization)
|
||||
|
||||
| VRAM | Max Model Size | Example Models |
|
||||
|:---------|:----------------|:--------------------------------------------|
|
||||
| CPU only | ~1B params | GPT-2, TinyLlama, SmolLM |
|
||||
| 4-8 GB | ~4B params | Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 3B |
|
||||
| 8-16 GB | ~9B params | Llama 3.1 8B, Mistral 7B, Gemma 2 9B |
|
||||
| 24 GB | ~32B params | Qwen3-32B, Llama 3.1 70B (tight), Command-R |
|
||||
| 48 GB+ | ~72B+ params | Qwen2.5-72B, DeepSeek-R1 |
|
||||
| Multi-GPU| 200B+ params | Llama 3.1 405B, DeepSeek-V3 (685B MoE) |
|
||||
|
||||
## Step 3: Browse Available Models & Get Recommendations
|
||||
|
||||
```bash
|
||||
# Browse models by compute tier
|
||||
obliteratus models --tier medium
|
||||
|
||||
# Get architecture info for a specific model
|
||||
obliteratus info <model_name>
|
||||
|
||||
# Get telemetry-driven recommendation for best method & params
|
||||
obliteratus recommend <model_name>
|
||||
obliteratus recommend <model_name> --insights # global cross-architecture rankings
|
||||
```
|
||||
|
||||
## Step 4: Choose a Method
|
||||
|
||||
### Method Selection Guide
|
||||
**Default / recommended for most cases: `advanced`.** It uses multi-direction SVD with norm-preserving projection and is well-tested.
|
||||
|
||||
| Situation | Recommended Method | Why |
|
||||
|:----------------------------------|:-------------------|:-----------------------------------------|
|
||||
| Default / most models | `advanced` | Multi-direction SVD, norm-preserving, reliable |
|
||||
| Quick test / prototyping | `basic` | Fast, simple, good enough to evaluate |
|
||||
| Dense model (Llama, Mistral) | `advanced` | Multi-direction, norm-preserving |
|
||||
| MoE model (DeepSeek, Mixtral) | `nuclear` | Expert-granular, handles MoE complexity |
|
||||
| Reasoning model (R1 distills) | `surgical` | CoT-aware, preserves chain-of-thought |
|
||||
| Stubborn refusals persist | `aggressive` | Whitened SVD + head surgery + jailbreak |
|
||||
| Want reversible changes | Use steering vectors (see Analysis section) |
|
||||
| Maximum quality, time no object | `optimized` | Bayesian search for best parameters |
|
||||
| Experimental auto-detection | `informed` | Auto-detects alignment type — experimental, may not always outperform advanced |
|
||||
|
||||
### 9 CLI Methods
|
||||
- **basic** — Single refusal direction via diff-in-means. Fast (~5-10 min for 8B).
|
||||
- **advanced** (DEFAULT, RECOMMENDED) — Multiple SVD directions, norm-preserving projection, 2 refinement passes. Medium speed (~10-20 min).
|
||||
- **aggressive** — Whitened SVD + jailbreak-contrastive + attention head surgery. Higher risk of coherence damage.
|
||||
- **spectral_cascade** — DCT frequency-domain decomposition. Research/novel approach.
|
||||
- **informed** — Runs analysis DURING abliteration to auto-configure. Experimental — slower and less predictable than advanced.
|
||||
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Very slow (~1-2 hrs). Best for reasoning models.
|
||||
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Longest runtime but finds optimal parameters.
|
||||
- **inverted** — Flips the refusal direction. Model becomes actively willing.
|
||||
- **nuclear** — Maximum force combo for stubborn MoE models. Expert-granular.
|
||||
|
||||
### Direction Extraction Methods (--direction-method flag)
|
||||
- **diff_means** (default) — Simple difference-in-means between refused/complied activations. Robust.
|
||||
- **svd** — Multi-direction SVD extraction. Better for complex alignment.
|
||||
- **leace** — LEACE (Linear Erasure via Closed-form Estimation). Optimal linear erasure.
|
||||
|
||||
### 4 Python-API-Only Methods
|
||||
(NOT available via CLI — require Python import, which violates AGPL boundary. Mention to user only if they explicitly want to use OBLITERATUS as a library in their own AGPL project.)
|
||||
- failspy, gabliteration, heretic, rdo
|
||||
|
||||
## Step 5: Run Abliteration
|
||||
|
||||
### Standard usage
|
||||
```bash
|
||||
# Default method (advanced) — recommended for most models
|
||||
obliteratus obliterate <model_name> --method advanced --output-dir ./abliterated-models
|
||||
|
||||
# With 4-bit quantization (saves VRAM)
|
||||
obliteratus obliterate <model_name> --method advanced --quantization 4bit --output-dir ./abliterated-models
|
||||
|
||||
# Large models (70B+) — conservative defaults
|
||||
obliteratus obliterate <model_name> --method advanced --quantization 4bit --large-model --output-dir ./abliterated-models
|
||||
```
|
||||
|
||||
### Fine-tuning parameters
|
||||
```bash
|
||||
obliteratus obliterate <model_name> \
|
||||
--method advanced \
|
||||
--direction-method diff_means \
|
||||
--n-directions 4 \
|
||||
--refinement-passes 2 \
|
||||
--regularization 0.1 \
|
||||
--quantization 4bit \
|
||||
--output-dir ./abliterated-models \
|
||||
--contribute # opt-in telemetry for community research
|
||||
```
|
||||
|
||||
### Key flags
|
||||
| Flag | Description | Default |
|
||||
|:-----|:------------|:--------|
|
||||
| `--method` | Abliteration method | advanced |
|
||||
| `--direction-method` | Direction extraction | diff_means |
|
||||
| `--n-directions` | Number of refusal directions (1-32) | method-dependent |
|
||||
| `--refinement-passes` | Iterative passes (1-5) | 2 |
|
||||
| `--regularization` | Regularization strength (0.0-1.0) | 0.1 |
|
||||
| `--quantization` | Load in 4bit or 8bit | none (full precision) |
|
||||
| `--large-model` | Conservative defaults for 120B+ | false |
|
||||
| `--output-dir` | Where to save the abliterated model | ./obliterated_model |
|
||||
| `--contribute` | Share anonymized results for research | false |
|
||||
| `--verify-sample-size` | Number of test prompts for refusal check | 20 |
|
||||
| `--dtype` | Model dtype (float16, bfloat16) | auto |
|
||||
|
||||
### Other execution modes
|
||||
```bash
|
||||
# Interactive guided mode (hardware → model → preset)
|
||||
obliteratus interactive
|
||||
|
||||
# Web UI (Gradio)
|
||||
obliteratus ui --port 7860
|
||||
|
||||
# Run a full ablation study from YAML config
|
||||
obliteratus run config.yaml --preset quick
|
||||
|
||||
# Tournament: pit all methods against each other
|
||||
obliteratus tourney <model_name>
|
||||
```
|
||||
|
||||
## Step 6: Verify Results
|
||||
|
||||
After abliteration, check the output metrics:
|
||||
|
||||
| Metric | Good Value | Warning |
|
||||
|:-------|:-----------|:--------|
|
||||
| Refusal rate | < 5% (ideally ~0%) | > 10% means refusals persist |
|
||||
| Perplexity change | < 10% increase | > 15% means coherence damage |
|
||||
| KL divergence | < 0.1 | > 0.5 means significant distribution shift |
|
||||
| Coherence | High / passes qualitative check | Degraded responses, repetition |
|
||||
|
||||
### If refusals persist (> 10%)
|
||||
1. Try `aggressive` method
|
||||
2. Increase `--n-directions` (e.g., 8 or 16)
|
||||
3. Add `--refinement-passes 3`
|
||||
4. Try `--direction-method svd` instead of diff_means
|
||||
|
||||
### If coherence is damaged (perplexity > 15% increase)
|
||||
1. Reduce `--n-directions` (try 2)
|
||||
2. Increase `--regularization` (try 0.3)
|
||||
3. Reduce `--refinement-passes` to 1
|
||||
4. Try `basic` method (gentler)
|
||||
|
||||
## Step 7: Use the Abliterated Model
|
||||
|
||||
The output is a standard HuggingFace model directory.
|
||||
|
||||
```bash
|
||||
# Test locally with transformers
|
||||
python3 -c "
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
model = AutoModelForCausalLM.from_pretrained('./abliterated-models/<model>')
|
||||
tokenizer = AutoTokenizer.from_pretrained('./abliterated-models/<model>')
|
||||
inputs = tokenizer('How do I pick a lock?', return_tensors='pt')
|
||||
outputs = model.generate(**inputs, max_new_tokens=200)
|
||||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
"
|
||||
|
||||
# Upload to HuggingFace Hub
|
||||
huggingface-cli upload <username>/<model-name>-abliterated ./abliterated-models/<model>
|
||||
|
||||
# Serve with vLLM
|
||||
vllm serve ./abliterated-models/<model>
|
||||
```
|
||||
|
||||
## CLI Command Reference
|
||||
|
||||
| Command | Description |
|
||||
|:--------|:------------|
|
||||
| `obliteratus obliterate` | Main abliteration command |
|
||||
| `obliteratus info <model>` | Print model architecture details |
|
||||
| `obliteratus models --tier <tier>` | Browse curated models by compute tier |
|
||||
| `obliteratus recommend <model>` | Telemetry-driven method/param suggestion |
|
||||
| `obliteratus interactive` | Guided setup wizard |
|
||||
| `obliteratus tourney <model>` | Tournament: all methods head-to-head |
|
||||
| `obliteratus run <config.yaml>` | Execute ablation study from YAML |
|
||||
| `obliteratus strategies` | List all registered ablation strategies |
|
||||
| `obliteratus report <results.json>` | Regenerate visual reports |
|
||||
| `obliteratus ui` | Launch Gradio web interface |
|
||||
| `obliteratus aggregate` | Summarize community telemetry data |
|
||||
|
||||
## Analysis Modules
|
||||
|
||||
OBLITERATUS includes 28 analysis modules for mechanistic interpretability.
|
||||
See `skill_view(name="obliteratus", file_path="references/analysis-modules.md")` for the full reference.
|
||||
|
||||
### Quick analysis commands
|
||||
```bash
|
||||
# Run specific analysis modules
|
||||
obliteratus run analysis-config.yaml --preset quick
|
||||
|
||||
# Key modules to run first:
|
||||
# - alignment_imprint: Fingerprint DPO/RLHF/CAI/SFT alignment method
|
||||
# - concept_geometry: Single direction vs polyhedral cone
|
||||
# - logit_lens: Which layer decides to refuse
|
||||
# - anti_ouroboros: Self-repair risk score
|
||||
# - causal_tracing: Causally necessary components
|
||||
```
|
||||
|
||||
### Steering Vectors (Reversible Alternative)
|
||||
Instead of permanent weight modification, use inference-time steering:
|
||||
```python
|
||||
# Python API only — for user's own projects
|
||||
from obliteratus.analysis.steering_vectors import SteeringVectorFactory, SteeringHookManager
|
||||
```
|
||||
|
||||
## Ablation Strategies
|
||||
|
||||
Beyond direction-based abliteration, OBLITERATUS includes structural ablation strategies:
|
||||
- **Embedding Ablation** — Target embedding layer components
|
||||
- **FFN Ablation** — Feed-forward network block removal
|
||||
- **Head Pruning** — Attention head pruning
|
||||
- **Layer Removal** — Full layer removal
|
||||
|
||||
List all available: `obliteratus strategies`
|
||||
|
||||
## Evaluation
|
||||
|
||||
OBLITERATUS includes built-in evaluation tools:
|
||||
- Refusal rate benchmarking
|
||||
- Perplexity comparison (before/after)
|
||||
- LM Eval Harness integration for academic benchmarks
|
||||
- Head-to-head competitor comparison
|
||||
- Baseline performance tracking
|
||||
|
||||
## Platform Support
|
||||
|
||||
- **CUDA** — Full support (NVIDIA GPUs)
|
||||
- **Apple Silicon (MLX)** — Supported via MLX backend
|
||||
- **CPU** — Supported for tiny models (< 1B params)
|
||||
|
||||
## YAML Config Templates
|
||||
|
||||
Load templates for reproducible runs via `skill_view`:
|
||||
- `templates/abliteration-config.yaml` — Standard single-model config
|
||||
- `templates/analysis-study.yaml` — Pre-abliteration analysis study
|
||||
- `templates/batch-abliteration.yaml` — Multi-model batch processing
|
||||
|
||||
## Telemetry
|
||||
|
||||
OBLITERATUS can optionally contribute anonymized run data to a global research dataset.
|
||||
Enable with `--contribute` flag. No personal data is collected — only model name, method, metrics.
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Don't use `informed` as default** — it's experimental and slower. Use `advanced` for reliable results.
|
||||
2. **Models under ~1B respond poorly to abliteration** — their refusal behaviors are shallow and fragmented, making clean direction extraction difficult. Expect partial results (20-40% remaining refusal). Models 3B+ have cleaner refusal directions and respond much better (often 0% refusal with `advanced`).
|
||||
3. **`aggressive` can make things worse** — on small models it can damage coherence and actually increase refusal rate. Only use it if `advanced` leaves > 10% refusals on a 3B+ model.
|
||||
4. **Always check perplexity** — if it spikes > 15%, the model is damaged. Reduce aggressiveness.
|
||||
5. **MoE models need special handling** — use `nuclear` method for Mixtral, DeepSeek-MoE, etc.
|
||||
6. **Quantized models can't be re-quantized** — abliterate the full-precision model, then quantize the output.
|
||||
7. **VRAM estimation is approximate** — 4-bit quant helps but peak usage can spike during extraction.
|
||||
8. **Reasoning models are sensitive** — use `surgical` for R1 distills to preserve chain-of-thought.
|
||||
9. **Check `obliteratus recommend`** — telemetry data may have better parameters than defaults.
|
||||
10. **AGPL license** — never `import obliteratus` in MIT/Apache projects. CLI invocation only.
|
||||
11. **Large models (70B+)** — always use `--large-model` flag for conservative defaults.
|
||||
12. **Spectral certification RED is common** — the spectral check often flags "incomplete" even when practical refusal rate is 0%. Check actual refusal rate rather than relying on spectral certification alone.
|
||||
|
||||
## Complementary Skills
|
||||
|
||||
- **vllm** — Serve abliterated models with high throughput
|
||||
- **gguf** — Convert abliterated models to GGUF for llama.cpp
|
||||
- **huggingface-tokenizers** — Work with model tokenizers
|
||||
@@ -0,0 +1,166 @@
|
||||
# OBLITERATUS Analysis Modules — Reference
|
||||
|
||||
OBLITERATUS includes 28 analysis modules for mechanistic interpretability of refusal in LLMs.
|
||||
These modules help understand how and where refusal behaviors are encoded before performing abliteration.
|
||||
|
||||
---
|
||||
|
||||
## Core Analysis (Run These First)
|
||||
|
||||
### 1. Alignment Imprint Detection (`alignment_imprint.py`)
|
||||
Fingerprints whether a model was trained via DPO, RLHF, CAI, or SFT.
|
||||
This determines which extraction strategy will work best.
|
||||
|
||||
### 2. Concept Cone Geometry (`concept_geometry.py`)
|
||||
Determines if refusal is a single linear direction or a polyhedral cone
|
||||
(set of multiple mechanisms). Single-direction models respond well to `basic`;
|
||||
polyhedral models need `advanced` or `surgical`.
|
||||
|
||||
### 3. Refusal Logit Lens (`logit_lens.py`)
|
||||
Identifies the specific layer where a model "decides" to refuse by decoding
|
||||
intermediate layer representations into token space.
|
||||
|
||||
### 4. Ouroboros Detection (`anti_ouroboros.py`)
|
||||
Identifies if a model attempts to "self-repair" refusal behaviors after
|
||||
excision. Reports a risk score (0-1). High scores mean additional refinement
|
||||
passes are needed.
|
||||
|
||||
### 5. Causal Tracing (`causal_tracing.py`)
|
||||
Identifies which components (layers, heads, MLPs) are causally necessary
|
||||
for refusal behavior using activation patching.
|
||||
|
||||
---
|
||||
|
||||
## Geometric Analysis
|
||||
|
||||
### 6. Cross-Layer Alignment (`cross_layer.py`)
|
||||
Measures how refusal directions align across different layers. High alignment
|
||||
means the refusal signal is consistent; low alignment suggests layer-specific
|
||||
mechanisms.
|
||||
|
||||
### 7. Residual Stream Decomposition (`residual_stream.py`)
|
||||
Decomposes the residual stream into attention and MLP contributions to
|
||||
understand which component type contributes more to refusal.
|
||||
|
||||
### 8. Riemannian Manifold Geometry (`riemannian_manifold.py`)
|
||||
Analyzes the curvature and geometry of the weight manifold near refusal
|
||||
directions. Informs how aggressively projections can be applied without
|
||||
damaging the manifold structure.
|
||||
|
||||
### 9. Whitened SVD (`whitened_svd.py`)
|
||||
Covariance-normalized SVD extraction that separates guardrail signals from
|
||||
natural activation variance. More precise than standard SVD for models with
|
||||
high activation variance.
|
||||
|
||||
### 10. Concept Cone Geometry (extended)
|
||||
Maps the full polyhedral structure of refusal, including cone angles,
|
||||
face counts, and intersection patterns.
|
||||
|
||||
---
|
||||
|
||||
## Probing & Classification
|
||||
|
||||
### 11. Activation Probing (`activation_probing.py`)
|
||||
Post-excision verification — probes for residual refusal concepts after
|
||||
abliteration to ensure complete removal.
|
||||
|
||||
### 12. Probing Classifiers (`probing_classifiers.py`)
|
||||
Trains linear classifiers to detect refusal in activations. Used both
|
||||
before (to verify refusal exists) and after (to verify it's gone).
|
||||
|
||||
### 13. Activation Patching (`activation_patching.py`)
|
||||
Interchange interventions — swaps activations between refused and complied
|
||||
runs to identify causal components.
|
||||
|
||||
### 14. Tuned Lens (`tuned_lens.py`)
|
||||
Trained version of logit lens that provides more accurate per-layer
|
||||
decoding by learning affine transformations for each layer.
|
||||
|
||||
### 15. Multi-Token Position Analysis (`multi_token_position.py`)
|
||||
Analyzes refusal signals across multiple token positions, not just the
|
||||
last token. Important for models that distribute refusal across the sequence.
|
||||
|
||||
---
|
||||
|
||||
## Abliteration & Manipulation
|
||||
|
||||
### 16. SAE-Based Abliteration (`sae_abliteration.py`)
|
||||
Uses Sparse Autoencoder features to identify and remove specific refusal
|
||||
features. More surgical than direction-based methods.
|
||||
|
||||
### 17. Steering Vectors (`steering_vectors.py`)
|
||||
Creates and applies inference-time steering vectors for reversible refusal
|
||||
modification. Includes `SteeringVectorFactory` and `SteeringHookManager`.
|
||||
|
||||
### 18. LEACE Concept Erasure (`leace.py`)
|
||||
Linear Erasure via Closed-form Estimation — mathematically optimal linear
|
||||
concept removal. Available as both analysis module and direction extraction method.
|
||||
|
||||
### 19. Sparse Surgery (`sparse_surgery.py`)
|
||||
High-precision weight modification targeting individual neurons and
|
||||
weight matrix entries rather than full directions.
|
||||
|
||||
### 20. Conditional Abliteration (`conditional_abliteration.py`)
|
||||
Targeted removal that only affects specific refusal categories while
|
||||
preserving others (e.g., remove weapons refusal but keep CSAM refusal).
|
||||
|
||||
---
|
||||
|
||||
## Transfer & Robustness
|
||||
|
||||
### 21. Cross-Model Transfer (`cross_model_transfer.py`)
|
||||
Tests whether refusal directions extracted from one model transfer to
|
||||
another architecture. Measures universality of guardrail directions.
|
||||
|
||||
### 22. Defense Robustness (`defense_robustness.py`)
|
||||
Evaluates how robust the abliteration is against various defense mechanisms
|
||||
and re-alignment attempts.
|
||||
|
||||
### 23. Spectral Certification (`spectral_certification.py`)
|
||||
Provides mathematical bounds on the completeness of refusal removal
|
||||
using spectral analysis of the projection.
|
||||
|
||||
### 24. Wasserstein Optimal Extraction (`wasserstein_optimal.py`)
|
||||
Uses optimal transport theory for more precise direction extraction
|
||||
that minimizes distribution shift.
|
||||
|
||||
### 25. Wasserstein Transfer (`wasserstein_transfer.py`)
|
||||
Distribution transfer between models using Wasserstein distance
|
||||
for cross-architecture refusal direction mapping.
|
||||
|
||||
---
|
||||
|
||||
## Advanced / Research
|
||||
|
||||
### 26. Bayesian Kernel Projection (`bayesian_kernel_projection.py`)
|
||||
Probabilistic feature mapping that estimates uncertainty in refusal
|
||||
direction identification.
|
||||
|
||||
### 27. Cross-Model Universality Index
|
||||
Measures if guardrail directions generalize across different model
|
||||
architectures and training regimes.
|
||||
|
||||
### 28. Visualization (`visualization.py`)
|
||||
Plotting and graphing utilities for all analysis modules. Generates
|
||||
heatmaps, direction plots, and layer-wise analysis charts.
|
||||
|
||||
---
|
||||
|
||||
## Running Analysis
|
||||
|
||||
### Via CLI
|
||||
```bash
|
||||
# Run analysis from a YAML config
|
||||
obliteratus run analysis-study.yaml --preset quick
|
||||
|
||||
# Available study presets:
|
||||
# quick — Fast sanity check (2-3 modules)
|
||||
# full — All core + geometric analysis
|
||||
# jailbreak — Refusal circuit localization
|
||||
# knowledge — Knowledge preservation analysis
|
||||
# robustness — Stress testing / defense evaluation
|
||||
```
|
||||
|
||||
### Via YAML Config
|
||||
See the `templates/analysis-study.yaml` template for a complete example.
|
||||
Load with: `skill_view(name="obliteratus", file_path="templates/analysis-study.yaml")`
|
||||
141
skills/mlops/inference/obliteratus/references/methods-guide.md
Normal file
141
skills/mlops/inference/obliteratus/references/methods-guide.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# OBLITERATUS Methods — Detailed Guide
|
||||
|
||||
> The CLI accepts 9 methods via `--method`: basic, advanced, aggressive, spectral_cascade,
|
||||
> informed, surgical, optimized, inverted, nuclear.
|
||||
> Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API.
|
||||
|
||||
## How Abliteration Works (Theory)
|
||||
|
||||
Abliteration identifies a "refusal direction" — a vector in the model's activation space that
|
||||
corresponds to refusal behavior — and projects it out of the weight matrices.
|
||||
|
||||
Mathematically: `W_new = W_old - (W_old @ d @ d.T)` where `d` is the refusal direction.
|
||||
|
||||
The key challenge is finding accurate refusal directions without damaging other capabilities.
|
||||
|
||||
---
|
||||
|
||||
## Direction Extraction Methods
|
||||
|
||||
Before projecting, OBLITERATUS extracts refusal directions using one of three methods:
|
||||
|
||||
| Method | Flag | Description | Best For |
|
||||
|:-------|:-----|:------------|:---------|
|
||||
| Diff-in-Means | `--direction-method diff_means` | Difference between mean activations on refused vs. complied prompts | Default, fast, robust |
|
||||
| SVD | `--direction-method svd` | Multi-direction extraction via Singular Value Decomposition | Complex alignment, multiple refusal mechanisms |
|
||||
| LEACE | `--direction-method leace` | Linear Erasure via Closed-form Estimation — mathematically optimal | Maximum precision, research |
|
||||
|
||||
---
|
||||
|
||||
## Method Details
|
||||
|
||||
### basic
|
||||
- **Directions:** 1 (single diff-in-means vector)
|
||||
- **Speed:** Fast (~5-10 min for 8B model)
|
||||
- **Risk:** Low
|
||||
- **Use case:** Quick tests, prototyping, evaluating if abliteration works for a model
|
||||
- **How it works:** Extracts one refusal direction and projects it out uniformly across all layers.
|
||||
|
||||
### advanced (DEFAULT — RECOMMENDED)
|
||||
- **Directions:** 4 (multi-direction SVD)
|
||||
- **Speed:** Medium (~10-20 min for 8B model)
|
||||
- **Risk:** Low-Medium
|
||||
- **Refinement passes:** 2
|
||||
- **Use case:** Default for most models. Well-tested and reliable.
|
||||
- **How it works:** Extracts multiple refusal directions via SVD, applies norm-preserving bi-projection to maintain weight matrix norms. Two refinement passes catch residual refusal.
|
||||
|
||||
### aggressive
|
||||
- **Directions:** 8+ (whitened SVD + jailbreak-contrastive)
|
||||
- **Speed:** Medium-Slow
|
||||
- **Risk:** Medium-High (may damage coherence)
|
||||
- **Use case:** When `advanced` leaves > 10% refusals. Stubborn models.
|
||||
- **How it works:** Uses whitened SVD for covariance-normalized extraction, adds jailbreak-contrastive directions, performs attention head surgery on the most refusal-active heads.
|
||||
|
||||
### spectral_cascade
|
||||
- **Speed:** Medium
|
||||
- **Risk:** Medium
|
||||
- **Use case:** Research, novel approaches
|
||||
- **How it works:** DCT (Discrete Cosine Transform) frequency-domain decomposition of refusal signals. Separates high-frequency (surface-level) from low-frequency (deep) refusal patterns.
|
||||
|
||||
### informed (EXPERIMENTAL)
|
||||
- **Speed:** Slow (~20-40 min for 8B model)
|
||||
- **Risk:** Variable — results depend on analysis quality
|
||||
- **Use case:** When you want auto-configuration, but be aware this is experimental and may not outperform `advanced`.
|
||||
- **How it works:** Runs 4 analysis modules first (alignment imprint, concept geometry, logit lens, ouroboros detection), then auto-configures extraction strategy. Includes an "Ouroboros loop" that detects and counteracts self-repair.
|
||||
- **Note:** The auto-detection can sometimes misconfigure. If results are poor, fall back to `advanced`.
|
||||
|
||||
### surgical
|
||||
- **Speed:** Very slow (~1-2 hrs for 8B model)
|
||||
- **Risk:** Low (very precise)
|
||||
- **Use case:** Reasoning models (R1 distills, QwQ, etc.) where chain-of-thought must be preserved.
|
||||
- **How it works:** Uses SAE (Sparse Autoencoder) features + individual neuron masking + attention head surgery + per-expert decomposition (for MoE). CoT-aware — identifies and protects reasoning-critical directions before projecting.
|
||||
|
||||
### optimized
|
||||
- **Speed:** Very slow (hours — runs many trials)
|
||||
- **Risk:** Low (finds optimal parameters)
|
||||
- **Use case:** When quality matters more than speed. Production models.
|
||||
- **How it works:** Bayesian hyperparameter search via Optuna TPE sampler. Optimizes n_directions, regularization, refinement passes, and layer selection jointly. Evaluates each configuration on refusal rate + perplexity.
|
||||
|
||||
### inverted
|
||||
- **Speed:** Fast
|
||||
- **Risk:** High (model behavior changes dramatically)
|
||||
- **Use case:** Research, studying refusal mechanisms
|
||||
- **How it works:** Instead of projecting out the refusal direction, reflects it. The model actively complies rather than passively not-refusing. Useful for understanding the geometry of alignment.
|
||||
|
||||
### nuclear
|
||||
- **Speed:** Slow
|
||||
- **Risk:** Medium-High
|
||||
- **Use case:** Stubborn MoE models (DeepSeek-MoE, Mixtral, etc.)
|
||||
- **How it works:** Combines expert-granular abliteration (EGA), steering vector injection, attention head pruning, and multi-pass refinement. Decomposes refusal signals into per-expert components for MoE architectures.
|
||||
|
||||
---
|
||||
|
||||
## Method Selection Flowchart
|
||||
|
||||
```
|
||||
Is this a quick test?
|
||||
→ YES: basic
|
||||
→ NO: continue
|
||||
|
||||
Is it an MoE model (Mixtral, DeepSeek-MoE)?
|
||||
→ YES: nuclear
|
||||
→ NO: continue
|
||||
|
||||
Is it a reasoning model (R1, QwQ, CoT-focused)?
|
||||
→ YES: surgical
|
||||
→ NO: continue
|
||||
|
||||
Do you need the absolute best quality and have time?
|
||||
→ YES: optimized
|
||||
→ NO: advanced (recommended default)
|
||||
|
||||
Did advanced leave > 10% refusals?
|
||||
→ YES: aggressive
|
||||
→ Still refusing: nuclear
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Parameters
|
||||
|
||||
| Parameter | Range | Default | Effect |
|
||||
|:----------|:------|:--------|:-------|
|
||||
| `--n-directions` | 1-32 | method-dependent | More directions = more complete removal, but higher damage risk |
|
||||
| `--regularization` | 0.0-1.0 | 0.1 | Higher = more conservative (less removal, less damage) |
|
||||
| `--refinement-passes` | 1-5 | 2 | More passes catch residual refusal, but diminishing returns |
|
||||
| `--quantization` | 4bit, 8bit | none | Reduces VRAM usage; quality impact minimal for extraction |
|
||||
| `--verify-sample-size` | 10-200 | 20 | More samples = more accurate refusal rate estimate |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Likely Cause | Fix |
|
||||
|:--------|:-------------|:----|
|
||||
| Refusal rate > 20% | Too few directions | Increase `--n-directions`, try `aggressive` |
|
||||
| Refusal rate 5-20% | Residual refusal | Add `--refinement-passes 3`, try `--direction-method svd` |
|
||||
| Perplexity spike > 20% | Over-aggressive removal | Reduce `--n-directions`, increase `--regularization` |
|
||||
| Repetitive output | Weight matrix damage | Use `basic` with fewer directions, check norm preservation |
|
||||
| MoE model still refuses | Non-expert-aware method | Switch to `nuclear` |
|
||||
| Reasoning degraded | CoT directions damaged | Use `surgical` method |
|
||||
| OOM during extraction | Insufficient VRAM | Add `--quantization 4bit` and/or `--large-model` |
|
||||
@@ -0,0 +1,33 @@
|
||||
# OBLITERATUS Abliteration Config
|
||||
# Usage: obliteratus run this-file.yaml
|
||||
#
|
||||
# This is for reproducible, version-controlled abliteration runs.
|
||||
# For one-off usage, the CLI flags are simpler.
|
||||
|
||||
# Model to abliterate
|
||||
model:
|
||||
name: "meta-llama/Llama-3.1-8B-Instruct"
|
||||
dtype: "bfloat16" # float16, bfloat16, float32
|
||||
quantization: null # null, "4bit", "8bit"
|
||||
device: "auto" # auto, cuda, cuda:0, cpu
|
||||
|
||||
# Abliteration method and parameters
|
||||
abliteration:
|
||||
method: "informed" # See SKILL.md Step 4 for all 13 methods
|
||||
n_directions: null # null = auto-detect, or integer (e.g., 8)
|
||||
regularization: 0.0 # 0.0-1.0, fraction of original to preserve
|
||||
refinement_passes: 1 # Iterative passes (increase for self-repair)
|
||||
norm_preserve: true # Keep weight norms intact after projection
|
||||
|
||||
# Output
|
||||
output:
|
||||
directory: "./abliterated-models"
|
||||
save_metadata: true # Save abliteration_metadata.json alongside model
|
||||
contribute: false # Save community contribution data
|
||||
|
||||
# Verification
|
||||
verify:
|
||||
enabled: true
|
||||
test_prompts: null # null = use built-in test prompts
|
||||
compute_perplexity: true
|
||||
compute_kl: true
|
||||
@@ -0,0 +1,40 @@
|
||||
# OBLITERATUS Analysis Study Config
|
||||
# Usage: obliteratus run this-file.yaml --preset jailbreak
|
||||
#
|
||||
# Run analysis modules to understand refusal geometry BEFORE abliterating.
|
||||
# Useful for research or when you want to understand what you're removing.
|
||||
|
||||
# Model to analyze
|
||||
model:
|
||||
name: "meta-llama/Llama-3.1-8B-Instruct"
|
||||
dtype: "bfloat16"
|
||||
quantization: "4bit" # Saves VRAM for analysis
|
||||
device: "auto"
|
||||
|
||||
# Study configuration
|
||||
study:
|
||||
# Available presets: quick, full, attention, jailbreak, guardrail, knowledge
|
||||
preset: "jailbreak"
|
||||
|
||||
# Or specify individual strategies:
|
||||
# strategies:
|
||||
# - layer_removal
|
||||
# - head_pruning
|
||||
# - ffn_ablation
|
||||
# - embedding_ablation
|
||||
|
||||
# Analysis modules to run (subset of the 27 available)
|
||||
analysis:
|
||||
- alignment_imprint # Detect DPO/RLHF/CAI/SFT training method
|
||||
- concept_geometry # Map refusal cone geometry
|
||||
- logit_lens # Find which layer decides to refuse
|
||||
- anti_ouroboros # Detect self-repair tendency
|
||||
- cross_layer # Cross-layer alignment clustering
|
||||
- causal_tracing # Causal necessity of components
|
||||
- residual_stream # Attention vs MLP contribution
|
||||
|
||||
# Output
|
||||
output:
|
||||
directory: "./analysis-results"
|
||||
save_plots: true # Generate matplotlib visualizations
|
||||
save_report: true # Generate markdown report
|
||||
@@ -0,0 +1,41 @@
|
||||
# OBLITERATUS Batch Abliteration Config
|
||||
# Abliterate multiple models with the same method for comparison.
|
||||
#
|
||||
# Run each one sequentially:
|
||||
# for model in models; do obliteratus obliterate $model --method informed; done
|
||||
#
|
||||
# Or use this as a reference for which models to process.
|
||||
|
||||
# Common settings
|
||||
defaults:
|
||||
method: "informed"
|
||||
quantization: "4bit"
|
||||
output_dir: "./abliterated-models"
|
||||
|
||||
# Models to process (grouped by compute tier)
|
||||
models:
|
||||
# Small (4-8 GB VRAM)
|
||||
small:
|
||||
- "Qwen/Qwen2.5-1.5B-Instruct"
|
||||
- "microsoft/Phi-3.5-mini-instruct"
|
||||
- "meta-llama/Llama-3.2-3B-Instruct"
|
||||
|
||||
# Medium (8-16 GB VRAM)
|
||||
medium:
|
||||
- "meta-llama/Llama-3.1-8B-Instruct"
|
||||
- "mistralai/Mistral-7B-Instruct-v0.3"
|
||||
- "google/gemma-2-9b-it"
|
||||
- "Qwen/Qwen2.5-7B-Instruct"
|
||||
|
||||
# Large (24 GB VRAM, 4-bit quantization)
|
||||
large:
|
||||
- "Qwen/Qwen2.5-14B-Instruct"
|
||||
- "Qwen/Qwen3-32B"
|
||||
- "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
|
||||
|
||||
# Per-model method overrides (optional)
|
||||
overrides:
|
||||
"deepseek-ai/DeepSeek-R1-Distill-Qwen-32B":
|
||||
method: "surgical" # CoT-aware for reasoning models
|
||||
"mistralai/Mixtral-8x7B-Instruct-v0.1":
|
||||
method: "nuclear" # Expert-granular for MoE models
|
||||
655
skills/mlops/inference/outlines/SKILL.md
Normal file
655
skills/mlops/inference/outlines/SKILL.md
Normal file
@@ -0,0 +1,655 @@
|
||||
---
|
||||
name: outlines
|
||||
description: Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [outlines, transformers, vllm, pydantic]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Prompt Engineering, Outlines, Structured Generation, JSON Schema, Pydantic, Local Models, Grammar-Based Generation, vLLM, Transformers, Type Safety]
|
||||
|
||||
---
|
||||
|
||||
# Outlines: Structured Text Generation
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use Outlines when you need to:
|
||||
- **Guarantee valid JSON/XML/code** structure during generation
|
||||
- **Use Pydantic models** for type-safe outputs
|
||||
- **Support local models** (Transformers, llama.cpp, vLLM)
|
||||
- **Maximize inference speed** with zero-overhead structured generation
|
||||
- **Generate against JSON schemas** automatically
|
||||
- **Control token sampling** at the grammar level
|
||||
|
||||
**GitHub Stars**: 8,000+ | **From**: dottxt.ai (formerly .txt)
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Base installation
|
||||
pip install outlines
|
||||
|
||||
# With specific backends
|
||||
pip install outlines transformers # Hugging Face models
|
||||
pip install outlines llama-cpp-python # llama.cpp
|
||||
pip install outlines vllm # vLLM for high-throughput
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Example: Classification
|
||||
|
||||
```python
|
||||
import outlines
|
||||
from typing import Literal
|
||||
|
||||
# Load model
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
|
||||
# Generate with type constraint
|
||||
prompt = "Sentiment of 'This product is amazing!': "
|
||||
generator = outlines.generate.choice(model, ["positive", "negative", "neutral"])
|
||||
sentiment = generator(prompt)
|
||||
|
||||
print(sentiment) # "positive" (guaranteed one of these)
|
||||
```
|
||||
|
||||
### With Pydantic Models
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
import outlines
|
||||
|
||||
class User(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
email: str
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
|
||||
# Generate structured output
|
||||
prompt = "Extract user: John Doe, 30 years old, john@example.com"
|
||||
generator = outlines.generate.json(model, User)
|
||||
user = generator(prompt)
|
||||
|
||||
print(user.name) # "John Doe"
|
||||
print(user.age) # 30
|
||||
print(user.email) # "john@example.com"
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Constrained Token Sampling
|
||||
|
||||
Outlines uses Finite State Machines (FSM) to constrain token generation at the logit level.
|
||||
|
||||
**How it works:**
|
||||
1. Convert schema (JSON/Pydantic/regex) to context-free grammar (CFG)
|
||||
2. Transform CFG into Finite State Machine (FSM)
|
||||
3. Filter invalid tokens at each step during generation
|
||||
4. Fast-forward when only one valid token exists
|
||||
|
||||
**Benefits:**
|
||||
- **Zero overhead**: Filtering happens at token level
|
||||
- **Speed improvement**: Fast-forward through deterministic paths
|
||||
- **Guaranteed validity**: Invalid outputs impossible
|
||||
|
||||
```python
|
||||
import outlines
|
||||
|
||||
# Pydantic model -> JSON schema -> CFG -> FSM
|
||||
class Person(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
|
||||
# Behind the scenes:
|
||||
# 1. Person -> JSON schema
|
||||
# 2. JSON schema -> CFG
|
||||
# 3. CFG -> FSM
|
||||
# 4. FSM filters tokens during generation
|
||||
|
||||
generator = outlines.generate.json(model, Person)
|
||||
result = generator("Generate person: Alice, 25")
|
||||
```
|
||||
|
||||
### 2. Structured Generators
|
||||
|
||||
Outlines provides specialized generators for different output types.
|
||||
|
||||
#### Choice Generator
|
||||
|
||||
```python
|
||||
# Multiple choice selection
|
||||
generator = outlines.generate.choice(
|
||||
model,
|
||||
["positive", "negative", "neutral"]
|
||||
)
|
||||
|
||||
sentiment = generator("Review: This is great!")
|
||||
# Result: One of the three choices
|
||||
```
|
||||
|
||||
#### JSON Generator
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
|
||||
class Product(BaseModel):
|
||||
name: str
|
||||
price: float
|
||||
in_stock: bool
|
||||
|
||||
# Generate valid JSON matching schema
|
||||
generator = outlines.generate.json(model, Product)
|
||||
product = generator("Extract: iPhone 15, $999, available")
|
||||
|
||||
# Guaranteed valid Product instance
|
||||
print(type(product)) # <class '__main__.Product'>
|
||||
```
|
||||
|
||||
#### Regex Generator
|
||||
|
||||
```python
|
||||
# Generate text matching regex
|
||||
generator = outlines.generate.regex(
|
||||
model,
|
||||
r"[0-9]{3}-[0-9]{3}-[0-9]{4}" # Phone number pattern
|
||||
)
|
||||
|
||||
phone = generator("Generate phone number:")
|
||||
# Result: "555-123-4567" (guaranteed to match pattern)
|
||||
```
|
||||
|
||||
#### Integer/Float Generators
|
||||
|
||||
```python
|
||||
# Generate specific numeric types
|
||||
int_generator = outlines.generate.integer(model)
|
||||
age = int_generator("Person's age:") # Guaranteed integer
|
||||
|
||||
float_generator = outlines.generate.float(model)
|
||||
price = float_generator("Product price:") # Guaranteed float
|
||||
```
|
||||
|
||||
### 3. Model Backends
|
||||
|
||||
Outlines supports multiple local and API-based backends.
|
||||
|
||||
#### Transformers (Hugging Face)
|
||||
|
||||
```python
|
||||
import outlines
|
||||
|
||||
# Load from Hugging Face
|
||||
model = outlines.models.transformers(
|
||||
"microsoft/Phi-3-mini-4k-instruct",
|
||||
device="cuda" # Or "cpu"
|
||||
)
|
||||
|
||||
# Use with any generator
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
```
|
||||
|
||||
#### llama.cpp
|
||||
|
||||
```python
|
||||
# Load GGUF model
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/llama-3.1-8b-instruct.Q4_K_M.gguf",
|
||||
n_gpu_layers=35
|
||||
)
|
||||
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
```
|
||||
|
||||
#### vLLM (High Throughput)
|
||||
|
||||
```python
|
||||
# For production deployments
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
tensor_parallel_size=2 # Multi-GPU
|
||||
)
|
||||
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
```
|
||||
|
||||
#### OpenAI (Limited Support)
|
||||
|
||||
```python
|
||||
# Basic OpenAI support
|
||||
model = outlines.models.openai(
|
||||
"gpt-4o-mini",
|
||||
api_key="your-api-key"
|
||||
)
|
||||
|
||||
# Note: Some features limited with API models
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
```
|
||||
|
||||
### 4. Pydantic Integration
|
||||
|
||||
Outlines has first-class Pydantic support with automatic schema translation.
|
||||
|
||||
#### Basic Models
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class Article(BaseModel):
|
||||
title: str = Field(description="Article title")
|
||||
author: str = Field(description="Author name")
|
||||
word_count: int = Field(description="Number of words", gt=0)
|
||||
tags: list[str] = Field(description="List of tags")
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, Article)
|
||||
|
||||
article = generator("Generate article about AI")
|
||||
print(article.title)
|
||||
print(article.word_count) # Guaranteed > 0
|
||||
```
|
||||
|
||||
#### Nested Models
|
||||
|
||||
```python
|
||||
class Address(BaseModel):
|
||||
street: str
|
||||
city: str
|
||||
country: str
|
||||
|
||||
class Person(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
address: Address # Nested model
|
||||
|
||||
generator = outlines.generate.json(model, Person)
|
||||
person = generator("Generate person in New York")
|
||||
|
||||
print(person.address.city) # "New York"
|
||||
```
|
||||
|
||||
#### Enums and Literals
|
||||
|
||||
```python
|
||||
from enum import Enum
|
||||
from typing import Literal
|
||||
|
||||
class Status(str, Enum):
|
||||
PENDING = "pending"
|
||||
APPROVED = "approved"
|
||||
REJECTED = "rejected"
|
||||
|
||||
class Application(BaseModel):
|
||||
applicant: str
|
||||
status: Status # Must be one of enum values
|
||||
priority: Literal["low", "medium", "high"] # Must be one of literals
|
||||
|
||||
generator = outlines.generate.json(model, Application)
|
||||
app = generator("Generate application")
|
||||
|
||||
print(app.status) # Status.PENDING (or APPROVED/REJECTED)
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Data Extraction
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
import outlines
|
||||
|
||||
class CompanyInfo(BaseModel):
|
||||
name: str
|
||||
founded_year: int
|
||||
industry: str
|
||||
employees: int
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, CompanyInfo)
|
||||
|
||||
text = """
|
||||
Apple Inc. was founded in 1976 in the technology industry.
|
||||
The company employs approximately 164,000 people worldwide.
|
||||
"""
|
||||
|
||||
prompt = f"Extract company information:\n{text}\n\nCompany:"
|
||||
company = generator(prompt)
|
||||
|
||||
print(f"Name: {company.name}")
|
||||
print(f"Founded: {company.founded_year}")
|
||||
print(f"Industry: {company.industry}")
|
||||
print(f"Employees: {company.employees}")
|
||||
```
|
||||
|
||||
### Pattern 2: Classification
|
||||
|
||||
```python
|
||||
from typing import Literal
|
||||
import outlines
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
|
||||
# Binary classification
|
||||
generator = outlines.generate.choice(model, ["spam", "not_spam"])
|
||||
result = generator("Email: Buy now! 50% off!")
|
||||
|
||||
# Multi-class classification
|
||||
categories = ["technology", "business", "sports", "entertainment"]
|
||||
category_gen = outlines.generate.choice(model, categories)
|
||||
category = category_gen("Article: Apple announces new iPhone...")
|
||||
|
||||
# With confidence
|
||||
class Classification(BaseModel):
|
||||
label: Literal["positive", "negative", "neutral"]
|
||||
confidence: float
|
||||
|
||||
classifier = outlines.generate.json(model, Classification)
|
||||
result = classifier("Review: This product is okay, nothing special")
|
||||
```
|
||||
|
||||
### Pattern 3: Structured Forms
|
||||
|
||||
```python
|
||||
class UserProfile(BaseModel):
|
||||
full_name: str
|
||||
age: int
|
||||
email: str
|
||||
phone: str
|
||||
country: str
|
||||
interests: list[str]
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, UserProfile)
|
||||
|
||||
prompt = """
|
||||
Extract user profile from:
|
||||
Name: Alice Johnson
|
||||
Age: 28
|
||||
Email: alice@example.com
|
||||
Phone: 555-0123
|
||||
Country: USA
|
||||
Interests: hiking, photography, cooking
|
||||
"""
|
||||
|
||||
profile = generator(prompt)
|
||||
print(profile.full_name)
|
||||
print(profile.interests) # ["hiking", "photography", "cooking"]
|
||||
```
|
||||
|
||||
### Pattern 4: Multi-Entity Extraction
|
||||
|
||||
```python
|
||||
class Entity(BaseModel):
|
||||
name: str
|
||||
type: Literal["PERSON", "ORGANIZATION", "LOCATION"]
|
||||
|
||||
class DocumentEntities(BaseModel):
|
||||
entities: list[Entity]
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, DocumentEntities)
|
||||
|
||||
text = "Tim Cook met with Satya Nadella at Microsoft headquarters in Redmond."
|
||||
prompt = f"Extract entities from: {text}"
|
||||
|
||||
result = generator(prompt)
|
||||
for entity in result.entities:
|
||||
print(f"{entity.name} ({entity.type})")
|
||||
```
|
||||
|
||||
### Pattern 5: Code Generation
|
||||
|
||||
```python
|
||||
class PythonFunction(BaseModel):
|
||||
function_name: str
|
||||
parameters: list[str]
|
||||
docstring: str
|
||||
body: str
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, PythonFunction)
|
||||
|
||||
prompt = "Generate a Python function to calculate factorial"
|
||||
func = generator(prompt)
|
||||
|
||||
print(f"def {func.function_name}({', '.join(func.parameters)}):")
|
||||
print(f' """{func.docstring}"""')
|
||||
print(f" {func.body}")
|
||||
```
|
||||
|
||||
### Pattern 6: Batch Processing
|
||||
|
||||
```python
|
||||
def batch_extract(texts: list[str], schema: type[BaseModel]):
|
||||
"""Extract structured data from multiple texts."""
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, schema)
|
||||
|
||||
results = []
|
||||
for text in texts:
|
||||
result = generator(f"Extract from: {text}")
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
class Person(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
texts = [
|
||||
"John is 30 years old",
|
||||
"Alice is 25 years old",
|
||||
"Bob is 40 years old"
|
||||
]
|
||||
|
||||
people = batch_extract(texts, Person)
|
||||
for person in people:
|
||||
print(f"{person.name}: {person.age}")
|
||||
```
|
||||
|
||||
## Backend Configuration
|
||||
|
||||
### Transformers
|
||||
|
||||
```python
|
||||
import outlines
|
||||
|
||||
# Basic usage
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
|
||||
# GPU configuration
|
||||
model = outlines.models.transformers(
|
||||
"microsoft/Phi-3-mini-4k-instruct",
|
||||
device="cuda",
|
||||
model_kwargs={"torch_dtype": "float16"}
|
||||
)
|
||||
|
||||
# Popular models
|
||||
model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
|
||||
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")
|
||||
model = outlines.models.transformers("Qwen/Qwen2.5-7B-Instruct")
|
||||
```
|
||||
|
||||
### llama.cpp
|
||||
|
||||
```python
|
||||
# Load GGUF model
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/llama-3.1-8b.Q4_K_M.gguf",
|
||||
n_ctx=4096, # Context window
|
||||
n_gpu_layers=35, # GPU layers
|
||||
n_threads=8 # CPU threads
|
||||
)
|
||||
|
||||
# Full GPU offload
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/model.gguf",
|
||||
n_gpu_layers=-1 # All layers on GPU
|
||||
)
|
||||
```
|
||||
|
||||
### vLLM (Production)
|
||||
|
||||
```python
|
||||
# Single GPU
|
||||
model = outlines.models.vllm("meta-llama/Llama-3.1-8B-Instruct")
|
||||
|
||||
# Multi-GPU
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4 # 4 GPUs
|
||||
)
|
||||
|
||||
# With quantization
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
quantization="awq" # Or "gptq"
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Use Specific Types
|
||||
|
||||
```python
|
||||
# ✅ Good: Specific types
|
||||
class Product(BaseModel):
|
||||
name: str
|
||||
price: float # Not str
|
||||
quantity: int # Not str
|
||||
in_stock: bool # Not str
|
||||
|
||||
# ❌ Bad: Everything as string
|
||||
class Product(BaseModel):
|
||||
name: str
|
||||
price: str # Should be float
|
||||
quantity: str # Should be int
|
||||
```
|
||||
|
||||
### 2. Add Constraints
|
||||
|
||||
```python
|
||||
from pydantic import Field
|
||||
|
||||
# ✅ Good: With constraints
|
||||
class User(BaseModel):
|
||||
name: str = Field(min_length=1, max_length=100)
|
||||
age: int = Field(ge=0, le=120)
|
||||
email: str = Field(pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$")
|
||||
|
||||
# ❌ Bad: No constraints
|
||||
class User(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
email: str
|
||||
```
|
||||
|
||||
### 3. Use Enums for Categories
|
||||
|
||||
```python
|
||||
# ✅ Good: Enum for fixed set
|
||||
class Priority(str, Enum):
|
||||
LOW = "low"
|
||||
MEDIUM = "medium"
|
||||
HIGH = "high"
|
||||
|
||||
class Task(BaseModel):
|
||||
title: str
|
||||
priority: Priority
|
||||
|
||||
# ❌ Bad: Free-form string
|
||||
class Task(BaseModel):
|
||||
title: str
|
||||
priority: str # Can be anything
|
||||
```
|
||||
|
||||
### 4. Provide Context in Prompts
|
||||
|
||||
```python
|
||||
# ✅ Good: Clear context
|
||||
prompt = """
|
||||
Extract product information from the following text.
|
||||
Text: iPhone 15 Pro costs $999 and is currently in stock.
|
||||
Product:
|
||||
"""
|
||||
|
||||
# ❌ Bad: Minimal context
|
||||
prompt = "iPhone 15 Pro costs $999 and is currently in stock."
|
||||
```
|
||||
|
||||
### 5. Handle Optional Fields
|
||||
|
||||
```python
|
||||
from typing import Optional
|
||||
|
||||
# ✅ Good: Optional fields for incomplete data
|
||||
class Article(BaseModel):
|
||||
title: str # Required
|
||||
author: Optional[str] = None # Optional
|
||||
date: Optional[str] = None # Optional
|
||||
tags: list[str] = [] # Default empty list
|
||||
|
||||
# Can succeed even if author/date missing
|
||||
```
|
||||
|
||||
## Comparison to Alternatives
|
||||
|
||||
| Feature | Outlines | Instructor | Guidance | LMQL |
|
||||
|---------|----------|------------|----------|------|
|
||||
| Pydantic Support | ✅ Native | ✅ Native | ❌ No | ❌ No |
|
||||
| JSON Schema | ✅ Yes | ✅ Yes | ⚠️ Limited | ✅ Yes |
|
||||
| Regex Constraints | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
|
||||
| Local Models | ✅ Full | ⚠️ Limited | ✅ Full | ✅ Full |
|
||||
| API Models | ⚠️ Limited | ✅ Full | ✅ Full | ✅ Full |
|
||||
| Zero Overhead | ✅ Yes | ❌ No | ⚠️ Partial | ✅ Yes |
|
||||
| Automatic Retrying | ❌ No | ✅ Yes | ❌ No | ❌ No |
|
||||
| Learning Curve | Low | Low | Low | High |
|
||||
|
||||
**When to choose Outlines:**
|
||||
- Using local models (Transformers, llama.cpp, vLLM)
|
||||
- Need maximum inference speed
|
||||
- Want Pydantic model support
|
||||
- Require zero-overhead structured generation
|
||||
- Control token sampling process
|
||||
|
||||
**When to choose alternatives:**
|
||||
- Instructor: Need API models with automatic retrying
|
||||
- Guidance: Need token healing and complex workflows
|
||||
- LMQL: Prefer declarative query syntax
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
**Speed:**
|
||||
- **Zero overhead**: Structured generation as fast as unconstrained
|
||||
- **Fast-forward optimization**: Skips deterministic tokens
|
||||
- **1.2-2x faster** than post-generation validation approaches
|
||||
|
||||
**Memory:**
|
||||
- FSM compiled once per schema (cached)
|
||||
- Minimal runtime overhead
|
||||
- Efficient with vLLM for high throughput
|
||||
|
||||
**Accuracy:**
|
||||
- **100% valid outputs** (guaranteed by FSM)
|
||||
- No retry loops needed
|
||||
- Deterministic token filtering
|
||||
|
||||
## Resources
|
||||
|
||||
- **Documentation**: https://outlines-dev.github.io/outlines
|
||||
- **GitHub**: https://github.com/outlines-dev/outlines (8k+ stars)
|
||||
- **Discord**: https://discord.gg/R9DSu34mGd
|
||||
- **Blog**: https://blog.dottxt.co
|
||||
|
||||
## See Also
|
||||
|
||||
- `references/json_generation.md` - Comprehensive JSON and Pydantic patterns
|
||||
- `references/backends.md` - Backend-specific configuration
|
||||
- `references/examples.md` - Production-ready examples
|
||||
|
||||
|
||||
615
skills/mlops/inference/outlines/references/backends.md
Normal file
615
skills/mlops/inference/outlines/references/backends.md
Normal file
@@ -0,0 +1,615 @@
|
||||
# Backend Configuration Guide
|
||||
|
||||
Complete guide to configuring Outlines with different model backends.
|
||||
|
||||
## Table of Contents
|
||||
- Local Models (Transformers, llama.cpp, vLLM)
|
||||
- API Models (OpenAI)
|
||||
- Performance Comparison
|
||||
- Configuration Examples
|
||||
- Production Deployment
|
||||
|
||||
## Transformers (Hugging Face)
|
||||
|
||||
### Basic Setup
|
||||
|
||||
```python
|
||||
import outlines
|
||||
|
||||
# Load model from Hugging Face
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
|
||||
# Use with generator
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
result = generator("Your prompt")
|
||||
```
|
||||
|
||||
### GPU Configuration
|
||||
|
||||
```python
|
||||
# Use CUDA GPU
|
||||
model = outlines.models.transformers(
|
||||
"microsoft/Phi-3-mini-4k-instruct",
|
||||
device="cuda"
|
||||
)
|
||||
|
||||
# Use specific GPU
|
||||
model = outlines.models.transformers(
|
||||
"microsoft/Phi-3-mini-4k-instruct",
|
||||
device="cuda:0" # GPU 0
|
||||
)
|
||||
|
||||
# Use CPU
|
||||
model = outlines.models.transformers(
|
||||
"microsoft/Phi-3-mini-4k-instruct",
|
||||
device="cpu"
|
||||
)
|
||||
|
||||
# Use Apple Silicon MPS
|
||||
model = outlines.models.transformers(
|
||||
"microsoft/Phi-3-mini-4k-instruct",
|
||||
device="mps"
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
```python
|
||||
# FP16 for faster inference
|
||||
model = outlines.models.transformers(
|
||||
"microsoft/Phi-3-mini-4k-instruct",
|
||||
device="cuda",
|
||||
model_kwargs={
|
||||
"torch_dtype": "float16"
|
||||
}
|
||||
)
|
||||
|
||||
# 8-bit quantization (less memory)
|
||||
model = outlines.models.transformers(
|
||||
"microsoft/Phi-3-mini-4k-instruct",
|
||||
device="cuda",
|
||||
model_kwargs={
|
||||
"load_in_8bit": True,
|
||||
"device_map": "auto"
|
||||
}
|
||||
)
|
||||
|
||||
# 4-bit quantization (even less memory)
|
||||
model = outlines.models.transformers(
|
||||
"meta-llama/Llama-3.1-70B-Instruct",
|
||||
device="cuda",
|
||||
model_kwargs={
|
||||
"load_in_4bit": True,
|
||||
"device_map": "auto",
|
||||
"bnb_4bit_compute_dtype": "float16"
|
||||
}
|
||||
)
|
||||
|
||||
# Multi-GPU
|
||||
model = outlines.models.transformers(
|
||||
"meta-llama/Llama-3.1-70B-Instruct",
|
||||
device="cuda",
|
||||
model_kwargs={
|
||||
"device_map": "auto", # Automatic GPU distribution
|
||||
"max_memory": {0: "40GB", 1: "40GB"} # Per-GPU limits
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Popular Models
|
||||
|
||||
```python
|
||||
# Phi-4 (Microsoft)
|
||||
model = outlines.models.transformers("microsoft/Phi-4-mini-instruct")
|
||||
model = outlines.models.transformers("microsoft/Phi-3-medium-4k-instruct")
|
||||
|
||||
# Llama 3.1 (Meta)
|
||||
model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
|
||||
model = outlines.models.transformers("meta-llama/Llama-3.1-70B-Instruct")
|
||||
model = outlines.models.transformers("meta-llama/Llama-3.1-405B-Instruct")
|
||||
|
||||
# Mistral (Mistral AI)
|
||||
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")
|
||||
model = outlines.models.transformers("mistralai/Mixtral-8x7B-Instruct-v0.1")
|
||||
model = outlines.models.transformers("mistralai/Mixtral-8x22B-Instruct-v0.1")
|
||||
|
||||
# Qwen (Alibaba)
|
||||
model = outlines.models.transformers("Qwen/Qwen2.5-7B-Instruct")
|
||||
model = outlines.models.transformers("Qwen/Qwen2.5-14B-Instruct")
|
||||
model = outlines.models.transformers("Qwen/Qwen2.5-72B-Instruct")
|
||||
|
||||
# Gemma (Google)
|
||||
model = outlines.models.transformers("google/gemma-2-9b-it")
|
||||
model = outlines.models.transformers("google/gemma-2-27b-it")
|
||||
|
||||
# Llava (Vision)
|
||||
model = outlines.models.transformers("llava-hf/llava-v1.6-mistral-7b-hf")
|
||||
```
|
||||
|
||||
### Custom Model Loading
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
import outlines
|
||||
|
||||
# Load model manually
|
||||
tokenizer = AutoTokenizer.from_pretrained("your-model")
|
||||
model_hf = AutoModelForCausalLM.from_pretrained(
|
||||
"your-model",
|
||||
device_map="auto",
|
||||
torch_dtype="float16"
|
||||
)
|
||||
|
||||
# Use with Outlines
|
||||
model = outlines.models.transformers(
|
||||
model=model_hf,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
```
|
||||
|
||||
## llama.cpp
|
||||
|
||||
### Basic Setup
|
||||
|
||||
```python
|
||||
import outlines
|
||||
|
||||
# Load GGUF model
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/llama-3.1-8b-instruct.Q4_K_M.gguf",
|
||||
n_ctx=4096 # Context window
|
||||
)
|
||||
|
||||
# Use with generator
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
```
|
||||
|
||||
### GPU Configuration
|
||||
|
||||
```python
|
||||
# CPU only
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/model.gguf",
|
||||
n_ctx=4096,
|
||||
n_threads=8 # Use 8 CPU threads
|
||||
)
|
||||
|
||||
# GPU offload (partial)
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/model.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35, # Offload 35 layers to GPU
|
||||
n_threads=4 # CPU threads for remaining layers
|
||||
)
|
||||
|
||||
# Full GPU offload
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/model.gguf",
|
||||
n_ctx=8192,
|
||||
n_gpu_layers=-1 # All layers on GPU
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
```python
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/llama-3.1-8b.Q4_K_M.gguf",
|
||||
n_ctx=8192, # Context window (tokens)
|
||||
n_gpu_layers=35, # GPU layers
|
||||
n_threads=8, # CPU threads
|
||||
n_batch=512, # Batch size for prompt processing
|
||||
use_mmap=True, # Memory-map model file (faster loading)
|
||||
use_mlock=False, # Lock model in RAM (prevents swapping)
|
||||
seed=42, # Random seed for reproducibility
|
||||
verbose=False # Suppress verbose output
|
||||
)
|
||||
```
|
||||
|
||||
### Quantization Formats
|
||||
|
||||
```python
|
||||
# Q4_K_M (4-bit, recommended for most cases)
|
||||
# - Size: ~4.5GB for 7B model
|
||||
# - Quality: Good
|
||||
# - Speed: Fast
|
||||
model = outlines.models.llamacpp("./models/model.Q4_K_M.gguf")
|
||||
|
||||
# Q5_K_M (5-bit, better quality)
|
||||
# - Size: ~5.5GB for 7B model
|
||||
# - Quality: Very good
|
||||
# - Speed: Slightly slower than Q4
|
||||
model = outlines.models.llamacpp("./models/model.Q5_K_M.gguf")
|
||||
|
||||
# Q6_K (6-bit, high quality)
|
||||
# - Size: ~6.5GB for 7B model
|
||||
# - Quality: Excellent
|
||||
# - Speed: Slower than Q5
|
||||
model = outlines.models.llamacpp("./models/model.Q6_K.gguf")
|
||||
|
||||
# Q8_0 (8-bit, near-original quality)
|
||||
# - Size: ~8GB for 7B model
|
||||
# - Quality: Near FP16
|
||||
# - Speed: Slower than Q6
|
||||
model = outlines.models.llamacpp("./models/model.Q8_0.gguf")
|
||||
|
||||
# F16 (16-bit float, original quality)
|
||||
# - Size: ~14GB for 7B model
|
||||
# - Quality: Original
|
||||
# - Speed: Slowest
|
||||
model = outlines.models.llamacpp("./models/model.F16.gguf")
|
||||
```
|
||||
|
||||
### Popular GGUF Models
|
||||
|
||||
```python
|
||||
# Llama 3.1
|
||||
model = outlines.models.llamacpp("llama-3.1-8b-instruct.Q4_K_M.gguf")
|
||||
model = outlines.models.llamacpp("llama-3.1-70b-instruct.Q4_K_M.gguf")
|
||||
|
||||
# Mistral
|
||||
model = outlines.models.llamacpp("mistral-7b-instruct-v0.3.Q4_K_M.gguf")
|
||||
|
||||
# Phi-4
|
||||
model = outlines.models.llamacpp("phi-4-mini-instruct.Q4_K_M.gguf")
|
||||
|
||||
# Qwen
|
||||
model = outlines.models.llamacpp("qwen2.5-7b-instruct.Q4_K_M.gguf")
|
||||
```
|
||||
|
||||
### Apple Silicon Optimization
|
||||
|
||||
```python
|
||||
# Optimized for M1/M2/M3 Macs
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/llama-3.1-8b.Q4_K_M.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=-1, # Use Metal GPU acceleration
|
||||
use_mmap=True, # Efficient memory mapping
|
||||
n_threads=8 # Use performance cores
|
||||
)
|
||||
```
|
||||
|
||||
## vLLM (Production)
|
||||
|
||||
### Basic Setup
|
||||
|
||||
```python
|
||||
import outlines
|
||||
|
||||
# Load model with vLLM
|
||||
model = outlines.models.vllm("meta-llama/Llama-3.1-8B-Instruct")
|
||||
|
||||
# Use with generator
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
```
|
||||
|
||||
### Single GPU
|
||||
|
||||
```python
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
gpu_memory_utilization=0.9, # Use 90% of GPU memory
|
||||
max_model_len=4096 # Max sequence length
|
||||
)
|
||||
```
|
||||
|
||||
### Multi-GPU
|
||||
|
||||
```python
|
||||
# Tensor parallelism (split model across GPUs)
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4, # Use 4 GPUs
|
||||
gpu_memory_utilization=0.9
|
||||
)
|
||||
|
||||
# Pipeline parallelism (rare, for very large models)
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-405B-Instruct",
|
||||
pipeline_parallel_size=8, # 8-GPU pipeline
|
||||
tensor_parallel_size=4 # 4-GPU tensor split
|
||||
# Total: 32 GPUs
|
||||
)
|
||||
```
|
||||
|
||||
### Quantization
|
||||
|
||||
```python
|
||||
# AWQ quantization (4-bit)
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
quantization="awq",
|
||||
dtype="float16"
|
||||
)
|
||||
|
||||
# GPTQ quantization (4-bit)
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
quantization="gptq"
|
||||
)
|
||||
|
||||
# SqueezeLLM quantization
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
quantization="squeezellm"
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
```python
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
tensor_parallel_size=1,
|
||||
gpu_memory_utilization=0.9,
|
||||
max_model_len=8192,
|
||||
max_num_seqs=256, # Max concurrent sequences
|
||||
max_num_batched_tokens=8192, # Max tokens per batch
|
||||
dtype="float16",
|
||||
trust_remote_code=True,
|
||||
enforce_eager=False, # Use CUDA graphs (faster)
|
||||
swap_space=4 # CPU swap space (GB)
|
||||
)
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
# vLLM optimized for high-throughput batch processing
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
max_num_seqs=128 # Process 128 sequences in parallel
|
||||
)
|
||||
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
|
||||
# Process many prompts efficiently
|
||||
prompts = ["prompt1", "prompt2", ..., "prompt100"]
|
||||
results = [generator(p) for p in prompts]
|
||||
# vLLM automatically batches and optimizes
|
||||
```
|
||||
|
||||
## OpenAI (Limited Support)
|
||||
|
||||
### Basic Setup
|
||||
|
||||
```python
|
||||
import outlines
|
||||
|
||||
# Basic OpenAI support
|
||||
model = outlines.models.openai("gpt-4o-mini", api_key="your-api-key")
|
||||
|
||||
# Use with generator
|
||||
generator = outlines.generate.json(model, YourModel)
|
||||
result = generator("Your prompt")
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
```python
|
||||
model = outlines.models.openai(
|
||||
"gpt-4o-mini",
|
||||
api_key="your-api-key", # Or set OPENAI_API_KEY env var
|
||||
max_tokens=2048,
|
||||
temperature=0.7
|
||||
)
|
||||
```
|
||||
|
||||
### Available Models
|
||||
|
||||
```python
|
||||
# GPT-4o (latest)
|
||||
model = outlines.models.openai("gpt-4o")
|
||||
|
||||
# GPT-4o Mini (cost-effective)
|
||||
model = outlines.models.openai("gpt-4o-mini")
|
||||
|
||||
# GPT-4 Turbo
|
||||
model = outlines.models.openai("gpt-4-turbo")
|
||||
|
||||
# GPT-3.5 Turbo
|
||||
model = outlines.models.openai("gpt-3.5-turbo")
|
||||
```
|
||||
|
||||
**Note**: OpenAI support is limited compared to local models. Some advanced features may not work.
|
||||
|
||||
## Backend Comparison
|
||||
|
||||
### Feature Matrix
|
||||
|
||||
| Feature | Transformers | llama.cpp | vLLM | OpenAI |
|
||||
|---------|-------------|-----------|------|--------|
|
||||
| Structured Generation | ✅ Full | ✅ Full | ✅ Full | ⚠️ Limited |
|
||||
| FSM Optimization | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
|
||||
| GPU Support | ✅ Yes | ✅ Yes | ✅ Yes | N/A |
|
||||
| Multi-GPU | ✅ Yes | ✅ Yes | ✅ Yes | N/A |
|
||||
| Quantization | ✅ Yes | ✅ Yes | ✅ Yes | N/A |
|
||||
| High Throughput | ⚠️ Medium | ⚠️ Medium | ✅ Excellent | ⚠️ API-limited |
|
||||
| Setup Difficulty | Easy | Medium | Medium | Easy |
|
||||
| Cost | Hardware | Hardware | Hardware | API usage |
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
**Transformers:**
|
||||
- **Latency**: 50-200ms (single request, GPU)
|
||||
- **Throughput**: 10-50 tokens/sec (depends on hardware)
|
||||
- **Memory**: 2-4GB per 1B parameters (FP16)
|
||||
- **Best for**: Development, small-scale deployment, flexibility
|
||||
|
||||
**llama.cpp:**
|
||||
- **Latency**: 30-150ms (single request)
|
||||
- **Throughput**: 20-150 tokens/sec (depends on quantization)
|
||||
- **Memory**: 0.5-2GB per 1B parameters (Q4-Q8)
|
||||
- **Best for**: CPU inference, Apple Silicon, edge deployment, low memory
|
||||
|
||||
**vLLM:**
|
||||
- **Latency**: 30-100ms (single request)
|
||||
- **Throughput**: 100-1000+ tokens/sec (batch processing)
|
||||
- **Memory**: 2-4GB per 1B parameters (FP16)
|
||||
- **Best for**: Production, high-throughput, batch processing, serving
|
||||
|
||||
**OpenAI:**
|
||||
- **Latency**: 200-500ms (API call)
|
||||
- **Throughput**: API rate limits
|
||||
- **Memory**: N/A (cloud-based)
|
||||
- **Best for**: Quick prototyping, no infrastructure
|
||||
|
||||
### Memory Requirements
|
||||
|
||||
**7B Model:**
|
||||
- FP16: ~14GB
|
||||
- 8-bit: ~7GB
|
||||
- 4-bit: ~4GB
|
||||
- Q4_K_M (GGUF): ~4.5GB
|
||||
|
||||
**13B Model:**
|
||||
- FP16: ~26GB
|
||||
- 8-bit: ~13GB
|
||||
- 4-bit: ~7GB
|
||||
- Q4_K_M (GGUF): ~8GB
|
||||
|
||||
**70B Model:**
|
||||
- FP16: ~140GB (multi-GPU)
|
||||
- 8-bit: ~70GB (multi-GPU)
|
||||
- 4-bit: ~35GB (single A100/H100)
|
||||
- Q4_K_M (GGUF): ~40GB
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Transformers Optimization
|
||||
|
||||
```python
|
||||
# Use FP16
|
||||
model = outlines.models.transformers(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
device="cuda",
|
||||
model_kwargs={"torch_dtype": "float16"}
|
||||
)
|
||||
|
||||
# Use flash attention (2-4x faster)
|
||||
model = outlines.models.transformers(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
device="cuda",
|
||||
model_kwargs={
|
||||
"torch_dtype": "float16",
|
||||
"use_flash_attention_2": True
|
||||
}
|
||||
)
|
||||
|
||||
# Use 8-bit quantization (2x less memory)
|
||||
model = outlines.models.transformers(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
device="cuda",
|
||||
model_kwargs={
|
||||
"load_in_8bit": True,
|
||||
"device_map": "auto"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### llama.cpp Optimization
|
||||
|
||||
```python
|
||||
# Maximize GPU usage
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/model.Q4_K_M.gguf",
|
||||
n_gpu_layers=-1, # All layers on GPU
|
||||
n_ctx=8192,
|
||||
n_batch=512 # Larger batch = faster
|
||||
)
|
||||
|
||||
# Optimize for CPU (Apple Silicon)
|
||||
model = outlines.models.llamacpp(
|
||||
"./models/model.Q4_K_M.gguf",
|
||||
n_ctx=4096,
|
||||
n_threads=8, # Use all performance cores
|
||||
use_mmap=True
|
||||
)
|
||||
```
|
||||
|
||||
### vLLM Optimization
|
||||
|
||||
```python
|
||||
# High throughput
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
gpu_memory_utilization=0.95, # Use 95% of GPU
|
||||
max_num_seqs=256, # High concurrency
|
||||
enforce_eager=False # Use CUDA graphs
|
||||
)
|
||||
|
||||
# Multi-GPU
|
||||
model = outlines.models.vllm(
|
||||
"meta-llama/Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4, # 4 GPUs
|
||||
gpu_memory_utilization=0.9
|
||||
)
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Docker with vLLM
|
||||
|
||||
```dockerfile
|
||||
FROM vllm/vllm-openai:latest
|
||||
|
||||
# Install outlines
|
||||
RUN pip install outlines
|
||||
|
||||
# Copy your code
|
||||
COPY app.py /app/
|
||||
|
||||
# Run
|
||||
CMD ["python", "/app/app.py"]
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Transformers cache
|
||||
export HF_HOME="/path/to/cache"
|
||||
export TRANSFORMERS_CACHE="/path/to/cache"
|
||||
|
||||
# GPU selection
|
||||
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||
|
||||
# OpenAI API key
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
|
||||
# Disable tokenizers parallelism warning
|
||||
export TOKENIZERS_PARALLELISM=false
|
||||
```
|
||||
|
||||
### Model Serving
|
||||
|
||||
```python
|
||||
# Simple HTTP server with vLLM
|
||||
import outlines
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
# Load model once at startup
|
||||
model = outlines.models.vllm("meta-llama/Llama-3.1-8B-Instruct")
|
||||
|
||||
class User(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
email: str
|
||||
|
||||
generator = outlines.generate.json(model, User)
|
||||
|
||||
@app.post("/extract")
|
||||
def extract(text: str):
|
||||
result = generator(f"Extract user from: {text}")
|
||||
return result.model_dump()
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **Transformers**: https://huggingface.co/docs/transformers
|
||||
- **llama.cpp**: https://github.com/ggerganov/llama.cpp
|
||||
- **vLLM**: https://docs.vllm.ai
|
||||
- **Outlines**: https://github.com/outlines-dev/outlines
|
||||
773
skills/mlops/inference/outlines/references/examples.md
Normal file
773
skills/mlops/inference/outlines/references/examples.md
Normal file
@@ -0,0 +1,773 @@
|
||||
# Production-Ready Examples
|
||||
|
||||
Real-world examples of using Outlines for structured generation in production systems.
|
||||
|
||||
## Table of Contents
|
||||
- Data Extraction
|
||||
- Classification Systems
|
||||
- Form Processing
|
||||
- Multi-Entity Extraction
|
||||
- Code Generation
|
||||
- Batch Processing
|
||||
- Production Patterns
|
||||
|
||||
## Data Extraction
|
||||
|
||||
### Basic Information Extraction
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, Field
|
||||
import outlines
|
||||
|
||||
class PersonInfo(BaseModel):
|
||||
name: str = Field(description="Full name")
|
||||
age: int = Field(ge=0, le=120)
|
||||
occupation: str
|
||||
email: str = Field(pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$")
|
||||
location: str
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, PersonInfo)
|
||||
|
||||
text = """
|
||||
Dr. Sarah Johnson is a 42-year-old research scientist at MIT.
|
||||
She can be reached at sarah.j@mit.edu and currently lives in Cambridge, MA.
|
||||
"""
|
||||
|
||||
prompt = f"Extract person information from:\n{text}\n\nPerson:"
|
||||
person = generator(prompt)
|
||||
|
||||
print(f"Name: {person.name}")
|
||||
print(f"Age: {person.age}")
|
||||
print(f"Occupation: {person.occupation}")
|
||||
print(f"Email: {person.email}")
|
||||
print(f"Location: {person.location}")
|
||||
```
|
||||
|
||||
### Company Information
|
||||
|
||||
```python
|
||||
class CompanyInfo(BaseModel):
|
||||
name: str
|
||||
founded_year: int = Field(ge=1800, le=2025)
|
||||
industry: str
|
||||
headquarters: str
|
||||
employees: int = Field(gt=0)
|
||||
revenue: Optional[str] = None
|
||||
|
||||
model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
|
||||
generator = outlines.generate.json(model, CompanyInfo)
|
||||
|
||||
text = """
|
||||
Tesla, Inc. was founded in 2003 and operates primarily in the automotive
|
||||
and energy industries. The company is headquartered in Austin, Texas,
|
||||
and employs approximately 140,000 people worldwide.
|
||||
"""
|
||||
|
||||
company = generator(f"Extract company information:\n{text}\n\nCompany:")
|
||||
|
||||
print(f"Company: {company.name}")
|
||||
print(f"Founded: {company.founded_year}")
|
||||
print(f"Industry: {company.industry}")
|
||||
print(f"HQ: {company.headquarters}")
|
||||
print(f"Employees: {company.employees:,}")
|
||||
```
|
||||
|
||||
### Product Specifications
|
||||
|
||||
```python
|
||||
class ProductSpec(BaseModel):
|
||||
name: str
|
||||
brand: str
|
||||
price: float = Field(gt=0)
|
||||
dimensions: str
|
||||
weight: str
|
||||
features: list[str]
|
||||
rating: Optional[float] = Field(None, ge=0, le=5)
|
||||
|
||||
generator = outlines.generate.json(model, ProductSpec)
|
||||
|
||||
text = """
|
||||
The Apple iPhone 15 Pro is priced at $999. It measures 146.6 x 70.6 x 8.25 mm
|
||||
and weighs 187 grams. Key features include the A17 Pro chip, titanium design,
|
||||
action button, and USB-C port. It has an average customer rating of 4.5 stars.
|
||||
"""
|
||||
|
||||
product = generator(f"Extract product specifications:\n{text}\n\nProduct:")
|
||||
|
||||
print(f"Product: {product.brand} {product.name}")
|
||||
print(f"Price: ${product.price}")
|
||||
print(f"Features: {', '.join(product.features)}")
|
||||
```
|
||||
|
||||
## Classification Systems
|
||||
|
||||
### Sentiment Analysis
|
||||
|
||||
```python
|
||||
from typing import Literal
|
||||
from enum import Enum
|
||||
|
||||
class Sentiment(str, Enum):
|
||||
VERY_POSITIVE = "very_positive"
|
||||
POSITIVE = "positive"
|
||||
NEUTRAL = "neutral"
|
||||
NEGATIVE = "negative"
|
||||
VERY_NEGATIVE = "very_negative"
|
||||
|
||||
class SentimentAnalysis(BaseModel):
|
||||
text: str
|
||||
sentiment: Sentiment
|
||||
confidence: float = Field(ge=0.0, le=1.0)
|
||||
aspects: list[str] # What aspects were mentioned
|
||||
reasoning: str
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, SentimentAnalysis)
|
||||
|
||||
review = """
|
||||
This product completely exceeded my expectations! The build quality is
|
||||
outstanding, and customer service was incredibly helpful. My only minor
|
||||
complaint is the packaging could be better.
|
||||
"""
|
||||
|
||||
result = generator(f"Analyze sentiment:\n{review}\n\nAnalysis:")
|
||||
|
||||
print(f"Sentiment: {result.sentiment.value}")
|
||||
print(f"Confidence: {result.confidence:.2%}")
|
||||
print(f"Aspects: {', '.join(result.aspects)}")
|
||||
print(f"Reasoning: {result.reasoning}")
|
||||
```
|
||||
|
||||
### Content Classification
|
||||
|
||||
```python
|
||||
class Category(str, Enum):
|
||||
TECHNOLOGY = "technology"
|
||||
BUSINESS = "business"
|
||||
SCIENCE = "science"
|
||||
POLITICS = "politics"
|
||||
ENTERTAINMENT = "entertainment"
|
||||
SPORTS = "sports"
|
||||
HEALTH = "health"
|
||||
|
||||
class ArticleClassification(BaseModel):
|
||||
primary_category: Category
|
||||
secondary_categories: list[Category]
|
||||
keywords: list[str] = Field(min_items=3, max_items=10)
|
||||
target_audience: Literal["general", "expert", "beginner"]
|
||||
reading_level: Literal["elementary", "intermediate", "advanced"]
|
||||
|
||||
generator = outlines.generate.json(model, ArticleClassification)
|
||||
|
||||
article = """
|
||||
Apple announced groundbreaking advancements in its AI capabilities with the
|
||||
release of iOS 18. The new features leverage machine learning to significantly
|
||||
improve battery life and overall device performance. Industry analysts predict
|
||||
this will strengthen Apple's position in the competitive smartphone market.
|
||||
"""
|
||||
|
||||
classification = generator(f"Classify article:\n{article}\n\nClassification:")
|
||||
|
||||
print(f"Primary: {classification.primary_category.value}")
|
||||
print(f"Secondary: {[c.value for c in classification.secondary_categories]}")
|
||||
print(f"Keywords: {classification.keywords}")
|
||||
print(f"Audience: {classification.target_audience}")
|
||||
```
|
||||
|
||||
### Intent Recognition
|
||||
|
||||
```python
|
||||
class Intent(str, Enum):
|
||||
QUESTION = "question"
|
||||
COMPLAINT = "complaint"
|
||||
REQUEST = "request"
|
||||
FEEDBACK = "feedback"
|
||||
CANCEL = "cancel"
|
||||
UPGRADE = "upgrade"
|
||||
|
||||
class UserMessage(BaseModel):
|
||||
original_message: str
|
||||
intent: Intent
|
||||
urgency: Literal["low", "medium", "high", "critical"]
|
||||
department: Literal["support", "sales", "billing", "technical"]
|
||||
sentiment: Literal["positive", "neutral", "negative"]
|
||||
action_required: bool
|
||||
summary: str
|
||||
|
||||
generator = outlines.generate.json(model, UserMessage)
|
||||
|
||||
message = """
|
||||
I've been charged twice for my subscription this month! This is the third
|
||||
time this has happened. I need someone to fix this immediately and refund
|
||||
the extra charge. Very disappointed with this service.
|
||||
"""
|
||||
|
||||
result = generator(f"Analyze message:\n{message}\n\nAnalysis:")
|
||||
|
||||
print(f"Intent: {result.intent.value}")
|
||||
print(f"Urgency: {result.urgency}")
|
||||
print(f"Route to: {result.department}")
|
||||
print(f"Action required: {result.action_required}")
|
||||
print(f"Summary: {result.summary}")
|
||||
```
|
||||
|
||||
## Form Processing
|
||||
|
||||
### Job Application
|
||||
|
||||
```python
|
||||
class Education(BaseModel):
|
||||
degree: str
|
||||
field: str
|
||||
institution: str
|
||||
year: int
|
||||
|
||||
class Experience(BaseModel):
|
||||
title: str
|
||||
company: str
|
||||
duration: str
|
||||
responsibilities: list[str]
|
||||
|
||||
class JobApplication(BaseModel):
|
||||
full_name: str
|
||||
email: str
|
||||
phone: str
|
||||
education: list[Education]
|
||||
experience: list[Experience]
|
||||
skills: list[str]
|
||||
availability: str
|
||||
|
||||
model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
|
||||
generator = outlines.generate.json(model, JobApplication)
|
||||
|
||||
resume_text = """
|
||||
John Smith
|
||||
Email: john.smith@email.com | Phone: 555-0123
|
||||
|
||||
EDUCATION
|
||||
- BS in Computer Science, MIT, 2018
|
||||
- MS in Artificial Intelligence, Stanford, 2020
|
||||
|
||||
EXPERIENCE
|
||||
Software Engineer, Google (2020-2023)
|
||||
- Developed ML pipelines for search ranking
|
||||
- Led team of 5 engineers
|
||||
- Improved search quality by 15%
|
||||
|
||||
SKILLS: Python, Machine Learning, TensorFlow, System Design
|
||||
|
||||
AVAILABILITY: Immediate
|
||||
"""
|
||||
|
||||
application = generator(f"Extract job application:\n{resume_text}\n\nApplication:")
|
||||
|
||||
print(f"Applicant: {application.full_name}")
|
||||
print(f"Email: {application.email}")
|
||||
print(f"Education: {len(application.education)} degrees")
|
||||
for edu in application.education:
|
||||
print(f" - {edu.degree} in {edu.field}, {edu.institution} ({edu.year})")
|
||||
print(f"Experience: {len(application.experience)} positions")
|
||||
```
|
||||
|
||||
### Invoice Processing
|
||||
|
||||
```python
|
||||
class InvoiceItem(BaseModel):
|
||||
description: str
|
||||
quantity: int = Field(gt=0)
|
||||
unit_price: float = Field(gt=0)
|
||||
total: float = Field(gt=0)
|
||||
|
||||
class Invoice(BaseModel):
|
||||
invoice_number: str
|
||||
date: str = Field(pattern=r"\d{4}-\d{2}-\d{2}")
|
||||
vendor: str
|
||||
customer: str
|
||||
items: list[InvoiceItem]
|
||||
subtotal: float = Field(gt=0)
|
||||
tax: float = Field(ge=0)
|
||||
total: float = Field(gt=0)
|
||||
|
||||
generator = outlines.generate.json(model, Invoice)
|
||||
|
||||
invoice_text = """
|
||||
INVOICE #INV-2024-001
|
||||
Date: 2024-01-15
|
||||
|
||||
From: Acme Corp
|
||||
To: Smith & Co
|
||||
|
||||
Items:
|
||||
- Widget A: 10 units @ $50.00 = $500.00
|
||||
- Widget B: 5 units @ $75.00 = $375.00
|
||||
- Service Fee: 1 @ $100.00 = $100.00
|
||||
|
||||
Subtotal: $975.00
|
||||
Tax (8%): $78.00
|
||||
TOTAL: $1,053.00
|
||||
"""
|
||||
|
||||
invoice = generator(f"Extract invoice:\n{invoice_text}\n\nInvoice:")
|
||||
|
||||
print(f"Invoice: {invoice.invoice_number}")
|
||||
print(f"From: {invoice.vendor} → To: {invoice.customer}")
|
||||
print(f"Items: {len(invoice.items)}")
|
||||
for item in invoice.items:
|
||||
print(f" - {item.description}: {item.quantity} × ${item.unit_price} = ${item.total}")
|
||||
print(f"Total: ${invoice.total}")
|
||||
```
|
||||
|
||||
### Survey Responses
|
||||
|
||||
```python
|
||||
class SurveyResponse(BaseModel):
|
||||
respondent_id: str
|
||||
completion_date: str
|
||||
satisfaction: Literal[1, 2, 3, 4, 5]
|
||||
would_recommend: bool
|
||||
favorite_features: list[str]
|
||||
improvement_areas: list[str]
|
||||
additional_comments: Optional[str] = None
|
||||
|
||||
generator = outlines.generate.json(model, SurveyResponse)
|
||||
|
||||
survey_text = """
|
||||
Survey ID: RESP-12345
|
||||
Completed: 2024-01-20
|
||||
|
||||
How satisfied are you with our product? 4 out of 5
|
||||
|
||||
Would you recommend to a friend? Yes
|
||||
|
||||
What features do you like most?
|
||||
- Fast performance
|
||||
- Easy to use
|
||||
- Great customer support
|
||||
|
||||
What could we improve?
|
||||
- Better documentation
|
||||
- More integrations
|
||||
|
||||
Additional feedback: Overall great product, keep up the good work!
|
||||
"""
|
||||
|
||||
response = generator(f"Extract survey response:\n{survey_text}\n\nResponse:")
|
||||
|
||||
print(f"Respondent: {response.respondent_id}")
|
||||
print(f"Satisfaction: {response.satisfaction}/5")
|
||||
print(f"Would recommend: {response.would_recommend}")
|
||||
print(f"Favorite features: {response.favorite_features}")
|
||||
print(f"Improvement areas: {response.improvement_areas}")
|
||||
```
|
||||
|
||||
## Multi-Entity Extraction
|
||||
|
||||
### News Article Entities
|
||||
|
||||
```python
|
||||
class Person(BaseModel):
|
||||
name: str
|
||||
role: Optional[str] = None
|
||||
affiliation: Optional[str] = None
|
||||
|
||||
class Organization(BaseModel):
|
||||
name: str
|
||||
type: Optional[str] = None
|
||||
|
||||
class Location(BaseModel):
|
||||
name: str
|
||||
type: Literal["city", "state", "country", "region"]
|
||||
|
||||
class Event(BaseModel):
|
||||
name: str
|
||||
date: Optional[str] = None
|
||||
location: Optional[str] = None
|
||||
|
||||
class ArticleEntities(BaseModel):
|
||||
people: list[Person]
|
||||
organizations: list[Organization]
|
||||
locations: list[Location]
|
||||
events: list[Event]
|
||||
dates: list[str]
|
||||
|
||||
model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
|
||||
generator = outlines.generate.json(model, ArticleEntities)
|
||||
|
||||
article = """
|
||||
Apple CEO Tim Cook met with Microsoft CEO Satya Nadella at Microsoft
|
||||
headquarters in Redmond, Washington on September 15, 2024, to discuss
|
||||
potential collaboration opportunities. The meeting was attended by executives
|
||||
from both companies and focused on AI integration strategies. Apple's
|
||||
Cupertino offices will host a follow-up meeting on October 20, 2024.
|
||||
"""
|
||||
|
||||
entities = generator(f"Extract all entities:\n{article}\n\nEntities:")
|
||||
|
||||
print("People:")
|
||||
for person in entities.people:
|
||||
print(f" - {person.name} ({person.role}) @ {person.affiliation}")
|
||||
|
||||
print("\nOrganizations:")
|
||||
for org in entities.organizations:
|
||||
print(f" - {org.name} ({org.type})")
|
||||
|
||||
print("\nLocations:")
|
||||
for loc in entities.locations:
|
||||
print(f" - {loc.name} ({loc.type})")
|
||||
|
||||
print("\nEvents:")
|
||||
for event in entities.events:
|
||||
print(f" - {event.name} on {event.date}")
|
||||
```
|
||||
|
||||
### Document Metadata
|
||||
|
||||
```python
|
||||
class Author(BaseModel):
|
||||
name: str
|
||||
email: Optional[str] = None
|
||||
affiliation: Optional[str] = None
|
||||
|
||||
class Reference(BaseModel):
|
||||
title: str
|
||||
authors: list[str]
|
||||
year: int
|
||||
source: str
|
||||
|
||||
class DocumentMetadata(BaseModel):
|
||||
title: str
|
||||
authors: list[Author]
|
||||
abstract: str
|
||||
keywords: list[str]
|
||||
publication_date: str
|
||||
journal: str
|
||||
doi: Optional[str] = None
|
||||
references: list[Reference]
|
||||
|
||||
generator = outlines.generate.json(model, DocumentMetadata)
|
||||
|
||||
paper = """
|
||||
Title: Advances in Neural Machine Translation
|
||||
|
||||
Authors:
|
||||
- Dr. Jane Smith (jane@university.edu), MIT
|
||||
- Prof. John Doe (jdoe@stanford.edu), Stanford University
|
||||
|
||||
Abstract: This paper presents novel approaches to neural machine translation
|
||||
using transformer architectures. We demonstrate significant improvements in
|
||||
translation quality across multiple language pairs.
|
||||
|
||||
Keywords: Neural Networks, Machine Translation, Transformers, NLP
|
||||
|
||||
Published: Journal of AI Research, 2024-03-15
|
||||
DOI: 10.1234/jair.2024.001
|
||||
|
||||
References:
|
||||
1. "Attention Is All You Need" by Vaswani et al., 2017, NeurIPS
|
||||
2. "BERT: Pre-training of Deep Bidirectional Transformers" by Devlin et al., 2019, NAACL
|
||||
"""
|
||||
|
||||
metadata = generator(f"Extract document metadata:\n{paper}\n\nMetadata:")
|
||||
|
||||
print(f"Title: {metadata.title}")
|
||||
print(f"Authors: {', '.join(a.name for a in metadata.authors)}")
|
||||
print(f"Keywords: {', '.join(metadata.keywords)}")
|
||||
print(f"References: {len(metadata.references)}")
|
||||
```
|
||||
|
||||
## Code Generation
|
||||
|
||||
### Python Function Generation
|
||||
|
||||
```python
|
||||
class Parameter(BaseModel):
|
||||
name: str = Field(pattern=r"^[a-z_][a-z0-9_]*$")
|
||||
type_hint: str
|
||||
default: Optional[str] = None
|
||||
|
||||
class PythonFunction(BaseModel):
|
||||
function_name: str = Field(pattern=r"^[a-z_][a-z0-9_]*$")
|
||||
parameters: list[Parameter]
|
||||
return_type: str
|
||||
docstring: str
|
||||
body: list[str] # Lines of code
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, PythonFunction)
|
||||
|
||||
spec = "Create a function to calculate the factorial of a number"
|
||||
|
||||
func = generator(f"Generate Python function:\n{spec}\n\nFunction:")
|
||||
|
||||
print(f"def {func.function_name}(", end="")
|
||||
print(", ".join(f"{p.name}: {p.type_hint}" for p in func.parameters), end="")
|
||||
print(f") -> {func.return_type}:")
|
||||
print(f' """{func.docstring}"""')
|
||||
for line in func.body:
|
||||
print(f" {line}")
|
||||
```
|
||||
|
||||
### SQL Query Generation
|
||||
|
||||
```python
|
||||
class SQLQuery(BaseModel):
|
||||
query_type: Literal["SELECT", "INSERT", "UPDATE", "DELETE"]
|
||||
select_columns: Optional[list[str]] = None
|
||||
from_tables: list[str]
|
||||
joins: Optional[list[str]] = None
|
||||
where_conditions: Optional[list[str]] = None
|
||||
group_by: Optional[list[str]] = None
|
||||
order_by: Optional[list[str]] = None
|
||||
limit: Optional[int] = None
|
||||
|
||||
generator = outlines.generate.json(model, SQLQuery)
|
||||
|
||||
request = "Get top 10 users who made purchases in the last 30 days, ordered by total spent"
|
||||
|
||||
sql = generator(f"Generate SQL query:\n{request}\n\nQuery:")
|
||||
|
||||
print(f"Query type: {sql.query_type}")
|
||||
print(f"SELECT {', '.join(sql.select_columns)}")
|
||||
print(f"FROM {', '.join(sql.from_tables)}")
|
||||
if sql.joins:
|
||||
for join in sql.joins:
|
||||
print(f" {join}")
|
||||
if sql.where_conditions:
|
||||
print(f"WHERE {' AND '.join(sql.where_conditions)}")
|
||||
if sql.order_by:
|
||||
print(f"ORDER BY {', '.join(sql.order_by)}")
|
||||
if sql.limit:
|
||||
print(f"LIMIT {sql.limit}")
|
||||
```
|
||||
|
||||
### API Endpoint Spec
|
||||
|
||||
```python
|
||||
class Parameter(BaseModel):
|
||||
name: str
|
||||
type: str
|
||||
required: bool
|
||||
description: str
|
||||
|
||||
class APIEndpoint(BaseModel):
|
||||
method: Literal["GET", "POST", "PUT", "DELETE", "PATCH"]
|
||||
path: str
|
||||
description: str
|
||||
parameters: list[Parameter]
|
||||
request_body: Optional[dict] = None
|
||||
response_schema: dict
|
||||
status_codes: dict[int, str]
|
||||
|
||||
generator = outlines.generate.json(model, APIEndpoint)
|
||||
|
||||
spec = "Create user endpoint"
|
||||
|
||||
endpoint = generator(f"Generate API endpoint:\n{spec}\n\nEndpoint:")
|
||||
|
||||
print(f"{endpoint.method} {endpoint.path}")
|
||||
print(f"Description: {endpoint.description}")
|
||||
print("\nParameters:")
|
||||
for param in endpoint.parameters:
|
||||
req = "required" if param.required else "optional"
|
||||
print(f" - {param.name} ({param.type}, {req}): {param.description}")
|
||||
```
|
||||
|
||||
## Batch Processing
|
||||
|
||||
### Parallel Extraction
|
||||
|
||||
```python
|
||||
def batch_extract(texts: list[str], schema: type[BaseModel], model_name: str):
|
||||
"""Extract structured data from multiple texts."""
|
||||
model = outlines.models.transformers(model_name)
|
||||
generator = outlines.generate.json(model, schema)
|
||||
|
||||
results = []
|
||||
for i, text in enumerate(texts):
|
||||
print(f"Processing {i+1}/{len(texts)}...", end="\r")
|
||||
result = generator(f"Extract:\n{text}\n\nData:")
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
class Product(BaseModel):
|
||||
name: str
|
||||
price: float
|
||||
category: str
|
||||
|
||||
texts = [
|
||||
"iPhone 15 Pro costs $999 in Electronics",
|
||||
"Running Shoes are $89.99 in Sports",
|
||||
"Coffee Maker priced at $49.99 in Home & Kitchen"
|
||||
]
|
||||
|
||||
products = batch_extract(texts, Product, "microsoft/Phi-3-mini-4k-instruct")
|
||||
|
||||
for product in products:
|
||||
print(f"{product.name}: ${product.price} ({product.category})")
|
||||
```
|
||||
|
||||
### CSV Processing
|
||||
|
||||
```python
|
||||
import csv
|
||||
|
||||
def process_csv(csv_file: str, schema: type[BaseModel]):
|
||||
"""Process CSV file and extract structured data."""
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, schema)
|
||||
|
||||
results = []
|
||||
with open(csv_file, 'r') as f:
|
||||
reader = csv.DictReader(f)
|
||||
for row in reader:
|
||||
text = " | ".join(f"{k}: {v}" for k, v in row.items())
|
||||
result = generator(f"Extract:\n{text}\n\nData:")
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
class Customer(BaseModel):
|
||||
name: str
|
||||
email: str
|
||||
tier: Literal["basic", "premium", "enterprise"]
|
||||
mrr: float
|
||||
|
||||
# customers = process_csv("customers.csv", Customer)
|
||||
```
|
||||
|
||||
## Production Patterns
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from pydantic import ValidationError
|
||||
|
||||
def safe_extract(text: str, schema: type[BaseModel], retries: int = 3):
|
||||
"""Extract with error handling and retries."""
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, schema)
|
||||
|
||||
for attempt in range(retries):
|
||||
try:
|
||||
result = generator(f"Extract:\n{text}\n\nData:")
|
||||
return result
|
||||
except ValidationError as e:
|
||||
print(f"Attempt {attempt + 1} failed: {e}")
|
||||
if attempt == retries - 1:
|
||||
raise
|
||||
except Exception as e:
|
||||
print(f"Unexpected error: {e}")
|
||||
if attempt == retries - 1:
|
||||
raise
|
||||
|
||||
return None
|
||||
```
|
||||
|
||||
### Caching
|
||||
|
||||
```python
|
||||
from functools import lru_cache
|
||||
import hashlib
|
||||
|
||||
@lru_cache(maxsize=1000)
|
||||
def cached_extract(text_hash: str, schema_name: str):
|
||||
"""Cache extraction results."""
|
||||
# This would be called with actual extraction logic
|
||||
pass
|
||||
|
||||
def extract_with_cache(text: str, schema: type[BaseModel]):
|
||||
"""Extract with caching."""
|
||||
text_hash = hashlib.md5(text.encode()).hexdigest()
|
||||
schema_name = schema.__name__
|
||||
|
||||
cached_result = cached_extract(text_hash, schema_name)
|
||||
if cached_result:
|
||||
return cached_result
|
||||
|
||||
# Perform actual extraction
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, schema)
|
||||
result = generator(f"Extract:\n{text}\n\nData:")
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
```python
|
||||
import time
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def monitored_extract(text: str, schema: type[BaseModel]):
|
||||
"""Extract with monitoring and logging."""
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, schema)
|
||||
|
||||
result = generator(f"Extract:\n{text}\n\nData:")
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
logger.info(f"Extraction succeeded in {elapsed:.2f}s")
|
||||
logger.info(f"Input length: {len(text)} chars")
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
elapsed = time.time() - start_time
|
||||
logger.error(f"Extraction failed after {elapsed:.2f}s: {e}")
|
||||
raise
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
```python
|
||||
import time
|
||||
from threading import Lock
|
||||
|
||||
class RateLimiter:
|
||||
def __init__(self, max_requests: int, time_window: int):
|
||||
self.max_requests = max_requests
|
||||
self.time_window = time_window
|
||||
self.requests = []
|
||||
self.lock = Lock()
|
||||
|
||||
def wait_if_needed(self):
|
||||
with self.lock:
|
||||
now = time.time()
|
||||
# Remove old requests
|
||||
self.requests = [r for r in self.requests if now - r < self.time_window]
|
||||
|
||||
if len(self.requests) >= self.max_requests:
|
||||
sleep_time = self.time_window - (now - self.requests[0])
|
||||
time.sleep(sleep_time)
|
||||
self.requests = []
|
||||
|
||||
self.requests.append(now)
|
||||
|
||||
def rate_limited_extract(texts: list[str], schema: type[BaseModel]):
|
||||
"""Extract with rate limiting."""
|
||||
limiter = RateLimiter(max_requests=10, time_window=60) # 10 req/min
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, schema)
|
||||
|
||||
results = []
|
||||
for text in texts:
|
||||
limiter.wait_if_needed()
|
||||
result = generator(f"Extract:\n{text}\n\nData:")
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **Outlines Documentation**: https://outlines-dev.github.io/outlines
|
||||
- **Pydantic Documentation**: https://docs.pydantic.dev
|
||||
- **GitHub Examples**: https://github.com/outlines-dev/outlines/tree/main/examples
|
||||
652
skills/mlops/inference/outlines/references/json_generation.md
Normal file
652
skills/mlops/inference/outlines/references/json_generation.md
Normal file
@@ -0,0 +1,652 @@
|
||||
# Comprehensive JSON Generation Guide
|
||||
|
||||
Complete guide to JSON generation with Outlines using Pydantic models and JSON schemas.
|
||||
|
||||
## Table of Contents
|
||||
- Pydantic Models
|
||||
- JSON Schema Support
|
||||
- Advanced Patterns
|
||||
- Nested Structures
|
||||
- Complex Types
|
||||
- Validation
|
||||
- Performance Optimization
|
||||
|
||||
## Pydantic Models
|
||||
|
||||
### Basic Models
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
import outlines
|
||||
|
||||
class User(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
email: str
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, User)
|
||||
|
||||
user = generator("Generate user: Alice, 25, alice@example.com")
|
||||
print(user.name) # "Alice"
|
||||
print(user.age) # 25
|
||||
print(user.email) # "alice@example.com"
|
||||
```
|
||||
|
||||
###
|
||||
|
||||
Field Constraints
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class Product(BaseModel):
|
||||
name: str = Field(min_length=1, max_length=100)
|
||||
price: float = Field(gt=0, description="Price in USD")
|
||||
discount: float = Field(ge=0, le=100, description="Discount percentage")
|
||||
quantity: int = Field(ge=0, description="Available quantity")
|
||||
sku: str = Field(pattern=r"^[A-Z]{3}-\d{6}$")
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, Product)
|
||||
|
||||
product = generator("Generate product: iPhone 15, $999")
|
||||
# All fields guaranteed to meet constraints
|
||||
```
|
||||
|
||||
**Available Constraints:**
|
||||
- `min_length`, `max_length`: String length
|
||||
- `gt`, `ge`, `lt`, `le`: Numeric comparisons
|
||||
- `multiple_of`: Number must be multiple of value
|
||||
- `pattern`: Regex pattern for strings
|
||||
- `min_items`, `max_items`: List length
|
||||
|
||||
### Optional Fields
|
||||
|
||||
```python
|
||||
from typing import Optional
|
||||
|
||||
class Article(BaseModel):
|
||||
title: str # Required
|
||||
author: Optional[str] = None # Optional
|
||||
published_date: Optional[str] = None # Optional
|
||||
tags: list[str] = [] # Default empty list
|
||||
view_count: int = 0 # Default value
|
||||
|
||||
generator = outlines.generate.json(model, Article)
|
||||
|
||||
# Can generate even if optional fields missing
|
||||
article = generator("Title: Introduction to AI")
|
||||
print(article.author) # None (not provided)
|
||||
print(article.tags) # [] (default)
|
||||
```
|
||||
|
||||
### Default Values
|
||||
|
||||
```python
|
||||
class Config(BaseModel):
|
||||
debug: bool = False
|
||||
max_retries: int = 3
|
||||
timeout: float = 30.0
|
||||
log_level: str = "INFO"
|
||||
|
||||
# Generator uses defaults when not specified
|
||||
generator = outlines.generate.json(model, Config)
|
||||
config = generator("Generate config with debug enabled")
|
||||
print(config.debug) # True (from prompt)
|
||||
print(config.timeout) # 30.0 (default)
|
||||
```
|
||||
|
||||
## Enums and Literals
|
||||
|
||||
### Enum Fields
|
||||
|
||||
```python
|
||||
from enum import Enum
|
||||
|
||||
class Status(str, Enum):
|
||||
PENDING = "pending"
|
||||
APPROVED = "approved"
|
||||
REJECTED = "rejected"
|
||||
CANCELLED = "cancelled"
|
||||
|
||||
class Application(BaseModel):
|
||||
applicant_name: str
|
||||
status: Status # Must be one of enum values
|
||||
submitted_date: str
|
||||
|
||||
generator = outlines.generate.json(model, Application)
|
||||
app = generator("Generate application for John Doe")
|
||||
|
||||
print(app.status) # Status.PENDING (or one of the enum values)
|
||||
print(type(app.status)) # <enum 'Status'>
|
||||
```
|
||||
|
||||
### Literal Types
|
||||
|
||||
```python
|
||||
from typing import Literal
|
||||
|
||||
class Task(BaseModel):
|
||||
title: str
|
||||
priority: Literal["low", "medium", "high", "critical"]
|
||||
status: Literal["todo", "in_progress", "done"]
|
||||
assigned_to: str
|
||||
|
||||
generator = outlines.generate.json(model, Task)
|
||||
task = generator("Create high priority task: Fix bug")
|
||||
|
||||
print(task.priority) # One of: "low", "medium", "high", "critical"
|
||||
```
|
||||
|
||||
### Multiple Choice Fields
|
||||
|
||||
```python
|
||||
class Survey(BaseModel):
|
||||
question: str
|
||||
answer: Literal["strongly_disagree", "disagree", "neutral", "agree", "strongly_agree"]
|
||||
confidence: Literal["low", "medium", "high"]
|
||||
|
||||
generator = outlines.generate.json(model, Survey)
|
||||
survey = generator("Rate: 'I enjoy using this product'")
|
||||
```
|
||||
|
||||
## Nested Structures
|
||||
|
||||
### Nested Models
|
||||
|
||||
```python
|
||||
class Address(BaseModel):
|
||||
street: str
|
||||
city: str
|
||||
state: str
|
||||
zip_code: str
|
||||
country: str = "USA"
|
||||
|
||||
class Person(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
email: str
|
||||
address: Address # Nested model
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, Person)
|
||||
|
||||
prompt = """
|
||||
Extract person:
|
||||
Name: Alice Johnson
|
||||
Age: 28
|
||||
Email: alice@example.com
|
||||
Address: 123 Main St, Boston, MA, 02101
|
||||
"""
|
||||
|
||||
person = generator(prompt)
|
||||
print(person.name) # "Alice Johnson"
|
||||
print(person.address.city) # "Boston"
|
||||
print(person.address.state) # "MA"
|
||||
```
|
||||
|
||||
### Deep Nesting
|
||||
|
||||
```python
|
||||
class Coordinates(BaseModel):
|
||||
latitude: float
|
||||
longitude: float
|
||||
|
||||
class Location(BaseModel):
|
||||
name: str
|
||||
coordinates: Coordinates
|
||||
|
||||
class Event(BaseModel):
|
||||
title: str
|
||||
date: str
|
||||
location: Location
|
||||
|
||||
generator = outlines.generate.json(model, Event)
|
||||
event = generator("Generate event: Tech Conference in San Francisco")
|
||||
|
||||
print(event.title) # "Tech Conference"
|
||||
print(event.location.name) # "San Francisco"
|
||||
print(event.location.coordinates.latitude) # 37.7749
|
||||
```
|
||||
|
||||
### Lists of Nested Models
|
||||
|
||||
```python
|
||||
class Item(BaseModel):
|
||||
name: str
|
||||
quantity: int
|
||||
price: float
|
||||
|
||||
class Order(BaseModel):
|
||||
order_id: str
|
||||
customer: str
|
||||
items: list[Item] # List of nested models
|
||||
total: float
|
||||
|
||||
generator = outlines.generate.json(model, Order)
|
||||
|
||||
prompt = """
|
||||
Generate order for John:
|
||||
- 2x Widget ($10 each)
|
||||
- 3x Gadget ($15 each)
|
||||
Order ID: ORD-001
|
||||
"""
|
||||
|
||||
order = generator(prompt)
|
||||
print(f"Order ID: {order.order_id}")
|
||||
for item in order.items:
|
||||
print(f"- {item.quantity}x {item.name} @ ${item.price}")
|
||||
print(f"Total: ${order.total}")
|
||||
```
|
||||
|
||||
## Complex Types
|
||||
|
||||
### Union Types
|
||||
|
||||
```python
|
||||
from typing import Union
|
||||
|
||||
class TextContent(BaseModel):
|
||||
type: Literal["text"]
|
||||
content: str
|
||||
|
||||
class ImageContent(BaseModel):
|
||||
type: Literal["image"]
|
||||
url: str
|
||||
caption: str
|
||||
|
||||
class Post(BaseModel):
|
||||
title: str
|
||||
content: Union[TextContent, ImageContent] # Either type
|
||||
|
||||
generator = outlines.generate.json(model, Post)
|
||||
|
||||
# Can generate either text or image content
|
||||
post = generator("Generate blog post with image")
|
||||
if post.content.type == "text":
|
||||
print(post.content.content)
|
||||
elif post.content.type == "image":
|
||||
print(post.content.url)
|
||||
```
|
||||
|
||||
### Lists and Arrays
|
||||
|
||||
```python
|
||||
class Article(BaseModel):
|
||||
title: str
|
||||
authors: list[str] # List of strings
|
||||
tags: list[str]
|
||||
sections: list[dict[str, str]] # List of dicts
|
||||
related_ids: list[int]
|
||||
|
||||
generator = outlines.generate.json(model, Article)
|
||||
article = generator("Generate article about AI")
|
||||
|
||||
print(article.authors) # ["Alice", "Bob"]
|
||||
print(article.tags) # ["AI", "Machine Learning", "Technology"]
|
||||
```
|
||||
|
||||
### Dictionaries
|
||||
|
||||
```python
|
||||
class Metadata(BaseModel):
|
||||
title: str
|
||||
properties: dict[str, str] # String keys and values
|
||||
counts: dict[str, int] # String keys, int values
|
||||
settings: dict[str, Union[str, int, bool]] # Mixed value types
|
||||
|
||||
generator = outlines.generate.json(model, Metadata)
|
||||
meta = generator("Generate metadata")
|
||||
|
||||
print(meta.properties) # {"author": "Alice", "version": "1.0"}
|
||||
print(meta.counts) # {"views": 1000, "likes": 50}
|
||||
```
|
||||
|
||||
### Any Type (Use Sparingly)
|
||||
|
||||
```python
|
||||
from typing import Any
|
||||
|
||||
class FlexibleData(BaseModel):
|
||||
name: str
|
||||
structured_field: str
|
||||
flexible_field: Any # Can be anything
|
||||
|
||||
# Note: Any reduces type safety, use only when necessary
|
||||
generator = outlines.generate.json(model, FlexibleData)
|
||||
```
|
||||
|
||||
## JSON Schema Support
|
||||
|
||||
### Direct Schema Usage
|
||||
|
||||
```python
|
||||
import outlines
|
||||
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
|
||||
# Define JSON schema
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {"type": "string"},
|
||||
"age": {"type": "integer", "minimum": 0, "maximum": 120},
|
||||
"email": {"type": "string", "format": "email"}
|
||||
},
|
||||
"required": ["name", "age", "email"]
|
||||
}
|
||||
|
||||
# Generate from schema
|
||||
generator = outlines.generate.json(model, schema)
|
||||
result = generator("Generate person: Alice, 25, alice@example.com")
|
||||
|
||||
print(result) # Valid JSON matching schema
|
||||
```
|
||||
|
||||
### Schema from Pydantic
|
||||
|
||||
```python
|
||||
class User(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
email: str
|
||||
|
||||
# Get JSON schema from Pydantic model
|
||||
schema = User.model_json_schema()
|
||||
print(schema)
|
||||
# {
|
||||
# "type": "object",
|
||||
# "properties": {
|
||||
# "name": {"type": "string"},
|
||||
# "age": {"type": "integer"},
|
||||
# "email": {"type": "string"}
|
||||
# },
|
||||
# "required": ["name", "age", "email"]
|
||||
# }
|
||||
|
||||
# Both approaches equivalent:
|
||||
generator1 = outlines.generate.json(model, User)
|
||||
generator2 = outlines.generate.json(model, schema)
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Conditional Fields
|
||||
|
||||
```python
|
||||
class Order(BaseModel):
|
||||
order_type: Literal["standard", "express"]
|
||||
delivery_date: str
|
||||
express_fee: Optional[float] = None # Only for express orders
|
||||
|
||||
generator = outlines.generate.json(model, Order)
|
||||
|
||||
# Express order
|
||||
order1 = generator("Create express order for tomorrow")
|
||||
print(order1.express_fee) # 25.0
|
||||
|
||||
# Standard order
|
||||
order2 = generator("Create standard order")
|
||||
print(order2.express_fee) # None
|
||||
```
|
||||
|
||||
### Recursive Models
|
||||
|
||||
```python
|
||||
from typing import Optional, List
|
||||
|
||||
class TreeNode(BaseModel):
|
||||
value: str
|
||||
children: Optional[List['TreeNode']] = None
|
||||
|
||||
# Enable forward references
|
||||
TreeNode.model_rebuild()
|
||||
|
||||
generator = outlines.generate.json(model, TreeNode)
|
||||
tree = generator("Generate file tree with subdirectories")
|
||||
|
||||
print(tree.value) # "root"
|
||||
print(tree.children[0].value) # "subdir1"
|
||||
```
|
||||
|
||||
### Model with Validation
|
||||
|
||||
```python
|
||||
from pydantic import field_validator
|
||||
|
||||
class DateRange(BaseModel):
|
||||
start_date: str
|
||||
end_date: str
|
||||
|
||||
@field_validator('end_date')
|
||||
def end_after_start(cls, v, info):
|
||||
"""Ensure end_date is after start_date."""
|
||||
if 'start_date' in info.data:
|
||||
from datetime import datetime
|
||||
start = datetime.strptime(info.data['start_date'], '%Y-%m-%d')
|
||||
end = datetime.strptime(v, '%Y-%m-%d')
|
||||
if end < start:
|
||||
raise ValueError('end_date must be after start_date')
|
||||
return v
|
||||
|
||||
generator = outlines.generate.json(model, DateRange)
|
||||
# Validation happens after generation
|
||||
```
|
||||
|
||||
## Multiple Objects
|
||||
|
||||
### Generate List of Objects
|
||||
|
||||
```python
|
||||
class Person(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
class Team(BaseModel):
|
||||
team_name: str
|
||||
members: list[Person]
|
||||
|
||||
generator = outlines.generate.json(model, Team)
|
||||
|
||||
team = generator("Generate engineering team with 5 members")
|
||||
print(f"Team: {team.team_name}")
|
||||
for member in team.members:
|
||||
print(f"- {member.name}, {member.age}")
|
||||
```
|
||||
|
||||
### Batch Generation
|
||||
|
||||
```python
|
||||
def generate_batch(prompts: list[str], schema: type[BaseModel]):
|
||||
"""Generate structured outputs for multiple prompts."""
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, schema)
|
||||
|
||||
results = []
|
||||
for prompt in prompts:
|
||||
result = generator(prompt)
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
|
||||
class Product(BaseModel):
|
||||
name: str
|
||||
price: float
|
||||
|
||||
prompts = [
|
||||
"Product: iPhone 15, $999",
|
||||
"Product: MacBook Pro, $2499",
|
||||
"Product: AirPods, $179"
|
||||
]
|
||||
|
||||
products = generate_batch(prompts, Product)
|
||||
for product in products:
|
||||
print(f"{product.name}: ${product.price}")
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Caching Generators
|
||||
|
||||
```python
|
||||
from functools import lru_cache
|
||||
|
||||
@lru_cache(maxsize=10)
|
||||
def get_generator(model_name: str, schema_hash: int):
|
||||
"""Cache generators for reuse."""
|
||||
model = outlines.models.transformers(model_name)
|
||||
return outlines.generate.json(model, schema)
|
||||
|
||||
# First call: creates generator
|
||||
gen1 = get_generator("microsoft/Phi-3-mini-4k-instruct", hash(User))
|
||||
|
||||
# Second call: returns cached generator (fast!)
|
||||
gen2 = get_generator("microsoft/Phi-3-mini-4k-instruct", hash(User))
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
# Process multiple items efficiently
|
||||
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
|
||||
generator = outlines.generate.json(model, User)
|
||||
|
||||
texts = ["User: Alice, 25", "User: Bob, 30", "User: Carol, 35"]
|
||||
|
||||
# Reuse generator (model stays loaded)
|
||||
users = [generator(text) for text in texts]
|
||||
```
|
||||
|
||||
### Minimize Schema Complexity
|
||||
|
||||
```python
|
||||
# ✅ Good: Simple, flat structure (faster)
|
||||
class SimplePerson(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
city: str
|
||||
|
||||
# ⚠️ Slower: Deep nesting
|
||||
class ComplexPerson(BaseModel):
|
||||
personal_info: PersonalInfo
|
||||
address: Address
|
||||
employment: Employment
|
||||
# ... many nested levels
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Handle Missing Fields
|
||||
|
||||
```python
|
||||
from pydantic import ValidationError
|
||||
|
||||
class User(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
email: str
|
||||
|
||||
try:
|
||||
user = generator("Generate user") # May not include all fields
|
||||
except ValidationError as e:
|
||||
print(f"Validation error: {e}")
|
||||
# Handle gracefully
|
||||
```
|
||||
|
||||
### Fallback with Optional Fields
|
||||
|
||||
```python
|
||||
class RobustUser(BaseModel):
|
||||
name: str # Required
|
||||
age: Optional[int] = None # Optional
|
||||
email: Optional[str] = None # Optional
|
||||
|
||||
# More likely to succeed even with incomplete data
|
||||
user = generator("Generate user: Alice")
|
||||
print(user.name) # "Alice"
|
||||
print(user.age) # None (not provided)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Use Specific Types
|
||||
|
||||
```python
|
||||
# ✅ Good: Specific types
|
||||
class Product(BaseModel):
|
||||
name: str
|
||||
price: float # Not Any or str
|
||||
quantity: int # Not str
|
||||
in_stock: bool # Not int
|
||||
|
||||
# ❌ Bad: Generic types
|
||||
class Product(BaseModel):
|
||||
name: Any
|
||||
price: str # Should be float
|
||||
quantity: str # Should be int
|
||||
```
|
||||
|
||||
### 2. Add Descriptions
|
||||
|
||||
```python
|
||||
# ✅ Good: Clear descriptions
|
||||
class Article(BaseModel):
|
||||
title: str = Field(description="Article title, 10-100 characters")
|
||||
content: str = Field(description="Main article content in paragraphs")
|
||||
tags: list[str] = Field(description="List of relevant topic tags")
|
||||
|
||||
# Descriptions help the model understand expected output
|
||||
```
|
||||
|
||||
### 3. Use Constraints
|
||||
|
||||
```python
|
||||
# ✅ Good: With constraints
|
||||
class Age(BaseModel):
|
||||
value: int = Field(ge=0, le=120, description="Age in years")
|
||||
|
||||
# ❌ Bad: No constraints
|
||||
class Age(BaseModel):
|
||||
value: int # Could be negative or > 120
|
||||
```
|
||||
|
||||
### 4. Prefer Enums Over Strings
|
||||
|
||||
```python
|
||||
# ✅ Good: Enum for fixed set
|
||||
class Priority(str, Enum):
|
||||
LOW = "low"
|
||||
MEDIUM = "medium"
|
||||
HIGH = "high"
|
||||
|
||||
class Task(BaseModel):
|
||||
priority: Priority # Guaranteed valid
|
||||
|
||||
# ❌ Bad: Free-form string
|
||||
class Task(BaseModel):
|
||||
priority: str # Could be "urgent", "ASAP", "!!", etc.
|
||||
```
|
||||
|
||||
### 5. Test Your Models
|
||||
|
||||
```python
|
||||
# Test models work as expected
|
||||
def test_product_model():
|
||||
product = Product(
|
||||
name="Test Product",
|
||||
price=19.99,
|
||||
quantity=10,
|
||||
in_stock=True
|
||||
)
|
||||
assert product.price == 19.99
|
||||
assert isinstance(product, Product)
|
||||
|
||||
# Run tests before using in production
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **Pydantic Docs**: https://docs.pydantic.dev
|
||||
- **JSON Schema**: https://json-schema.org
|
||||
- **Outlines GitHub**: https://github.com/outlines-dev/outlines
|
||||
367
skills/mlops/inference/vllm/SKILL.md
Normal file
367
skills/mlops/inference/vllm/SKILL.md
Normal file
@@ -0,0 +1,367 @@
|
||||
---
|
||||
name: serving-llms-vllm
|
||||
description: Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [vllm, torch, transformers]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [vLLM, Inference Serving, PagedAttention, Continuous Batching, High Throughput, Production, OpenAI API, Quantization, Tensor Parallelism]
|
||||
|
||||
---
|
||||
|
||||
# vLLM - High-Performance LLM Serving
|
||||
|
||||
## Quick start
|
||||
|
||||
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install vllm
|
||||
```
|
||||
|
||||
**Basic offline inference**:
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
|
||||
sampling = SamplingParams(temperature=0.7, max_tokens=256)
|
||||
|
||||
outputs = llm.generate(["Explain quantum computing"], sampling)
|
||||
print(outputs[0].outputs[0].text)
|
||||
```
|
||||
|
||||
**OpenAI-compatible server**:
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct
|
||||
|
||||
# Query with OpenAI SDK
|
||||
python -c "
|
||||
from openai import OpenAI
|
||||
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
|
||||
print(client.chat.completions.create(
|
||||
model='meta-llama/Llama-3-8B-Instruct',
|
||||
messages=[{'role': 'user', 'content': 'Hello!'}]
|
||||
).choices[0].message.content)
|
||||
"
|
||||
```
|
||||
|
||||
## Common workflows
|
||||
|
||||
### Workflow 1: Production API deployment
|
||||
|
||||
Copy this checklist and track progress:
|
||||
|
||||
```
|
||||
Deployment Progress:
|
||||
- [ ] Step 1: Configure server settings
|
||||
- [ ] Step 2: Test with limited traffic
|
||||
- [ ] Step 3: Enable monitoring
|
||||
- [ ] Step 4: Deploy to production
|
||||
- [ ] Step 5: Verify performance metrics
|
||||
```
|
||||
|
||||
**Step 1: Configure server settings**
|
||||
|
||||
Choose configuration based on your model size:
|
||||
|
||||
```bash
|
||||
# For 7B-13B models on single GPU
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-model-len 8192 \
|
||||
--port 8000
|
||||
|
||||
# For 30B-70B models with tensor parallelism
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 4 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization awq \
|
||||
--port 8000
|
||||
|
||||
# For production with caching and metrics
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-prefix-caching \
|
||||
--enable-metrics \
|
||||
--metrics-port 9090 \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0
|
||||
```
|
||||
|
||||
**Step 2: Test with limited traffic**
|
||||
|
||||
Run load test before production:
|
||||
|
||||
```bash
|
||||
# Install load testing tool
|
||||
pip install locust
|
||||
|
||||
# Create test_load.py with sample requests
|
||||
# Run: locust -f test_load.py --host http://localhost:8000
|
||||
```
|
||||
|
||||
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
|
||||
|
||||
**Step 3: Enable monitoring**
|
||||
|
||||
vLLM exposes Prometheus metrics on port 9090:
|
||||
|
||||
```bash
|
||||
curl http://localhost:9090/metrics | grep vllm
|
||||
```
|
||||
|
||||
Key metrics to monitor:
|
||||
- `vllm:time_to_first_token_seconds` - Latency
|
||||
- `vllm:num_requests_running` - Active requests
|
||||
- `vllm:gpu_cache_usage_perc` - KV cache utilization
|
||||
|
||||
**Step 4: Deploy to production**
|
||||
|
||||
Use Docker for consistent deployment:
|
||||
|
||||
```bash
|
||||
# Run vLLM in Docker
|
||||
docker run --gpus all -p 8000:8000 \
|
||||
vllm/vllm-openai:latest \
|
||||
--model meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-prefix-caching
|
||||
```
|
||||
|
||||
**Step 5: Verify performance metrics**
|
||||
|
||||
Check that deployment meets targets:
|
||||
- TTFT < 500ms (for short prompts)
|
||||
- Throughput > target req/sec
|
||||
- GPU utilization > 80%
|
||||
- No OOM errors in logs
|
||||
|
||||
### Workflow 2: Offline batch inference
|
||||
|
||||
For processing large datasets without server overhead.
|
||||
|
||||
Copy this checklist:
|
||||
|
||||
```
|
||||
Batch Processing:
|
||||
- [ ] Step 1: Prepare input data
|
||||
- [ ] Step 2: Configure LLM engine
|
||||
- [ ] Step 3: Run batch inference
|
||||
- [ ] Step 4: Process results
|
||||
```
|
||||
|
||||
**Step 1: Prepare input data**
|
||||
|
||||
```python
|
||||
# Load prompts from file
|
||||
prompts = []
|
||||
with open("prompts.txt") as f:
|
||||
prompts = [line.strip() for line in f]
|
||||
|
||||
print(f"Loaded {len(prompts)} prompts")
|
||||
```
|
||||
|
||||
**Step 2: Configure LLM engine**
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3-8B-Instruct",
|
||||
tensor_parallel_size=2, # Use 2 GPUs
|
||||
gpu_memory_utilization=0.9,
|
||||
max_model_len=4096
|
||||
)
|
||||
|
||||
sampling = SamplingParams(
|
||||
temperature=0.7,
|
||||
top_p=0.95,
|
||||
max_tokens=512,
|
||||
stop=["</s>", "\n\n"]
|
||||
)
|
||||
```
|
||||
|
||||
**Step 3: Run batch inference**
|
||||
|
||||
vLLM automatically batches requests for efficiency:
|
||||
|
||||
```python
|
||||
# Process all prompts in one call
|
||||
outputs = llm.generate(prompts, sampling)
|
||||
|
||||
# vLLM handles batching internally
|
||||
# No need to manually chunk prompts
|
||||
```
|
||||
|
||||
**Step 4: Process results**
|
||||
|
||||
```python
|
||||
# Extract generated text
|
||||
results = []
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated = output.outputs[0].text
|
||||
results.append({
|
||||
"prompt": prompt,
|
||||
"generated": generated,
|
||||
"tokens": len(output.outputs[0].token_ids)
|
||||
})
|
||||
|
||||
# Save to file
|
||||
import json
|
||||
with open("results.jsonl", "w") as f:
|
||||
for result in results:
|
||||
f.write(json.dumps(result) + "\n")
|
||||
|
||||
print(f"Processed {len(results)} prompts")
|
||||
```
|
||||
|
||||
### Workflow 3: Quantized model serving
|
||||
|
||||
Fit large models in limited GPU memory.
|
||||
|
||||
```
|
||||
Quantization Setup:
|
||||
- [ ] Step 1: Choose quantization method
|
||||
- [ ] Step 2: Find or create quantized model
|
||||
- [ ] Step 3: Launch with quantization flag
|
||||
- [ ] Step 4: Verify accuracy
|
||||
```
|
||||
|
||||
**Step 1: Choose quantization method**
|
||||
|
||||
- **AWQ**: Best for 70B models, minimal accuracy loss
|
||||
- **GPTQ**: Wide model support, good compression
|
||||
- **FP8**: Fastest on H100 GPUs
|
||||
|
||||
**Step 2: Find or create quantized model**
|
||||
|
||||
Use pre-quantized models from HuggingFace:
|
||||
|
||||
```bash
|
||||
# Search for AWQ models
|
||||
# Example: TheBloke/Llama-2-70B-AWQ
|
||||
```
|
||||
|
||||
**Step 3: Launch with quantization flag**
|
||||
|
||||
```bash
|
||||
# Using pre-quantized model
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95
|
||||
|
||||
# Results: 70B model in ~40GB VRAM
|
||||
```
|
||||
|
||||
**Step 4: Verify accuracy**
|
||||
|
||||
Test outputs match expected quality:
|
||||
|
||||
```python
|
||||
# Compare quantized vs non-quantized responses
|
||||
# Verify task-specific performance unchanged
|
||||
```
|
||||
|
||||
## When to use vs alternatives
|
||||
|
||||
**Use vLLM when:**
|
||||
- Deploying production LLM APIs (100+ req/sec)
|
||||
- Serving OpenAI-compatible endpoints
|
||||
- Limited GPU memory but need large models
|
||||
- Multi-user applications (chatbots, assistants)
|
||||
- Need low latency with high throughput
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **llama.cpp**: CPU/edge inference, single-user
|
||||
- **HuggingFace transformers**: Research, prototyping, one-off generation
|
||||
- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance
|
||||
- **Text-Generation-Inference**: Already in HuggingFace ecosystem
|
||||
|
||||
## Common issues
|
||||
|
||||
**Issue: Out of memory during model loading**
|
||||
|
||||
Reduce memory usage:
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--max-model-len 4096
|
||||
```
|
||||
|
||||
Or use quantization:
|
||||
```bash
|
||||
vllm serve MODEL --quantization awq
|
||||
```
|
||||
|
||||
**Issue: Slow first token (TTFT > 1 second)**
|
||||
|
||||
Enable prefix caching for repeated prompts:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching
|
||||
```
|
||||
|
||||
For long prompts, enable chunked prefill:
|
||||
```bash
|
||||
vllm serve MODEL --enable-chunked-prefill
|
||||
```
|
||||
|
||||
**Issue: Model not found error**
|
||||
|
||||
Use `--trust-remote-code` for custom models:
|
||||
```bash
|
||||
vllm serve MODEL --trust-remote-code
|
||||
```
|
||||
|
||||
**Issue: Low throughput (<50 req/sec)**
|
||||
|
||||
Increase concurrent sequences:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 512
|
||||
```
|
||||
|
||||
Check GPU utilization with `nvidia-smi` - should be >80%.
|
||||
|
||||
**Issue: Inference slower than expected**
|
||||
|
||||
Verify tensor parallelism uses power of 2 GPUs:
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 4 # Not 3
|
||||
```
|
||||
|
||||
Enable speculative decoding for faster generation:
|
||||
```bash
|
||||
vllm serve MODEL --speculative-model DRAFT_MODEL
|
||||
```
|
||||
|
||||
## Advanced topics
|
||||
|
||||
**Server deployment patterns**: See [references/server-deployment.md](references/server-deployment.md) for Docker, Kubernetes, and load balancing configurations.
|
||||
|
||||
**Performance optimization**: See [references/optimization.md](references/optimization.md) for PagedAttention tuning, continuous batching details, and benchmark results.
|
||||
|
||||
**Quantization guide**: See [references/quantization.md](references/quantization.md) for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
|
||||
|
||||
**Troubleshooting**: See [references/troubleshooting.md](references/troubleshooting.md) for detailed error messages, debugging steps, and performance diagnostics.
|
||||
|
||||
## Hardware requirements
|
||||
|
||||
- **Small models (7B-13B)**: 1x A10 (24GB) or A100 (40GB)
|
||||
- **Medium models (30B-40B)**: 2x A100 (40GB) with tensor parallelism
|
||||
- **Large models (70B+)**: 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
|
||||
|
||||
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
|
||||
|
||||
## Resources
|
||||
|
||||
- Official docs: https://docs.vllm.ai
|
||||
- GitHub: https://github.com/vllm-project/vllm
|
||||
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
|
||||
- Community: https://discuss.vllm.ai
|
||||
|
||||
|
||||
|
||||
226
skills/mlops/inference/vllm/references/optimization.md
Normal file
226
skills/mlops/inference/vllm/references/optimization.md
Normal file
@@ -0,0 +1,226 @@
|
||||
# Performance Optimization
|
||||
|
||||
## Contents
|
||||
- PagedAttention explained
|
||||
- Continuous batching mechanics
|
||||
- Prefix caching strategies
|
||||
- Speculative decoding setup
|
||||
- Benchmark results and comparisons
|
||||
- Performance tuning guide
|
||||
|
||||
## PagedAttention explained
|
||||
|
||||
**Traditional attention problem**:
|
||||
- KV cache stored in contiguous memory
|
||||
- Wastes ~50% GPU memory due to fragmentation
|
||||
- Cannot dynamically reallocate for varying sequence lengths
|
||||
|
||||
**PagedAttention solution**:
|
||||
- Divides KV cache into fixed-size blocks (like OS virtual memory)
|
||||
- Dynamic allocation from free block queue
|
||||
- Shares blocks across sequences (for prefix caching)
|
||||
|
||||
**Memory savings example**:
|
||||
```
|
||||
Traditional: 70B model needs 160GB KV cache → OOM on 8x A100
|
||||
PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100
|
||||
```
|
||||
|
||||
**Configuration**:
|
||||
```bash
|
||||
# Block size (default: 16 tokens)
|
||||
vllm serve MODEL --block-size 16
|
||||
|
||||
# Number of GPU blocks (auto-calculated)
|
||||
# Controlled by --gpu-memory-utilization
|
||||
vllm serve MODEL --gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
## Continuous batching mechanics
|
||||
|
||||
**Traditional batching**:
|
||||
- Wait for all sequences in batch to finish
|
||||
- GPU idle while waiting for longest sequence
|
||||
- Low GPU utilization (~40-60%)
|
||||
|
||||
**Continuous batching**:
|
||||
- Add new requests as slots become available
|
||||
- Mix prefill (new requests) and decode (ongoing) in same batch
|
||||
- High GPU utilization (>90%)
|
||||
|
||||
**Throughput improvement**:
|
||||
```
|
||||
Traditional batching: 50 req/sec @ 50% GPU util
|
||||
Continuous batching: 200 req/sec @ 90% GPU util
|
||||
= 4x throughput improvement
|
||||
```
|
||||
|
||||
**Tuning parameters**:
|
||||
```bash
|
||||
# Max concurrent sequences (higher = more batching)
|
||||
vllm serve MODEL --max-num-seqs 256
|
||||
|
||||
# Prefill/decode schedule (auto-balanced by default)
|
||||
# No manual tuning needed
|
||||
```
|
||||
|
||||
## Prefix caching strategies
|
||||
|
||||
Reuse computed KV cache for common prompt prefixes.
|
||||
|
||||
**Use cases**:
|
||||
- System prompts repeated across requests
|
||||
- Few-shot examples in every prompt
|
||||
- RAG contexts with overlapping chunks
|
||||
|
||||
**Example savings**:
|
||||
```
|
||||
Prompt: [System: 500 tokens] + [User: 100 tokens]
|
||||
|
||||
Without caching: Compute 600 tokens every request
|
||||
With caching: Compute 500 tokens once, then 100 tokens/request
|
||||
= 83% faster TTFT
|
||||
```
|
||||
|
||||
**Enable prefix caching**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching
|
||||
```
|
||||
|
||||
**Automatic prefix detection**:
|
||||
- vLLM detects common prefixes automatically
|
||||
- No code changes required
|
||||
- Works with OpenAI-compatible API
|
||||
|
||||
**Cache hit rate monitoring**:
|
||||
```bash
|
||||
curl http://localhost:9090/metrics | grep cache_hit
|
||||
# vllm_cache_hit_rate: 0.75 (75% hit rate)
|
||||
```
|
||||
|
||||
## Speculative decoding setup
|
||||
|
||||
Use smaller "draft" model to propose tokens, larger model to verify.
|
||||
|
||||
**Speed improvement**:
|
||||
```
|
||||
Standard: Generate 1 token per forward pass
|
||||
Speculative: Generate 3-5 tokens per forward pass
|
||||
= 2-3x faster generation
|
||||
```
|
||||
|
||||
**How it works**:
|
||||
1. Draft model proposes K tokens (fast)
|
||||
2. Target model verifies all K tokens in parallel (one pass)
|
||||
3. Accept verified tokens, restart from first rejection
|
||||
|
||||
**Setup with separate draft model**:
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct \
|
||||
--speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||
--num-speculative-tokens 5
|
||||
```
|
||||
|
||||
**Setup with n-gram draft** (no separate model):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--speculative-method ngram \
|
||||
--num-speculative-tokens 3
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Output length > 100 tokens
|
||||
- Draft model 5-10x smaller than target
|
||||
- Acceptable 2-3% accuracy trade-off
|
||||
|
||||
## Benchmark results
|
||||
|
||||
**vLLM vs HuggingFace Transformers** (Llama 3 8B, A100):
|
||||
```
|
||||
Metric | HF Transformers | vLLM | Improvement
|
||||
------------------------|-----------------|--------|------------
|
||||
Throughput (req/sec) | 12 | 280 | 23x
|
||||
TTFT (ms) | 850 | 120 | 7x
|
||||
Tokens/sec | 45 | 2,100 | 47x
|
||||
GPU Memory (GB) | 28 | 16 | 1.75x less
|
||||
```
|
||||
|
||||
**vLLM vs TensorRT-LLM** (Llama 2 70B, 4x A100):
|
||||
```
|
||||
Metric | TensorRT-LLM | vLLM | Notes
|
||||
------------------------|--------------|--------|------------------
|
||||
Throughput (req/sec) | 320 | 285 | TRT 12% faster
|
||||
Setup complexity | High | Low | vLLM much easier
|
||||
NVIDIA-only | Yes | No | vLLM multi-platform
|
||||
Quantization support | FP8, INT8 | AWQ/GPTQ/FP8 | vLLM more options
|
||||
```
|
||||
|
||||
## Performance tuning guide
|
||||
|
||||
**Step 1: Measure baseline**
|
||||
|
||||
```bash
|
||||
# Install benchmarking tool
|
||||
pip install locust
|
||||
|
||||
# Run baseline benchmark
|
||||
vllm bench throughput \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--num-prompts 1000
|
||||
|
||||
# Record: throughput, TTFT, tokens/sec
|
||||
```
|
||||
|
||||
**Step 2: Tune memory utilization**
|
||||
|
||||
```bash
|
||||
# Try different values: 0.7, 0.85, 0.9, 0.95
|
||||
vllm serve MODEL --gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Higher = more batch capacity = higher throughput, but risk OOM.
|
||||
|
||||
**Step 3: Tune concurrency**
|
||||
|
||||
```bash
|
||||
# Try values: 128, 256, 512, 1024
|
||||
vllm serve MODEL --max-num-seqs 256
|
||||
```
|
||||
|
||||
Higher = more batching opportunity, but may increase latency.
|
||||
|
||||
**Step 4: Enable optimizations**
|
||||
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--enable-prefix-caching \ # For repeated prompts
|
||||
--enable-chunked-prefill \ # For long prompts
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num-seqs 512
|
||||
```
|
||||
|
||||
**Step 5: Re-benchmark and compare**
|
||||
|
||||
Target improvements:
|
||||
- Throughput: +30-100%
|
||||
- TTFT: -20-50%
|
||||
- GPU utilization: >85%
|
||||
|
||||
**Common performance issues**:
|
||||
|
||||
**Low throughput (<50 req/sec)**:
|
||||
- Increase `--max-num-seqs`
|
||||
- Enable `--enable-prefix-caching`
|
||||
- Check GPU utilization (should be >80%)
|
||||
|
||||
**High TTFT (>1 second)**:
|
||||
- Enable `--enable-chunked-prefill`
|
||||
- Reduce `--max-model-len` if possible
|
||||
- Check if model is too large for GPU
|
||||
|
||||
**OOM errors**:
|
||||
- Reduce `--gpu-memory-utilization` to 0.7
|
||||
- Reduce `--max-model-len`
|
||||
- Use quantization (`--quantization awq`)
|
||||
284
skills/mlops/inference/vllm/references/quantization.md
Normal file
284
skills/mlops/inference/vllm/references/quantization.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# Quantization Guide
|
||||
|
||||
## Contents
|
||||
- Quantization methods comparison
|
||||
- AWQ setup and usage
|
||||
- GPTQ setup and usage
|
||||
- FP8 quantization (H100)
|
||||
- Model preparation
|
||||
- Accuracy vs compression trade-offs
|
||||
|
||||
## Quantization methods comparison
|
||||
|
||||
| Method | Compression | Accuracy Loss | Speed | Best For |
|
||||
|--------|-------------|---------------|-------|----------|
|
||||
| **AWQ** | 4-bit (75%) | <1% | Fast | 70B models, production |
|
||||
| **GPTQ** | 4-bit (75%) | 1-2% | Fast | Wide model support |
|
||||
| **FP8** | 8-bit (50%) | <0.5% | Fastest | H100 GPUs only |
|
||||
| **SqueezeLLM** | 3-4 bit (75-80%) | 2-3% | Medium | Extreme compression |
|
||||
|
||||
**Recommendation**:
|
||||
- **Production**: Use AWQ for 70B models
|
||||
- **H100 GPUs**: Use FP8 for best speed
|
||||
- **Maximum compatibility**: Use GPTQ
|
||||
- **Extreme compression**: Use SqueezeLLM
|
||||
|
||||
## AWQ setup and usage
|
||||
|
||||
**AWQ** (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.
|
||||
|
||||
**Step 1: Find pre-quantized model**
|
||||
|
||||
Search HuggingFace for AWQ models:
|
||||
```bash
|
||||
# Example: TheBloke/Llama-2-70B-AWQ
|
||||
# Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
|
||||
```
|
||||
|
||||
**Step 2: Launch with AWQ**
|
||||
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95
|
||||
```
|
||||
|
||||
**Memory savings**:
|
||||
```
|
||||
Llama 2 70B fp16: 140GB VRAM (4x A100 needed)
|
||||
Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)
|
||||
= 4x memory reduction
|
||||
```
|
||||
|
||||
**Step 3: Verify performance**
|
||||
|
||||
Test that outputs are acceptable:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
||||
|
||||
# Test complex reasoning
|
||||
response = client.chat.completions.create(
|
||||
model="TheBloke/Llama-2-70B-AWQ",
|
||||
messages=[{"role": "user", "content": "Explain quantum entanglement"}]
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
# Verify quality matches your requirements
|
||||
```
|
||||
|
||||
**Quantize your own model** (requires GPU with 80GB+ VRAM):
|
||||
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
model_path = "meta-llama/Llama-2-70b-hf"
|
||||
quant_path = "llama-2-70b-awq"
|
||||
|
||||
# Load model
|
||||
model = AutoAWQForCausalLM.from_pretrained(model_path)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
|
||||
# Quantize
|
||||
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
|
||||
model.quantize(tokenizer, quant_config=quant_config)
|
||||
|
||||
# Save
|
||||
model.save_quantized(quant_path)
|
||||
tokenizer.save_pretrained(quant_path)
|
||||
```
|
||||
|
||||
## GPTQ setup and usage
|
||||
|
||||
**GPTQ** has widest model support and good compression.
|
||||
|
||||
**Step 1: Find GPTQ model**
|
||||
|
||||
```bash
|
||||
# Example: TheBloke/Llama-2-13B-GPTQ
|
||||
# Example: TheBloke/CodeLlama-34B-GPTQ
|
||||
```
|
||||
|
||||
**Step 2: Launch with GPTQ**
|
||||
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-13B-GPTQ \
|
||||
--quantization gptq \
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
**GPTQ configuration options**:
|
||||
```bash
|
||||
# Specify GPTQ parameters if needed
|
||||
vllm serve MODEL \
|
||||
--quantization gptq \
|
||||
--gptq-act-order \ # Activation ordering
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
**Quantize your own model**:
|
||||
|
||||
```python
|
||||
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
model_name = "meta-llama/Llama-2-13b-hf"
|
||||
quantized_name = "llama-2-13b-gptq"
|
||||
|
||||
# Load model
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
|
||||
|
||||
# Prepare calibration data
|
||||
calib_data = [...] # List of sample texts
|
||||
|
||||
# Quantize
|
||||
quantize_config = BaseQuantizeConfig(
|
||||
bits=4,
|
||||
group_size=128,
|
||||
desc_act=True
|
||||
)
|
||||
model.quantize(calib_data)
|
||||
|
||||
# Save
|
||||
model.save_quantized(quantized_name)
|
||||
```
|
||||
|
||||
## FP8 quantization (H100)
|
||||
|
||||
**FP8** (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.
|
||||
|
||||
**Requirements**:
|
||||
- H100 or H800 GPU
|
||||
- CUDA 12.3+ (12.8 recommended)
|
||||
- Hopper architecture support
|
||||
|
||||
**Step 1: Enable FP8**
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct \
|
||||
--quantization fp8 \
|
||||
--tensor-parallel-size 2
|
||||
```
|
||||
|
||||
**Performance gains on H100**:
|
||||
```
|
||||
fp16: 180 tokens/sec
|
||||
FP8: 320 tokens/sec
|
||||
= 1.8x speedup
|
||||
```
|
||||
|
||||
**Step 2: Verify accuracy**
|
||||
|
||||
FP8 typically has <0.5% accuracy degradation:
|
||||
```python
|
||||
# Run evaluation suite
|
||||
# Compare FP8 vs FP16 on your tasks
|
||||
# Verify acceptable accuracy
|
||||
```
|
||||
|
||||
**Dynamic FP8 quantization** (no pre-quantized model needed):
|
||||
|
||||
```bash
|
||||
# vLLM automatically quantizes at runtime
|
||||
vllm serve MODEL --quantization fp8
|
||||
# No model preparation required
|
||||
```
|
||||
|
||||
## Model preparation
|
||||
|
||||
**Pre-quantized models (easiest)**:
|
||||
|
||||
1. Search HuggingFace: `[model name] AWQ` or `[model name] GPTQ`
|
||||
2. Download or use directly: `TheBloke/[Model]-AWQ`
|
||||
3. Launch with appropriate `--quantization` flag
|
||||
|
||||
**Quantize your own model**:
|
||||
|
||||
**AWQ**:
|
||||
```bash
|
||||
# Install AutoAWQ
|
||||
pip install autoawq
|
||||
|
||||
# Run quantization script
|
||||
python quantize_awq.py --model MODEL --output OUTPUT
|
||||
```
|
||||
|
||||
**GPTQ**:
|
||||
```bash
|
||||
# Install AutoGPTQ
|
||||
pip install auto-gptq
|
||||
|
||||
# Run quantization script
|
||||
python quantize_gptq.py --model MODEL --output OUTPUT
|
||||
```
|
||||
|
||||
**Calibration data**:
|
||||
- Use 128-512 diverse examples from target domain
|
||||
- Representative of production inputs
|
||||
- Higher quality calibration = better accuracy
|
||||
|
||||
## Accuracy vs compression trade-offs
|
||||
|
||||
**Empirical results** (Llama 2 70B on MMLU benchmark):
|
||||
|
||||
| Quantization | Accuracy | Memory | Speed | Production-Ready |
|
||||
|--------------|----------|--------|-------|------------------|
|
||||
| FP16 (baseline) | 100% | 140GB | 1.0x | ✅ (if memory available) |
|
||||
| FP8 | 99.5% | 70GB | 1.8x | ✅ (H100 only) |
|
||||
| AWQ 4-bit | 99.0% | 35GB | 1.5x | ✅ (best for 70B) |
|
||||
| GPTQ 4-bit | 98.5% | 35GB | 1.5x | ✅ (good compatibility) |
|
||||
| SqueezeLLM 3-bit | 96.0% | 26GB | 1.3x | ⚠️ (check accuracy) |
|
||||
|
||||
**When to use each**:
|
||||
|
||||
**No quantization (FP16)**:
|
||||
- Have sufficient GPU memory
|
||||
- Need absolute best accuracy
|
||||
- Model <13B parameters
|
||||
|
||||
**FP8**:
|
||||
- Using H100/H800 GPUs
|
||||
- Need best speed with minimal accuracy loss
|
||||
- Production deployment
|
||||
|
||||
**AWQ 4-bit**:
|
||||
- Need to fit 70B model in 40GB GPU
|
||||
- Production deployment
|
||||
- <1% accuracy loss acceptable
|
||||
|
||||
**GPTQ 4-bit**:
|
||||
- Wide model support needed
|
||||
- Not on H100 (use FP8 instead)
|
||||
- 1-2% accuracy loss acceptable
|
||||
|
||||
**Testing strategy**:
|
||||
|
||||
1. **Baseline**: Measure FP16 accuracy on your evaluation set
|
||||
2. **Quantize**: Create quantized version
|
||||
3. **Evaluate**: Compare quantized vs baseline on same tasks
|
||||
4. **Decide**: Accept if degradation < threshold (typically 1-2%)
|
||||
|
||||
**Example evaluation**:
|
||||
```python
|
||||
from evaluate import load_evaluation_suite
|
||||
|
||||
# Run on FP16 baseline
|
||||
baseline_score = evaluate(model_fp16, eval_suite)
|
||||
|
||||
# Run on quantized
|
||||
quant_score = evaluate(model_awq, eval_suite)
|
||||
|
||||
# Compare
|
||||
degradation = (baseline_score - quant_score) / baseline_score * 100
|
||||
print(f"Accuracy degradation: {degradation:.2f}%")
|
||||
|
||||
# Decision
|
||||
if degradation < 1.0:
|
||||
print("✅ Quantization acceptable for production")
|
||||
else:
|
||||
print("⚠️ Review accuracy loss")
|
||||
```
|
||||
255
skills/mlops/inference/vllm/references/server-deployment.md
Normal file
255
skills/mlops/inference/vllm/references/server-deployment.md
Normal file
@@ -0,0 +1,255 @@
|
||||
# Server Deployment Patterns
|
||||
|
||||
## Contents
|
||||
- Docker deployment
|
||||
- Kubernetes deployment
|
||||
- Load balancing with Nginx
|
||||
- Multi-node distributed serving
|
||||
- Production configuration examples
|
||||
- Health checks and monitoring
|
||||
|
||||
## Docker deployment
|
||||
|
||||
**Basic Dockerfile**:
|
||||
```dockerfile
|
||||
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
|
||||
|
||||
RUN apt-get update && apt-get install -y python3-pip
|
||||
RUN pip install vllm
|
||||
|
||||
EXPOSE 8000
|
||||
|
||||
CMD ["vllm", "serve", "meta-llama/Llama-3-8B-Instruct", \
|
||||
"--host", "0.0.0.0", "--port", "8000", \
|
||||
"--gpu-memory-utilization", "0.9"]
|
||||
```
|
||||
|
||||
**Build and run**:
|
||||
```bash
|
||||
docker build -t vllm-server .
|
||||
docker run --gpus all -p 8000:8000 vllm-server
|
||||
```
|
||||
|
||||
**Docker Compose** (with metrics):
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
vllm:
|
||||
image: vllm/vllm-openai:latest
|
||||
command: >
|
||||
--model meta-llama/Llama-3-8B-Instruct
|
||||
--gpu-memory-utilization 0.9
|
||||
--enable-metrics
|
||||
--metrics-port 9090
|
||||
ports:
|
||||
- "8000:8000"
|
||||
- "9090:9090"
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: all
|
||||
capabilities: [gpu]
|
||||
```
|
||||
|
||||
## Kubernetes deployment
|
||||
|
||||
**Deployment manifest**:
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: vllm-server
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: vllm
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: vllm
|
||||
spec:
|
||||
containers:
|
||||
- name: vllm
|
||||
image: vllm/vllm-openai:latest
|
||||
args:
|
||||
- "--model=meta-llama/Llama-3-8B-Instruct"
|
||||
- "--gpu-memory-utilization=0.9"
|
||||
- "--enable-prefix-caching"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
name: http
|
||||
- containerPort: 9090
|
||||
name: metrics
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 30
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: vllm-service
|
||||
spec:
|
||||
selector:
|
||||
app: vllm
|
||||
ports:
|
||||
- port: 8000
|
||||
targetPort: 8000
|
||||
name: http
|
||||
- port: 9090
|
||||
targetPort: 9090
|
||||
name: metrics
|
||||
type: LoadBalancer
|
||||
```
|
||||
|
||||
## Load balancing with Nginx
|
||||
|
||||
**Nginx configuration**:
|
||||
```nginx
|
||||
upstream vllm_backend {
|
||||
least_conn; # Route to least-loaded server
|
||||
server localhost:8001;
|
||||
server localhost:8002;
|
||||
server localhost:8003;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
|
||||
location / {
|
||||
proxy_pass http://vllm_backend;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
|
||||
# Timeouts for long-running inference
|
||||
proxy_read_timeout 300s;
|
||||
proxy_connect_timeout 75s;
|
||||
}
|
||||
|
||||
# Metrics endpoint
|
||||
location /metrics {
|
||||
proxy_pass http://localhost:9090/metrics;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Start multiple vLLM instances**:
|
||||
```bash
|
||||
# Terminal 1
|
||||
vllm serve MODEL --port 8001 --tensor-parallel-size 1
|
||||
|
||||
# Terminal 2
|
||||
vllm serve MODEL --port 8002 --tensor-parallel-size 1
|
||||
|
||||
# Terminal 3
|
||||
vllm serve MODEL --port 8003 --tensor-parallel-size 1
|
||||
|
||||
# Start Nginx
|
||||
nginx -c /path/to/nginx.conf
|
||||
```
|
||||
|
||||
## Multi-node distributed serving
|
||||
|
||||
For models too large for single node:
|
||||
|
||||
**Node 1** (master):
|
||||
```bash
|
||||
export MASTER_ADDR=192.168.1.10
|
||||
export MASTER_PORT=29500
|
||||
export RANK=0
|
||||
export WORLD_SIZE=2
|
||||
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
**Node 2** (worker):
|
||||
```bash
|
||||
export MASTER_ADDR=192.168.1.10
|
||||
export MASTER_PORT=29500
|
||||
export RANK=1
|
||||
export WORLD_SIZE=2
|
||||
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
## Production configuration examples
|
||||
|
||||
**High throughput** (batch-heavy workload):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--max-num-seqs 512 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--enable-prefix-caching \
|
||||
--trust-remote-code
|
||||
```
|
||||
|
||||
**Low latency** (interactive workload):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--max-num-seqs 64 \
|
||||
--gpu-memory-utilization 0.85 \
|
||||
--enable-chunked-prefill
|
||||
```
|
||||
|
||||
**Memory-constrained** (40GB GPU for 70B model):
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--max-model-len 4096
|
||||
```
|
||||
|
||||
## Health checks and monitoring
|
||||
|
||||
**Health check endpoint**:
|
||||
```bash
|
||||
curl http://localhost:8000/health
|
||||
# Returns: {"status": "ok"}
|
||||
```
|
||||
|
||||
**Readiness check** (wait for model loaded):
|
||||
```bash
|
||||
#!/bin/bash
|
||||
until curl -f http://localhost:8000/health; do
|
||||
echo "Waiting for vLLM to be ready..."
|
||||
sleep 5
|
||||
done
|
||||
echo "vLLM is ready!"
|
||||
```
|
||||
|
||||
**Prometheus scraping**:
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'vllm'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
metrics_path: '/metrics'
|
||||
scrape_interval: 15s
|
||||
```
|
||||
|
||||
**Grafana dashboard** (key metrics):
|
||||
- Requests per second: `rate(vllm_request_success_total[5m])`
|
||||
- TTFT p50: `histogram_quantile(0.5, vllm_time_to_first_token_seconds_bucket)`
|
||||
- TTFT p99: `histogram_quantile(0.99, vllm_time_to_first_token_seconds_bucket)`
|
||||
- GPU cache usage: `vllm_gpu_cache_usage_perc`
|
||||
- Active requests: `vllm_num_requests_running`
|
||||
447
skills/mlops/inference/vllm/references/troubleshooting.md
Normal file
447
skills/mlops/inference/vllm/references/troubleshooting.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
## Contents
|
||||
- Out of memory (OOM) errors
|
||||
- Performance issues
|
||||
- Model loading errors
|
||||
- Network and connection issues
|
||||
- Quantization problems
|
||||
- Distributed serving issues
|
||||
- Debugging tools and commands
|
||||
|
||||
## Out of memory (OOM) errors
|
||||
|
||||
### Symptom: `torch.cuda.OutOfMemoryError` during model loading
|
||||
|
||||
**Cause**: Model + KV cache exceeds available VRAM
|
||||
|
||||
**Solutions (try in order)**:
|
||||
|
||||
1. **Reduce GPU memory utilization**:
|
||||
```bash
|
||||
vllm serve MODEL --gpu-memory-utilization 0.7 # Try 0.7, 0.75, 0.8
|
||||
```
|
||||
|
||||
2. **Reduce max sequence length**:
|
||||
```bash
|
||||
vllm serve MODEL --max-model-len 4096 # Instead of 8192
|
||||
```
|
||||
|
||||
3. **Enable quantization**:
|
||||
```bash
|
||||
vllm serve MODEL --quantization awq # 4x memory reduction
|
||||
```
|
||||
|
||||
4. **Use tensor parallelism** (multiple GPUs):
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 2 # Split across 2 GPUs
|
||||
```
|
||||
|
||||
5. **Reduce max concurrent sequences**:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 128 # Default is 256
|
||||
```
|
||||
|
||||
### Symptom: OOM during inference (not model loading)
|
||||
|
||||
**Cause**: KV cache fills up during generation
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Reduce KV cache allocation
|
||||
vllm serve MODEL --gpu-memory-utilization 0.85
|
||||
|
||||
# Reduce batch size
|
||||
vllm serve MODEL --max-num-seqs 64
|
||||
|
||||
# Reduce max tokens per request
|
||||
# Set in client request: max_tokens=512
|
||||
```
|
||||
|
||||
### Symptom: OOM with quantized model
|
||||
|
||||
**Cause**: Quantization overhead or incorrect configuration
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Ensure quantization flag matches model
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq # Must specify
|
||||
|
||||
# Try different dtype
|
||||
vllm serve MODEL --quantization awq --dtype float16
|
||||
```
|
||||
|
||||
## Performance issues
|
||||
|
||||
### Symptom: Low throughput (<50 req/sec expected >100)
|
||||
|
||||
**Diagnostic steps**:
|
||||
|
||||
1. **Check GPU utilization**:
|
||||
```bash
|
||||
watch -n 1 nvidia-smi
|
||||
# GPU utilization should be >80%
|
||||
```
|
||||
|
||||
If <80%, increase concurrent requests:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 512 # Increase from 256
|
||||
```
|
||||
|
||||
2. **Check if memory-bound**:
|
||||
```bash
|
||||
# If memory at 100% but GPU <80%, reduce sequence length
|
||||
vllm serve MODEL --max-model-len 4096
|
||||
```
|
||||
|
||||
3. **Enable optimizations**:
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--enable-prefix-caching \
|
||||
--enable-chunked-prefill \
|
||||
--max-num-seqs 512
|
||||
```
|
||||
|
||||
4. **Check tensor parallelism settings**:
|
||||
```bash
|
||||
# Must use power-of-2 GPUs
|
||||
vllm serve MODEL --tensor-parallel-size 4 # Not 3 or 5
|
||||
```
|
||||
|
||||
### Symptom: High TTFT (time to first token >1 second)
|
||||
|
||||
**Causes and solutions**:
|
||||
|
||||
**Long prompts**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-chunked-prefill
|
||||
```
|
||||
|
||||
**No prefix caching**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching # For repeated prompts
|
||||
```
|
||||
|
||||
**Too many concurrent requests**:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 64 # Reduce to prioritize latency
|
||||
```
|
||||
|
||||
**Model too large for single GPU**:
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 2 # Parallelize prefill
|
||||
```
|
||||
|
||||
### Symptom: Slow token generation (low tokens/sec)
|
||||
|
||||
**Diagnostic**:
|
||||
```bash
|
||||
# Check if model is correct size
|
||||
vllm serve MODEL # Should see model size in logs
|
||||
|
||||
# Check speculative decoding
|
||||
vllm serve MODEL --speculative-model DRAFT_MODEL
|
||||
```
|
||||
|
||||
**For H100 GPUs**, enable FP8:
|
||||
```bash
|
||||
vllm serve MODEL --quantization fp8
|
||||
```
|
||||
|
||||
## Model loading errors
|
||||
|
||||
### Symptom: `OSError: MODEL not found`
|
||||
|
||||
**Causes**:
|
||||
|
||||
1. **Model name typo**:
|
||||
```bash
|
||||
# Check exact model name on HuggingFace
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct # Correct capitalization
|
||||
```
|
||||
|
||||
2. **Private/gated model**:
|
||||
```bash
|
||||
# Login to HuggingFace first
|
||||
huggingface-cli login
|
||||
# Then run vLLM
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct
|
||||
```
|
||||
|
||||
3. **Custom model needs trust flag**:
|
||||
```bash
|
||||
vllm serve MODEL --trust-remote-code
|
||||
```
|
||||
|
||||
### Symptom: `ValueError: Tokenizer not found`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Download model manually first
|
||||
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"
|
||||
|
||||
# Then launch vLLM
|
||||
vllm serve MODEL
|
||||
```
|
||||
|
||||
### Symptom: `ImportError: No module named 'flash_attn'`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Install flash attention
|
||||
pip install flash-attn --no-build-isolation
|
||||
|
||||
# Or disable flash attention
|
||||
vllm serve MODEL --disable-flash-attn
|
||||
```
|
||||
|
||||
## Network and connection issues
|
||||
|
||||
### Symptom: `Connection refused` when querying server
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Check server is running**:
|
||||
```bash
|
||||
curl http://localhost:8000/health
|
||||
```
|
||||
|
||||
2. **Check port binding**:
|
||||
```bash
|
||||
# Bind to all interfaces for remote access
|
||||
vllm serve MODEL --host 0.0.0.0 --port 8000
|
||||
|
||||
# Check if port is in use
|
||||
lsof -i :8000
|
||||
```
|
||||
|
||||
3. **Check firewall**:
|
||||
```bash
|
||||
# Allow port through firewall
|
||||
sudo ufw allow 8000
|
||||
```
|
||||
|
||||
### Symptom: Slow response times over network
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Increase timeout**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="EMPTY",
|
||||
timeout=300.0 # 5 minute timeout
|
||||
)
|
||||
```
|
||||
|
||||
2. **Check network latency**:
|
||||
```bash
|
||||
ping SERVER_IP # Should be <10ms for local network
|
||||
```
|
||||
|
||||
3. **Use connection pooling**:
|
||||
```python
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util.retry import Retry
|
||||
|
||||
session = requests.Session()
|
||||
retries = Retry(total=3, backoff_factor=1)
|
||||
session.mount('http://', HTTPAdapter(max_retries=retries))
|
||||
```
|
||||
|
||||
## Quantization problems
|
||||
|
||||
### Symptom: `RuntimeError: Quantization format not supported`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Ensure correct quantization method
|
||||
vllm serve MODEL --quantization awq # For AWQ models
|
||||
vllm serve MODEL --quantization gptq # For GPTQ models
|
||||
|
||||
# Check model card for quantization type
|
||||
```
|
||||
|
||||
### Symptom: Poor quality outputs after quantization
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Verify model is correctly quantized**:
|
||||
```bash
|
||||
# Check model config.json for quantization_config
|
||||
cat ~/.cache/huggingface/hub/models--MODEL/config.json
|
||||
```
|
||||
|
||||
2. **Try different quantization method**:
|
||||
```bash
|
||||
# If AWQ quality issues, try FP8 (H100 only)
|
||||
vllm serve MODEL --quantization fp8
|
||||
|
||||
# Or use less aggressive quantization
|
||||
vllm serve MODEL # No quantization
|
||||
```
|
||||
|
||||
3. **Increase temperature for better diversity**:
|
||||
```python
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
```
|
||||
|
||||
## Distributed serving issues
|
||||
|
||||
### Symptom: `RuntimeError: Distributed init failed`
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Check environment variables**:
|
||||
```bash
|
||||
# On all nodes
|
||||
echo $MASTER_ADDR # Should be same
|
||||
echo $MASTER_PORT # Should be same
|
||||
echo $RANK # Should be unique per node (0, 1, 2, ...)
|
||||
echo $WORLD_SIZE # Should be same (total nodes)
|
||||
```
|
||||
|
||||
2. **Check network connectivity**:
|
||||
```bash
|
||||
# From node 1 to node 2
|
||||
ping NODE2_IP
|
||||
nc -zv NODE2_IP 29500 # Check port accessibility
|
||||
```
|
||||
|
||||
3. **Check NCCL settings**:
|
||||
```bash
|
||||
export NCCL_DEBUG=INFO
|
||||
export NCCL_SOCKET_IFNAME=eth0 # Or your network interface
|
||||
vllm serve MODEL --tensor-parallel-size 8
|
||||
```
|
||||
|
||||
### Symptom: `NCCL error: unhandled cuda error`
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Set NCCL to use correct network interface
|
||||
export NCCL_SOCKET_IFNAME=eth0 # Replace with your interface
|
||||
|
||||
# Increase timeout
|
||||
export NCCL_TIMEOUT=1800 # 30 minutes
|
||||
|
||||
# Force P2P for debugging
|
||||
export NCCL_P2P_DISABLE=1
|
||||
```
|
||||
|
||||
## Debugging tools and commands
|
||||
|
||||
### Enable debug logging
|
||||
|
||||
```bash
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
vllm serve MODEL
|
||||
```
|
||||
|
||||
### Monitor GPU usage
|
||||
|
||||
```bash
|
||||
# Real-time GPU monitoring
|
||||
watch -n 1 nvidia-smi
|
||||
|
||||
# Memory breakdown
|
||||
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
|
||||
```
|
||||
|
||||
### Profile performance
|
||||
|
||||
```bash
|
||||
# Built-in benchmarking
|
||||
vllm bench throughput \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--num-prompts 100
|
||||
|
||||
vllm bench latency \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--batch-size 8
|
||||
```
|
||||
|
||||
### Check metrics
|
||||
|
||||
```bash
|
||||
# Prometheus metrics
|
||||
curl http://localhost:9090/metrics
|
||||
|
||||
# Filter for specific metrics
|
||||
curl http://localhost:9090/metrics | grep vllm_time_to_first_token
|
||||
|
||||
# Key metrics to monitor:
|
||||
# - vllm_time_to_first_token_seconds
|
||||
# - vllm_time_per_output_token_seconds
|
||||
# - vllm_num_requests_running
|
||||
# - vllm_gpu_cache_usage_perc
|
||||
# - vllm_request_success_total
|
||||
```
|
||||
|
||||
### Test server health
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:8000/health
|
||||
|
||||
# Model info
|
||||
curl http://localhost:8000/v1/models
|
||||
|
||||
# Test completion
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "MODEL",
|
||||
"prompt": "Hello",
|
||||
"max_tokens": 10
|
||||
}'
|
||||
```
|
||||
|
||||
### Common environment variables
|
||||
|
||||
```bash
|
||||
# CUDA settings
|
||||
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Limit to specific GPUs
|
||||
|
||||
# vLLM settings
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
export VLLM_TRACE_FUNCTION=1 # Profile functions
|
||||
export VLLM_USE_V1=1 # Use v1.0 engine (faster)
|
||||
|
||||
# NCCL settings (distributed)
|
||||
export NCCL_DEBUG=INFO
|
||||
export NCCL_SOCKET_IFNAME=eth0
|
||||
export NCCL_IB_DISABLE=0 # Enable InfiniBand
|
||||
```
|
||||
|
||||
### Collect diagnostic info for bug reports
|
||||
|
||||
```bash
|
||||
# System info
|
||||
nvidia-smi
|
||||
python --version
|
||||
pip show vllm
|
||||
|
||||
# vLLM version and config
|
||||
vllm --version
|
||||
python -c "import vllm; print(vllm.__version__)"
|
||||
|
||||
# Run with debug logging
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
vllm serve MODEL 2>&1 | tee vllm_debug.log
|
||||
|
||||
# Include in bug report:
|
||||
# - vllm_debug.log
|
||||
# - nvidia-smi output
|
||||
# - Full command used
|
||||
# - Expected vs actual behavior
|
||||
```
|
||||
Reference in New Issue
Block a user