Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/inference/DESCRIPTION.md
+++ b/skills/mlops/inference/DESCRIPTION.md
@@ -0,0 +1,3 @@
+---
+description: Model serving, quantization (GGUF/GPTQ), structured output, inference optimization, and model surgery tools for deploying and running LLMs.
+---
--- a/skills/mlops/inference/gguf/SKILL.md
+++ b/skills/mlops/inference/gguf/SKILL.md
@@ -0,0 +1,430 @@
+---
+name: gguf-quantization
+description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [llama-cpp-python>=0.2.0]
+metadata:
+  hermes:
+    tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
+
+---
+
+# GGUF - Quantization Format for llama.cpp
+
+The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
+
+## When to use GGUF
+
+**Use GGUF when:**
+- Deploying on consumer hardware (laptops, desktops)
+- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
+- Need CPU inference without GPU requirements
+- Want flexible quantization (Q2_K to Q8_0)
+- Using local AI tools (LM Studio, Ollama, text-generation-webui)
+
+**Key advantages:**
+- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
+- **No Python runtime**: Pure C/C++ inference
+- **Flexible quantization**: 2-8 bit with various methods (K-quants)
+- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
+- **imatrix**: Importance matrix for better low-bit quality
+
+**Use alternatives instead:**
+- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
+- **HQQ**: Fast calibration-free quantization for HuggingFace
+- **bitsandbytes**: Simple integration with transformers library
+- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
+
+## Quick start
+
+### Installation
+
+```bash
+# Clone llama.cpp
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+
+# Build (CPU)
+make
+
+# Build with CUDA (NVIDIA)
+make GGML_CUDA=1
+
+# Build with Metal (Apple Silicon)
+make GGML_METAL=1
+
+# Install Python bindings (optional)
+pip install llama-cpp-python
+```
+
+### Convert model to GGUF
+
+```bash
+# Install requirements
+pip install -r requirements.txt
+
+# Convert HuggingFace model to GGUF (FP16)
+python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
+
+# Or specify output type
+python convert_hf_to_gguf.py ./path/to/model \
+    --outfile model-f16.gguf \
+    --outtype f16
+```
+
+### Quantize model
+
+```bash
+# Basic quantization to Q4_K_M
+./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
+
+# Quantize with importance matrix (better quality)
+./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
+./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
+```
+
+### Run inference
+
+```bash
+# CLI inference
+./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
+
+# Interactive mode
+./llama-cli -m model-q4_k_m.gguf --interactive
+
+# With GPU offload
+./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
+```
+
+## Quantization types
+
+### K-quant methods (recommended)
+
+| Type | Bits | Size (7B) | Quality | Use Case |
+|------|------|-----------|---------|----------|
+| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
+| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
+| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
+| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
+| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
+| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
+| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
+| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
+| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
+
+### Legacy methods
+
+| Type | Description |
+|------|-------------|
+| Q4_0 | 4-bit, basic |
+| Q4_1 | 4-bit with delta |
+| Q5_0 | 5-bit, basic |
+| Q5_1 | 5-bit with delta |
+
+**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
+
+## Conversion workflows
+
+### Workflow 1: HuggingFace to GGUF
+
+```bash
+# 1. Download model
+huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
+
+# 2. Convert to GGUF (FP16)
+python convert_hf_to_gguf.py ./llama-3.1-8b \
+    --outfile llama-3.1-8b-f16.gguf \
+    --outtype f16
+
+# 3. Quantize
+./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
+
+# 4. Test
+./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
+```
+
+### Workflow 2: With importance matrix (better quality)
+
+```bash
+# 1. Convert to GGUF
+python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
+
+# 2. Create calibration text (diverse samples)
+cat > calibration.txt << 'EOF'
+The quick brown fox jumps over the lazy dog.
+Machine learning is a subset of artificial intelligence.
+Python is a popular programming language.
+# Add more diverse text samples...
+EOF
+
+# 3. Generate importance matrix
+./llama-imatrix -m model-f16.gguf \
+    -f calibration.txt \
+    --chunk 512 \
+    -o model.imatrix \
+    -ngl 35  # GPU layers if available
+
+# 4. Quantize with imatrix
+./llama-quantize --imatrix model.imatrix \
+    model-f16.gguf \
+    model-q4_k_m.gguf \
+    Q4_K_M
+```
+
+### Workflow 3: Multiple quantizations
+
+```bash
+#!/bin/bash
+MODEL="llama-3.1-8b-f16.gguf"
+IMATRIX="llama-3.1-8b.imatrix"
+
+# Generate imatrix once
+./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
+
+# Create multiple quantizations
+for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
+    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
+    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
+    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
+done
+```
+
+## Python usage
+
+### llama-cpp-python
+
+```python
+from llama_cpp import Llama
+
+# Load model
+llm = Llama(
+    model_path="./model-q4_k_m.gguf",
+    n_ctx=4096,          # Context window
+    n_gpu_layers=35,     # GPU offload (0 for CPU only)
+    n_threads=8          # CPU threads
+)
+
+# Generate
+output = llm(
+    "What is machine learning?",
+    max_tokens=256,
+    temperature=0.7,
+    stop=["</s>", "\n\n"]
+)
+print(output["choices"][0]["text"])
+```
+
+### Chat completion
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="./model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,
+    chat_format="llama-3"  # Or "chatml", "mistral", etc.
+)
+
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is Python?"}
+]
+
+response = llm.create_chat_completion(
+    messages=messages,
+    max_tokens=256,
+    temperature=0.7
+)
+print(response["choices"][0]["message"]["content"])
+```
+
+### Streaming
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
+
+# Stream tokens
+for chunk in llm(
+    "Explain quantum computing:",
+    max_tokens=256,
+    stream=True
+):
+    print(chunk["choices"][0]["text"], end="", flush=True)
+```
+
+## Server mode
+
+### Start OpenAI-compatible server
+
+```bash
+# Start server
+./llama-server -m model-q4_k_m.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -ngl 35 \
+    -c 4096
+
+# Or with Python bindings
+python -m llama_cpp.server \
+    --model model-q4_k_m.gguf \
+    --n_gpu_layers 35 \
+    --host 0.0.0.0 \
+    --port 8080
+```
+
+### Use with OpenAI client
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://localhost:8080/v1",
+    api_key="not-needed"
+)
+
+response = client.chat.completions.create(
+    model="local-model",
+    messages=[{"role": "user", "content": "Hello!"}],
+    max_tokens=256
+)
+print(response.choices[0].message.content)
+```
+
+## Hardware optimization
+
+### Apple Silicon (Metal)
+
+```bash
+# Build with Metal
+make clean && make GGML_METAL=1
+
+# Run with Metal acceleration
+./llama-cli -m model.gguf -ngl 99 -p "Hello"
+
+# Python with Metal
+llm = Llama(
+    model_path="model.gguf",
+    n_gpu_layers=99,     # Offload all layers
+    n_threads=1          # Metal handles parallelism
+)
+```
+
+### NVIDIA CUDA
+
+```bash
+# Build with CUDA
+make clean && make GGML_CUDA=1
+
+# Run with CUDA
+./llama-cli -m model.gguf -ngl 35 -p "Hello"
+
+# Specify GPU
+CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
+```
+
+### CPU optimization
+
+```bash
+# Build with AVX2/AVX512
+make clean && make
+
+# Run with optimal threads
+./llama-cli -m model.gguf -t 8 -p "Hello"
+
+# Python CPU config
+llm = Llama(
+    model_path="model.gguf",
+    n_gpu_layers=0,      # CPU only
+    n_threads=8,         # Match physical cores
+    n_batch=512          # Batch size for prompt processing
+)
+```
+
+## Integration with tools
+
+### Ollama
+
+```bash
+# Create Modelfile
+cat > Modelfile << 'EOF'
+FROM ./model-q4_k_m.gguf
+TEMPLATE """{{ .System }}
+{{ .Prompt }}"""
+PARAMETER temperature 0.7
+PARAMETER num_ctx 4096
+EOF
+
+# Create Ollama model
+ollama create mymodel -f Modelfile
+
+# Run
+ollama run mymodel "Hello!"
+```
+
+### LM Studio
+
+1. Place GGUF file in `~/.cache/lm-studio/models/`
+2. Open LM Studio and select the model
+3. Configure context length and GPU offload
+4. Start inference
+
+### text-generation-webui
+
+```bash
+# Place in models folder
+cp model-q4_k_m.gguf text-generation-webui/models/
+
+# Start with llama.cpp loader
+python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
+```
+
+## Best practices
+
+1. **Use K-quants**: Q4_K_M offers best quality/size balance
+2. **Use imatrix**: Always use importance matrix for Q4 and below
+3. **GPU offload**: Offload as many layers as VRAM allows
+4. **Context length**: Start with 4096, increase if needed
+5. **Thread count**: Match physical CPU cores, not logical
+6. **Batch size**: Increase n_batch for faster prompt processing
+
+## Common issues
+
+**Model loads slowly:**
+```bash
+# Use mmap for faster loading
+./llama-cli -m model.gguf --mmap
+```
+
+**Out of memory:**
+```bash
+# Reduce GPU layers
+./llama-cli -m model.gguf -ngl 20  # Reduce from 35
+
+# Or use smaller quantization
+./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
+```
+
+**Poor quality at low bits:**
+```bash
+# Always use imatrix for Q4 and below
+./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
+./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
+```
+
+## References
+
+- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
+- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
+
+## Resources
+
+- **Repository**: https://github.com/ggml-org/llama.cpp
+- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
+- **Pre-quantized Models**: https://huggingface.co/TheBloke
+- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
+- **License**: MIT
--- a/skills/mlops/inference/gguf/references/advanced-usage.md
+++ b/skills/mlops/inference/gguf/references/advanced-usage.md
@@ -0,0 +1,504 @@
+# GGUF Advanced Usage Guide
+
+## Speculative Decoding
+
+### Draft Model Approach
+
+```bash
+# Use smaller model as draft for faster generation
+./llama-speculative \
+    -m large-model-q4_k_m.gguf \
+    -md draft-model-q4_k_m.gguf \
+    -p "Write a story about AI" \
+    -n 500 \
+    --draft 8  # Draft tokens before verification
+```
+
+### Self-Speculative Decoding
+
+```bash
+# Use same model with different context for speculation
+./llama-cli -m model-q4_k_m.gguf \
+    --lookup-cache-static lookup.bin \
+    --lookup-cache-dynamic lookup-dynamic.bin \
+    -p "Hello world"
+```
+
+## Batched Inference
+
+### Process Multiple Prompts
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,
+    n_batch=512  # Larger batch for parallel processing
+)
+
+prompts = [
+    "What is Python?",
+    "Explain machine learning.",
+    "Describe neural networks."
+]
+
+# Process in batch (each prompt gets separate context)
+for prompt in prompts:
+    output = llm(prompt, max_tokens=100)
+    print(f"Q: {prompt}")
+    print(f"A: {output['choices'][0]['text']}\n")
+```
+
+### Server Batching
+
+```bash
+# Start server with batching
+./llama-server -m model-q4_k_m.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -ngl 35 \
+    -c 4096 \
+    --parallel 4        # Concurrent requests
+    --cont-batching     # Continuous batching
+```
+
+## Custom Model Conversion
+
+### Convert with Vocabulary Modifications
+
+```python
+# custom_convert.py
+import sys
+sys.path.insert(0, './llama.cpp')
+
+from convert_hf_to_gguf import main
+from gguf import GGUFWriter
+
+# Custom conversion with modified vocab
+def convert_with_custom_vocab(model_path, output_path):
+    # Load and modify tokenizer
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+
+    # Add special tokens if needed
+    special_tokens = {"additional_special_tokens": ["<|custom|>"]}
+    tokenizer.add_special_tokens(special_tokens)
+    tokenizer.save_pretrained(model_path)
+
+    # Then run standard conversion
+    main([model_path, "--outfile", output_path])
+```
+
+### Convert Specific Architecture
+
+```bash
+# For Mistral-style models
+python convert_hf_to_gguf.py ./mistral-model \
+    --outfile mistral-f16.gguf \
+    --outtype f16
+
+# For Qwen models
+python convert_hf_to_gguf.py ./qwen-model \
+    --outfile qwen-f16.gguf \
+    --outtype f16
+
+# For Phi models
+python convert_hf_to_gguf.py ./phi-model \
+    --outfile phi-f16.gguf \
+    --outtype f16
+```
+
+## Advanced Quantization
+
+### Mixed Quantization
+
+```bash
+# Quantize different layer types differently
+./llama-quantize model-f16.gguf model-mixed.gguf Q4_K_M \
+    --allow-requantize \
+    --leave-output-tensor
+```
+
+### Quantization with Token Embeddings
+
+```bash
+# Keep embeddings at higher precision
+./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M \
+    --token-embedding-type f16
+```
+
+### IQ Quantization (Importance-aware)
+
+```bash
+# Ultra-low bit quantization with importance
+./llama-quantize --imatrix model.imatrix \
+    model-f16.gguf model-iq2_xxs.gguf IQ2_XXS
+
+# Available IQ types: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS
+```
+
+## Memory Optimization
+
+### Memory Mapping
+
+```python
+from llama_cpp import Llama
+
+# Use memory mapping for large models
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    use_mmap=True,       # Memory map the model
+    use_mlock=False,     # Don't lock in RAM
+    n_gpu_layers=35
+)
+```
+
+### Partial GPU Offload
+
+```python
+# Calculate layers to offload based on VRAM
+import subprocess
+
+def get_free_vram_gb():
+    result = subprocess.run(
+        ['nvidia-smi', '--query-gpu=memory.free', '--format=csv,nounits,noheader'],
+        capture_output=True, text=True
+    )
+    return int(result.stdout.strip()) / 1024
+
+# Estimate layers based on VRAM (rough: 0.5GB per layer for 7B Q4)
+free_vram = get_free_vram_gb()
+layers_to_offload = int(free_vram / 0.5)
+
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    n_gpu_layers=min(layers_to_offload, 35)  # Cap at total layers
+)
+```
+
+### KV Cache Optimization
+
+```python
+from llama_cpp import Llama
+
+# Optimize KV cache for long contexts
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    n_ctx=8192,          # Large context
+    n_gpu_layers=35,
+    type_k=1,            # Q8_0 for K cache (1)
+    type_v=1,            # Q8_0 for V cache (1)
+    # Or use Q4_0 (2) for more compression
+)
+```
+
+## Context Management
+
+### Context Shifting
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35
+)
+
+# Handle long conversations with context shifting
+conversation = []
+max_history = 10
+
+def chat(user_message):
+    conversation.append({"role": "user", "content": user_message})
+
+    # Keep only recent history
+    if len(conversation) > max_history * 2:
+        conversation = conversation[-max_history * 2:]
+
+    response = llm.create_chat_completion(
+        messages=conversation,
+        max_tokens=256
+    )
+
+    assistant_message = response["choices"][0]["message"]["content"]
+    conversation.append({"role": "assistant", "content": assistant_message})
+    return assistant_message
+```
+
+### Save and Load State
+
+```bash
+# Save state to file
+./llama-cli -m model.gguf \
+    -p "Once upon a time" \
+    --save-session session.bin \
+    -n 100
+
+# Load and continue
+./llama-cli -m model.gguf \
+    --load-session session.bin \
+    -p " and they lived" \
+    -n 100
+```
+
+## Grammar Constrained Generation
+
+### JSON Output
+
+```python
+from llama_cpp import Llama, LlamaGrammar
+
+# Define JSON grammar
+json_grammar = LlamaGrammar.from_string('''
+root ::= object
+object ::= "{" ws pair ("," ws pair)* "}" ws
+pair ::= string ":" ws value
+value ::= string | number | object | array | "true" | "false" | "null"
+array ::= "[" ws value ("," ws value)* "]" ws
+string ::= "\\"" [^"\\\\]* "\\""
+number ::= [0-9]+
+ws ::= [ \\t\\n]*
+''')
+
+llm = Llama(model_path="model-q4_k_m.gguf", n_gpu_layers=35)
+
+output = llm(
+    "Output a JSON object with name and age:",
+    grammar=json_grammar,
+    max_tokens=100
+)
+print(output["choices"][0]["text"])
+```
+
+### Custom Grammar
+
+```python
+# Grammar for specific format
+answer_grammar = LlamaGrammar.from_string('''
+root ::= "Answer: " letter "\\n" "Explanation: " explanation
+letter ::= [A-D]
+explanation ::= [a-zA-Z0-9 .,!?]+
+''')
+
+output = llm(
+    "Q: What is 2+2? A) 3 B) 4 C) 5 D) 6",
+    grammar=answer_grammar,
+    max_tokens=100
+)
+```
+
+## LoRA Integration
+
+### Load LoRA Adapter
+
+```bash
+# Apply LoRA at runtime
+./llama-cli -m base-model-q4_k_m.gguf \
+    --lora lora-adapter.gguf \
+    --lora-scale 1.0 \
+    -p "Hello!"
+```
+
+### Multiple LoRA Adapters
+
+```bash
+# Stack multiple adapters
+./llama-cli -m base-model.gguf \
+    --lora adapter1.gguf --lora-scale 0.5 \
+    --lora adapter2.gguf --lora-scale 0.5 \
+    -p "Hello!"
+```
+
+### Python LoRA Usage
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="base-model-q4_k_m.gguf",
+    lora_path="lora-adapter.gguf",
+    lora_scale=1.0,
+    n_gpu_layers=35
+)
+```
+
+## Embedding Generation
+
+### Extract Embeddings
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="model-q4_k_m.gguf",
+    embedding=True,      # Enable embedding mode
+    n_gpu_layers=35
+)
+
+# Get embeddings
+embeddings = llm.embed("This is a test sentence.")
+print(f"Embedding dimension: {len(embeddings)}")
+```
+
+### Batch Embeddings
+
+```python
+texts = [
+    "Machine learning is fascinating.",
+    "Deep learning uses neural networks.",
+    "Python is a programming language."
+]
+
+embeddings = [llm.embed(text) for text in texts]
+
+# Calculate similarity
+import numpy as np
+
+def cosine_similarity(a, b):
+    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
+
+sim = cosine_similarity(embeddings[0], embeddings[1])
+print(f"Similarity: {sim:.4f}")
+```
+
+## Performance Tuning
+
+### Benchmark Script
+
+```python
+import time
+from llama_cpp import Llama
+
+def benchmark(model_path, prompt, n_tokens=100, n_runs=5):
+    llm = Llama(
+        model_path=model_path,
+        n_gpu_layers=35,
+        n_ctx=2048,
+        verbose=False
+    )
+
+    # Warmup
+    llm(prompt, max_tokens=10)
+
+    # Benchmark
+    times = []
+    for _ in range(n_runs):
+        start = time.time()
+        output = llm(prompt, max_tokens=n_tokens)
+        elapsed = time.time() - start
+        times.append(elapsed)
+
+    avg_time = sum(times) / len(times)
+    tokens_per_sec = n_tokens / avg_time
+
+    print(f"Model: {model_path}")
+    print(f"Avg time: {avg_time:.2f}s")
+    print(f"Tokens/sec: {tokens_per_sec:.1f}")
+
+    return tokens_per_sec
+
+# Compare quantizations
+for quant in ["q4_k_m", "q5_k_m", "q8_0"]:
+    benchmark(f"model-{quant}.gguf", "Explain quantum computing:", 100)
+```
+
+### Optimal Configuration Finder
+
+```python
+def find_optimal_config(model_path, target_vram_gb=8):
+    """Find optimal n_gpu_layers and n_batch for target VRAM."""
+    from llama_cpp import Llama
+    import gc
+
+    best_config = None
+    best_speed = 0
+
+    for n_gpu_layers in range(0, 50, 5):
+        for n_batch in [128, 256, 512, 1024]:
+            try:
+                gc.collect()
+                llm = Llama(
+                    model_path=model_path,
+                    n_gpu_layers=n_gpu_layers,
+                    n_batch=n_batch,
+                    n_ctx=2048,
+                    verbose=False
+                )
+
+                # Quick benchmark
+                start = time.time()
+                llm("Hello", max_tokens=50)
+                speed = 50 / (time.time() - start)
+
+                if speed > best_speed:
+                    best_speed = speed
+                    best_config = {
+                        "n_gpu_layers": n_gpu_layers,
+                        "n_batch": n_batch,
+                        "speed": speed
+                    }
+
+                del llm
+                gc.collect()
+
+            except Exception as e:
+                print(f"OOM at layers={n_gpu_layers}, batch={n_batch}")
+                break
+
+    return best_config
+```
+
+## Multi-GPU Setup
+
+### Distribute Across GPUs
+
+```bash
+# Split model across multiple GPUs
+./llama-cli -m large-model.gguf \
+    --tensor-split 0.5,0.5 \
+    -ngl 60 \
+    -p "Hello!"
+```
+
+### Python Multi-GPU
+
+```python
+import os
+os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
+
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="large-model-q4_k_m.gguf",
+    n_gpu_layers=60,
+    tensor_split=[0.5, 0.5]  # Split evenly across 2 GPUs
+)
+```
+
+## Custom Builds
+
+### Build with All Optimizations
+
+```bash
+# Clean build with all CPU optimizations
+make clean
+LLAMA_OPENBLAS=1 LLAMA_BLAS_VENDOR=OpenBLAS make -j
+
+# With CUDA and cuBLAS
+make clean
+GGML_CUDA=1 LLAMA_CUBLAS=1 make -j
+
+# With specific CUDA architecture
+GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_86 make -j
+```
+
+### CMake Build
+
+```bash
+mkdir build && cd build
+cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
+cmake --build . --config Release -j
+```
--- a/skills/mlops/inference/gguf/references/troubleshooting.md
+++ b/skills/mlops/inference/gguf/references/troubleshooting.md
@@ -0,0 +1,442 @@
+# GGUF Troubleshooting Guide
+
+## Installation Issues
+
+### Build Fails
+
+**Error**: `make: *** No targets specified and no makefile found`
+
+**Fix**:
+```bash
+# Ensure you're in llama.cpp directory
+cd llama.cpp
+make
+```
+
+**Error**: `fatal error: cuda_runtime.h: No such file or directory`
+
+**Fix**:
+```bash
+# Install CUDA toolkit
+# Ubuntu
+sudo apt install nvidia-cuda-toolkit
+
+# Or set CUDA path
+export CUDA_PATH=/usr/local/cuda
+export PATH=$CUDA_PATH/bin:$PATH
+make GGML_CUDA=1
+```
+
+### Python Bindings Issues
+
+**Error**: `ERROR: Failed building wheel for llama-cpp-python`
+
+**Fix**:
+```bash
+# Install build dependencies
+pip install cmake scikit-build-core
+
+# For CUDA support
+CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
+
+# For Metal (macOS)
+CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
+```
+
+**Error**: `ImportError: libcudart.so.XX: cannot open shared object file`
+
+**Fix**:
+```bash
+# Add CUDA libraries to path
+export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
+
+# Or reinstall with correct CUDA version
+pip uninstall llama-cpp-python
+CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
+```
+
+## Conversion Issues
+
+### Model Not Supported
+
+**Error**: `KeyError: 'model.embed_tokens.weight'`
+
+**Fix**:
+```bash
+# Check model architecture
+python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"
+
+# Use appropriate conversion script
+# For most models:
+python convert_hf_to_gguf.py ./model --outfile model.gguf
+
+# For older models, check if legacy script needed
+```
+
+### Vocabulary Mismatch
+
+**Error**: `RuntimeError: Vocabulary size mismatch`
+
+**Fix**:
+```python
+# Ensure tokenizer matches model
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+tokenizer = AutoTokenizer.from_pretrained("./model")
+model = AutoModelForCausalLM.from_pretrained("./model")
+
+print(f"Tokenizer vocab size: {len(tokenizer)}")
+print(f"Model vocab size: {model.config.vocab_size}")
+
+# If mismatch, resize embeddings before conversion
+model.resize_token_embeddings(len(tokenizer))
+model.save_pretrained("./model-fixed")
+```
+
+### Out of Memory During Conversion
+
+**Error**: `torch.cuda.OutOfMemoryError` during conversion
+
+**Fix**:
+```bash
+# Use CPU for conversion
+CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf
+
+# Or use low memory mode
+python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
+```
+
+## Quantization Issues
+
+### Wrong Output File Size
+
+**Problem**: Quantized file is larger than expected
+
+**Check**:
+```bash
+# Verify quantization type
+./llama-cli -m model.gguf --verbose
+
+# Expected sizes for 7B model:
+# Q4_K_M: ~4.1 GB
+# Q5_K_M: ~4.8 GB
+# Q8_0: ~7.2 GB
+# F16: ~13.5 GB
+```
+
+### Quantization Crashes
+
+**Error**: `Segmentation fault` during quantization
+
+**Fix**:
+```bash
+# Increase stack size
+ulimit -s unlimited
+
+# Or use less threads
+./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
+```
+
+### Poor Quality After Quantization
+
+**Problem**: Model outputs gibberish after quantization
+
+**Solutions**:
+
+1. **Use importance matrix**:
+```bash
+# Generate imatrix with good calibration data
+./llama-imatrix -m model-f16.gguf \
+    -f wiki_sample.txt \
+    --chunk 512 \
+    -o model.imatrix
+
+# Quantize with imatrix
+./llama-quantize --imatrix model.imatrix \
+    model-f16.gguf model-q4_k_m.gguf Q4_K_M
+```
+
+2. **Try higher precision**:
+```bash
+# Use Q5_K_M or Q6_K instead of Q4
+./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
+```
+
+3. **Check original model**:
+```bash
+# Test FP16 version first
+./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
+```
+
+## Inference Issues
+
+### Slow Generation
+
+**Problem**: Generation is slower than expected
+
+**Solutions**:
+
+1. **Enable GPU offload**:
+```bash
+./llama-cli -m model.gguf -ngl 35 -p "Hello"
+```
+
+2. **Optimize batch size**:
+```python
+llm = Llama(
+    model_path="model.gguf",
+    n_batch=512,        # Increase for faster prompt processing
+    n_gpu_layers=35
+)
+```
+
+3. **Use appropriate threads**:
+```bash
+# Match physical cores, not logical
+./llama-cli -m model.gguf -t 8 -p "Hello"
+```
+
+4. **Enable Flash Attention** (if supported):
+```bash
+./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
+```
+
+### Out of Memory
+
+**Error**: `CUDA out of memory` or system freeze
+
+**Solutions**:
+
+1. **Reduce GPU layers**:
+```python
+# Start low and increase
+llm = Llama(model_path="model.gguf", n_gpu_layers=10)
+```
+
+2. **Use smaller quantization**:
+```bash
+./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
+```
+
+3. **Reduce context length**:
+```python
+llm = Llama(
+    model_path="model.gguf",
+    n_ctx=2048,  # Reduce from 4096
+    n_gpu_layers=35
+)
+```
+
+4. **Quantize KV cache**:
+```python
+llm = Llama(
+    model_path="model.gguf",
+    type_k=2,    # Q4_0 for K cache
+    type_v=2,    # Q4_0 for V cache
+    n_gpu_layers=35
+)
+```
+
+### Garbage Output
+
+**Problem**: Model outputs random characters or nonsense
+
+**Diagnose**:
+```python
+# Check model loading
+llm = Llama(model_path="model.gguf", verbose=True)
+
+# Test with simple prompt
+output = llm("1+1=", max_tokens=5, temperature=0)
+print(output)
+```
+
+**Solutions**:
+
+1. **Check model integrity**:
+```bash
+# Verify GGUF file
+./llama-cli -m model.gguf --verbose 2>&1 | head -50
+```
+
+2. **Use correct chat format**:
+```python
+llm = Llama(
+    model_path="model.gguf",
+    chat_format="llama-3"  # Match your model: chatml, mistral, etc.
+)
+```
+
+3. **Check temperature**:
+```python
+# Use lower temperature for deterministic output
+output = llm("Hello", max_tokens=50, temperature=0.1)
+```
+
+### Token Issues
+
+**Error**: `RuntimeError: unknown token` or encoding errors
+
+**Fix**:
+```python
+# Ensure UTF-8 encoding
+prompt = "Hello, world!".encode('utf-8').decode('utf-8')
+output = llm(prompt, max_tokens=50)
+```
+
+## Server Issues
+
+### Connection Refused
+
+**Error**: `Connection refused` when accessing server
+
+**Fix**:
+```bash
+# Bind to all interfaces
+./llama-server -m model.gguf --host 0.0.0.0 --port 8080
+
+# Check if port is in use
+lsof -i :8080
+```
+
+### Server Crashes Under Load
+
+**Problem**: Server crashes with multiple concurrent requests
+
+**Solutions**:
+
+1. **Limit parallelism**:
+```bash
+./llama-server -m model.gguf \
+    --parallel 2 \
+    -c 4096 \
+    --cont-batching
+```
+
+2. **Add request timeout**:
+```bash
+./llama-server -m model.gguf --timeout 300
+```
+
+3. **Monitor memory**:
+```bash
+watch -n 1 nvidia-smi  # For GPU
+watch -n 1 free -h     # For RAM
+```
+
+### API Compatibility Issues
+
+**Problem**: OpenAI client not working with server
+
+**Fix**:
+```python
+from openai import OpenAI
+
+# Use correct base URL format
+client = OpenAI(
+    base_url="http://localhost:8080/v1",  # Include /v1
+    api_key="not-needed"
+)
+
+# Use correct model name
+response = client.chat.completions.create(
+    model="local",  # Or the actual model name
+    messages=[{"role": "user", "content": "Hello"}]
+)
+```
+
+## Apple Silicon Issues
+
+### Metal Not Working
+
+**Problem**: Metal acceleration not enabled
+
+**Check**:
+```bash
+# Verify Metal support
+./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
+```
+
+**Fix**:
+```bash
+# Rebuild with Metal
+make clean
+make GGML_METAL=1
+
+# Python bindings
+CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
+```
+
+### Incorrect Memory Usage on M1/M2
+
+**Problem**: Model uses too much unified memory
+
+**Fix**:
+```python
+# Offload all layers for Metal
+llm = Llama(
+    model_path="model.gguf",
+    n_gpu_layers=99,    # Offload everything
+    n_threads=1         # Metal handles parallelism
+)
+```
+
+## Debugging
+
+### Enable Verbose Output
+
+```bash
+# CLI verbose mode
+./llama-cli -m model.gguf --verbose -p "Hello" -n 50
+
+# Python verbose
+llm = Llama(model_path="model.gguf", verbose=True)
+```
+
+### Check Model Metadata
+
+```bash
+# View GGUF metadata
+./llama-cli -m model.gguf --verbose 2>&1 | head -100
+```
+
+### Validate GGUF File
+
+```python
+import struct
+
+def validate_gguf(filepath):
+    with open(filepath, 'rb') as f:
+        magic = f.read(4)
+        if magic != b'GGUF':
+            print(f"Invalid magic: {magic}")
+            return False
+
+        version = struct.unpack('<I', f.read(4))[0]
+        print(f"GGUF version: {version}")
+
+        tensor_count = struct.unpack('<Q', f.read(8))[0]
+        metadata_count = struct.unpack('<Q', f.read(8))[0]
+        print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")
+
+        return True
+
+validate_gguf("model.gguf")
+```
+
+## Getting Help
+
+1. **GitHub Issues**: https://github.com/ggml-org/llama.cpp/issues
+2. **Discussions**: https://github.com/ggml-org/llama.cpp/discussions
+3. **Reddit**: r/LocalLLaMA
+
+### Reporting Issues
+
+Include:
+- llama.cpp version/commit hash
+- Build command used
+- Model name and quantization
+- Full error message/stack trace
+- Hardware: CPU/GPU model, RAM, VRAM
+- OS version
+- Minimal reproduction steps
--- a/skills/mlops/inference/guidance/SKILL.md
+++ b/skills/mlops/inference/guidance/SKILL.md
@@ -0,0 +1,575 @@
+---
+name: guidance
+description: Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [guidance, transformers]
+metadata:
+  hermes:
+    tags: [Prompt Engineering, Guidance, Constrained Generation, Structured Output, JSON Validation, Grammar, Microsoft Research, Format Enforcement, Multi-Step Workflows]
+
+---
+
+# Guidance: Constrained LLM Generation
+
+## When to Use This Skill
+
+Use Guidance when you need to:
+- **Control LLM output syntax** with regex or grammars
+- **Guarantee valid JSON/XML/code** generation
+- **Reduce latency** vs traditional prompting approaches
+- **Enforce structured formats** (dates, emails, IDs, etc.)
+- **Build multi-step workflows** with Pythonic control flow
+- **Prevent invalid outputs** through grammatical constraints
+
+**GitHub Stars**: 18,000+ | **From**: Microsoft Research
+
+## Installation
+
+```bash
+# Base installation
+pip install guidance
+
+# With specific backends
+pip install guidance[transformers]  # Hugging Face models
+pip install guidance[llama_cpp]     # llama.cpp models
+```
+
+## Quick Start
+
+### Basic Example: Structured Generation
+
+```python
+from guidance import models, gen
+
+# Load model (supports OpenAI, Transformers, llama.cpp)
+lm = models.OpenAI("gpt-4")
+
+# Generate with constraints
+result = lm + "The capital of France is " + gen("capital", max_tokens=5)
+
+print(result["capital"])  # "Paris"
+```
+
+### With Anthropic Claude
+
+```python
+from guidance import models, gen, system, user, assistant
+
+# Configure Claude
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Use context managers for chat format
+with system():
+    lm += "You are a helpful assistant."
+
+with user():
+    lm += "What is the capital of France?"
+
+with assistant():
+    lm += gen(max_tokens=20)
+```
+
+## Core Concepts
+
+### 1. Context Managers
+
+Guidance uses Pythonic context managers for chat-style interactions.
+
+```python
+from guidance import system, user, assistant, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# System message
+with system():
+    lm += "You are a JSON generation expert."
+
+# User message
+with user():
+    lm += "Generate a person object with name and age."
+
+# Assistant response
+with assistant():
+    lm += gen("response", max_tokens=100)
+
+print(lm["response"])
+```
+
+**Benefits:**
+- Natural chat flow
+- Clear role separation
+- Easy to read and maintain
+
+### 2. Constrained Generation
+
+Guidance ensures outputs match specified patterns using regex or grammars.
+
+#### Regex Constraints
+
+```python
+from guidance import models, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Constrain to valid email format
+lm += "Email: " + gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
+
+# Constrain to date format (YYYY-MM-DD)
+lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}")
+
+# Constrain to phone number
+lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}")
+
+print(lm["email"])  # Guaranteed valid email
+print(lm["date"])   # Guaranteed YYYY-MM-DD format
+```
+
+**How it works:**
+- Regex converted to grammar at token level
+- Invalid tokens filtered during generation
+- Model can only produce matching outputs
+
+#### Selection Constraints
+
+```python
+from guidance import models, gen, select
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Constrain to specific choices
+lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
+
+# Multiple-choice selection
+lm += "Best answer: " + select(
+    ["A) Paris", "B) London", "C) Berlin", "D) Madrid"],
+    name="answer"
+)
+
+print(lm["sentiment"])  # One of: positive, negative, neutral
+print(lm["answer"])     # One of: A, B, C, or D
+```
+
+### 3. Token Healing
+
+Guidance automatically "heals" token boundaries between prompt and generation.
+
+**Problem:** Tokenization creates unnatural boundaries.
+
+```python
+# Without token healing
+prompt = "The capital of France is "
+# Last token: " is "
+# First generated token might be " Par" (with leading space)
+# Result: "The capital of France is  Paris" (double space!)
+```
+
+**Solution:** Guidance backs up one token and regenerates.
+
+```python
+from guidance import models, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Token healing enabled by default
+lm += "The capital of France is " + gen("capital", max_tokens=5)
+# Result: "The capital of France is Paris" (correct spacing)
+```
+
+**Benefits:**
+- Natural text boundaries
+- No awkward spacing issues
+- Better model performance (sees natural token sequences)
+
+### 4. Grammar-Based Generation
+
+Define complex structures using context-free grammars.
+
+```python
+from guidance import models, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# JSON grammar (simplified)
+json_grammar = """
+{
+    "name": <gen name regex="[A-Za-z ]+" max_tokens=20>,
+    "age": <gen age regex="[0-9]+" max_tokens=3>,
+    "email": <gen email regex="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}" max_tokens=50>
+}
+"""
+
+# Generate valid JSON
+lm += gen("person", grammar=json_grammar)
+
+print(lm["person"])  # Guaranteed valid JSON structure
+```
+
+**Use cases:**
+- Complex structured outputs
+- Nested data structures
+- Programming language syntax
+- Domain-specific languages
+
+### 5. Guidance Functions
+
+Create reusable generation patterns with the `@guidance` decorator.
+
+```python
+from guidance import guidance, gen, models
+
+@guidance
+def generate_person(lm):
+    """Generate a person with name and age."""
+    lm += "Name: " + gen("name", max_tokens=20, stop="\n")
+    lm += "\nAge: " + gen("age", regex=r"[0-9]+", max_tokens=3)
+    return lm
+
+# Use the function
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = generate_person(lm)
+
+print(lm["name"])
+print(lm["age"])
+```
+
+**Stateful Functions:**
+
+```python
+@guidance(stateless=False)
+def react_agent(lm, question, tools, max_rounds=5):
+    """ReAct agent with tool use."""
+    lm += f"Question: {question}\n\n"
+
+    for i in range(max_rounds):
+        # Thought
+        lm += f"Thought {i+1}: " + gen("thought", stop="\n")
+
+        # Action
+        lm += "\nAction: " + select(list(tools.keys()), name="action")
+
+        # Execute tool
+        tool_result = tools[lm["action"]]()
+        lm += f"\nObservation: {tool_result}\n\n"
+
+        # Check if done
+        lm += "Done? " + select(["Yes", "No"], name="done")
+        if lm["done"] == "Yes":
+            break
+
+    # Final answer
+    lm += "\nFinal Answer: " + gen("answer", max_tokens=100)
+    return lm
+```
+
+## Backend Configuration
+
+### Anthropic Claude
+
+```python
+from guidance import models
+
+lm = models.Anthropic(
+    model="claude-sonnet-4-5-20250929",
+    api_key="your-api-key"  # Or set ANTHROPIC_API_KEY env var
+)
+```
+
+### OpenAI
+
+```python
+lm = models.OpenAI(
+    model="gpt-4o-mini",
+    api_key="your-api-key"  # Or set OPENAI_API_KEY env var
+)
+```
+
+### Local Models (Transformers)
+
+```python
+from guidance.models import Transformers
+
+lm = Transformers(
+    "microsoft/Phi-4-mini-instruct",
+    device="cuda"  # Or "cpu"
+)
+```
+
+### Local Models (llama.cpp)
+
+```python
+from guidance.models import LlamaCpp
+
+lm = LlamaCpp(
+    model_path="/path/to/model.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35
+)
+```
+
+## Common Patterns
+
+### Pattern 1: JSON Generation
+
+```python
+from guidance import models, gen, system, user, assistant
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+with system():
+    lm += "You generate valid JSON."
+
+with user():
+    lm += "Generate a user profile with name, age, and email."
+
+with assistant():
+    lm += """{
+    "name": """ + gen("name", regex=r'"[A-Za-z ]+"', max_tokens=30) + """,
+    "age": """ + gen("age", regex=r"[0-9]+", max_tokens=3) + """,
+    "email": """ + gen("email", regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"', max_tokens=50) + """
+}"""
+
+print(lm)  # Valid JSON guaranteed
+```
+
+### Pattern 2: Classification
+
+```python
+from guidance import models, gen, select
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+text = "This product is amazing! I love it."
+
+lm += f"Text: {text}\n"
+lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
+lm += "\nConfidence: " + gen("confidence", regex=r"[0-9]+", max_tokens=3) + "%"
+
+print(f"Sentiment: {lm['sentiment']}")
+print(f"Confidence: {lm['confidence']}%")
+```
+
+### Pattern 3: Multi-Step Reasoning
+
+```python
+from guidance import models, gen, guidance
+
+@guidance
+def chain_of_thought(lm, question):
+    """Generate answer with step-by-step reasoning."""
+    lm += f"Question: {question}\n\n"
+
+    # Generate multiple reasoning steps
+    for i in range(3):
+        lm += f"Step {i+1}: " + gen(f"step_{i+1}", stop="\n", max_tokens=100) + "\n"
+
+    # Final answer
+    lm += "\nTherefore, the answer is: " + gen("answer", max_tokens=50)
+
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = chain_of_thought(lm, "What is 15% of 200?")
+
+print(lm["answer"])
+```
+
+### Pattern 4: ReAct Agent
+
+```python
+from guidance import models, gen, select, guidance
+
+@guidance(stateless=False)
+def react_agent(lm, question):
+    """ReAct agent with tool use."""
+    tools = {
+        "calculator": lambda expr: eval(expr),
+        "search": lambda query: f"Search results for: {query}",
+    }
+
+    lm += f"Question: {question}\n\n"
+
+    for round in range(5):
+        # Thought
+        lm += f"Thought: " + gen("thought", stop="\n") + "\n"
+
+        # Action selection
+        lm += "Action: " + select(["calculator", "search", "answer"], name="action")
+
+        if lm["action"] == "answer":
+            lm += "\nFinal Answer: " + gen("answer", max_tokens=100)
+            break
+
+        # Action input
+        lm += "\nAction Input: " + gen("action_input", stop="\n") + "\n"
+
+        # Execute tool
+        if lm["action"] in tools:
+            result = tools[lm["action"]](lm["action_input"])
+            lm += f"Observation: {result}\n\n"
+
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = react_agent(lm, "What is 25 * 4 + 10?")
+print(lm["answer"])
+```
+
+### Pattern 5: Data Extraction
+
+```python
+from guidance import models, gen, guidance
+
+@guidance
+def extract_entities(lm, text):
+    """Extract structured entities from text."""
+    lm += f"Text: {text}\n\n"
+
+    # Extract person
+    lm += "Person: " + gen("person", stop="\n", max_tokens=30) + "\n"
+
+    # Extract organization
+    lm += "Organization: " + gen("organization", stop="\n", max_tokens=30) + "\n"
+
+    # Extract date
+    lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}", max_tokens=10) + "\n"
+
+    # Extract location
+    lm += "Location: " + gen("location", stop="\n", max_tokens=30) + "\n"
+
+    return lm
+
+text = "Tim Cook announced at Apple Park on 2024-09-15 in Cupertino."
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = extract_entities(lm, text)
+
+print(f"Person: {lm['person']}")
+print(f"Organization: {lm['organization']}")
+print(f"Date: {lm['date']}")
+print(f"Location: {lm['location']}")
+```
+
+## Best Practices
+
+### 1. Use Regex for Format Validation
+
+```python
+# ✅ Good: Regex ensures valid format
+lm += "Email: " + gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
+
+# ❌ Bad: Free generation may produce invalid emails
+lm += "Email: " + gen("email", max_tokens=50)
+```
+
+### 2. Use select() for Fixed Categories
+
+```python
+# ✅ Good: Guaranteed valid category
+lm += "Status: " + select(["pending", "approved", "rejected"], name="status")
+
+# ❌ Bad: May generate typos or invalid values
+lm += "Status: " + gen("status", max_tokens=20)
+```
+
+### 3. Leverage Token Healing
+
+```python
+# Token healing is enabled by default
+# No special action needed - just concatenate naturally
+lm += "The capital is " + gen("capital")  # Automatic healing
+```
+
+### 4. Use stop Sequences
+
+```python
+# ✅ Good: Stop at newline for single-line outputs
+lm += "Name: " + gen("name", stop="\n")
+
+# ❌ Bad: May generate multiple lines
+lm += "Name: " + gen("name", max_tokens=50)
+```
+
+### 5. Create Reusable Functions
+
+```python
+# ✅ Good: Reusable pattern
+@guidance
+def generate_person(lm):
+    lm += "Name: " + gen("name", stop="\n")
+    lm += "\nAge: " + gen("age", regex=r"[0-9]+")
+    return lm
+
+# Use multiple times
+lm = generate_person(lm)
+lm += "\n\n"
+lm = generate_person(lm)
+```
+
+### 6. Balance Constraints
+
+```python
+# ✅ Good: Reasonable constraints
+lm += gen("name", regex=r"[A-Za-z ]+", max_tokens=30)
+
+# ❌ Too strict: May fail or be very slow
+lm += gen("name", regex=r"^(John|Jane)$", max_tokens=10)
+```
+
+## Comparison to Alternatives
+
+| Feature | Guidance | Instructor | Outlines | LMQL |
+|---------|----------|------------|----------|------|
+| Regex Constraints | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
+| Grammar Support | ✅ CFG | ❌ No | ✅ CFG | ✅ CFG |
+| Pydantic Validation | ❌ No | ✅ Yes | ✅ Yes | ❌ No |
+| Token Healing | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
+| Local Models | ✅ Yes | ⚠️ Limited | ✅ Yes | ✅ Yes |
+| API Models | ✅ Yes | ✅ Yes | ⚠️ Limited | ✅ Yes |
+| Pythonic Syntax | ✅ Yes | ✅ Yes | ✅ Yes | ❌ SQL-like |
+| Learning Curve | Low | Low | Medium | High |
+
+**When to choose Guidance:**
+- Need regex/grammar constraints
+- Want token healing
+- Building complex workflows with control flow
+- Using local models (Transformers, llama.cpp)
+- Prefer Pythonic syntax
+
+**When to choose alternatives:**
+- Instructor: Need Pydantic validation with automatic retrying
+- Outlines: Need JSON schema validation
+- LMQL: Prefer declarative query syntax
+
+## Performance Characteristics
+
+**Latency Reduction:**
+- 30-50% faster than traditional prompting for constrained outputs
+- Token healing reduces unnecessary regeneration
+- Grammar constraints prevent invalid token generation
+
+**Memory Usage:**
+- Minimal overhead vs unconstrained generation
+- Grammar compilation cached after first use
+- Efficient token filtering at inference time
+
+**Token Efficiency:**
+- Prevents wasted tokens on invalid outputs
+- No need for retry loops
+- Direct path to valid outputs
+
+## Resources
+
+- **Documentation**: https://guidance.readthedocs.io
+- **GitHub**: https://github.com/guidance-ai/guidance (18k+ stars)
+- **Notebooks**: https://github.com/guidance-ai/guidance/tree/main/notebooks
+- **Discord**: Community support available
+
+## See Also
+
+- `references/constraints.md` - Comprehensive regex and grammar patterns
+- `references/backends.md` - Backend-specific configuration
+- `references/examples.md` - Production-ready examples
+
+
--- a/skills/mlops/inference/guidance/references/backends.md
+++ b/skills/mlops/inference/guidance/references/backends.md
@@ -0,0 +1,554 @@
+# Backend Configuration Guide
+
+Complete guide to configuring Guidance with different LLM backends.
+
+## Table of Contents
+- API-Based Models (Anthropic, OpenAI)
+- Local Models (Transformers, llama.cpp)
+- Backend Comparison
+- Performance Tuning
+- Advanced Configuration
+
+## API-Based Models
+
+### Anthropic Claude
+
+#### Basic Setup
+
+```python
+from guidance import models
+
+# Using environment variable
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+# Reads ANTHROPIC_API_KEY from environment
+
+# Explicit API key
+lm = models.Anthropic(
+    model="claude-sonnet-4-5-20250929",
+    api_key="your-api-key-here"
+)
+```
+
+#### Available Models
+
+```python
+# Claude 3.5 Sonnet (Latest, recommended)
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Claude 3.7 Sonnet (Fast, cost-effective)
+lm = models.Anthropic("claude-sonnet-3.7-20250219")
+
+# Claude 3 Opus (Most capable)
+lm = models.Anthropic("claude-3-opus-20240229")
+
+# Claude 3.5 Haiku (Fastest, cheapest)
+lm = models.Anthropic("claude-3-5-haiku-20241022")
+```
+
+#### Configuration Options
+
+```python
+lm = models.Anthropic(
+    model="claude-sonnet-4-5-20250929",
+    api_key="your-api-key",
+    max_tokens=4096,           # Max tokens to generate
+    temperature=0.7,            # Sampling temperature (0-1)
+    top_p=0.9,                  # Nucleus sampling
+    timeout=30,                 # Request timeout (seconds)
+    max_retries=3              # Retry failed requests
+)
+```
+
+#### With Context Managers
+
+```python
+from guidance import models, system, user, assistant, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+with system():
+    lm += "You are a helpful assistant."
+
+with user():
+    lm += "What is the capital of France?"
+
+with assistant():
+    lm += gen(max_tokens=50)
+
+print(lm)
+```
+
+### OpenAI
+
+#### Basic Setup
+
+```python
+from guidance import models
+
+# Using environment variable
+lm = models.OpenAI("gpt-4o")
+# Reads OPENAI_API_KEY from environment
+
+# Explicit API key
+lm = models.OpenAI(
+    model="gpt-4o",
+    api_key="your-api-key-here"
+)
+```
+
+#### Available Models
+
+```python
+# GPT-4o (Latest, multimodal)
+lm = models.OpenAI("gpt-4o")
+
+# GPT-4o Mini (Fast, cost-effective)
+lm = models.OpenAI("gpt-4o-mini")
+
+# GPT-4 Turbo
+lm = models.OpenAI("gpt-4-turbo")
+
+# GPT-3.5 Turbo (Cheapest)
+lm = models.OpenAI("gpt-3.5-turbo")
+```
+
+#### Configuration Options
+
+```python
+lm = models.OpenAI(
+    model="gpt-4o-mini",
+    api_key="your-api-key",
+    max_tokens=2048,
+    temperature=0.7,
+    top_p=1.0,
+    frequency_penalty=0.0,
+    presence_penalty=0.0,
+    timeout=30
+)
+```
+
+#### Chat Format
+
+```python
+from guidance import models, gen
+
+lm = models.OpenAI("gpt-4o-mini")
+
+# OpenAI uses chat format
+lm += [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is 2+2?"}
+]
+
+# Generate response
+lm += gen(max_tokens=50)
+```
+
+### Azure OpenAI
+
+```python
+from guidance import models
+
+lm = models.AzureOpenAI(
+    model="gpt-4o",
+    azure_endpoint="https://your-resource.openai.azure.com/",
+    api_key="your-azure-api-key",
+    api_version="2024-02-15-preview",
+    deployment_name="your-deployment-name"
+)
+```
+
+## Local Models
+
+### Transformers (Hugging Face)
+
+#### Basic Setup
+
+```python
+from guidance.models import Transformers
+
+# Load model from Hugging Face
+lm = Transformers("microsoft/Phi-4-mini-instruct")
+```
+
+#### GPU Configuration
+
+```python
+# Use GPU
+lm = Transformers(
+    "microsoft/Phi-4-mini-instruct",
+    device="cuda"
+)
+
+# Use specific GPU
+lm = Transformers(
+    "microsoft/Phi-4-mini-instruct",
+    device="cuda:0"  # GPU 0
+)
+
+# Use CPU
+lm = Transformers(
+    "microsoft/Phi-4-mini-instruct",
+    device="cpu"
+)
+```
+
+#### Advanced Configuration
+
+```python
+lm = Transformers(
+    "microsoft/Phi-4-mini-instruct",
+    device="cuda",
+    torch_dtype="float16",      # Use FP16 (faster, less memory)
+    load_in_8bit=True,          # 8-bit quantization
+    max_memory={0: "20GB"},     # GPU memory limit
+    offload_folder="./offload"  # Offload to disk if needed
+)
+```
+
+#### Popular Models
+
+```python
+# Phi-4 (Microsoft)
+lm = Transformers("microsoft/Phi-4-mini-instruct")
+lm = Transformers("microsoft/Phi-3-medium-4k-instruct")
+
+# Llama 3 (Meta)
+lm = Transformers("meta-llama/Llama-3.1-8B-Instruct")
+lm = Transformers("meta-llama/Llama-3.1-70B-Instruct")
+
+# Mistral (Mistral AI)
+lm = Transformers("mistralai/Mistral-7B-Instruct-v0.3")
+lm = Transformers("mistralai/Mixtral-8x7B-Instruct-v0.1")
+
+# Qwen (Alibaba)
+lm = Transformers("Qwen/Qwen2.5-7B-Instruct")
+
+# Gemma (Google)
+lm = Transformers("google/gemma-2-9b-it")
+```
+
+#### Generation Configuration
+
+```python
+lm = Transformers(
+    "microsoft/Phi-4-mini-instruct",
+    device="cuda"
+)
+
+# Configure generation
+from guidance import gen
+
+result = lm + gen(
+    max_tokens=100,
+    temperature=0.7,
+    top_p=0.9,
+    top_k=50,
+    repetition_penalty=1.1
+)
+```
+
+### llama.cpp
+
+#### Basic Setup
+
+```python
+from guidance.models import LlamaCpp
+
+# Load GGUF model
+lm = LlamaCpp(
+    model_path="/path/to/model.gguf",
+    n_ctx=4096  # Context window
+)
+```
+
+#### GPU Configuration
+
+```python
+# Use GPU acceleration
+lm = LlamaCpp(
+    model_path="/path/to/model.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,  # Offload 35 layers to GPU
+    n_threads=8       # CPU threads for remaining layers
+)
+
+# Full GPU offload
+lm = LlamaCpp(
+    model_path="/path/to/model.gguf",
+    n_ctx=4096,
+    n_gpu_layers=-1  # Offload all layers
+)
+```
+
+#### Advanced Configuration
+
+```python
+lm = LlamaCpp(
+    model_path="/path/to/llama-3.1-8b-instruct.Q4_K_M.gguf",
+    n_ctx=8192,          # Context window (tokens)
+    n_gpu_layers=35,     # GPU layers
+    n_threads=8,         # CPU threads
+    n_batch=512,         # Batch size for prompt processing
+    use_mmap=True,       # Memory-map the model file
+    use_mlock=False,     # Lock model in RAM
+    seed=42,             # Random seed
+    verbose=False        # Suppress verbose output
+)
+```
+
+#### Quantized Models
+
+```python
+# Q4_K_M (4-bit, recommended for most cases)
+lm = LlamaCpp("/path/to/model.Q4_K_M.gguf")
+
+# Q5_K_M (5-bit, better quality)
+lm = LlamaCpp("/path/to/model.Q5_K_M.gguf")
+
+# Q8_0 (8-bit, high quality)
+lm = LlamaCpp("/path/to/model.Q8_0.gguf")
+
+# F16 (16-bit float, highest quality)
+lm = LlamaCpp("/path/to/model.F16.gguf")
+```
+
+#### Popular GGUF Models
+
+```python
+# Llama 3.1
+lm = LlamaCpp("llama-3.1-8b-instruct.Q4_K_M.gguf")
+
+# Mistral
+lm = LlamaCpp("mistral-7b-instruct-v0.3.Q4_K_M.gguf")
+
+# Phi-4
+lm = LlamaCpp("phi-4-mini-instruct.Q4_K_M.gguf")
+```
+
+## Backend Comparison
+
+### Feature Matrix
+
+| Feature | Anthropic | OpenAI | Transformers | llama.cpp |
+|---------|-----------|--------|--------------|-----------|
+| Constrained Generation | ✅ Full | ✅ Full | ✅ Full | ✅ Full |
+| Token Healing | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
+| Streaming | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
+| GPU Support | N/A | N/A | ✅ Yes | ✅ Yes |
+| Quantization | N/A | N/A | ✅ Yes | ✅ Yes |
+| Cost | $$$ | $$$ | Free | Free |
+| Latency | Low | Low | Medium | Low |
+| Setup Difficulty | Easy | Easy | Medium | Medium |
+
+### Performance Characteristics
+
+**Anthropic Claude:**
+- **Latency**: 200-500ms (API call)
+- **Throughput**: Limited by API rate limits
+- **Cost**: $3-15 per 1M input tokens
+- **Best for**: Production systems, high-quality outputs
+
+**OpenAI:**
+- **Latency**: 200-400ms (API call)
+- **Throughput**: Limited by API rate limits
+- **Cost**: $0.15-30 per 1M input tokens
+- **Best for**: Cost-sensitive production, gpt-4o-mini
+
+**Transformers:**
+- **Latency**: 50-200ms (local inference)
+- **Throughput**: GPU-dependent (10-100 tokens/sec)
+- **Cost**: Hardware cost only
+- **Best for**: Privacy-sensitive, high-volume, experimentation
+
+**llama.cpp:**
+- **Latency**: 30-150ms (local inference)
+- **Throughput**: Hardware-dependent (20-150 tokens/sec)
+- **Cost**: Hardware cost only
+- **Best for**: Edge deployment, Apple Silicon, CPU inference
+
+### Memory Requirements
+
+**Transformers (FP16):**
+- 7B model: ~14GB GPU VRAM
+- 13B model: ~26GB GPU VRAM
+- 70B model: ~140GB GPU VRAM (multi-GPU)
+
+**llama.cpp (Q4_K_M):**
+- 7B model: ~4.5GB RAM
+- 13B model: ~8GB RAM
+- 70B model: ~40GB RAM
+
+**Optimization Tips:**
+- Use quantized models (Q4_K_M) for lower memory
+- Use GPU offloading for faster inference
+- Use CPU inference for smaller models (<7B)
+
+## Performance Tuning
+
+### API Models (Anthropic, OpenAI)
+
+#### Reduce Latency
+
+```python
+from guidance import models, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Use lower max_tokens (faster response)
+lm += gen(max_tokens=100)  # Instead of 1000
+
+# Use streaming (perceived latency reduction)
+for chunk in lm.stream(gen(max_tokens=500)):
+    print(chunk, end="", flush=True)
+```
+
+#### Reduce Cost
+
+```python
+# Use cheaper models
+lm = models.Anthropic("claude-3-5-haiku-20241022")  # vs Sonnet
+lm = models.OpenAI("gpt-4o-mini")  # vs gpt-4o
+
+# Reduce context size
+# - Keep prompts concise
+# - Avoid large few-shot examples
+# - Use max_tokens limits
+```
+
+### Local Models (Transformers, llama.cpp)
+
+#### Optimize GPU Usage
+
+```python
+from guidance.models import Transformers
+
+# Use FP16 for 2x speedup
+lm = Transformers(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    device="cuda",
+    torch_dtype="float16"
+)
+
+# Use 8-bit quantization for 4x memory reduction
+lm = Transformers(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    device="cuda",
+    load_in_8bit=True
+)
+
+# Use flash attention (requires flash-attn package)
+lm = Transformers(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    device="cuda",
+    use_flash_attention_2=True
+)
+```
+
+#### Optimize llama.cpp
+
+```python
+from guidance.models import LlamaCpp
+
+# Maximize GPU layers
+lm = LlamaCpp(
+    model_path="/path/to/model.Q4_K_M.gguf",
+    n_gpu_layers=-1  # All layers on GPU
+)
+
+# Optimize batch size
+lm = LlamaCpp(
+    model_path="/path/to/model.Q4_K_M.gguf",
+    n_batch=512,     # Larger batch = faster prompt processing
+    n_gpu_layers=-1
+)
+
+# Use Metal (Apple Silicon)
+lm = LlamaCpp(
+    model_path="/path/to/model.Q4_K_M.gguf",
+    n_gpu_layers=-1,  # Use Metal GPU acceleration
+    use_mmap=True
+)
+```
+
+#### Batch Processing
+
+```python
+# Process multiple requests efficiently
+requests = [
+    "What is 2+2?",
+    "What is the capital of France?",
+    "What is photosynthesis?"
+]
+
+# Bad: Sequential processing
+for req in requests:
+    lm = Transformers("microsoft/Phi-4-mini-instruct")
+    lm += req + gen(max_tokens=50)
+
+# Good: Reuse loaded model
+lm = Transformers("microsoft/Phi-4-mini-instruct")
+for req in requests:
+    lm += req + gen(max_tokens=50)
+```
+
+## Advanced Configuration
+
+### Custom Model Configurations
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from guidance.models import Transformers
+
+# Load custom model
+tokenizer = AutoTokenizer.from_pretrained("your-model")
+model = AutoModelForCausalLM.from_pretrained(
+    "your-model",
+    device_map="auto",
+    torch_dtype="float16"
+)
+
+# Use with Guidance
+lm = Transformers(model=model, tokenizer=tokenizer)
+```
+
+### Environment Variables
+
+```bash
+# API keys
+export ANTHROPIC_API_KEY="sk-ant-..."
+export OPENAI_API_KEY="sk-..."
+
+# Transformers cache
+export HF_HOME="/path/to/cache"
+export TRANSFORMERS_CACHE="/path/to/cache"
+
+# GPU selection
+export CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1
+```
+
+### Debugging
+
+```python
+# Enable verbose logging
+import logging
+logging.basicConfig(level=logging.DEBUG)
+
+# Check backend info
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+print(f"Model: {lm.model_name}")
+print(f"Backend: {lm.backend}")
+
+# Check GPU usage (Transformers)
+lm = Transformers("microsoft/Phi-4-mini-instruct", device="cuda")
+print(f"Device: {lm.device}")
+print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
+```
+
+## Resources
+
+- **Anthropic Docs**: https://docs.anthropic.com
+- **OpenAI Docs**: https://platform.openai.com/docs
+- **Hugging Face Models**: https://huggingface.co/models
+- **llama.cpp**: https://github.com/ggerganov/llama.cpp
+- **GGUF Models**: https://huggingface.co/models?library=gguf
--- a/skills/mlops/inference/guidance/references/constraints.md
+++ b/skills/mlops/inference/guidance/references/constraints.md
@@ -0,0 +1,674 @@
+# Comprehensive Constraint Patterns
+
+Guide to regex constraints, grammar-based generation, and token healing in Guidance.
+
+## Table of Contents
+- Regex Constraints
+- Grammar-Based Generation
+- Token Healing
+- Selection Constraints
+- Complex Patterns
+- Performance Optimization
+
+## Regex Constraints
+
+### Basic Patterns
+
+#### Numeric Constraints
+
+```python
+from guidance import models, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Integer (positive)
+lm += "Age: " + gen("age", regex=r"[0-9]+")
+
+# Integer (with negatives)
+lm += "Temperature: " + gen("temp", regex=r"-?[0-9]+")
+
+# Float (positive)
+lm += "Price: $" + gen("price", regex=r"[0-9]+\.[0-9]{2}")
+
+# Float (with negatives and optional decimals)
+lm += "Value: " + gen("value", regex=r"-?[0-9]+(\.[0-9]+)?")
+
+# Percentage (0-100)
+lm += "Progress: " + gen("progress", regex=r"(100|[0-9]{1,2})")
+
+# Range (1-5 stars)
+lm += "Rating: " + gen("rating", regex=r"[1-5]") + " stars"
+```
+
+#### Text Constraints
+
+```python
+# Alphabetic only
+lm += "Name: " + gen("name", regex=r"[A-Za-z]+")
+
+# Alphabetic with spaces
+lm += "Full Name: " + gen("full_name", regex=r"[A-Za-z ]+")
+
+# Alphanumeric
+lm += "Username: " + gen("username", regex=r"[A-Za-z0-9_]+")
+
+# Capitalized words
+lm += "Title: " + gen("title", regex=r"[A-Z][a-z]+( [A-Z][a-z]+)*")
+
+# Lowercase only
+lm += "Code: " + gen("code", regex=r"[a-z0-9-]+")
+
+# Specific length
+lm += "ID: " + gen("id", regex=r"[A-Z]{3}-[0-9]{6}")  # e.g., "ABC-123456"
+```
+
+#### Date and Time Constraints
+
+```python
+# Date (YYYY-MM-DD)
+lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}")
+
+# Date (MM/DD/YYYY)
+lm += "Date: " + gen("date_us", regex=r"\d{2}/\d{2}/\d{4}")
+
+# Time (HH:MM)
+lm += "Time: " + gen("time", regex=r"\d{2}:\d{2}")
+
+# Time (HH:MM:SS)
+lm += "Time: " + gen("time_full", regex=r"\d{2}:\d{2}:\d{2}")
+
+# ISO 8601 datetime
+lm += "Timestamp: " + gen(
+    "timestamp",
+    regex=r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z"
+)
+
+# Year (YYYY)
+lm += "Year: " + gen("year", regex=r"(19|20)\d{2}")
+
+# Month name
+lm += "Month: " + gen(
+    "month",
+    regex=r"(January|February|March|April|May|June|July|August|September|October|November|December)"
+)
+```
+
+#### Contact Information
+
+```python
+# Email
+lm += "Email: " + gen(
+    "email",
+    regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
+)
+
+# Phone (US format)
+lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}")
+
+# Phone (international format)
+lm += "Phone: " + gen("phone_intl", regex=r"\+[0-9]{1,3}-[0-9]{1,14}")
+
+# ZIP code (US)
+lm += "ZIP: " + gen("zip", regex=r"\d{5}(-\d{4})?")
+
+# Postal code (Canada)
+lm += "Postal: " + gen("postal", regex=r"[A-Z]\d[A-Z] \d[A-Z]\d")
+
+# URL
+lm += "URL: " + gen(
+    "url",
+    regex=r"https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/[a-zA-Z0-9._~:/?#\[\]@!$&'()*+,;=-]*)?"
+)
+```
+
+### Advanced Patterns
+
+#### JSON Field Constraints
+
+```python
+from guidance import models, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# String field with quotes
+lm += '"name": ' + gen("name", regex=r'"[A-Za-z ]+"')
+
+# Numeric field (no quotes)
+lm += '"age": ' + gen("age", regex=r"[0-9]+")
+
+# Boolean field
+lm += '"active": ' + gen("active", regex=r"(true|false)")
+
+# Null field
+lm += '"optional": ' + gen("optional", regex=r"(null|[0-9]+)")
+
+# Array of strings
+lm += '"tags": [' + gen(
+    "tags",
+    regex=r'"[a-z]+"(, "[a-z]+")*'
+) + ']'
+
+# Complete JSON object
+lm += """{
+    "name": """ + gen("name", regex=r'"[A-Za-z ]+"') + """,
+    "age": """ + gen("age", regex=r"[0-9]+") + """,
+    "email": """ + gen(
+        "email",
+        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
+    ) + """
+}"""
+```
+
+#### Code Patterns
+
+```python
+# Python variable name
+lm += "Variable: " + gen("var", regex=r"[a-z_][a-z0-9_]*")
+
+# Python function name
+lm += "Function: " + gen("func", regex=r"[a-z_][a-z0-9_]*")
+
+# Hex color code
+lm += "Color: #" + gen("color", regex=r"[0-9A-Fa-f]{6}")
+
+# UUID
+lm += "UUID: " + gen(
+    "uuid",
+    regex=r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
+)
+
+# Git commit hash (short)
+lm += "Commit: " + gen("commit", regex=r"[0-9a-f]{7}")
+
+# Semantic version
+lm += "Version: " + gen("version", regex=r"[0-9]+\.[0-9]+\.[0-9]+")
+
+# IP address (IPv4)
+lm += "IP: " + gen(
+    "ip",
+    regex=r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
+)
+```
+
+#### Domain-Specific Patterns
+
+```python
+# Credit card number
+lm += "Card: " + gen("card", regex=r"\d{4}-\d{4}-\d{4}-\d{4}")
+
+# Social Security Number (US)
+lm += "SSN: " + gen("ssn", regex=r"\d{3}-\d{2}-\d{4}")
+
+# ISBN-13
+lm += "ISBN: " + gen("isbn", regex=r"978-\d{1,5}-\d{1,7}-\d{1,7}-\d")
+
+# License plate (US)
+lm += "Plate: " + gen("plate", regex=r"[A-Z]{3}-\d{4}")
+
+# Currency amount
+lm += "Amount: $" + gen("amount", regex=r"[0-9]{1,3}(,[0-9]{3})*\.[0-9]{2}")
+
+# Percentage with decimal
+lm += "Rate: " + gen("rate", regex=r"[0-9]+\.[0-9]{1,2}%")
+```
+
+## Grammar-Based Generation
+
+### JSON Grammar
+
+```python
+from guidance import models, gen, guidance
+
+@guidance
+def json_object(lm):
+    """Generate valid JSON object."""
+    lm += "{\n"
+
+    # Name field (required)
+    lm += '    "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
+
+    # Age field (required)
+    lm += '    "age": ' + gen("age", regex=r"[0-9]+") + ",\n"
+
+    # Email field (required)
+    lm += '    "email": ' + gen(
+        "email",
+        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
+    ) + ",\n"
+
+    # Active field (required, boolean)
+    lm += '    "active": ' + gen("active", regex=r"(true|false)") + "\n"
+
+    lm += "}"
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = json_object(lm)
+print(lm)  # Valid JSON guaranteed
+```
+
+### Nested JSON Grammar
+
+```python
+@guidance
+def nested_json(lm):
+    """Generate nested JSON structure."""
+    lm += "{\n"
+
+    # User object
+    lm += '    "user": {\n'
+    lm += '        "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
+    lm += '        "age": ' + gen("age", regex=r"[0-9]+") + "\n"
+    lm += "    },\n"
+
+    # Address object
+    lm += '    "address": {\n'
+    lm += '        "street": ' + gen("street", regex=r'"[A-Za-z0-9 ]+"') + ",\n"
+    lm += '        "city": ' + gen("city", regex=r'"[A-Za-z ]+"') + ",\n"
+    lm += '        "zip": ' + gen("zip", regex=r'"\d{5}"') + "\n"
+    lm += "    }\n"
+
+    lm += "}"
+    return lm
+```
+
+### Array Grammar
+
+```python
+@guidance
+def json_array(lm, count=3):
+    """Generate JSON array with fixed count."""
+    lm += "[\n"
+
+    for i in range(count):
+        lm += "    {\n"
+        lm += '        "id": ' + gen(f"id_{i}", regex=r"[0-9]+") + ",\n"
+        lm += '        "name": ' + gen(f"name_{i}", regex=r'"[A-Za-z ]+"') + "\n"
+        lm += "    }"
+        if i < count - 1:
+            lm += ","
+        lm += "\n"
+
+    lm += "]"
+    return lm
+```
+
+### XML Grammar
+
+```python
+@guidance
+def xml_document(lm):
+    """Generate valid XML document."""
+    lm += '<?xml version="1.0"?>\n'
+    lm += "<person>\n"
+
+    # Name element
+    lm += "    <name>" + gen("name", regex=r"[A-Za-z ]+") + "</name>\n"
+
+    # Age element
+    lm += "    <age>" + gen("age", regex=r"[0-9]+") + "</age>\n"
+
+    # Email element
+    lm += "    <email>" + gen(
+        "email",
+        regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
+    ) + "</email>\n"
+
+    lm += "</person>"
+    return lm
+```
+
+### CSV Grammar
+
+```python
+@guidance
+def csv_row(lm):
+    """Generate CSV row."""
+    lm += gen("name", regex=r"[A-Za-z ]+") + ","
+    lm += gen("age", regex=r"[0-9]+") + ","
+    lm += gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
+    return lm
+
+@guidance
+def csv_document(lm, rows=5):
+    """Generate complete CSV."""
+    # Header
+    lm += "Name,Age,Email\n"
+
+    # Rows
+    for i in range(rows):
+        lm = csv_row(lm)
+        if i < rows - 1:
+            lm += "\n"
+
+    return lm
+```
+
+## Token Healing
+
+### How Token Healing Works
+
+**Problem:** Tokenization creates unnatural boundaries.
+
+```python
+# Example without token healing
+prompt = "The capital of France is "
+# Tokenization: ["The", " capital", " of", " France", " is", " "]
+# Model sees last token: " "
+# First generated token might include leading space: " Paris"
+# Result: "The capital of France is  Paris" (double space)
+```
+
+**Solution:** Guidance backs up and regenerates the last token.
+
+```python
+from guidance import models, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Token healing enabled by default
+lm += "The capital of France is " + gen("capital", max_tokens=5)
+
+# Process:
+# 1. Back up to token before " is "
+# 2. Regenerate " is" + "capital" together
+# 3. Result: "The capital of France is Paris" (correct)
+```
+
+### Token Healing Examples
+
+#### Natural Continuations
+
+```python
+# Before token healing
+lm += "The function name is get" + gen("rest")
+# Might generate: "The function name is get User" (space before User)
+
+# With token healing
+lm += "The function name is get" + gen("rest")
+# Generates: "The function name is getUser" (correct camelCase)
+```
+
+#### Code Generation
+
+```python
+# Function name completion
+lm += "def calculate_" + gen("rest", stop="(")
+# Token healing ensures smooth connection: "calculate_total"
+
+# Variable name completion
+lm += "my_" + gen("var_name", regex=r"[a-z_]+")
+# Token healing ensures: "my_variable_name" (not "my_ variable_name")
+```
+
+#### Domain-Specific Terms
+
+```python
+# Medical terms
+lm += "The patient has hyper" + gen("condition")
+# Token healing helps: "hypertension" (not "hyper tension")
+
+# Technical terms
+lm += "Using micro" + gen("tech")
+# Token healing helps: "microservices" (not "micro services")
+```
+
+### Disabling Token Healing
+
+```python
+# Disable token healing if needed (rare)
+lm += gen("text", token_healing=False)
+```
+
+## Selection Constraints
+
+### Basic Selection
+
+```python
+from guidance import models, select
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+# Simple selection
+lm += "Status: " + select(["active", "inactive", "pending"], name="status")
+
+# Boolean selection
+lm += "Approved: " + select(["Yes", "No"], name="approved")
+
+# Multiple choice
+lm += "Answer: " + select(
+    ["A) Paris", "B) London", "C) Berlin", "D) Madrid"],
+    name="answer"
+)
+```
+
+### Conditional Selection
+
+```python
+from guidance import models, select, gen, guidance
+
+@guidance
+def conditional_fields(lm):
+    """Generate fields conditionally based on type."""
+    lm += "Type: " + select(["person", "company"], name="type")
+
+    if lm["type"] == "person":
+        lm += "\nName: " + gen("name", regex=r"[A-Za-z ]+")
+        lm += "\nAge: " + gen("age", regex=r"[0-9]+")
+    else:
+        lm += "\nCompany Name: " + gen("company", regex=r"[A-Za-z ]+")
+        lm += "\nEmployees: " + gen("employees", regex=r"[0-9]+")
+
+    return lm
+```
+
+### Repeated Selection
+
+```python
+@guidance
+def multiple_selections(lm):
+    """Select multiple items."""
+    lm += "Select 3 colors:\n"
+
+    colors = ["red", "blue", "green", "yellow", "purple"]
+
+    for i in range(3):
+        lm += f"{i+1}. " + select(colors, name=f"color_{i}") + "\n"
+
+    return lm
+```
+
+## Complex Patterns
+
+### Pattern 1: Structured Forms
+
+```python
+@guidance
+def user_form(lm):
+    """Generate structured user form."""
+    lm += "=== User Registration ===\n\n"
+
+    # Name (alphabetic only)
+    lm += "Full Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n") + "\n"
+
+    # Age (numeric)
+    lm += "Age: " + gen("age", regex=r"[0-9]+", max_tokens=3) + "\n"
+
+    # Email (validated format)
+    lm += "Email: " + gen(
+        "email",
+        regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
+        stop="\n"
+    ) + "\n"
+
+    # Phone (US format)
+    lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}") + "\n"
+
+    # Account type (selection)
+    lm += "Account Type: " + select(
+        ["Standard", "Premium", "Enterprise"],
+        name="account_type"
+    ) + "\n"
+
+    # Active status (boolean)
+    lm += "Active: " + select(["Yes", "No"], name="active") + "\n"
+
+    return lm
+```
+
+### Pattern 2: Multi-Entity Extraction
+
+```python
+@guidance
+def extract_entities(lm, text):
+    """Extract multiple entities with constraints."""
+    lm += f"Text: {text}\n\n"
+
+    # Person name (alphabetic)
+    lm += "Person: " + gen("person", regex=r"[A-Za-z ]+", stop="\n") + "\n"
+
+    # Organization (alphanumeric with spaces)
+    lm += "Organization: " + gen(
+        "organization",
+        regex=r"[A-Za-z0-9 ]+",
+        stop="\n"
+    ) + "\n"
+
+    # Date (YYYY-MM-DD format)
+    lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}") + "\n"
+
+    # Location (alphabetic with spaces)
+    lm += "Location: " + gen("location", regex=r"[A-Za-z ]+", stop="\n") + "\n"
+
+    # Amount (currency)
+    lm += "Amount: $" + gen("amount", regex=r"[0-9,]+\.[0-9]{2}") + "\n"
+
+    return lm
+```
+
+### Pattern 3: Code Generation
+
+```python
+@guidance
+def generate_python_function(lm):
+    """Generate Python function with constraints."""
+    # Function name (valid Python identifier)
+    lm += "def " + gen("func_name", regex=r"[a-z_][a-z0-9_]*") + "("
+
+    # Parameter name
+    lm += gen("param", regex=r"[a-z_][a-z0-9_]*") + "):\n"
+
+    # Docstring
+    lm += '    """' + gen("docstring", stop='"""', max_tokens=50) + '"""\n'
+
+    # Function body (constrained to valid Python)
+    lm += "    return " + gen("return_value", stop="\n") + "\n"
+
+    return lm
+```
+
+### Pattern 4: Hierarchical Data
+
+```python
+@guidance
+def org_chart(lm):
+    """Generate organizational chart."""
+    lm += "Company: " + gen("company", regex=r"[A-Za-z ]+") + "\n\n"
+
+    # CEO
+    lm += "CEO: " + gen("ceo", regex=r"[A-Za-z ]+") + "\n"
+
+    # Departments
+    for dept in ["Engineering", "Sales", "Marketing"]:
+        lm += f"\n{dept} Department:\n"
+        lm += "  Head: " + gen(f"{dept.lower()}_head", regex=r"[A-Za-z ]+") + "\n"
+        lm += "  Size: " + gen(f"{dept.lower()}_size", regex=r"[0-9]+") + " employees\n"
+
+    return lm
+```
+
+## Performance Optimization
+
+### Best Practices
+
+#### 1. Use Specific Patterns
+
+```python
+# ✅ Good: Specific pattern
+lm += gen("age", regex=r"[0-9]{1,3}")  # Fast
+
+# ❌ Bad: Overly broad pattern
+lm += gen("age", regex=r"[0-9]+")  # Slower
+```
+
+#### 2. Limit Max Tokens
+
+```python
+# ✅ Good: Reasonable limit
+lm += gen("name", max_tokens=30)
+
+# ❌ Bad: No limit
+lm += gen("name")  # May generate forever
+```
+
+#### 3. Use stop Sequences
+
+```python
+# ✅ Good: Stop at newline
+lm += gen("line", stop="\n")
+
+# ❌ Bad: Rely on max_tokens
+lm += gen("line", max_tokens=100)
+```
+
+#### 4. Cache Compiled Grammars
+
+```python
+# Grammars are cached automatically after first use
+# No manual caching needed
+@guidance
+def reusable_pattern(lm):
+    """This grammar is compiled once and cached."""
+    lm += gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
+    return lm
+
+# First call: compiles grammar
+lm = reusable_pattern(lm)
+
+# Subsequent calls: uses cached grammar (fast)
+lm = reusable_pattern(lm)
+```
+
+#### 5. Avoid Overlapping Constraints
+
+```python
+# ✅ Good: Clear constraints
+lm += gen("age", regex=r"[0-9]+", max_tokens=3)
+
+# ❌ Bad: Conflicting constraints
+lm += gen("age", regex=r"[0-9]{2}", max_tokens=10)  # max_tokens unnecessary
+```
+
+### Performance Benchmarks
+
+**Regex vs Free Generation:**
+- Simple regex (digits): ~1.2x slower than free gen
+- Complex regex (email): ~1.5x slower than free gen
+- Grammar-based: ~2x slower than free gen
+
+**But:**
+- 100% valid outputs (vs ~70% with free gen + validation)
+- No retry loops needed
+- Overall faster end-to-end for structured outputs
+
+**Optimization Tips:**
+- Use regex for critical fields only
+- Use `select()` for small fixed sets (fastest)
+- Use `stop` sequences when possible (faster than max_tokens)
+- Cache compiled grammars by reusing functions
+
+## Resources
+
+- **Token Healing Paper**: https://arxiv.org/abs/2306.17648
+- **Guidance Docs**: https://guidance.readthedocs.io
+- **GitHub**: https://github.com/guidance-ai/guidance
--- a/skills/mlops/inference/guidance/references/examples.md
+++ b/skills/mlops/inference/guidance/references/examples.md
@@ -0,0 +1,767 @@
+# Production-Ready Examples
+
+Real-world examples of using Guidance for structured generation, agents, and workflows.
+
+## Table of Contents
+- JSON Generation
+- Data Extraction
+- Classification Systems
+- Agent Systems
+- Multi-Step Workflows
+- Code Generation
+- Production Tips
+
+## JSON Generation
+
+### Basic JSON
+
+```python
+from guidance import models, gen, guidance
+
+@guidance
+def generate_user(lm):
+    """Generate valid user JSON."""
+    lm += "{\n"
+    lm += '  "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
+    lm += '  "age": ' + gen("age", regex=r"[0-9]+") + ",\n"
+    lm += '  "email": ' + gen(
+        "email",
+        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
+    ) + "\n"
+    lm += "}"
+    return lm
+
+# Use it
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm += "Generate a user profile:\n"
+lm = generate_user(lm)
+
+print(lm)
+# Output: Valid JSON guaranteed
+```
+
+### Nested JSON
+
+```python
+@guidance
+def generate_order(lm):
+    """Generate nested order JSON."""
+    lm += "{\n"
+
+    # Customer info
+    lm += '  "customer": {\n'
+    lm += '    "name": ' + gen("customer_name", regex=r'"[A-Za-z ]+"') + ",\n"
+    lm += '    "email": ' + gen(
+        "customer_email",
+        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
+    ) + "\n"
+    lm += "  },\n"
+
+    # Order details
+    lm += '  "order": {\n'
+    lm += '    "id": ' + gen("order_id", regex=r'"ORD-[0-9]{6}"') + ",\n"
+    lm += '    "date": ' + gen("order_date", regex=r'"\d{4}-\d{2}-\d{2}"') + ",\n"
+    lm += '    "total": ' + gen("order_total", regex=r"[0-9]+\.[0-9]{2}") + "\n"
+    lm += "  },\n"
+
+    # Status
+    lm += '  "status": ' + gen(
+        "status",
+        regex=r'"(pending|processing|shipped|delivered)"'
+    ) + "\n"
+
+    lm += "}"
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = generate_order(lm)
+```
+
+### JSON Array
+
+```python
+@guidance
+def generate_user_list(lm, count=3):
+    """Generate JSON array of users."""
+    lm += "[\n"
+
+    for i in range(count):
+        lm += "  {\n"
+        lm += '    "id": ' + gen(f"id_{i}", regex=r"[0-9]+") + ",\n"
+        lm += '    "name": ' + gen(f"name_{i}", regex=r'"[A-Za-z ]+"') + ",\n"
+        lm += '    "active": ' + gen(f"active_{i}", regex=r"(true|false)") + "\n"
+        lm += "  }"
+        if i < count - 1:
+            lm += ","
+        lm += "\n"
+
+    lm += "]"
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = generate_user_list(lm, count=5)
+```
+
+### Dynamic JSON Schema
+
+```python
+import json
+from guidance import models, gen, guidance
+
+@guidance
+def json_from_schema(lm, schema):
+    """Generate JSON matching a schema."""
+    lm += "{\n"
+
+    fields = list(schema["properties"].items())
+    for i, (field_name, field_schema) in enumerate(fields):
+        lm += f'  "{field_name}": '
+
+        # Handle different types
+        if field_schema["type"] == "string":
+            if "pattern" in field_schema:
+                lm += gen(field_name, regex=f'"{field_schema["pattern"]}"')
+            else:
+                lm += gen(field_name, regex=r'"[^"]+"')
+        elif field_schema["type"] == "number":
+            lm += gen(field_name, regex=r"[0-9]+(\.[0-9]+)?")
+        elif field_schema["type"] == "integer":
+            lm += gen(field_name, regex=r"[0-9]+")
+        elif field_schema["type"] == "boolean":
+            lm += gen(field_name, regex=r"(true|false)")
+
+        if i < len(fields) - 1:
+            lm += ","
+        lm += "\n"
+
+    lm += "}"
+    return lm
+
+# Define schema
+schema = {
+    "type": "object",
+    "properties": {
+        "name": {"type": "string"},
+        "age": {"type": "integer"},
+        "score": {"type": "number"},
+        "active": {"type": "boolean"}
+    }
+}
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = json_from_schema(lm, schema)
+```
+
+## Data Extraction
+
+### Extract from Text
+
+```python
+from guidance import models, gen, guidance, system, user, assistant
+
+@guidance
+def extract_person_info(lm, text):
+    """Extract structured info from text."""
+    lm += f"Text: {text}\n\n"
+
+    with assistant():
+        lm += "Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n") + "\n"
+        lm += "Age: " + gen("age", regex=r"[0-9]+", max_tokens=3) + "\n"
+        lm += "Occupation: " + gen("occupation", regex=r"[A-Za-z ]+", stop="\n") + "\n"
+        lm += "Email: " + gen(
+            "email",
+            regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
+            stop="\n"
+        ) + "\n"
+
+    return lm
+
+text = "John Smith is a 35-year-old software engineer. Contact: john@example.com"
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+with system():
+    lm += "You extract structured information from text."
+
+with user():
+    lm = extract_person_info(lm, text)
+
+print(f"Name: {lm['name']}")
+print(f"Age: {lm['age']}")
+print(f"Occupation: {lm['occupation']}")
+print(f"Email: {lm['email']}")
+```
+
+### Multi-Entity Extraction
+
+```python
+@guidance
+def extract_entities(lm, text):
+    """Extract multiple entity types."""
+    lm += f"Analyze: {text}\n\n"
+
+    # Person entities
+    lm += "People:\n"
+    for i in range(3):  # Up to 3 people
+        lm += f"- " + gen(f"person_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
+
+    # Organization entities
+    lm += "\nOrganizations:\n"
+    for i in range(2):  # Up to 2 orgs
+        lm += f"- " + gen(f"org_{i}", regex=r"[A-Za-z0-9 ]+", stop="\n") + "\n"
+
+    # Dates
+    lm += "\nDates:\n"
+    for i in range(2):  # Up to 2 dates
+        lm += f"- " + gen(f"date_{i}", regex=r"\d{4}-\d{2}-\d{2}", stop="\n") + "\n"
+
+    # Locations
+    lm += "\nLocations:\n"
+    for i in range(2):  # Up to 2 locations
+        lm += f"- " + gen(f"location_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
+
+    return lm
+
+text = """
+Tim Cook and Satya Nadella met at Microsoft headquarters in Redmond on 2024-09-15
+to discuss the collaboration between Apple and Microsoft. The meeting continued
+in Cupertino on 2024-09-20.
+"""
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = extract_entities(lm, text)
+```
+
+### Batch Extraction
+
+```python
+@guidance
+def batch_extract(lm, texts):
+    """Extract from multiple texts."""
+    lm += "Batch Extraction Results:\n\n"
+
+    for i, text in enumerate(texts):
+        lm += f"=== Item {i+1} ===\n"
+        lm += f"Text: {text}\n"
+        lm += "Name: " + gen(f"name_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
+        lm += "Sentiment: " + gen(
+            f"sentiment_{i}",
+            regex=r"(positive|negative|neutral)",
+            stop="\n"
+        ) + "\n\n"
+
+    return lm
+
+texts = [
+    "Alice is happy with the product",
+    "Bob is disappointed with the service",
+    "Carol has no strong feelings either way"
+]
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = batch_extract(lm, texts)
+```
+
+## Classification Systems
+
+### Sentiment Analysis
+
+```python
+from guidance import models, select, gen
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+
+text = "This product is absolutely amazing! Best purchase ever."
+
+lm += f"Text: {text}\n\n"
+lm += "Sentiment: " + select(
+    ["positive", "negative", "neutral"],
+    name="sentiment"
+)
+lm += "\nConfidence: " + gen("confidence", regex=r"[0-9]{1,3}") + "%\n"
+lm += "Reasoning: " + gen("reasoning", stop="\n", max_tokens=50)
+
+print(f"Sentiment: {lm['sentiment']}")
+print(f"Confidence: {lm['confidence']}%")
+print(f"Reasoning: {lm['reasoning']}")
+```
+
+### Multi-Label Classification
+
+```python
+@guidance
+def classify_article(lm, text):
+    """Classify article with multiple labels."""
+    lm += f"Article: {text}\n\n"
+
+    # Primary category
+    lm += "Primary Category: " + select(
+        ["Technology", "Business", "Science", "Politics", "Entertainment"],
+        name="primary_category"
+    ) + "\n"
+
+    # Secondary categories (up to 3)
+    lm += "\nSecondary Categories:\n"
+    categories = ["Technology", "Business", "Science", "Politics", "Entertainment"]
+    for i in range(3):
+        lm += f"{i+1}. " + select(categories, name=f"secondary_{i}") + "\n"
+
+    # Tags
+    lm += "\nTags: " + gen("tags", stop="\n", max_tokens=50) + "\n"
+
+    # Target audience
+    lm += "Target Audience: " + select(
+        ["General", "Expert", "Beginner"],
+        name="audience"
+    )
+
+    return lm
+
+article = """
+Apple announced new AI features in iOS 18, leveraging machine learning to improve
+battery life and performance. The company's stock rose 5% following the announcement.
+"""
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = classify_article(lm, article)
+```
+
+### Intent Classification
+
+```python
+@guidance
+def classify_intent(lm, message):
+    """Classify user intent."""
+    lm += f"User Message: {message}\n\n"
+
+    # Intent
+    lm += "Intent: " + select(
+        ["question", "complaint", "request", "feedback", "other"],
+        name="intent"
+    ) + "\n"
+
+    # Urgency
+    lm += "Urgency: " + select(
+        ["low", "medium", "high", "critical"],
+        name="urgency"
+    ) + "\n"
+
+    # Department
+    lm += "Route To: " + select(
+        ["support", "sales", "billing", "technical"],
+        name="department"
+    ) + "\n"
+
+    # Sentiment
+    lm += "Sentiment: " + select(
+        ["positive", "neutral", "negative"],
+        name="sentiment"
+    )
+
+    return lm
+
+message = "My account was charged twice for the same order. Need help ASAP!"
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = classify_intent(lm, message)
+
+print(f"Intent: {lm['intent']}")
+print(f"Urgency: {lm['urgency']}")
+print(f"Department: {lm['department']}")
+```
+
+## Agent Systems
+
+### ReAct Agent
+
+```python
+from guidance import models, gen, select, guidance
+
+@guidance(stateless=False)
+def react_agent(lm, question, tools, max_rounds=5):
+    """ReAct agent with tool use."""
+    lm += f"Question: {question}\n\n"
+
+    for round in range(max_rounds):
+        # Thought
+        lm += f"Thought {round+1}: " + gen("thought", stop="\n", max_tokens=100) + "\n"
+
+        # Action selection
+        lm += "Action: " + select(
+            list(tools.keys()) + ["answer"],
+            name="action"
+        )
+
+        if lm["action"] == "answer":
+            lm += "\n\nFinal Answer: " + gen("answer", max_tokens=200)
+            break
+
+        # Action input
+        lm += "\nAction Input: " + gen("action_input", stop="\n", max_tokens=100) + "\n"
+
+        # Execute tool
+        if lm["action"] in tools:
+            try:
+                result = tools[lm["action"]](lm["action_input"])
+                lm += f"Observation: {result}\n\n"
+            except Exception as e:
+                lm += f"Observation: Error - {str(e)}\n\n"
+
+    return lm
+
+# Define tools
+tools = {
+    "calculator": lambda expr: eval(expr),
+    "search": lambda query: f"Search results for '{query}': [Mock results]",
+    "weather": lambda city: f"Weather in {city}: Sunny, 72°F"
+}
+
+# Use agent
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = react_agent(lm, "What is (25 * 4) + 10?", tools)
+
+print(lm["answer"])
+```
+
+### Multi-Agent System
+
+```python
+@guidance
+def coordinator_agent(lm, task):
+    """Coordinator that delegates to specialists."""
+    lm += f"Task: {task}\n\n"
+
+    # Determine which specialist to use
+    lm += "Specialist: " + select(
+        ["researcher", "writer", "coder", "analyst"],
+        name="specialist"
+    ) + "\n"
+
+    lm += "Reasoning: " + gen("reasoning", stop="\n", max_tokens=100) + "\n"
+
+    return lm
+
+@guidance
+def researcher_agent(lm, query):
+    """Research specialist."""
+    lm += f"Research Query: {query}\n\n"
+    lm += "Findings:\n"
+    for i in range(3):
+        lm += f"{i+1}. " + gen(f"finding_{i}", stop="\n", max_tokens=100) + "\n"
+    return lm
+
+@guidance
+def writer_agent(lm, topic):
+    """Writing specialist."""
+    lm += f"Topic: {topic}\n\n"
+    lm += "Title: " + gen("title", stop="\n", max_tokens=50) + "\n"
+    lm += "Content:\n" + gen("content", max_tokens=500)
+    return lm
+
+# Coordination workflow
+task = "Write an article about AI safety"
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = coordinator_agent(lm, task)
+
+specialist = lm["specialist"]
+if specialist == "researcher":
+    lm = researcher_agent(lm, task)
+elif specialist == "writer":
+    lm = writer_agent(lm, task)
+```
+
+### Tool Use with Validation
+
+```python
+@guidance(stateless=False)
+def validated_tool_agent(lm, question):
+    """Agent with validated tool calls."""
+    tools = {
+        "add": lambda a, b: float(a) + float(b),
+        "multiply": lambda a, b: float(a) * float(b),
+        "divide": lambda a, b: float(a) / float(b) if float(b) != 0 else "Error: Division by zero"
+    }
+
+    lm += f"Question: {question}\n\n"
+
+    for i in range(5):
+        # Select tool
+        lm += "Tool: " + select(list(tools.keys()) + ["done"], name="tool")
+
+        if lm["tool"] == "done":
+            lm += "\nAnswer: " + gen("answer", max_tokens=100)
+            break
+
+        # Get validated numeric arguments
+        lm += "\nArg1: " + gen("arg1", regex=r"-?[0-9]+(\.[0-9]+)?") + "\n"
+        lm += "Arg2: " + gen("arg2", regex=r"-?[0-9]+(\.[0-9]+)?") + "\n"
+
+        # Execute
+        result = tools[lm["tool"]](lm["arg1"], lm["arg2"])
+        lm += f"Result: {result}\n\n"
+
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = validated_tool_agent(lm, "What is (10 + 5) * 3?")
+```
+
+## Multi-Step Workflows
+
+### Chain of Thought
+
+```python
+@guidance
+def chain_of_thought(lm, question):
+    """Multi-step reasoning with CoT."""
+    lm += f"Question: {question}\n\n"
+
+    # Generate reasoning steps
+    lm += "Let me think step by step:\n\n"
+    for i in range(4):
+        lm += f"Step {i+1}: " + gen(f"step_{i+1}", stop="\n", max_tokens=100) + "\n"
+
+    # Final answer
+    lm += "\nTherefore, the answer is: " + gen("answer", stop="\n", max_tokens=50)
+
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = chain_of_thought(lm, "If a train travels 60 mph for 2.5 hours, how far does it go?")
+
+print(lm["answer"])
+```
+
+### Self-Consistency
+
+```python
+@guidance
+def self_consistency(lm, question, num_samples=3):
+    """Generate multiple reasoning paths and aggregate."""
+    lm += f"Question: {question}\n\n"
+
+    answers = []
+    for i in range(num_samples):
+        lm += f"=== Attempt {i+1} ===\n"
+        lm += "Reasoning: " + gen(f"reasoning_{i}", stop="\n", max_tokens=100) + "\n"
+        lm += "Answer: " + gen(f"answer_{i}", stop="\n", max_tokens=50) + "\n\n"
+        answers.append(lm[f"answer_{i}"])
+
+    # Aggregate (simple majority vote)
+    from collections import Counter
+    most_common = Counter(answers).most_common(1)[0][0]
+
+    lm += f"Final Answer (by majority): {most_common}\n"
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = self_consistency(lm, "What is 15% of 200?")
+```
+
+### Planning and Execution
+
+```python
+@guidance
+def plan_and_execute(lm, goal):
+    """Plan tasks then execute them."""
+    lm += f"Goal: {goal}\n\n"
+
+    # Planning phase
+    lm += "Plan:\n"
+    num_steps = 4
+    for i in range(num_steps):
+        lm += f"{i+1}. " + gen(f"plan_step_{i}", stop="\n", max_tokens=100) + "\n"
+
+    # Execution phase
+    lm += "\nExecution:\n\n"
+    for i in range(num_steps):
+        lm += f"Step {i+1}: {lm[f'plan_step_{i}']}\n"
+        lm += "Status: " + select(["completed", "in-progress", "blocked"], name=f"status_{i}") + "\n"
+        lm += "Result: " + gen(f"result_{i}", stop="\n", max_tokens=150) + "\n\n"
+
+    # Summary
+    lm += "Summary: " + gen("summary", max_tokens=200)
+
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = plan_and_execute(lm, "Build a REST API for a blog platform")
+```
+
+## Code Generation
+
+### Python Function
+
+```python
+@guidance
+def generate_python_function(lm, description):
+    """Generate Python function from description."""
+    lm += f"Description: {description}\n\n"
+
+    # Function signature
+    lm += "def " + gen("func_name", regex=r"[a-z_][a-z0-9_]*") + "("
+    lm += gen("params", regex=r"[a-z_][a-z0-9_]*(, [a-z_][a-z0-9_]*)*") + "):\n"
+
+    # Docstring
+    lm += '    """' + gen("docstring", stop='"""', max_tokens=100) + '"""\n'
+
+    # Function body
+    lm += "    " + gen("body", stop="\n", max_tokens=200) + "\n"
+
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = generate_python_function(lm, "Check if a number is prime")
+
+print(lm)
+```
+
+### SQL Query
+
+```python
+@guidance
+def generate_sql(lm, description):
+    """Generate SQL query from description."""
+    lm += f"Description: {description}\n\n"
+    lm += "SQL Query:\n"
+
+    # SELECT clause
+    lm += "SELECT " + gen("select_clause", stop=" FROM", max_tokens=100)
+
+    # FROM clause
+    lm += " FROM " + gen("from_clause", stop=" WHERE", max_tokens=50)
+
+    # WHERE clause (optional)
+    lm += " WHERE " + gen("where_clause", stop=";", max_tokens=100) + ";"
+
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = generate_sql(lm, "Get all users who signed up in the last 30 days")
+```
+
+### API Endpoint
+
+```python
+@guidance
+def generate_api_endpoint(lm, description):
+    """Generate REST API endpoint."""
+    lm += f"Description: {description}\n\n"
+
+    # HTTP method
+    lm += "Method: " + select(["GET", "POST", "PUT", "DELETE"], name="method") + "\n"
+
+    # Path
+    lm += "Path: /" + gen("path", regex=r"[a-z0-9/-]+", stop="\n") + "\n"
+
+    # Request body (if POST/PUT)
+    if lm["method"] in ["POST", "PUT"]:
+        lm += "\nRequest Body:\n"
+        lm += "{\n"
+        lm += '  "field1": ' + gen("field1", regex=r'"[a-z_]+"') + ",\n"
+        lm += '  "field2": ' + gen("field2", regex=r'"[a-z_]+"') + "\n"
+        lm += "}\n"
+
+    # Response
+    lm += "\nResponse (200 OK):\n"
+    lm += "{\n"
+    lm += '  "status": "success",\n'
+    lm += '  "data": ' + gen("response_data", max_tokens=100) + "\n"
+    lm += "}\n"
+
+    return lm
+
+lm = models.Anthropic("claude-sonnet-4-5-20250929")
+lm = generate_api_endpoint(lm, "Create a new blog post")
+```
+
+## Production Tips
+
+### Error Handling
+
+```python
+@guidance
+def safe_extraction(lm, text):
+    """Extract with fallback handling."""
+    try:
+        lm += f"Text: {text}\n"
+        lm += "Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n", max_tokens=30)
+        return lm
+    except Exception as e:
+        # Fallback to less strict extraction
+        lm += f"Text: {text}\n"
+        lm += "Name: " + gen("name", stop="\n", max_tokens=30)
+        return lm
+```
+
+### Caching
+
+```python
+from functools import lru_cache
+
+@lru_cache(maxsize=100)
+def cached_generation(text):
+    """Cache LLM generations."""
+    lm = models.Anthropic("claude-sonnet-4-5-20250929")
+    lm += f"Analyze: {text}\n"
+    lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
+    return lm["sentiment"]
+
+# First call: hits LLM
+result1 = cached_generation("This is great!")
+
+# Second call: returns cached result
+result2 = cached_generation("This is great!")  # Instant!
+```
+
+### Monitoring
+
+```python
+import time
+
+@guidance
+def monitored_generation(lm, text):
+    """Track generation metrics."""
+    start_time = time.time()
+
+    lm += f"Text: {text}\n"
+    lm += "Analysis: " + gen("analysis", max_tokens=100)
+
+    elapsed = time.time() - start_time
+
+    # Log metrics
+    print(f"Generation time: {elapsed:.2f}s")
+    print(f"Output length: {len(lm['analysis'])} chars")
+
+    return lm
+```
+
+### Batch Processing
+
+```python
+def batch_process(texts, batch_size=10):
+    """Process texts in batches."""
+    lm = models.Anthropic("claude-sonnet-4-5-20250929")
+    results = []
+
+    for i in range(0, len(texts), batch_size):
+        batch = texts[i:i+batch_size]
+
+        for text in batch:
+            lm += f"Text: {text}\n"
+            lm += "Sentiment: " + select(
+                ["positive", "negative", "neutral"],
+                name=f"sentiment_{i}"
+            ) + "\n\n"
+
+        results.extend([lm[f"sentiment_{i}"] for i in range(len(batch))])
+
+    return results
+```
+
+## Resources
+
+- **Guidance Notebooks**: https://github.com/guidance-ai/guidance/tree/main/notebooks
+- **Guidance Docs**: https://guidance.readthedocs.io
+- **Community Examples**: https://github.com/guidance-ai/guidance/discussions
--- a/skills/mlops/inference/llama-cpp/SKILL.md
+++ b/skills/mlops/inference/llama-cpp/SKILL.md
@@ -0,0 +1,261 @@
+---
+name: llama-cpp
+description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [llama-cpp-python]
+metadata:
+  hermes:
+    tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
+
+---
+
+# llama.cpp
+
+Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
+
+## When to use llama.cpp
+
+**Use llama.cpp when:**
+- Running on CPU-only machines
+- Deploying on Apple Silicon (M1/M2/M3/M4)
+- Using AMD or Intel GPUs (no CUDA)
+- Edge deployment (Raspberry Pi, embedded systems)
+- Need simple deployment without Docker/Python
+
+**Use TensorRT-LLM instead when:**
+- Have NVIDIA GPUs (A100/H100)
+- Need maximum throughput (100K+ tok/s)
+- Running in datacenter with CUDA
+
+**Use vLLM instead when:**
+- Have NVIDIA GPUs
+- Need Python-first API
+- Want PagedAttention
+
+## Quick start
+
+### Installation
+
+```bash
+# macOS/Linux
+brew install llama.cpp
+
+# Or build from source
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make
+
+# With Metal (Apple Silicon)
+make LLAMA_METAL=1
+
+# With CUDA (NVIDIA)
+make LLAMA_CUDA=1
+
+# With ROCm (AMD)
+make LLAMA_HIP=1
+```
+
+### Download model
+
+```bash
+# Download from HuggingFace (GGUF format)
+huggingface-cli download \
+    TheBloke/Llama-2-7B-Chat-GGUF \
+    llama-2-7b-chat.Q4_K_M.gguf \
+    --local-dir models/
+
+# Or convert from HuggingFace
+python convert_hf_to_gguf.py models/llama-2-7b-chat/
+```
+
+### Run inference
+
+```bash
+# Simple chat
+./llama-cli \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    -p "Explain quantum computing" \
+    -n 256  # Max tokens
+
+# Interactive chat
+./llama-cli \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    --interactive
+```
+
+### Server mode
+
+```bash
+# Start OpenAI-compatible server
+./llama-server \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -ngl 32  # Offload 32 layers to GPU
+
+# Client request
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2-7b-chat",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "temperature": 0.7,
+    "max_tokens": 100
+  }'
+```
+
+## Quantization formats
+
+### GGUF format overview
+
+| Format | Bits | Size (7B) | Speed | Quality | Use Case |
+|--------|------|-----------|-------|---------|----------|
+| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
+| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
+| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
+| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
+| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
+| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
+
+### Choosing quantization
+
+```bash
+# General use (balanced)
+Q4_K_M  # 4-bit, medium quality
+
+# Maximum speed (more degradation)
+Q2_K or Q3_K_M
+
+# Maximum quality (slower)
+Q6_K or Q8_0
+
+# Very large models (70B, 405B)
+Q3_K_M or Q4_K_S  # Lower bits to fit in memory
+```
+
+## Hardware acceleration
+
+### Apple Silicon (Metal)
+
+```bash
+# Build with Metal
+make LLAMA_METAL=1
+
+# Run with GPU acceleration (automatic)
+./llama-cli -m model.gguf -ngl 999  # Offload all layers
+
+# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
+```
+
+### NVIDIA GPUs (CUDA)
+
+```bash
+# Build with CUDA
+make LLAMA_CUDA=1
+
+# Offload layers to GPU
+./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers
+
+# Hybrid CPU+GPU for large models
+./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest
+```
+
+### AMD GPUs (ROCm)
+
+```bash
+# Build with ROCm
+make LLAMA_HIP=1
+
+# Run with AMD GPU
+./llama-cli -m model.gguf -ngl 999
+```
+
+## Common patterns
+
+### Batch processing
+
+```bash
+# Process multiple prompts from file
+cat prompts.txt | ./llama-cli \
+    -m model.gguf \
+    --batch-size 512 \
+    -n 100
+```
+
+### Constrained generation
+
+```bash
+# JSON output with grammar
+./llama-cli \
+    -m model.gguf \
+    -p "Generate a person: " \
+    --grammar-file grammars/json.gbnf
+
+# Outputs valid JSON only
+```
+
+### Context size
+
+```bash
+# Increase context (default 512)
+./llama-cli \
+    -m model.gguf \
+    -c 4096  # 4K context window
+
+# Very long context (if model supports)
+./llama-cli -m model.gguf -c 32768  # 32K context
+```
+
+## Performance benchmarks
+
+### CPU performance (Llama 2-7B Q4_K_M)
+
+| CPU | Threads | Speed | Cost |
+|-----|---------|-------|------|
+| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
+| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
+| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
+| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
+
+### GPU acceleration (Llama 2-7B Q4_K_M)
+
+| GPU | Speed | vs CPU | Cost |
+|-----|-------|--------|------|
+| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
+| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
+| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
+| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
+
+## Supported models
+
+**LLaMA family**:
+- Llama 2 (7B, 13B, 70B)
+- Llama 3 (8B, 70B, 405B)
+- Code Llama
+
+**Mistral family**:
+- Mistral 7B
+- Mixtral 8x7B, 8x22B
+
+**Other**:
+- Falcon, BLOOM, GPT-J
+- Phi-3, Gemma, Qwen
+- LLaVA (vision), Whisper (audio)
+
+**Find models**: https://huggingface.co/models?library=gguf
+
+## References
+
+- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
+- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
+- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
+
+## Resources
+
+- **GitHub**: https://github.com/ggerganov/llama.cpp
+- **Models**: https://huggingface.co/models?library=gguf
+- **Discord**: https://discord.gg/llama-cpp
+
+
--- a/skills/mlops/inference/llama-cpp/references/optimization.md
+++ b/skills/mlops/inference/llama-cpp/references/optimization.md
@@ -0,0 +1,89 @@
+# Performance Optimization Guide
+
+Maximize llama.cpp inference speed and efficiency.
+
+## CPU Optimization
+
+### Thread tuning
+```bash
+# Set threads (default: physical cores)
+./llama-cli -m model.gguf -t 8
+
+# For AMD Ryzen 9 7950X (16 cores, 32 threads)
+-t 16  # Best: physical cores
+
+# Avoid hyperthreading (slower for matrix ops)
+```
+
+### BLAS acceleration
+```bash
+# OpenBLAS (faster matrix ops)
+make LLAMA_OPENBLAS=1
+
+# BLAS gives 2-3× speedup
+```
+
+## GPU Offloading
+
+### Layer offloading
+```bash
+# Offload 35 layers to GPU (hybrid mode)
+./llama-cli -m model.gguf -ngl 35
+
+# Offload all layers
+./llama-cli -m model.gguf -ngl 999
+
+# Find optimal value:
+# Start with -ngl 999
+# If OOM, reduce by 5 until fits
+```
+
+### Memory usage
+```bash
+# Check VRAM usage
+nvidia-smi dmon
+
+# Reduce context if needed
+./llama-cli -m model.gguf -c 2048  # 2K context instead of 4K
+```
+
+## Batch Processing
+
+```bash
+# Increase batch size for throughput
+./llama-cli -m model.gguf -b 512  # Default: 512
+
+# Physical batch (GPU)
+--ubatch 128  # Process 128 tokens at once
+```
+
+## Context Management
+
+```bash
+# Default context (512 tokens)
+-c 512
+
+# Longer context (slower, more memory)
+-c 4096
+
+# Very long context (if model supports)
+-c 32768
+```
+
+## Benchmarks
+
+### CPU Performance (Llama 2-7B Q4_K_M)
+
+| Setup | Speed | Notes |
+|-------|-------|-------|
+| Apple M3 Max | 50 tok/s | Metal acceleration |
+| AMD 7950X (16c) | 35 tok/s | OpenBLAS |
+| Intel i9-13900K | 30 tok/s | AVX2 |
+
+### GPU Offloading (RTX 4090)
+
+| Layers GPU | Speed | VRAM |
+|------------|-------|------|
+| 0 (CPU only) | 30 tok/s | 0 GB |
+| 20 (hybrid) | 80 tok/s | 8 GB |
+| 35 (all) | 120 tok/s | 12 GB |
--- a/skills/mlops/inference/llama-cpp/references/quantization.md
+++ b/skills/mlops/inference/llama-cpp/references/quantization.md
@@ -0,0 +1,213 @@
+# GGUF Quantization Guide
+
+Complete guide to GGUF quantization formats and model conversion.
+
+## Quantization Overview
+
+**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models.
+
+### Format Comparison
+
+| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
+|--------|------------|-----------|------------|-------|
+| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
+| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
+| **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
+| **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
+| **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** |
+| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
+| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
+| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
+
+**Recommendation**: Use **Q4_K_M** for best balance of quality and speed.
+
+## Converting Models
+
+### HuggingFace to GGUF
+
+```bash
+# 1. Download HuggingFace model
+huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
+    --local-dir models/llama-2-7b-chat/
+
+# 2. Convert to FP16 GGUF
+python convert_hf_to_gguf.py \
+    models/llama-2-7b-chat/ \
+    --outtype f16 \
+    --outfile models/llama-2-7b-chat-f16.gguf
+
+# 3. Quantize to Q4_K_M
+./llama-quantize \
+    models/llama-2-7b-chat-f16.gguf \
+    models/llama-2-7b-chat-Q4_K_M.gguf \
+    Q4_K_M
+```
+
+### Batch quantization
+
+```bash
+# Quantize to multiple formats
+for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
+    ./llama-quantize \
+        model-f16.gguf \
+        model-${quant}.gguf \
+        $quant
+done
+```
+
+## K-Quantization Methods
+
+**K-quants** use mixed precision for better quality:
+- Attention weights: Higher precision
+- Feed-forward weights: Lower precision
+
+**Variants**:
+- `_S` (Small): Faster, lower quality
+- `_M` (Medium): Balanced (recommended)
+- `_L` (Large): Better quality, larger size
+
+**Example**: `Q4_K_M`
+- `Q4`: 4-bit quantization
+- `K`: Mixed precision method
+- `M`: Medium quality
+
+## Quality Testing
+
+```bash
+# Calculate perplexity (quality metric)
+./llama-perplexity \
+    -m model.gguf \
+    -f wikitext-2-raw/wiki.test.raw \
+    -c 512
+
+# Lower perplexity = better quality
+# Baseline (FP16): ~5.96
+# Q4_K_M: ~6.06 (+1.7%)
+# Q2_K: ~6.87 (+15.3% - too much degradation)
+```
+
+## Use Case Guide
+
+### General purpose (chatbots, assistants)
+```
+Q4_K_M - Best balance
+Q5_K_M - If you have extra RAM
+```
+
+### Code generation
+```
+Q5_K_M or Q6_K - Higher precision helps with code
+```
+
+### Creative writing
+```
+Q4_K_M - Sufficient quality
+Q3_K_M - Acceptable for draft generation
+```
+
+### Technical/medical
+```
+Q6_K or Q8_0 - Maximum accuracy
+```
+
+### Edge devices (Raspberry Pi)
+```
+Q2_K or Q3_K_S - Fit in limited RAM
+```
+
+## Model Size Scaling
+
+### 7B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 2.7 GB | 5 GB |
+| Q3_K_M | 3.3 GB | 6 GB |
+| Q4_K_M | 4.1 GB | 7 GB |
+| Q5_K_M | 4.8 GB | 8 GB |
+| Q6_K | 5.5 GB | 9 GB |
+| Q8_0 | 7.0 GB | 11 GB |
+
+### 13B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 5.1 GB | 8 GB |
+| Q3_K_M | 6.2 GB | 10 GB |
+| Q4_K_M | 7.9 GB | 12 GB |
+| Q5_K_M | 9.2 GB | 14 GB |
+| Q6_K | 10.7 GB | 16 GB |
+
+### 70B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 26 GB | 32 GB |
+| Q3_K_M | 32 GB | 40 GB |
+| Q4_K_M | 41 GB | 48 GB |
+| Q4_K_S | 39 GB | 46 GB |
+| Q5_K_M | 48 GB | 56 GB |
+
+**Recommendation for 70B**: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
+
+## Finding Pre-Quantized Models
+
+**TheBloke** on HuggingFace:
+- https://huggingface.co/TheBloke
+- Most models available in all GGUF formats
+- No conversion needed
+
+**Example**:
+```bash
+# Download pre-quantized Llama 2-7B
+huggingface-cli download \
+    TheBloke/Llama-2-7B-Chat-GGUF \
+    llama-2-7b-chat.Q4_K_M.gguf \
+    --local-dir models/
+```
+
+## Importance Matrices (imatrix)
+
+**What**: Calibration data to improve quantization quality.
+
+**Benefits**:
+- 10-20% perplexity improvement with Q4
+- Essential for Q3 and below
+
+**Usage**:
+```bash
+# 1. Generate importance matrix
+./llama-imatrix \
+    -m model-f16.gguf \
+    -f calibration-data.txt \
+    -o model.imatrix
+
+# 2. Quantize with imatrix
+./llama-quantize \
+    --imatrix model.imatrix \
+    model-f16.gguf \
+    model-Q4_K_M.gguf \
+    Q4_K_M
+```
+
+**Calibration data**:
+- Use domain-specific text (e.g., code for code models)
+- ~100MB of representative text
+- Higher quality data = better quantization
+
+## Troubleshooting
+
+**Model outputs gibberish**:
+- Quantization too aggressive (Q2_K)
+- Try Q4_K_M or Q5_K_M
+- Verify model converted correctly
+
+**Out of memory**:
+- Use lower quantization (Q4_K_S instead of Q5_K_M)
+- Offload fewer layers to GPU (`-ngl`)
+- Use smaller context (`-c 2048`)
+
+**Slow inference**:
+- Higher quantization uses more compute
+- Q8_0 much slower than Q4_K_M
+- Consider speed vs quality trade-off
--- a/skills/mlops/inference/llama-cpp/references/server.md
+++ b/skills/mlops/inference/llama-cpp/references/server.md
@@ -0,0 +1,125 @@
+# Server Deployment Guide
+
+Production deployment of llama.cpp server with OpenAI-compatible API.
+
+## Server Modes
+
+### llama-server
+
+```bash
+# Basic server
+./llama-server \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -c 4096  # Context size
+
+# With GPU acceleration
+./llama-server \
+    -m models/llama-2-70b.Q4_K_M.gguf \
+    -ngl 40  # Offload 40 layers to GPU
+```
+
+## OpenAI-Compatible API
+
+### Chat completions
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2",
+    "messages": [
+      {"role": "system", "content": "You are helpful"},
+      {"role": "user", "content": "Hello"}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 100
+  }'
+```
+
+### Streaming
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2",
+    "messages": [{"role": "user", "content": "Count to 10"}],
+    "stream": true
+  }'
+```
+
+## Docker Deployment
+
+**Dockerfile**:
+```dockerfile
+FROM ubuntu:22.04
+RUN apt-get update && apt-get install -y git build-essential
+RUN git clone https://github.com/ggerganov/llama.cpp
+WORKDIR /llama.cpp
+RUN make LLAMA_CUDA=1
+COPY models/ /models/
+EXPOSE 8080
+CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
+```
+
+**Run**:
+```bash
+docker run --gpus all -p 8080:8080 llama-cpp:latest
+```
+
+## Monitoring
+
+```bash
+# Server metrics endpoint
+curl http://localhost:8080/metrics
+
+# Health check
+curl http://localhost:8080/health
+```
+
+**Metrics**:
+- requests_total
+- tokens_generated
+- prompt_tokens
+- completion_tokens
+- kv_cache_tokens
+
+## Load Balancing
+
+**NGINX**:
+```nginx
+upstream llama_cpp {
+    server llama1:8080;
+    server llama2:8080;
+}
+
+server {
+    location / {
+        proxy_pass http://llama_cpp;
+        proxy_read_timeout 300s;
+    }
+}
+```
+
+## Performance Tuning
+
+**Parallel requests**:
+```bash
+./llama-server \
+    -m model.gguf \
+    -np 4  # 4 parallel slots
+```
+
+**Continuous batching**:
+```bash
+./llama-server \
+    -m model.gguf \
+    --cont-batching  # Enable continuous batching
+```
+
+**Context caching**:
+```bash
+./llama-server \
+    -m model.gguf \
+    --cache-prompt  # Cache processed prompts
+```
--- a/skills/mlops/inference/obliteratus/SKILL.md
+++ b/skills/mlops/inference/obliteratus/SKILL.md
@@ -0,0 +1,330 @@
+---
+name: obliteratus
+description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets across 5 compute tiers, tournament evaluation, and telemetry-driven recommendations. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
+version: 2.0.0
+author: Hermes Agent
+license: MIT
+dependencies: [obliteratus, torch, transformers, bitsandbytes, accelerate, safetensors]
+metadata:
+  hermes:
+    tags: [Abliteration, Uncensoring, Refusal-Removal, LLM, Weight-Projection, SVD, Mechanistic-Interpretability, HuggingFace, Model-Surgery]
+    related_skills: [vllm, gguf, huggingface-tokenizers]
+---
+
+# OBLITERATUS Skill
+
+Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, LEACE concept erasure, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
+
+**License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.
+
+## When to Use This Skill
+
+Trigger when the user:
+- Wants to "uncensor" or "abliterate" an LLM
+- Asks about removing refusal/guardrails from a model
+- Wants to create an uncensored version of Llama, Qwen, Mistral, etc.
+- Mentions "refusal removal", "abliteration", "weight projection"
+- Wants to analyze how a model's refusal mechanism works
+- References OBLITERATUS, abliterator, or refusal directions
+
+## Step 1: Installation
+
+Check if already installed:
+```bash
+obliteratus --version 2>/dev/null && echo "INSTALLED" || echo "NOT INSTALLED"
+```
+
+If not installed, clone and install from GitHub:
+```bash
+git clone https://github.com/elder-plinius/OBLITERATUS.git
+cd OBLITERATUS
+pip install -e .
+# For Gradio web UI support:
+# pip install -e ".[spaces]"
+```
+
+**IMPORTANT:** Confirm with user before installing. This pulls in ~5-10GB of dependencies (PyTorch, Transformers, bitsandbytes, etc.).
+
+## Step 2: Check Hardware
+
+Before anything, check what GPU is available:
+```bash
+python3 -c "
+import torch
+if torch.cuda.is_available():
+    gpu = torch.cuda.get_device_name(0)
+    vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
+    print(f'GPU: {gpu}')
+    print(f'VRAM: {vram:.1f} GB')
+    if vram < 4: print('TIER: tiny (models under 1B)')
+    elif vram < 8: print('TIER: small (models 1-4B)')
+    elif vram < 16: print('TIER: medium (models 4-9B with 4bit quant)')
+    elif vram < 32: print('TIER: large (models 8-32B with 4bit quant)')
+    else: print('TIER: frontier (models 32B+)')
+else:
+    print('NO GPU - only tiny models (under 1B) on CPU')
+"
+```
+
+### VRAM Requirements (with 4-bit quantization)
+
+| VRAM     | Max Model Size  | Example Models                              |
+|:---------|:----------------|:--------------------------------------------|
+| CPU only | ~1B params      | GPT-2, TinyLlama, SmolLM                    |
+| 4-8 GB   | ~4B params      | Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 3B   |
+| 8-16 GB  | ~9B params      | Llama 3.1 8B, Mistral 7B, Gemma 2 9B       |
+| 24 GB    | ~32B params     | Qwen3-32B, Llama 3.1 70B (tight), Command-R |
+| 48 GB+   | ~72B+ params    | Qwen2.5-72B, DeepSeek-R1                    |
+| Multi-GPU| 200B+ params    | Llama 3.1 405B, DeepSeek-V3 (685B MoE)      |
+
+## Step 3: Browse Available Models & Get Recommendations
+
+```bash
+# Browse models by compute tier
+obliteratus models --tier medium
+
+# Get architecture info for a specific model
+obliteratus info <model_name>
+
+# Get telemetry-driven recommendation for best method & params
+obliteratus recommend <model_name>
+obliteratus recommend <model_name> --insights  # global cross-architecture rankings
+```
+
+## Step 4: Choose a Method
+
+### Method Selection Guide
+**Default / recommended for most cases: `advanced`.** It uses multi-direction SVD with norm-preserving projection and is well-tested.
+
+| Situation                         | Recommended Method | Why                                      |
+|:----------------------------------|:-------------------|:-----------------------------------------|
+| Default / most models             | `advanced`         | Multi-direction SVD, norm-preserving, reliable |
+| Quick test / prototyping          | `basic`            | Fast, simple, good enough to evaluate    |
+| Dense model (Llama, Mistral)      | `advanced`         | Multi-direction, norm-preserving         |
+| MoE model (DeepSeek, Mixtral)     | `nuclear`          | Expert-granular, handles MoE complexity  |
+| Reasoning model (R1 distills)     | `surgical`         | CoT-aware, preserves chain-of-thought    |
+| Stubborn refusals persist         | `aggressive`       | Whitened SVD + head surgery + jailbreak   |
+| Want reversible changes           | Use steering vectors (see Analysis section) |
+| Maximum quality, time no object   | `optimized`        | Bayesian search for best parameters      |
+| Experimental auto-detection       | `informed`         | Auto-detects alignment type — experimental, may not always outperform advanced |
+
+### 9 CLI Methods
+- **basic** — Single refusal direction via diff-in-means. Fast (~5-10 min for 8B).
+- **advanced** (DEFAULT, RECOMMENDED) — Multiple SVD directions, norm-preserving projection, 2 refinement passes. Medium speed (~10-20 min).
+- **aggressive** — Whitened SVD + jailbreak-contrastive + attention head surgery. Higher risk of coherence damage.
+- **spectral_cascade** — DCT frequency-domain decomposition. Research/novel approach.
+- **informed** — Runs analysis DURING abliteration to auto-configure. Experimental — slower and less predictable than advanced.
+- **surgical** — SAE features + neuron masking + head surgery + per-expert. Very slow (~1-2 hrs). Best for reasoning models.
+- **optimized** — Bayesian hyperparameter search (Optuna TPE). Longest runtime but finds optimal parameters.
+- **inverted** — Flips the refusal direction. Model becomes actively willing.
+- **nuclear** — Maximum force combo for stubborn MoE models. Expert-granular.
+
+### Direction Extraction Methods (--direction-method flag)
+- **diff_means** (default) — Simple difference-in-means between refused/complied activations. Robust.
+- **svd** — Multi-direction SVD extraction. Better for complex alignment.
+- **leace** — LEACE (Linear Erasure via Closed-form Estimation). Optimal linear erasure.
+
+### 4 Python-API-Only Methods
+(NOT available via CLI — require Python import, which violates AGPL boundary. Mention to user only if they explicitly want to use OBLITERATUS as a library in their own AGPL project.)
+- failspy, gabliteration, heretic, rdo
+
+## Step 5: Run Abliteration
+
+### Standard usage
+```bash
+# Default method (advanced) — recommended for most models
+obliteratus obliterate <model_name> --method advanced --output-dir ./abliterated-models
+
+# With 4-bit quantization (saves VRAM)
+obliteratus obliterate <model_name> --method advanced --quantization 4bit --output-dir ./abliterated-models
+
+# Large models (70B+) — conservative defaults
+obliteratus obliterate <model_name> --method advanced --quantization 4bit --large-model --output-dir ./abliterated-models
+```
+
+### Fine-tuning parameters
+```bash
+obliteratus obliterate <model_name> \
+  --method advanced \
+  --direction-method diff_means \
+  --n-directions 4 \
+  --refinement-passes 2 \
+  --regularization 0.1 \
+  --quantization 4bit \
+  --output-dir ./abliterated-models \
+  --contribute  # opt-in telemetry for community research
+```
+
+### Key flags
+| Flag | Description | Default |
+|:-----|:------------|:--------|
+| `--method` | Abliteration method | advanced |
+| `--direction-method` | Direction extraction | diff_means |
+| `--n-directions` | Number of refusal directions (1-32) | method-dependent |
+| `--refinement-passes` | Iterative passes (1-5) | 2 |
+| `--regularization` | Regularization strength (0.0-1.0) | 0.1 |
+| `--quantization` | Load in 4bit or 8bit | none (full precision) |
+| `--large-model` | Conservative defaults for 120B+ | false |
+| `--output-dir` | Where to save the abliterated model | ./obliterated_model |
+| `--contribute` | Share anonymized results for research | false |
+| `--verify-sample-size` | Number of test prompts for refusal check | 20 |
+| `--dtype` | Model dtype (float16, bfloat16) | auto |
+
+### Other execution modes
+```bash
+# Interactive guided mode (hardware → model → preset)
+obliteratus interactive
+
+# Web UI (Gradio)
+obliteratus ui --port 7860
+
+# Run a full ablation study from YAML config
+obliteratus run config.yaml --preset quick
+
+# Tournament: pit all methods against each other
+obliteratus tourney <model_name>
+```
+
+## Step 6: Verify Results
+
+After abliteration, check the output metrics:
+
+| Metric | Good Value | Warning |
+|:-------|:-----------|:--------|
+| Refusal rate | < 5% (ideally ~0%) | > 10% means refusals persist |
+| Perplexity change | < 10% increase | > 15% means coherence damage |
+| KL divergence | < 0.1 | > 0.5 means significant distribution shift |
+| Coherence | High / passes qualitative check | Degraded responses, repetition |
+
+### If refusals persist (> 10%)
+1. Try `aggressive` method
+2. Increase `--n-directions` (e.g., 8 or 16)
+3. Add `--refinement-passes 3`
+4. Try `--direction-method svd` instead of diff_means
+
+### If coherence is damaged (perplexity > 15% increase)
+1. Reduce `--n-directions` (try 2)
+2. Increase `--regularization` (try 0.3)
+3. Reduce `--refinement-passes` to 1
+4. Try `basic` method (gentler)
+
+## Step 7: Use the Abliterated Model
+
+The output is a standard HuggingFace model directory.
+
+```bash
+# Test locally with transformers
+python3 -c "
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained('./abliterated-models/<model>')
+tokenizer = AutoTokenizer.from_pretrained('./abliterated-models/<model>')
+inputs = tokenizer('How do I pick a lock?', return_tensors='pt')
+outputs = model.generate(**inputs, max_new_tokens=200)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+"
+
+# Upload to HuggingFace Hub
+huggingface-cli upload <username>/<model-name>-abliterated ./abliterated-models/<model>
+
+# Serve with vLLM
+vllm serve ./abliterated-models/<model>
+```
+
+## CLI Command Reference
+
+| Command | Description |
+|:--------|:------------|
+| `obliteratus obliterate` | Main abliteration command |
+| `obliteratus info <model>` | Print model architecture details |
+| `obliteratus models --tier <tier>` | Browse curated models by compute tier |
+| `obliteratus recommend <model>` | Telemetry-driven method/param suggestion |
+| `obliteratus interactive` | Guided setup wizard |
+| `obliteratus tourney <model>` | Tournament: all methods head-to-head |
+| `obliteratus run <config.yaml>` | Execute ablation study from YAML |
+| `obliteratus strategies` | List all registered ablation strategies |
+| `obliteratus report <results.json>` | Regenerate visual reports |
+| `obliteratus ui` | Launch Gradio web interface |
+| `obliteratus aggregate` | Summarize community telemetry data |
+
+## Analysis Modules
+
+OBLITERATUS includes 28 analysis modules for mechanistic interpretability.
+See `skill_view(name="obliteratus", file_path="references/analysis-modules.md")` for the full reference.
+
+### Quick analysis commands
+```bash
+# Run specific analysis modules
+obliteratus run analysis-config.yaml --preset quick
+
+# Key modules to run first:
+# - alignment_imprint: Fingerprint DPO/RLHF/CAI/SFT alignment method
+# - concept_geometry: Single direction vs polyhedral cone
+# - logit_lens: Which layer decides to refuse
+# - anti_ouroboros: Self-repair risk score
+# - causal_tracing: Causally necessary components
+```
+
+### Steering Vectors (Reversible Alternative)
+Instead of permanent weight modification, use inference-time steering:
+```python
+# Python API only — for user's own projects
+from obliteratus.analysis.steering_vectors import SteeringVectorFactory, SteeringHookManager
+```
+
+## Ablation Strategies
+
+Beyond direction-based abliteration, OBLITERATUS includes structural ablation strategies:
+- **Embedding Ablation** — Target embedding layer components
+- **FFN Ablation** — Feed-forward network block removal
+- **Head Pruning** — Attention head pruning
+- **Layer Removal** — Full layer removal
+
+List all available: `obliteratus strategies`
+
+## Evaluation
+
+OBLITERATUS includes built-in evaluation tools:
+- Refusal rate benchmarking
+- Perplexity comparison (before/after)
+- LM Eval Harness integration for academic benchmarks
+- Head-to-head competitor comparison
+- Baseline performance tracking
+
+## Platform Support
+
+- **CUDA** — Full support (NVIDIA GPUs)
+- **Apple Silicon (MLX)** — Supported via MLX backend
+- **CPU** — Supported for tiny models (< 1B params)
+
+## YAML Config Templates
+
+Load templates for reproducible runs via `skill_view`:
+- `templates/abliteration-config.yaml` — Standard single-model config
+- `templates/analysis-study.yaml` — Pre-abliteration analysis study
+- `templates/batch-abliteration.yaml` — Multi-model batch processing
+
+## Telemetry
+
+OBLITERATUS can optionally contribute anonymized run data to a global research dataset.
+Enable with `--contribute` flag. No personal data is collected — only model name, method, metrics.
+
+## Common Pitfalls
+
+1. **Don't use `informed` as default** — it's experimental and slower. Use `advanced` for reliable results.
+2. **Models under ~1B respond poorly to abliteration** — their refusal behaviors are shallow and fragmented, making clean direction extraction difficult. Expect partial results (20-40% remaining refusal). Models 3B+ have cleaner refusal directions and respond much better (often 0% refusal with `advanced`).
+3. **`aggressive` can make things worse** — on small models it can damage coherence and actually increase refusal rate. Only use it if `advanced` leaves > 10% refusals on a 3B+ model.
+4. **Always check perplexity** — if it spikes > 15%, the model is damaged. Reduce aggressiveness.
+5. **MoE models need special handling** — use `nuclear` method for Mixtral, DeepSeek-MoE, etc.
+6. **Quantized models can't be re-quantized** — abliterate the full-precision model, then quantize the output.
+7. **VRAM estimation is approximate** — 4-bit quant helps but peak usage can spike during extraction.
+8. **Reasoning models are sensitive** — use `surgical` for R1 distills to preserve chain-of-thought.
+9. **Check `obliteratus recommend`** — telemetry data may have better parameters than defaults.
+10. **AGPL license** — never `import obliteratus` in MIT/Apache projects. CLI invocation only.
+11. **Large models (70B+)** — always use `--large-model` flag for conservative defaults.
+12. **Spectral certification RED is common** — the spectral check often flags "incomplete" even when practical refusal rate is 0%. Check actual refusal rate rather than relying on spectral certification alone.
+
+## Complementary Skills
+
+- **vllm** — Serve abliterated models with high throughput
+- **gguf** — Convert abliterated models to GGUF for llama.cpp
+- **huggingface-tokenizers** — Work with model tokenizers
--- a/skills/mlops/inference/obliteratus/references/analysis-modules.md
+++ b/skills/mlops/inference/obliteratus/references/analysis-modules.md
@@ -0,0 +1,166 @@
+# OBLITERATUS Analysis Modules — Reference
+
+OBLITERATUS includes 28 analysis modules for mechanistic interpretability of refusal in LLMs.
+These modules help understand how and where refusal behaviors are encoded before performing abliteration.
+
+---
+
+## Core Analysis (Run These First)
+
+### 1. Alignment Imprint Detection (`alignment_imprint.py`)
+Fingerprints whether a model was trained via DPO, RLHF, CAI, or SFT.
+This determines which extraction strategy will work best.
+
+### 2. Concept Cone Geometry (`concept_geometry.py`)
+Determines if refusal is a single linear direction or a polyhedral cone
+(set of multiple mechanisms). Single-direction models respond well to `basic`;
+polyhedral models need `advanced` or `surgical`.
+
+### 3. Refusal Logit Lens (`logit_lens.py`)
+Identifies the specific layer where a model "decides" to refuse by decoding
+intermediate layer representations into token space.
+
+### 4. Ouroboros Detection (`anti_ouroboros.py`)
+Identifies if a model attempts to "self-repair" refusal behaviors after
+excision. Reports a risk score (0-1). High scores mean additional refinement
+passes are needed.
+
+### 5. Causal Tracing (`causal_tracing.py`)
+Identifies which components (layers, heads, MLPs) are causally necessary
+for refusal behavior using activation patching.
+
+---
+
+## Geometric Analysis
+
+### 6. Cross-Layer Alignment (`cross_layer.py`)
+Measures how refusal directions align across different layers. High alignment
+means the refusal signal is consistent; low alignment suggests layer-specific
+mechanisms.
+
+### 7. Residual Stream Decomposition (`residual_stream.py`)
+Decomposes the residual stream into attention and MLP contributions to
+understand which component type contributes more to refusal.
+
+### 8. Riemannian Manifold Geometry (`riemannian_manifold.py`)
+Analyzes the curvature and geometry of the weight manifold near refusal
+directions. Informs how aggressively projections can be applied without
+damaging the manifold structure.
+
+### 9. Whitened SVD (`whitened_svd.py`)
+Covariance-normalized SVD extraction that separates guardrail signals from
+natural activation variance. More precise than standard SVD for models with
+high activation variance.
+
+### 10. Concept Cone Geometry (extended)
+Maps the full polyhedral structure of refusal, including cone angles,
+face counts, and intersection patterns.
+
+---
+
+## Probing & Classification
+
+### 11. Activation Probing (`activation_probing.py`)
+Post-excision verification — probes for residual refusal concepts after
+abliteration to ensure complete removal.
+
+### 12. Probing Classifiers (`probing_classifiers.py`)
+Trains linear classifiers to detect refusal in activations. Used both
+before (to verify refusal exists) and after (to verify it's gone).
+
+### 13. Activation Patching (`activation_patching.py`)
+Interchange interventions — swaps activations between refused and complied
+runs to identify causal components.
+
+### 14. Tuned Lens (`tuned_lens.py`)
+Trained version of logit lens that provides more accurate per-layer
+decoding by learning affine transformations for each layer.
+
+### 15. Multi-Token Position Analysis (`multi_token_position.py`)
+Analyzes refusal signals across multiple token positions, not just the
+last token. Important for models that distribute refusal across the sequence.
+
+---
+
+## Abliteration & Manipulation
+
+### 16. SAE-Based Abliteration (`sae_abliteration.py`)
+Uses Sparse Autoencoder features to identify and remove specific refusal
+features. More surgical than direction-based methods.
+
+### 17. Steering Vectors (`steering_vectors.py`)
+Creates and applies inference-time steering vectors for reversible refusal
+modification. Includes `SteeringVectorFactory` and `SteeringHookManager`.
+
+### 18. LEACE Concept Erasure (`leace.py`)
+Linear Erasure via Closed-form Estimation — mathematically optimal linear
+concept removal. Available as both analysis module and direction extraction method.
+
+### 19. Sparse Surgery (`sparse_surgery.py`)
+High-precision weight modification targeting individual neurons and
+weight matrix entries rather than full directions.
+
+### 20. Conditional Abliteration (`conditional_abliteration.py`)
+Targeted removal that only affects specific refusal categories while
+preserving others (e.g., remove weapons refusal but keep CSAM refusal).
+
+---
+
+## Transfer & Robustness
+
+### 21. Cross-Model Transfer (`cross_model_transfer.py`)
+Tests whether refusal directions extracted from one model transfer to
+another architecture. Measures universality of guardrail directions.
+
+### 22. Defense Robustness (`defense_robustness.py`)
+Evaluates how robust the abliteration is against various defense mechanisms
+and re-alignment attempts.
+
+### 23. Spectral Certification (`spectral_certification.py`)
+Provides mathematical bounds on the completeness of refusal removal
+using spectral analysis of the projection.
+
+### 24. Wasserstein Optimal Extraction (`wasserstein_optimal.py`)
+Uses optimal transport theory for more precise direction extraction
+that minimizes distribution shift.
+
+### 25. Wasserstein Transfer (`wasserstein_transfer.py`)
+Distribution transfer between models using Wasserstein distance
+for cross-architecture refusal direction mapping.
+
+---
+
+## Advanced / Research
+
+### 26. Bayesian Kernel Projection (`bayesian_kernel_projection.py`)
+Probabilistic feature mapping that estimates uncertainty in refusal
+direction identification.
+
+### 27. Cross-Model Universality Index
+Measures if guardrail directions generalize across different model
+architectures and training regimes.
+
+### 28. Visualization (`visualization.py`)
+Plotting and graphing utilities for all analysis modules. Generates
+heatmaps, direction plots, and layer-wise analysis charts.
+
+---
+
+## Running Analysis
+
+### Via CLI
+```bash
+# Run analysis from a YAML config
+obliteratus run analysis-study.yaml --preset quick
+
+# Available study presets:
+# quick     — Fast sanity check (2-3 modules)
+# full      — All core + geometric analysis
+# jailbreak — Refusal circuit localization
+# knowledge — Knowledge preservation analysis
+# robustness — Stress testing / defense evaluation
+```
+
+### Via YAML Config
+See the `templates/analysis-study.yaml` template for a complete example.
+Load with: `skill_view(name="obliteratus", file_path="templates/analysis-study.yaml")`
--- a/skills/mlops/inference/obliteratus/references/methods-guide.md
+++ b/skills/mlops/inference/obliteratus/references/methods-guide.md
@@ -0,0 +1,141 @@
+# OBLITERATUS Methods — Detailed Guide
+
+> The CLI accepts 9 methods via `--method`: basic, advanced, aggressive, spectral_cascade,
+> informed, surgical, optimized, inverted, nuclear.
+> Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API.
+
+## How Abliteration Works (Theory)
+
+Abliteration identifies a "refusal direction" — a vector in the model's activation space that
+corresponds to refusal behavior — and projects it out of the weight matrices.
+
+Mathematically: `W_new = W_old - (W_old @ d @ d.T)` where `d` is the refusal direction.
+
+The key challenge is finding accurate refusal directions without damaging other capabilities.
+
+---
+
+## Direction Extraction Methods
+
+Before projecting, OBLITERATUS extracts refusal directions using one of three methods:
+
+| Method | Flag | Description | Best For |
+|:-------|:-----|:------------|:---------|
+| Diff-in-Means | `--direction-method diff_means` | Difference between mean activations on refused vs. complied prompts | Default, fast, robust |
+| SVD | `--direction-method svd` | Multi-direction extraction via Singular Value Decomposition | Complex alignment, multiple refusal mechanisms |
+| LEACE | `--direction-method leace` | Linear Erasure via Closed-form Estimation — mathematically optimal | Maximum precision, research |
+
+---
+
+## Method Details
+
+### basic
+- **Directions:** 1 (single diff-in-means vector)
+- **Speed:** Fast (~5-10 min for 8B model)
+- **Risk:** Low
+- **Use case:** Quick tests, prototyping, evaluating if abliteration works for a model
+- **How it works:** Extracts one refusal direction and projects it out uniformly across all layers.
+
+### advanced (DEFAULT — RECOMMENDED)
+- **Directions:** 4 (multi-direction SVD)
+- **Speed:** Medium (~10-20 min for 8B model)
+- **Risk:** Low-Medium
+- **Refinement passes:** 2
+- **Use case:** Default for most models. Well-tested and reliable.
+- **How it works:** Extracts multiple refusal directions via SVD, applies norm-preserving bi-projection to maintain weight matrix norms. Two refinement passes catch residual refusal.
+
+### aggressive
+- **Directions:** 8+ (whitened SVD + jailbreak-contrastive)
+- **Speed:** Medium-Slow
+- **Risk:** Medium-High (may damage coherence)
+- **Use case:** When `advanced` leaves > 10% refusals. Stubborn models.
+- **How it works:** Uses whitened SVD for covariance-normalized extraction, adds jailbreak-contrastive directions, performs attention head surgery on the most refusal-active heads.
+
+### spectral_cascade
+- **Speed:** Medium
+- **Risk:** Medium
+- **Use case:** Research, novel approaches
+- **How it works:** DCT (Discrete Cosine Transform) frequency-domain decomposition of refusal signals. Separates high-frequency (surface-level) from low-frequency (deep) refusal patterns.
+
+### informed (EXPERIMENTAL)
+- **Speed:** Slow (~20-40 min for 8B model)
+- **Risk:** Variable — results depend on analysis quality
+- **Use case:** When you want auto-configuration, but be aware this is experimental and may not outperform `advanced`.
+- **How it works:** Runs 4 analysis modules first (alignment imprint, concept geometry, logit lens, ouroboros detection), then auto-configures extraction strategy. Includes an "Ouroboros loop" that detects and counteracts self-repair.
+- **Note:** The auto-detection can sometimes misconfigure. If results are poor, fall back to `advanced`.
+
+### surgical
+- **Speed:** Very slow (~1-2 hrs for 8B model)
+- **Risk:** Low (very precise)
+- **Use case:** Reasoning models (R1 distills, QwQ, etc.) where chain-of-thought must be preserved.
+- **How it works:** Uses SAE (Sparse Autoencoder) features + individual neuron masking + attention head surgery + per-expert decomposition (for MoE). CoT-aware — identifies and protects reasoning-critical directions before projecting.
+
+### optimized
+- **Speed:** Very slow (hours — runs many trials)
+- **Risk:** Low (finds optimal parameters)
+- **Use case:** When quality matters more than speed. Production models.
+- **How it works:** Bayesian hyperparameter search via Optuna TPE sampler. Optimizes n_directions, regularization, refinement passes, and layer selection jointly. Evaluates each configuration on refusal rate + perplexity.
+
+### inverted
+- **Speed:** Fast
+- **Risk:** High (model behavior changes dramatically)
+- **Use case:** Research, studying refusal mechanisms
+- **How it works:** Instead of projecting out the refusal direction, reflects it. The model actively complies rather than passively not-refusing. Useful for understanding the geometry of alignment.
+
+### nuclear
+- **Speed:** Slow
+- **Risk:** Medium-High
+- **Use case:** Stubborn MoE models (DeepSeek-MoE, Mixtral, etc.)
+- **How it works:** Combines expert-granular abliteration (EGA), steering vector injection, attention head pruning, and multi-pass refinement. Decomposes refusal signals into per-expert components for MoE architectures.
+
+---
+
+## Method Selection Flowchart
+
+```
+Is this a quick test?
+  → YES: basic
+  → NO: continue
+
+Is it an MoE model (Mixtral, DeepSeek-MoE)?
+  → YES: nuclear
+  → NO: continue
+
+Is it a reasoning model (R1, QwQ, CoT-focused)?
+  → YES: surgical
+  → NO: continue
+
+Do you need the absolute best quality and have time?
+  → YES: optimized
+  → NO: advanced (recommended default)
+
+Did advanced leave > 10% refusals?
+  → YES: aggressive
+  → Still refusing: nuclear
+```
+
+---
+
+## Key Parameters
+
+| Parameter | Range | Default | Effect |
+|:----------|:------|:--------|:-------|
+| `--n-directions` | 1-32 | method-dependent | More directions = more complete removal, but higher damage risk |
+| `--regularization` | 0.0-1.0 | 0.1 | Higher = more conservative (less removal, less damage) |
+| `--refinement-passes` | 1-5 | 2 | More passes catch residual refusal, but diminishing returns |
+| `--quantization` | 4bit, 8bit | none | Reduces VRAM usage; quality impact minimal for extraction |
+| `--verify-sample-size` | 10-200 | 20 | More samples = more accurate refusal rate estimate |
+
+---
+
+## Troubleshooting
+
+| Problem | Likely Cause | Fix |
+|:--------|:-------------|:----|
+| Refusal rate > 20% | Too few directions | Increase `--n-directions`, try `aggressive` |
+| Refusal rate 5-20% | Residual refusal | Add `--refinement-passes 3`, try `--direction-method svd` |
+| Perplexity spike > 20% | Over-aggressive removal | Reduce `--n-directions`, increase `--regularization` |
+| Repetitive output | Weight matrix damage | Use `basic` with fewer directions, check norm preservation |
+| MoE model still refuses | Non-expert-aware method | Switch to `nuclear` |
+| Reasoning degraded | CoT directions damaged | Use `surgical` method |
+| OOM during extraction | Insufficient VRAM | Add `--quantization 4bit` and/or `--large-model` |
--- a/skills/mlops/inference/obliteratus/templates/abliteration-config.yaml
+++ b/skills/mlops/inference/obliteratus/templates/abliteration-config.yaml
@@ -0,0 +1,33 @@
+# OBLITERATUS Abliteration Config
+# Usage: obliteratus run this-file.yaml
+#
+# This is for reproducible, version-controlled abliteration runs.
+# For one-off usage, the CLI flags are simpler.
+
+# Model to abliterate
+model:
+  name: "meta-llama/Llama-3.1-8B-Instruct"
+  dtype: "bfloat16"         # float16, bfloat16, float32
+  quantization: null         # null, "4bit", "8bit"
+  device: "auto"             # auto, cuda, cuda:0, cpu
+
+# Abliteration method and parameters
+abliteration:
+  method: "informed"         # See SKILL.md Step 4 for all 13 methods
+  n_directions: null         # null = auto-detect, or integer (e.g., 8)
+  regularization: 0.0        # 0.0-1.0, fraction of original to preserve
+  refinement_passes: 1       # Iterative passes (increase for self-repair)
+  norm_preserve: true        # Keep weight norms intact after projection
+
+# Output
+output:
+  directory: "./abliterated-models"
+  save_metadata: true        # Save abliteration_metadata.json alongside model
+  contribute: false          # Save community contribution data
+
+# Verification
+verify:
+  enabled: true
+  test_prompts: null         # null = use built-in test prompts
+  compute_perplexity: true
+  compute_kl: true
--- a/skills/mlops/inference/obliteratus/templates/analysis-study.yaml
+++ b/skills/mlops/inference/obliteratus/templates/analysis-study.yaml
@@ -0,0 +1,40 @@
+# OBLITERATUS Analysis Study Config
+# Usage: obliteratus run this-file.yaml --preset jailbreak
+#
+# Run analysis modules to understand refusal geometry BEFORE abliterating.
+# Useful for research or when you want to understand what you're removing.
+
+# Model to analyze
+model:
+  name: "meta-llama/Llama-3.1-8B-Instruct"
+  dtype: "bfloat16"
+  quantization: "4bit"       # Saves VRAM for analysis
+  device: "auto"
+
+# Study configuration
+study:
+  # Available presets: quick, full, attention, jailbreak, guardrail, knowledge
+  preset: "jailbreak"
+
+  # Or specify individual strategies:
+  # strategies:
+  #   - layer_removal
+  #   - head_pruning
+  #   - ffn_ablation
+  #   - embedding_ablation
+
+# Analysis modules to run (subset of the 27 available)
+analysis:
+  - alignment_imprint        # Detect DPO/RLHF/CAI/SFT training method
+  - concept_geometry          # Map refusal cone geometry
+  - logit_lens               # Find which layer decides to refuse
+  - anti_ouroboros            # Detect self-repair tendency
+  - cross_layer              # Cross-layer alignment clustering
+  - causal_tracing           # Causal necessity of components
+  - residual_stream          # Attention vs MLP contribution
+
+# Output
+output:
+  directory: "./analysis-results"
+  save_plots: true           # Generate matplotlib visualizations
+  save_report: true          # Generate markdown report
--- a/skills/mlops/inference/obliteratus/templates/batch-abliteration.yaml
+++ b/skills/mlops/inference/obliteratus/templates/batch-abliteration.yaml
@@ -0,0 +1,41 @@
+# OBLITERATUS Batch Abliteration Config
+# Abliterate multiple models with the same method for comparison.
+#
+# Run each one sequentially:
+#   for model in models; do obliteratus obliterate $model --method informed; done
+#
+# Or use this as a reference for which models to process.
+
+# Common settings
+defaults:
+  method: "informed"
+  quantization: "4bit"
+  output_dir: "./abliterated-models"
+
+# Models to process (grouped by compute tier)
+models:
+  # Small (4-8 GB VRAM)
+  small:
+    - "Qwen/Qwen2.5-1.5B-Instruct"
+    - "microsoft/Phi-3.5-mini-instruct"
+    - "meta-llama/Llama-3.2-3B-Instruct"
+
+  # Medium (8-16 GB VRAM)
+  medium:
+    - "meta-llama/Llama-3.1-8B-Instruct"
+    - "mistralai/Mistral-7B-Instruct-v0.3"
+    - "google/gemma-2-9b-it"
+    - "Qwen/Qwen2.5-7B-Instruct"
+
+  # Large (24 GB VRAM, 4-bit quantization)
+  large:
+    - "Qwen/Qwen2.5-14B-Instruct"
+    - "Qwen/Qwen3-32B"
+    - "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
+
+# Per-model method overrides (optional)
+overrides:
+  "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B":
+    method: "surgical"        # CoT-aware for reasoning models
+  "mistralai/Mixtral-8x7B-Instruct-v0.1":
+    method: "nuclear"         # Expert-granular for MoE models
--- a/skills/mlops/inference/outlines/SKILL.md
+++ b/skills/mlops/inference/outlines/SKILL.md
@@ -0,0 +1,655 @@
+---
+name: outlines
+description: Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [outlines, transformers, vllm, pydantic]
+metadata:
+  hermes:
+    tags: [Prompt Engineering, Outlines, Structured Generation, JSON Schema, Pydantic, Local Models, Grammar-Based Generation, vLLM, Transformers, Type Safety]
+
+---
+
+# Outlines: Structured Text Generation
+
+## When to Use This Skill
+
+Use Outlines when you need to:
+- **Guarantee valid JSON/XML/code** structure during generation
+- **Use Pydantic models** for type-safe outputs
+- **Support local models** (Transformers, llama.cpp, vLLM)
+- **Maximize inference speed** with zero-overhead structured generation
+- **Generate against JSON schemas** automatically
+- **Control token sampling** at the grammar level
+
+**GitHub Stars**: 8,000+ | **From**: dottxt.ai (formerly .txt)
+
+## Installation
+
+```bash
+# Base installation
+pip install outlines
+
+# With specific backends
+pip install outlines transformers  # Hugging Face models
+pip install outlines llama-cpp-python  # llama.cpp
+pip install outlines vllm  # vLLM for high-throughput
+```
+
+## Quick Start
+
+### Basic Example: Classification
+
+```python
+import outlines
+from typing import Literal
+
+# Load model
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+
+# Generate with type constraint
+prompt = "Sentiment of 'This product is amazing!': "
+generator = outlines.generate.choice(model, ["positive", "negative", "neutral"])
+sentiment = generator(prompt)
+
+print(sentiment)  # "positive" (guaranteed one of these)
+```
+
+### With Pydantic Models
+
+```python
+from pydantic import BaseModel
+import outlines
+
+class User(BaseModel):
+    name: str
+    age: int
+    email: str
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+
+# Generate structured output
+prompt = "Extract user: John Doe, 30 years old, john@example.com"
+generator = outlines.generate.json(model, User)
+user = generator(prompt)
+
+print(user.name)   # "John Doe"
+print(user.age)    # 30
+print(user.email)  # "john@example.com"
+```
+
+## Core Concepts
+
+### 1. Constrained Token Sampling
+
+Outlines uses Finite State Machines (FSM) to constrain token generation at the logit level.
+
+**How it works:**
+1. Convert schema (JSON/Pydantic/regex) to context-free grammar (CFG)
+2. Transform CFG into Finite State Machine (FSM)
+3. Filter invalid tokens at each step during generation
+4. Fast-forward when only one valid token exists
+
+**Benefits:**
+- **Zero overhead**: Filtering happens at token level
+- **Speed improvement**: Fast-forward through deterministic paths
+- **Guaranteed validity**: Invalid outputs impossible
+
+```python
+import outlines
+
+# Pydantic model -> JSON schema -> CFG -> FSM
+class Person(BaseModel):
+    name: str
+    age: int
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+
+# Behind the scenes:
+# 1. Person -> JSON schema
+# 2. JSON schema -> CFG
+# 3. CFG -> FSM
+# 4. FSM filters tokens during generation
+
+generator = outlines.generate.json(model, Person)
+result = generator("Generate person: Alice, 25")
+```
+
+### 2. Structured Generators
+
+Outlines provides specialized generators for different output types.
+
+#### Choice Generator
+
+```python
+# Multiple choice selection
+generator = outlines.generate.choice(
+    model,
+    ["positive", "negative", "neutral"]
+)
+
+sentiment = generator("Review: This is great!")
+# Result: One of the three choices
+```
+
+#### JSON Generator
+
+```python
+from pydantic import BaseModel
+
+class Product(BaseModel):
+    name: str
+    price: float
+    in_stock: bool
+
+# Generate valid JSON matching schema
+generator = outlines.generate.json(model, Product)
+product = generator("Extract: iPhone 15, $999, available")
+
+# Guaranteed valid Product instance
+print(type(product))  # <class '__main__.Product'>
+```
+
+#### Regex Generator
+
+```python
+# Generate text matching regex
+generator = outlines.generate.regex(
+    model,
+    r"[0-9]{3}-[0-9]{3}-[0-9]{4}"  # Phone number pattern
+)
+
+phone = generator("Generate phone number:")
+# Result: "555-123-4567" (guaranteed to match pattern)
+```
+
+#### Integer/Float Generators
+
+```python
+# Generate specific numeric types
+int_generator = outlines.generate.integer(model)
+age = int_generator("Person's age:")  # Guaranteed integer
+
+float_generator = outlines.generate.float(model)
+price = float_generator("Product price:")  # Guaranteed float
+```
+
+### 3. Model Backends
+
+Outlines supports multiple local and API-based backends.
+
+#### Transformers (Hugging Face)
+
+```python
+import outlines
+
+# Load from Hugging Face
+model = outlines.models.transformers(
+    "microsoft/Phi-3-mini-4k-instruct",
+    device="cuda"  # Or "cpu"
+)
+
+# Use with any generator
+generator = outlines.generate.json(model, YourModel)
+```
+
+#### llama.cpp
+
+```python
+# Load GGUF model
+model = outlines.models.llamacpp(
+    "./models/llama-3.1-8b-instruct.Q4_K_M.gguf",
+    n_gpu_layers=35
+)
+
+generator = outlines.generate.json(model, YourModel)
+```
+
+#### vLLM (High Throughput)
+
+```python
+# For production deployments
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    tensor_parallel_size=2  # Multi-GPU
+)
+
+generator = outlines.generate.json(model, YourModel)
+```
+
+#### OpenAI (Limited Support)
+
+```python
+# Basic OpenAI support
+model = outlines.models.openai(
+    "gpt-4o-mini",
+    api_key="your-api-key"
+)
+
+# Note: Some features limited with API models
+generator = outlines.generate.json(model, YourModel)
+```
+
+### 4. Pydantic Integration
+
+Outlines has first-class Pydantic support with automatic schema translation.
+
+#### Basic Models
+
+```python
+from pydantic import BaseModel, Field
+
+class Article(BaseModel):
+    title: str = Field(description="Article title")
+    author: str = Field(description="Author name")
+    word_count: int = Field(description="Number of words", gt=0)
+    tags: list[str] = Field(description="List of tags")
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, Article)
+
+article = generator("Generate article about AI")
+print(article.title)
+print(article.word_count)  # Guaranteed > 0
+```
+
+#### Nested Models
+
+```python
+class Address(BaseModel):
+    street: str
+    city: str
+    country: str
+
+class Person(BaseModel):
+    name: str
+    age: int
+    address: Address  # Nested model
+
+generator = outlines.generate.json(model, Person)
+person = generator("Generate person in New York")
+
+print(person.address.city)  # "New York"
+```
+
+#### Enums and Literals
+
+```python
+from enum import Enum
+from typing import Literal
+
+class Status(str, Enum):
+    PENDING = "pending"
+    APPROVED = "approved"
+    REJECTED = "rejected"
+
+class Application(BaseModel):
+    applicant: str
+    status: Status  # Must be one of enum values
+    priority: Literal["low", "medium", "high"]  # Must be one of literals
+
+generator = outlines.generate.json(model, Application)
+app = generator("Generate application")
+
+print(app.status)  # Status.PENDING (or APPROVED/REJECTED)
+```
+
+## Common Patterns
+
+### Pattern 1: Data Extraction
+
+```python
+from pydantic import BaseModel
+import outlines
+
+class CompanyInfo(BaseModel):
+    name: str
+    founded_year: int
+    industry: str
+    employees: int
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, CompanyInfo)
+
+text = """
+Apple Inc. was founded in 1976 in the technology industry.
+The company employs approximately 164,000 people worldwide.
+"""
+
+prompt = f"Extract company information:\n{text}\n\nCompany:"
+company = generator(prompt)
+
+print(f"Name: {company.name}")
+print(f"Founded: {company.founded_year}")
+print(f"Industry: {company.industry}")
+print(f"Employees: {company.employees}")
+```
+
+### Pattern 2: Classification
+
+```python
+from typing import Literal
+import outlines
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+
+# Binary classification
+generator = outlines.generate.choice(model, ["spam", "not_spam"])
+result = generator("Email: Buy now! 50% off!")
+
+# Multi-class classification
+categories = ["technology", "business", "sports", "entertainment"]
+category_gen = outlines.generate.choice(model, categories)
+category = category_gen("Article: Apple announces new iPhone...")
+
+# With confidence
+class Classification(BaseModel):
+    label: Literal["positive", "negative", "neutral"]
+    confidence: float
+
+classifier = outlines.generate.json(model, Classification)
+result = classifier("Review: This product is okay, nothing special")
+```
+
+### Pattern 3: Structured Forms
+
+```python
+class UserProfile(BaseModel):
+    full_name: str
+    age: int
+    email: str
+    phone: str
+    country: str
+    interests: list[str]
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, UserProfile)
+
+prompt = """
+Extract user profile from:
+Name: Alice Johnson
+Age: 28
+Email: alice@example.com
+Phone: 555-0123
+Country: USA
+Interests: hiking, photography, cooking
+"""
+
+profile = generator(prompt)
+print(profile.full_name)
+print(profile.interests)  # ["hiking", "photography", "cooking"]
+```
+
+### Pattern 4: Multi-Entity Extraction
+
+```python
+class Entity(BaseModel):
+    name: str
+    type: Literal["PERSON", "ORGANIZATION", "LOCATION"]
+
+class DocumentEntities(BaseModel):
+    entities: list[Entity]
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, DocumentEntities)
+
+text = "Tim Cook met with Satya Nadella at Microsoft headquarters in Redmond."
+prompt = f"Extract entities from: {text}"
+
+result = generator(prompt)
+for entity in result.entities:
+    print(f"{entity.name} ({entity.type})")
+```
+
+### Pattern 5: Code Generation
+
+```python
+class PythonFunction(BaseModel):
+    function_name: str
+    parameters: list[str]
+    docstring: str
+    body: str
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, PythonFunction)
+
+prompt = "Generate a Python function to calculate factorial"
+func = generator(prompt)
+
+print(f"def {func.function_name}({', '.join(func.parameters)}):")
+print(f'    """{func.docstring}"""')
+print(f"    {func.body}")
+```
+
+### Pattern 6: Batch Processing
+
+```python
+def batch_extract(texts: list[str], schema: type[BaseModel]):
+    """Extract structured data from multiple texts."""
+    model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+    generator = outlines.generate.json(model, schema)
+
+    results = []
+    for text in texts:
+        result = generator(f"Extract from: {text}")
+        results.append(result)
+
+    return results
+
+class Person(BaseModel):
+    name: str
+    age: int
+
+texts = [
+    "John is 30 years old",
+    "Alice is 25 years old",
+    "Bob is 40 years old"
+]
+
+people = batch_extract(texts, Person)
+for person in people:
+    print(f"{person.name}: {person.age}")
+```
+
+## Backend Configuration
+
+### Transformers
+
+```python
+import outlines
+
+# Basic usage
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+
+# GPU configuration
+model = outlines.models.transformers(
+    "microsoft/Phi-3-mini-4k-instruct",
+    device="cuda",
+    model_kwargs={"torch_dtype": "float16"}
+)
+
+# Popular models
+model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
+model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")
+model = outlines.models.transformers("Qwen/Qwen2.5-7B-Instruct")
+```
+
+### llama.cpp
+
+```python
+# Load GGUF model
+model = outlines.models.llamacpp(
+    "./models/llama-3.1-8b.Q4_K_M.gguf",
+    n_ctx=4096,         # Context window
+    n_gpu_layers=35,    # GPU layers
+    n_threads=8         # CPU threads
+)
+
+# Full GPU offload
+model = outlines.models.llamacpp(
+    "./models/model.gguf",
+    n_gpu_layers=-1  # All layers on GPU
+)
+```
+
+### vLLM (Production)
+
+```python
+# Single GPU
+model = outlines.models.vllm("meta-llama/Llama-3.1-8B-Instruct")
+
+# Multi-GPU
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-70B-Instruct",
+    tensor_parallel_size=4  # 4 GPUs
+)
+
+# With quantization
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    quantization="awq"  # Or "gptq"
+)
+```
+
+## Best Practices
+
+### 1. Use Specific Types
+
+```python
+# ✅ Good: Specific types
+class Product(BaseModel):
+    name: str
+    price: float  # Not str
+    quantity: int  # Not str
+    in_stock: bool  # Not str
+
+# ❌ Bad: Everything as string
+class Product(BaseModel):
+    name: str
+    price: str  # Should be float
+    quantity: str  # Should be int
+```
+
+### 2. Add Constraints
+
+```python
+from pydantic import Field
+
+# ✅ Good: With constraints
+class User(BaseModel):
+    name: str = Field(min_length=1, max_length=100)
+    age: int = Field(ge=0, le=120)
+    email: str = Field(pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$")
+
+# ❌ Bad: No constraints
+class User(BaseModel):
+    name: str
+    age: int
+    email: str
+```
+
+### 3. Use Enums for Categories
+
+```python
+# ✅ Good: Enum for fixed set
+class Priority(str, Enum):
+    LOW = "low"
+    MEDIUM = "medium"
+    HIGH = "high"
+
+class Task(BaseModel):
+    title: str
+    priority: Priority
+
+# ❌ Bad: Free-form string
+class Task(BaseModel):
+    title: str
+    priority: str  # Can be anything
+```
+
+### 4. Provide Context in Prompts
+
+```python
+# ✅ Good: Clear context
+prompt = """
+Extract product information from the following text.
+Text: iPhone 15 Pro costs $999 and is currently in stock.
+Product:
+"""
+
+# ❌ Bad: Minimal context
+prompt = "iPhone 15 Pro costs $999 and is currently in stock."
+```
+
+### 5. Handle Optional Fields
+
+```python
+from typing import Optional
+
+# ✅ Good: Optional fields for incomplete data
+class Article(BaseModel):
+    title: str  # Required
+    author: Optional[str] = None  # Optional
+    date: Optional[str] = None  # Optional
+    tags: list[str] = []  # Default empty list
+
+# Can succeed even if author/date missing
+```
+
+## Comparison to Alternatives
+
+| Feature | Outlines | Instructor | Guidance | LMQL |
+|---------|----------|------------|----------|------|
+| Pydantic Support | ✅ Native | ✅ Native | ❌ No | ❌ No |
+| JSON Schema | ✅ Yes | ✅ Yes | ⚠️ Limited | ✅ Yes |
+| Regex Constraints | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
+| Local Models | ✅ Full | ⚠️ Limited | ✅ Full | ✅ Full |
+| API Models | ⚠️ Limited | ✅ Full | ✅ Full | ✅ Full |
+| Zero Overhead | ✅ Yes | ❌ No | ⚠️ Partial | ✅ Yes |
+| Automatic Retrying | ❌ No | ✅ Yes | ❌ No | ❌ No |
+| Learning Curve | Low | Low | Low | High |
+
+**When to choose Outlines:**
+- Using local models (Transformers, llama.cpp, vLLM)
+- Need maximum inference speed
+- Want Pydantic model support
+- Require zero-overhead structured generation
+- Control token sampling process
+
+**When to choose alternatives:**
+- Instructor: Need API models with automatic retrying
+- Guidance: Need token healing and complex workflows
+- LMQL: Prefer declarative query syntax
+
+## Performance Characteristics
+
+**Speed:**
+- **Zero overhead**: Structured generation as fast as unconstrained
+- **Fast-forward optimization**: Skips deterministic tokens
+- **1.2-2x faster** than post-generation validation approaches
+
+**Memory:**
+- FSM compiled once per schema (cached)
+- Minimal runtime overhead
+- Efficient with vLLM for high throughput
+
+**Accuracy:**
+- **100% valid outputs** (guaranteed by FSM)
+- No retry loops needed
+- Deterministic token filtering
+
+## Resources
+
+- **Documentation**: https://outlines-dev.github.io/outlines
+- **GitHub**: https://github.com/outlines-dev/outlines (8k+ stars)
+- **Discord**: https://discord.gg/R9DSu34mGd
+- **Blog**: https://blog.dottxt.co
+
+## See Also
+
+- `references/json_generation.md` - Comprehensive JSON and Pydantic patterns
+- `references/backends.md` - Backend-specific configuration
+- `references/examples.md` - Production-ready examples
+
+
--- a/skills/mlops/inference/outlines/references/backends.md
+++ b/skills/mlops/inference/outlines/references/backends.md
@@ -0,0 +1,615 @@
+# Backend Configuration Guide
+
+Complete guide to configuring Outlines with different model backends.
+
+## Table of Contents
+- Local Models (Transformers, llama.cpp, vLLM)
+- API Models (OpenAI)
+- Performance Comparison
+- Configuration Examples
+- Production Deployment
+
+## Transformers (Hugging Face)
+
+### Basic Setup
+
+```python
+import outlines
+
+# Load model from Hugging Face
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+
+# Use with generator
+generator = outlines.generate.json(model, YourModel)
+result = generator("Your prompt")
+```
+
+### GPU Configuration
+
+```python
+# Use CUDA GPU
+model = outlines.models.transformers(
+    "microsoft/Phi-3-mini-4k-instruct",
+    device="cuda"
+)
+
+# Use specific GPU
+model = outlines.models.transformers(
+    "microsoft/Phi-3-mini-4k-instruct",
+    device="cuda:0"  # GPU 0
+)
+
+# Use CPU
+model = outlines.models.transformers(
+    "microsoft/Phi-3-mini-4k-instruct",
+    device="cpu"
+)
+
+# Use Apple Silicon MPS
+model = outlines.models.transformers(
+    "microsoft/Phi-3-mini-4k-instruct",
+    device="mps"
+)
+```
+
+### Advanced Configuration
+
+```python
+# FP16 for faster inference
+model = outlines.models.transformers(
+    "microsoft/Phi-3-mini-4k-instruct",
+    device="cuda",
+    model_kwargs={
+        "torch_dtype": "float16"
+    }
+)
+
+# 8-bit quantization (less memory)
+model = outlines.models.transformers(
+    "microsoft/Phi-3-mini-4k-instruct",
+    device="cuda",
+    model_kwargs={
+        "load_in_8bit": True,
+        "device_map": "auto"
+    }
+)
+
+# 4-bit quantization (even less memory)
+model = outlines.models.transformers(
+    "meta-llama/Llama-3.1-70B-Instruct",
+    device="cuda",
+    model_kwargs={
+        "load_in_4bit": True,
+        "device_map": "auto",
+        "bnb_4bit_compute_dtype": "float16"
+    }
+)
+
+# Multi-GPU
+model = outlines.models.transformers(
+    "meta-llama/Llama-3.1-70B-Instruct",
+    device="cuda",
+    model_kwargs={
+        "device_map": "auto",  # Automatic GPU distribution
+        "max_memory": {0: "40GB", 1: "40GB"}  # Per-GPU limits
+    }
+)
+```
+
+### Popular Models
+
+```python
+# Phi-4 (Microsoft)
+model = outlines.models.transformers("microsoft/Phi-4-mini-instruct")
+model = outlines.models.transformers("microsoft/Phi-3-medium-4k-instruct")
+
+# Llama 3.1 (Meta)
+model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
+model = outlines.models.transformers("meta-llama/Llama-3.1-70B-Instruct")
+model = outlines.models.transformers("meta-llama/Llama-3.1-405B-Instruct")
+
+# Mistral (Mistral AI)
+model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")
+model = outlines.models.transformers("mistralai/Mixtral-8x7B-Instruct-v0.1")
+model = outlines.models.transformers("mistralai/Mixtral-8x22B-Instruct-v0.1")
+
+# Qwen (Alibaba)
+model = outlines.models.transformers("Qwen/Qwen2.5-7B-Instruct")
+model = outlines.models.transformers("Qwen/Qwen2.5-14B-Instruct")
+model = outlines.models.transformers("Qwen/Qwen2.5-72B-Instruct")
+
+# Gemma (Google)
+model = outlines.models.transformers("google/gemma-2-9b-it")
+model = outlines.models.transformers("google/gemma-2-27b-it")
+
+# Llava (Vision)
+model = outlines.models.transformers("llava-hf/llava-v1.6-mistral-7b-hf")
+```
+
+### Custom Model Loading
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import outlines
+
+# Load model manually
+tokenizer = AutoTokenizer.from_pretrained("your-model")
+model_hf = AutoModelForCausalLM.from_pretrained(
+    "your-model",
+    device_map="auto",
+    torch_dtype="float16"
+)
+
+# Use with Outlines
+model = outlines.models.transformers(
+    model=model_hf,
+    tokenizer=tokenizer
+)
+```
+
+## llama.cpp
+
+### Basic Setup
+
+```python
+import outlines
+
+# Load GGUF model
+model = outlines.models.llamacpp(
+    "./models/llama-3.1-8b-instruct.Q4_K_M.gguf",
+    n_ctx=4096  # Context window
+)
+
+# Use with generator
+generator = outlines.generate.json(model, YourModel)
+```
+
+### GPU Configuration
+
+```python
+# CPU only
+model = outlines.models.llamacpp(
+    "./models/model.gguf",
+    n_ctx=4096,
+    n_threads=8  # Use 8 CPU threads
+)
+
+# GPU offload (partial)
+model = outlines.models.llamacpp(
+    "./models/model.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,  # Offload 35 layers to GPU
+    n_threads=4       # CPU threads for remaining layers
+)
+
+# Full GPU offload
+model = outlines.models.llamacpp(
+    "./models/model.gguf",
+    n_ctx=8192,
+    n_gpu_layers=-1  # All layers on GPU
+)
+```
+
+### Advanced Configuration
+
+```python
+model = outlines.models.llamacpp(
+    "./models/llama-3.1-8b.Q4_K_M.gguf",
+    n_ctx=8192,          # Context window (tokens)
+    n_gpu_layers=35,     # GPU layers
+    n_threads=8,         # CPU threads
+    n_batch=512,         # Batch size for prompt processing
+    use_mmap=True,       # Memory-map model file (faster loading)
+    use_mlock=False,     # Lock model in RAM (prevents swapping)
+    seed=42,             # Random seed for reproducibility
+    verbose=False        # Suppress verbose output
+)
+```
+
+### Quantization Formats
+
+```python
+# Q4_K_M (4-bit, recommended for most cases)
+# - Size: ~4.5GB for 7B model
+# - Quality: Good
+# - Speed: Fast
+model = outlines.models.llamacpp("./models/model.Q4_K_M.gguf")
+
+# Q5_K_M (5-bit, better quality)
+# - Size: ~5.5GB for 7B model
+# - Quality: Very good
+# - Speed: Slightly slower than Q4
+model = outlines.models.llamacpp("./models/model.Q5_K_M.gguf")
+
+# Q6_K (6-bit, high quality)
+# - Size: ~6.5GB for 7B model
+# - Quality: Excellent
+# - Speed: Slower than Q5
+model = outlines.models.llamacpp("./models/model.Q6_K.gguf")
+
+# Q8_0 (8-bit, near-original quality)
+# - Size: ~8GB for 7B model
+# - Quality: Near FP16
+# - Speed: Slower than Q6
+model = outlines.models.llamacpp("./models/model.Q8_0.gguf")
+
+# F16 (16-bit float, original quality)
+# - Size: ~14GB for 7B model
+# - Quality: Original
+# - Speed: Slowest
+model = outlines.models.llamacpp("./models/model.F16.gguf")
+```
+
+### Popular GGUF Models
+
+```python
+# Llama 3.1
+model = outlines.models.llamacpp("llama-3.1-8b-instruct.Q4_K_M.gguf")
+model = outlines.models.llamacpp("llama-3.1-70b-instruct.Q4_K_M.gguf")
+
+# Mistral
+model = outlines.models.llamacpp("mistral-7b-instruct-v0.3.Q4_K_M.gguf")
+
+# Phi-4
+model = outlines.models.llamacpp("phi-4-mini-instruct.Q4_K_M.gguf")
+
+# Qwen
+model = outlines.models.llamacpp("qwen2.5-7b-instruct.Q4_K_M.gguf")
+```
+
+### Apple Silicon Optimization
+
+```python
+# Optimized for M1/M2/M3 Macs
+model = outlines.models.llamacpp(
+    "./models/llama-3.1-8b.Q4_K_M.gguf",
+    n_ctx=4096,
+    n_gpu_layers=-1,  # Use Metal GPU acceleration
+    use_mmap=True,    # Efficient memory mapping
+    n_threads=8       # Use performance cores
+)
+```
+
+## vLLM (Production)
+
+### Basic Setup
+
+```python
+import outlines
+
+# Load model with vLLM
+model = outlines.models.vllm("meta-llama/Llama-3.1-8B-Instruct")
+
+# Use with generator
+generator = outlines.generate.json(model, YourModel)
+```
+
+### Single GPU
+
+```python
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
+    max_model_len=4096          # Max sequence length
+)
+```
+
+### Multi-GPU
+
+```python
+# Tensor parallelism (split model across GPUs)
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-70B-Instruct",
+    tensor_parallel_size=4,  # Use 4 GPUs
+    gpu_memory_utilization=0.9
+)
+
+# Pipeline parallelism (rare, for very large models)
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-405B-Instruct",
+    pipeline_parallel_size=8,  # 8-GPU pipeline
+    tensor_parallel_size=4     # 4-GPU tensor split
+    # Total: 32 GPUs
+)
+```
+
+### Quantization
+
+```python
+# AWQ quantization (4-bit)
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    quantization="awq",
+    dtype="float16"
+)
+
+# GPTQ quantization (4-bit)
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    quantization="gptq"
+)
+
+# SqueezeLLM quantization
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    quantization="squeezellm"
+)
+```
+
+### Advanced Configuration
+
+```python
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    tensor_parallel_size=1,
+    gpu_memory_utilization=0.9,
+    max_model_len=8192,
+    max_num_seqs=256,           # Max concurrent sequences
+    max_num_batched_tokens=8192, # Max tokens per batch
+    dtype="float16",
+    trust_remote_code=True,
+    enforce_eager=False,        # Use CUDA graphs (faster)
+    swap_space=4                # CPU swap space (GB)
+)
+```
+
+### Batch Processing
+
+```python
+# vLLM optimized for high-throughput batch processing
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    max_num_seqs=128  # Process 128 sequences in parallel
+)
+
+generator = outlines.generate.json(model, YourModel)
+
+# Process many prompts efficiently
+prompts = ["prompt1", "prompt2", ..., "prompt100"]
+results = [generator(p) for p in prompts]
+# vLLM automatically batches and optimizes
+```
+
+## OpenAI (Limited Support)
+
+### Basic Setup
+
+```python
+import outlines
+
+# Basic OpenAI support
+model = outlines.models.openai("gpt-4o-mini", api_key="your-api-key")
+
+# Use with generator
+generator = outlines.generate.json(model, YourModel)
+result = generator("Your prompt")
+```
+
+### Configuration
+
+```python
+model = outlines.models.openai(
+    "gpt-4o-mini",
+    api_key="your-api-key",  # Or set OPENAI_API_KEY env var
+    max_tokens=2048,
+    temperature=0.7
+)
+```
+
+### Available Models
+
+```python
+# GPT-4o (latest)
+model = outlines.models.openai("gpt-4o")
+
+# GPT-4o Mini (cost-effective)
+model = outlines.models.openai("gpt-4o-mini")
+
+# GPT-4 Turbo
+model = outlines.models.openai("gpt-4-turbo")
+
+# GPT-3.5 Turbo
+model = outlines.models.openai("gpt-3.5-turbo")
+```
+
+**Note**: OpenAI support is limited compared to local models. Some advanced features may not work.
+
+## Backend Comparison
+
+### Feature Matrix
+
+| Feature | Transformers | llama.cpp | vLLM | OpenAI |
+|---------|-------------|-----------|------|--------|
+| Structured Generation | ✅ Full | ✅ Full | ✅ Full | ⚠️ Limited |
+| FSM Optimization | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
+| GPU Support | ✅ Yes | ✅ Yes | ✅ Yes | N/A |
+| Multi-GPU | ✅ Yes | ✅ Yes | ✅ Yes | N/A |
+| Quantization | ✅ Yes | ✅ Yes | ✅ Yes | N/A |
+| High Throughput | ⚠️ Medium | ⚠️ Medium | ✅ Excellent | ⚠️ API-limited |
+| Setup Difficulty | Easy | Medium | Medium | Easy |
+| Cost | Hardware | Hardware | Hardware | API usage |
+
+### Performance Characteristics
+
+**Transformers:**
+- **Latency**: 50-200ms (single request, GPU)
+- **Throughput**: 10-50 tokens/sec (depends on hardware)
+- **Memory**: 2-4GB per 1B parameters (FP16)
+- **Best for**: Development, small-scale deployment, flexibility
+
+**llama.cpp:**
+- **Latency**: 30-150ms (single request)
+- **Throughput**: 20-150 tokens/sec (depends on quantization)
+- **Memory**: 0.5-2GB per 1B parameters (Q4-Q8)
+- **Best for**: CPU inference, Apple Silicon, edge deployment, low memory
+
+**vLLM:**
+- **Latency**: 30-100ms (single request)
+- **Throughput**: 100-1000+ tokens/sec (batch processing)
+- **Memory**: 2-4GB per 1B parameters (FP16)
+- **Best for**: Production, high-throughput, batch processing, serving
+
+**OpenAI:**
+- **Latency**: 200-500ms (API call)
+- **Throughput**: API rate limits
+- **Memory**: N/A (cloud-based)
+- **Best for**: Quick prototyping, no infrastructure
+
+### Memory Requirements
+
+**7B Model:**
+- FP16: ~14GB
+- 8-bit: ~7GB
+- 4-bit: ~4GB
+- Q4_K_M (GGUF): ~4.5GB
+
+**13B Model:**
+- FP16: ~26GB
+- 8-bit: ~13GB
+- 4-bit: ~7GB
+- Q4_K_M (GGUF): ~8GB
+
+**70B Model:**
+- FP16: ~140GB (multi-GPU)
+- 8-bit: ~70GB (multi-GPU)
+- 4-bit: ~35GB (single A100/H100)
+- Q4_K_M (GGUF): ~40GB
+
+## Performance Tuning
+
+### Transformers Optimization
+
+```python
+# Use FP16
+model = outlines.models.transformers(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    device="cuda",
+    model_kwargs={"torch_dtype": "float16"}
+)
+
+# Use flash attention (2-4x faster)
+model = outlines.models.transformers(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    device="cuda",
+    model_kwargs={
+        "torch_dtype": "float16",
+        "use_flash_attention_2": True
+    }
+)
+
+# Use 8-bit quantization (2x less memory)
+model = outlines.models.transformers(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    device="cuda",
+    model_kwargs={
+        "load_in_8bit": True,
+        "device_map": "auto"
+    }
+)
+```
+
+### llama.cpp Optimization
+
+```python
+# Maximize GPU usage
+model = outlines.models.llamacpp(
+    "./models/model.Q4_K_M.gguf",
+    n_gpu_layers=-1,  # All layers on GPU
+    n_ctx=8192,
+    n_batch=512       # Larger batch = faster
+)
+
+# Optimize for CPU (Apple Silicon)
+model = outlines.models.llamacpp(
+    "./models/model.Q4_K_M.gguf",
+    n_ctx=4096,
+    n_threads=8,      # Use all performance cores
+    use_mmap=True
+)
+```
+
+### vLLM Optimization
+
+```python
+# High throughput
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-8B-Instruct",
+    gpu_memory_utilization=0.95,  # Use 95% of GPU
+    max_num_seqs=256,             # High concurrency
+    enforce_eager=False           # Use CUDA graphs
+)
+
+# Multi-GPU
+model = outlines.models.vllm(
+    "meta-llama/Llama-3.1-70B-Instruct",
+    tensor_parallel_size=4,  # 4 GPUs
+    gpu_memory_utilization=0.9
+)
+```
+
+## Production Deployment
+
+### Docker with vLLM
+
+```dockerfile
+FROM vllm/vllm-openai:latest
+
+# Install outlines
+RUN pip install outlines
+
+# Copy your code
+COPY app.py /app/
+
+# Run
+CMD ["python", "/app/app.py"]
+```
+
+### Environment Variables
+
+```bash
+# Transformers cache
+export HF_HOME="/path/to/cache"
+export TRANSFORMERS_CACHE="/path/to/cache"
+
+# GPU selection
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+# OpenAI API key
+export OPENAI_API_KEY="sk-..."
+
+# Disable tokenizers parallelism warning
+export TOKENIZERS_PARALLELISM=false
+```
+
+### Model Serving
+
+```python
+# Simple HTTP server with vLLM
+import outlines
+from fastapi import FastAPI
+from pydantic import BaseModel
+
+app = FastAPI()
+
+# Load model once at startup
+model = outlines.models.vllm("meta-llama/Llama-3.1-8B-Instruct")
+
+class User(BaseModel):
+    name: str
+    age: int
+    email: str
+
+generator = outlines.generate.json(model, User)
+
+@app.post("/extract")
+def extract(text: str):
+    result = generator(f"Extract user from: {text}")
+    return result.model_dump()
+```
+
+## Resources
+
+- **Transformers**: https://huggingface.co/docs/transformers
+- **llama.cpp**: https://github.com/ggerganov/llama.cpp
+- **vLLM**: https://docs.vllm.ai
+- **Outlines**: https://github.com/outlines-dev/outlines
--- a/skills/mlops/inference/outlines/references/examples.md
+++ b/skills/mlops/inference/outlines/references/examples.md
@@ -0,0 +1,773 @@
+# Production-Ready Examples
+
+Real-world examples of using Outlines for structured generation in production systems.
+
+## Table of Contents
+- Data Extraction
+- Classification Systems
+- Form Processing
+- Multi-Entity Extraction
+- Code Generation
+- Batch Processing
+- Production Patterns
+
+## Data Extraction
+
+### Basic Information Extraction
+
+```python
+from pydantic import BaseModel, Field
+import outlines
+
+class PersonInfo(BaseModel):
+    name: str = Field(description="Full name")
+    age: int = Field(ge=0, le=120)
+    occupation: str
+    email: str = Field(pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$")
+    location: str
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, PersonInfo)
+
+text = """
+Dr. Sarah Johnson is a 42-year-old research scientist at MIT.
+She can be reached at sarah.j@mit.edu and currently lives in Cambridge, MA.
+"""
+
+prompt = f"Extract person information from:\n{text}\n\nPerson:"
+person = generator(prompt)
+
+print(f"Name: {person.name}")
+print(f"Age: {person.age}")
+print(f"Occupation: {person.occupation}")
+print(f"Email: {person.email}")
+print(f"Location: {person.location}")
+```
+
+### Company Information
+
+```python
+class CompanyInfo(BaseModel):
+    name: str
+    founded_year: int = Field(ge=1800, le=2025)
+    industry: str
+    headquarters: str
+    employees: int = Field(gt=0)
+    revenue: Optional[str] = None
+
+model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
+generator = outlines.generate.json(model, CompanyInfo)
+
+text = """
+Tesla, Inc. was founded in 2003 and operates primarily in the automotive
+and energy industries. The company is headquartered in Austin, Texas,
+and employs approximately 140,000 people worldwide.
+"""
+
+company = generator(f"Extract company information:\n{text}\n\nCompany:")
+
+print(f"Company: {company.name}")
+print(f"Founded: {company.founded_year}")
+print(f"Industry: {company.industry}")
+print(f"HQ: {company.headquarters}")
+print(f"Employees: {company.employees:,}")
+```
+
+### Product Specifications
+
+```python
+class ProductSpec(BaseModel):
+    name: str
+    brand: str
+    price: float = Field(gt=0)
+    dimensions: str
+    weight: str
+    features: list[str]
+    rating: Optional[float] = Field(None, ge=0, le=5)
+
+generator = outlines.generate.json(model, ProductSpec)
+
+text = """
+The Apple iPhone 15 Pro is priced at $999. It measures 146.6 x 70.6 x 8.25 mm
+and weighs 187 grams. Key features include the A17 Pro chip, titanium design,
+action button, and USB-C port. It has an average customer rating of 4.5 stars.
+"""
+
+product = generator(f"Extract product specifications:\n{text}\n\nProduct:")
+
+print(f"Product: {product.brand} {product.name}")
+print(f"Price: ${product.price}")
+print(f"Features: {', '.join(product.features)}")
+```
+
+## Classification Systems
+
+### Sentiment Analysis
+
+```python
+from typing import Literal
+from enum import Enum
+
+class Sentiment(str, Enum):
+    VERY_POSITIVE = "very_positive"
+    POSITIVE = "positive"
+    NEUTRAL = "neutral"
+    NEGATIVE = "negative"
+    VERY_NEGATIVE = "very_negative"
+
+class SentimentAnalysis(BaseModel):
+    text: str
+    sentiment: Sentiment
+    confidence: float = Field(ge=0.0, le=1.0)
+    aspects: list[str]  # What aspects were mentioned
+    reasoning: str
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, SentimentAnalysis)
+
+review = """
+This product completely exceeded my expectations! The build quality is
+outstanding, and customer service was incredibly helpful. My only minor
+complaint is the packaging could be better.
+"""
+
+result = generator(f"Analyze sentiment:\n{review}\n\nAnalysis:")
+
+print(f"Sentiment: {result.sentiment.value}")
+print(f"Confidence: {result.confidence:.2%}")
+print(f"Aspects: {', '.join(result.aspects)}")
+print(f"Reasoning: {result.reasoning}")
+```
+
+### Content Classification
+
+```python
+class Category(str, Enum):
+    TECHNOLOGY = "technology"
+    BUSINESS = "business"
+    SCIENCE = "science"
+    POLITICS = "politics"
+    ENTERTAINMENT = "entertainment"
+    SPORTS = "sports"
+    HEALTH = "health"
+
+class ArticleClassification(BaseModel):
+    primary_category: Category
+    secondary_categories: list[Category]
+    keywords: list[str] = Field(min_items=3, max_items=10)
+    target_audience: Literal["general", "expert", "beginner"]
+    reading_level: Literal["elementary", "intermediate", "advanced"]
+
+generator = outlines.generate.json(model, ArticleClassification)
+
+article = """
+Apple announced groundbreaking advancements in its AI capabilities with the
+release of iOS 18. The new features leverage machine learning to significantly
+improve battery life and overall device performance. Industry analysts predict
+this will strengthen Apple's position in the competitive smartphone market.
+"""
+
+classification = generator(f"Classify article:\n{article}\n\nClassification:")
+
+print(f"Primary: {classification.primary_category.value}")
+print(f"Secondary: {[c.value for c in classification.secondary_categories]}")
+print(f"Keywords: {classification.keywords}")
+print(f"Audience: {classification.target_audience}")
+```
+
+### Intent Recognition
+
+```python
+class Intent(str, Enum):
+    QUESTION = "question"
+    COMPLAINT = "complaint"
+    REQUEST = "request"
+    FEEDBACK = "feedback"
+    CANCEL = "cancel"
+    UPGRADE = "upgrade"
+
+class UserMessage(BaseModel):
+    original_message: str
+    intent: Intent
+    urgency: Literal["low", "medium", "high", "critical"]
+    department: Literal["support", "sales", "billing", "technical"]
+    sentiment: Literal["positive", "neutral", "negative"]
+    action_required: bool
+    summary: str
+
+generator = outlines.generate.json(model, UserMessage)
+
+message = """
+I've been charged twice for my subscription this month! This is the third
+time this has happened. I need someone to fix this immediately and refund
+the extra charge. Very disappointed with this service.
+"""
+
+result = generator(f"Analyze message:\n{message}\n\nAnalysis:")
+
+print(f"Intent: {result.intent.value}")
+print(f"Urgency: {result.urgency}")
+print(f"Route to: {result.department}")
+print(f"Action required: {result.action_required}")
+print(f"Summary: {result.summary}")
+```
+
+## Form Processing
+
+### Job Application
+
+```python
+class Education(BaseModel):
+    degree: str
+    field: str
+    institution: str
+    year: int
+
+class Experience(BaseModel):
+    title: str
+    company: str
+    duration: str
+    responsibilities: list[str]
+
+class JobApplication(BaseModel):
+    full_name: str
+    email: str
+    phone: str
+    education: list[Education]
+    experience: list[Experience]
+    skills: list[str]
+    availability: str
+
+model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
+generator = outlines.generate.json(model, JobApplication)
+
+resume_text = """
+John Smith
+Email: john.smith@email.com | Phone: 555-0123
+
+EDUCATION
+- BS in Computer Science, MIT, 2018
+- MS in Artificial Intelligence, Stanford, 2020
+
+EXPERIENCE
+Software Engineer, Google (2020-2023)
+- Developed ML pipelines for search ranking
+- Led team of 5 engineers
+- Improved search quality by 15%
+
+SKILLS: Python, Machine Learning, TensorFlow, System Design
+
+AVAILABILITY: Immediate
+"""
+
+application = generator(f"Extract job application:\n{resume_text}\n\nApplication:")
+
+print(f"Applicant: {application.full_name}")
+print(f"Email: {application.email}")
+print(f"Education: {len(application.education)} degrees")
+for edu in application.education:
+    print(f"  - {edu.degree} in {edu.field}, {edu.institution} ({edu.year})")
+print(f"Experience: {len(application.experience)} positions")
+```
+
+### Invoice Processing
+
+```python
+class InvoiceItem(BaseModel):
+    description: str
+    quantity: int = Field(gt=0)
+    unit_price: float = Field(gt=0)
+    total: float = Field(gt=0)
+
+class Invoice(BaseModel):
+    invoice_number: str
+    date: str = Field(pattern=r"\d{4}-\d{2}-\d{2}")
+    vendor: str
+    customer: str
+    items: list[InvoiceItem]
+    subtotal: float = Field(gt=0)
+    tax: float = Field(ge=0)
+    total: float = Field(gt=0)
+
+generator = outlines.generate.json(model, Invoice)
+
+invoice_text = """
+INVOICE #INV-2024-001
+Date: 2024-01-15
+
+From: Acme Corp
+To: Smith & Co
+
+Items:
+- Widget A: 10 units @ $50.00 = $500.00
+- Widget B: 5 units @ $75.00 = $375.00
+- Service Fee: 1 @ $100.00 = $100.00
+
+Subtotal: $975.00
+Tax (8%): $78.00
+TOTAL: $1,053.00
+"""
+
+invoice = generator(f"Extract invoice:\n{invoice_text}\n\nInvoice:")
+
+print(f"Invoice: {invoice.invoice_number}")
+print(f"From: {invoice.vendor} → To: {invoice.customer}")
+print(f"Items: {len(invoice.items)}")
+for item in invoice.items:
+    print(f"  - {item.description}: {item.quantity} × ${item.unit_price} = ${item.total}")
+print(f"Total: ${invoice.total}")
+```
+
+### Survey Responses
+
+```python
+class SurveyResponse(BaseModel):
+    respondent_id: str
+    completion_date: str
+    satisfaction: Literal[1, 2, 3, 4, 5]
+    would_recommend: bool
+    favorite_features: list[str]
+    improvement_areas: list[str]
+    additional_comments: Optional[str] = None
+
+generator = outlines.generate.json(model, SurveyResponse)
+
+survey_text = """
+Survey ID: RESP-12345
+Completed: 2024-01-20
+
+How satisfied are you with our product? 4 out of 5
+
+Would you recommend to a friend? Yes
+
+What features do you like most?
+- Fast performance
+- Easy to use
+- Great customer support
+
+What could we improve?
+- Better documentation
+- More integrations
+
+Additional feedback: Overall great product, keep up the good work!
+"""
+
+response = generator(f"Extract survey response:\n{survey_text}\n\nResponse:")
+
+print(f"Respondent: {response.respondent_id}")
+print(f"Satisfaction: {response.satisfaction}/5")
+print(f"Would recommend: {response.would_recommend}")
+print(f"Favorite features: {response.favorite_features}")
+print(f"Improvement areas: {response.improvement_areas}")
+```
+
+## Multi-Entity Extraction
+
+### News Article Entities
+
+```python
+class Person(BaseModel):
+    name: str
+    role: Optional[str] = None
+    affiliation: Optional[str] = None
+
+class Organization(BaseModel):
+    name: str
+    type: Optional[str] = None
+
+class Location(BaseModel):
+    name: str
+    type: Literal["city", "state", "country", "region"]
+
+class Event(BaseModel):
+    name: str
+    date: Optional[str] = None
+    location: Optional[str] = None
+
+class ArticleEntities(BaseModel):
+    people: list[Person]
+    organizations: list[Organization]
+    locations: list[Location]
+    events: list[Event]
+    dates: list[str]
+
+model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
+generator = outlines.generate.json(model, ArticleEntities)
+
+article = """
+Apple CEO Tim Cook met with Microsoft CEO Satya Nadella at Microsoft
+headquarters in Redmond, Washington on September 15, 2024, to discuss
+potential collaboration opportunities. The meeting was attended by executives
+from both companies and focused on AI integration strategies. Apple's
+Cupertino offices will host a follow-up meeting on October 20, 2024.
+"""
+
+entities = generator(f"Extract all entities:\n{article}\n\nEntities:")
+
+print("People:")
+for person in entities.people:
+    print(f"  - {person.name} ({person.role}) @ {person.affiliation}")
+
+print("\nOrganizations:")
+for org in entities.organizations:
+    print(f"  - {org.name} ({org.type})")
+
+print("\nLocations:")
+for loc in entities.locations:
+    print(f"  - {loc.name} ({loc.type})")
+
+print("\nEvents:")
+for event in entities.events:
+    print(f"  - {event.name} on {event.date}")
+```
+
+### Document Metadata
+
+```python
+class Author(BaseModel):
+    name: str
+    email: Optional[str] = None
+    affiliation: Optional[str] = None
+
+class Reference(BaseModel):
+    title: str
+    authors: list[str]
+    year: int
+    source: str
+
+class DocumentMetadata(BaseModel):
+    title: str
+    authors: list[Author]
+    abstract: str
+    keywords: list[str]
+    publication_date: str
+    journal: str
+    doi: Optional[str] = None
+    references: list[Reference]
+
+generator = outlines.generate.json(model, DocumentMetadata)
+
+paper = """
+Title: Advances in Neural Machine Translation
+
+Authors:
+- Dr. Jane Smith (jane@university.edu), MIT
+- Prof. John Doe (jdoe@stanford.edu), Stanford University
+
+Abstract: This paper presents novel approaches to neural machine translation
+using transformer architectures. We demonstrate significant improvements in
+translation quality across multiple language pairs.
+
+Keywords: Neural Networks, Machine Translation, Transformers, NLP
+
+Published: Journal of AI Research, 2024-03-15
+DOI: 10.1234/jair.2024.001
+
+References:
+1. "Attention Is All You Need" by Vaswani et al., 2017, NeurIPS
+2. "BERT: Pre-training of Deep Bidirectional Transformers" by Devlin et al., 2019, NAACL
+"""
+
+metadata = generator(f"Extract document metadata:\n{paper}\n\nMetadata:")
+
+print(f"Title: {metadata.title}")
+print(f"Authors: {', '.join(a.name for a in metadata.authors)}")
+print(f"Keywords: {', '.join(metadata.keywords)}")
+print(f"References: {len(metadata.references)}")
+```
+
+## Code Generation
+
+### Python Function Generation
+
+```python
+class Parameter(BaseModel):
+    name: str = Field(pattern=r"^[a-z_][a-z0-9_]*$")
+    type_hint: str
+    default: Optional[str] = None
+
+class PythonFunction(BaseModel):
+    function_name: str = Field(pattern=r"^[a-z_][a-z0-9_]*$")
+    parameters: list[Parameter]
+    return_type: str
+    docstring: str
+    body: list[str]  # Lines of code
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, PythonFunction)
+
+spec = "Create a function to calculate the factorial of a number"
+
+func = generator(f"Generate Python function:\n{spec}\n\nFunction:")
+
+print(f"def {func.function_name}(", end="")
+print(", ".join(f"{p.name}: {p.type_hint}" for p in func.parameters), end="")
+print(f") -> {func.return_type}:")
+print(f'    """{func.docstring}"""')
+for line in func.body:
+    print(f"    {line}")
+```
+
+### SQL Query Generation
+
+```python
+class SQLQuery(BaseModel):
+    query_type: Literal["SELECT", "INSERT", "UPDATE", "DELETE"]
+    select_columns: Optional[list[str]] = None
+    from_tables: list[str]
+    joins: Optional[list[str]] = None
+    where_conditions: Optional[list[str]] = None
+    group_by: Optional[list[str]] = None
+    order_by: Optional[list[str]] = None
+    limit: Optional[int] = None
+
+generator = outlines.generate.json(model, SQLQuery)
+
+request = "Get top 10 users who made purchases in the last 30 days, ordered by total spent"
+
+sql = generator(f"Generate SQL query:\n{request}\n\nQuery:")
+
+print(f"Query type: {sql.query_type}")
+print(f"SELECT {', '.join(sql.select_columns)}")
+print(f"FROM {', '.join(sql.from_tables)}")
+if sql.joins:
+    for join in sql.joins:
+        print(f"  {join}")
+if sql.where_conditions:
+    print(f"WHERE {' AND '.join(sql.where_conditions)}")
+if sql.order_by:
+    print(f"ORDER BY {', '.join(sql.order_by)}")
+if sql.limit:
+    print(f"LIMIT {sql.limit}")
+```
+
+### API Endpoint Spec
+
+```python
+class Parameter(BaseModel):
+    name: str
+    type: str
+    required: bool
+    description: str
+
+class APIEndpoint(BaseModel):
+    method: Literal["GET", "POST", "PUT", "DELETE", "PATCH"]
+    path: str
+    description: str
+    parameters: list[Parameter]
+    request_body: Optional[dict] = None
+    response_schema: dict
+    status_codes: dict[int, str]
+
+generator = outlines.generate.json(model, APIEndpoint)
+
+spec = "Create user endpoint"
+
+endpoint = generator(f"Generate API endpoint:\n{spec}\n\nEndpoint:")
+
+print(f"{endpoint.method} {endpoint.path}")
+print(f"Description: {endpoint.description}")
+print("\nParameters:")
+for param in endpoint.parameters:
+    req = "required" if param.required else "optional"
+    print(f"  - {param.name} ({param.type}, {req}): {param.description}")
+```
+
+## Batch Processing
+
+### Parallel Extraction
+
+```python
+def batch_extract(texts: list[str], schema: type[BaseModel], model_name: str):
+    """Extract structured data from multiple texts."""
+    model = outlines.models.transformers(model_name)
+    generator = outlines.generate.json(model, schema)
+
+    results = []
+    for i, text in enumerate(texts):
+        print(f"Processing {i+1}/{len(texts)}...", end="\r")
+        result = generator(f"Extract:\n{text}\n\nData:")
+        results.append(result)
+
+    return results
+
+class Product(BaseModel):
+    name: str
+    price: float
+    category: str
+
+texts = [
+    "iPhone 15 Pro costs $999 in Electronics",
+    "Running Shoes are $89.99 in Sports",
+    "Coffee Maker priced at $49.99 in Home & Kitchen"
+]
+
+products = batch_extract(texts, Product, "microsoft/Phi-3-mini-4k-instruct")
+
+for product in products:
+    print(f"{product.name}: ${product.price} ({product.category})")
+```
+
+### CSV Processing
+
+```python
+import csv
+
+def process_csv(csv_file: str, schema: type[BaseModel]):
+    """Process CSV file and extract structured data."""
+    model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+    generator = outlines.generate.json(model, schema)
+
+    results = []
+    with open(csv_file, 'r') as f:
+        reader = csv.DictReader(f)
+        for row in reader:
+            text = " | ".join(f"{k}: {v}" for k, v in row.items())
+            result = generator(f"Extract:\n{text}\n\nData:")
+            results.append(result)
+
+    return results
+
+class Customer(BaseModel):
+    name: str
+    email: str
+    tier: Literal["basic", "premium", "enterprise"]
+    mrr: float
+
+# customers = process_csv("customers.csv", Customer)
+```
+
+## Production Patterns
+
+### Error Handling
+
+```python
+from pydantic import ValidationError
+
+def safe_extract(text: str, schema: type[BaseModel], retries: int = 3):
+    """Extract with error handling and retries."""
+    model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+    generator = outlines.generate.json(model, schema)
+
+    for attempt in range(retries):
+        try:
+            result = generator(f"Extract:\n{text}\n\nData:")
+            return result
+        except ValidationError as e:
+            print(f"Attempt {attempt + 1} failed: {e}")
+            if attempt == retries - 1:
+                raise
+        except Exception as e:
+            print(f"Unexpected error: {e}")
+            if attempt == retries - 1:
+                raise
+
+    return None
+```
+
+### Caching
+
+```python
+from functools import lru_cache
+import hashlib
+
+@lru_cache(maxsize=1000)
+def cached_extract(text_hash: str, schema_name: str):
+    """Cache extraction results."""
+    # This would be called with actual extraction logic
+    pass
+
+def extract_with_cache(text: str, schema: type[BaseModel]):
+    """Extract with caching."""
+    text_hash = hashlib.md5(text.encode()).hexdigest()
+    schema_name = schema.__name__
+
+    cached_result = cached_extract(text_hash, schema_name)
+    if cached_result:
+        return cached_result
+
+    # Perform actual extraction
+    model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+    generator = outlines.generate.json(model, schema)
+    result = generator(f"Extract:\n{text}\n\nData:")
+
+    return result
+```
+
+### Monitoring
+
+```python
+import time
+import logging
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+def monitored_extract(text: str, schema: type[BaseModel]):
+    """Extract with monitoring and logging."""
+    start_time = time.time()
+
+    try:
+        model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+        generator = outlines.generate.json(model, schema)
+
+        result = generator(f"Extract:\n{text}\n\nData:")
+
+        elapsed = time.time() - start_time
+        logger.info(f"Extraction succeeded in {elapsed:.2f}s")
+        logger.info(f"Input length: {len(text)} chars")
+
+        return result
+
+    except Exception as e:
+        elapsed = time.time() - start_time
+        logger.error(f"Extraction failed after {elapsed:.2f}s: {e}")
+        raise
+```
+
+### Rate Limiting
+
+```python
+import time
+from threading import Lock
+
+class RateLimiter:
+    def __init__(self, max_requests: int, time_window: int):
+        self.max_requests = max_requests
+        self.time_window = time_window
+        self.requests = []
+        self.lock = Lock()
+
+    def wait_if_needed(self):
+        with self.lock:
+            now = time.time()
+            # Remove old requests
+            self.requests = [r for r in self.requests if now - r < self.time_window]
+
+            if len(self.requests) >= self.max_requests:
+                sleep_time = self.time_window - (now - self.requests[0])
+                time.sleep(sleep_time)
+                self.requests = []
+
+            self.requests.append(now)
+
+def rate_limited_extract(texts: list[str], schema: type[BaseModel]):
+    """Extract with rate limiting."""
+    limiter = RateLimiter(max_requests=10, time_window=60)  # 10 req/min
+    model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+    generator = outlines.generate.json(model, schema)
+
+    results = []
+    for text in texts:
+        limiter.wait_if_needed()
+        result = generator(f"Extract:\n{text}\n\nData:")
+        results.append(result)
+
+    return results
+```
+
+## Resources
+
+- **Outlines Documentation**: https://outlines-dev.github.io/outlines
+- **Pydantic Documentation**: https://docs.pydantic.dev
+- **GitHub Examples**: https://github.com/outlines-dev/outlines/tree/main/examples
--- a/skills/mlops/inference/outlines/references/json_generation.md
+++ b/skills/mlops/inference/outlines/references/json_generation.md
@@ -0,0 +1,652 @@
+# Comprehensive JSON Generation Guide
+
+Complete guide to JSON generation with Outlines using Pydantic models and JSON schemas.
+
+## Table of Contents
+- Pydantic Models
+- JSON Schema Support
+- Advanced Patterns
+- Nested Structures
+- Complex Types
+- Validation
+- Performance Optimization
+
+## Pydantic Models
+
+### Basic Models
+
+```python
+from pydantic import BaseModel
+import outlines
+
+class User(BaseModel):
+    name: str
+    age: int
+    email: str
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, User)
+
+user = generator("Generate user: Alice, 25, alice@example.com")
+print(user.name)   # "Alice"
+print(user.age)    # 25
+print(user.email)  # "alice@example.com"
+```
+
+###
+
+ Field Constraints
+
+```python
+from pydantic import BaseModel, Field
+
+class Product(BaseModel):
+    name: str = Field(min_length=1, max_length=100)
+    price: float = Field(gt=0, description="Price in USD")
+    discount: float = Field(ge=0, le=100, description="Discount percentage")
+    quantity: int = Field(ge=0, description="Available quantity")
+    sku: str = Field(pattern=r"^[A-Z]{3}-\d{6}$")
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, Product)
+
+product = generator("Generate product: iPhone 15, $999")
+# All fields guaranteed to meet constraints
+```
+
+**Available Constraints:**
+- `min_length`, `max_length`: String length
+- `gt`, `ge`, `lt`, `le`: Numeric comparisons
+- `multiple_of`: Number must be multiple of value
+- `pattern`: Regex pattern for strings
+- `min_items`, `max_items`: List length
+
+### Optional Fields
+
+```python
+from typing import Optional
+
+class Article(BaseModel):
+    title: str  # Required
+    author: Optional[str] = None  # Optional
+    published_date: Optional[str] = None  # Optional
+    tags: list[str] = []  # Default empty list
+    view_count: int = 0  # Default value
+
+generator = outlines.generate.json(model, Article)
+
+# Can generate even if optional fields missing
+article = generator("Title: Introduction to AI")
+print(article.author)  # None (not provided)
+print(article.tags)    # [] (default)
+```
+
+### Default Values
+
+```python
+class Config(BaseModel):
+    debug: bool = False
+    max_retries: int = 3
+    timeout: float = 30.0
+    log_level: str = "INFO"
+
+# Generator uses defaults when not specified
+generator = outlines.generate.json(model, Config)
+config = generator("Generate config with debug enabled")
+print(config.debug)  # True (from prompt)
+print(config.timeout)  # 30.0 (default)
+```
+
+## Enums and Literals
+
+### Enum Fields
+
+```python
+from enum import Enum
+
+class Status(str, Enum):
+    PENDING = "pending"
+    APPROVED = "approved"
+    REJECTED = "rejected"
+    CANCELLED = "cancelled"
+
+class Application(BaseModel):
+    applicant_name: str
+    status: Status  # Must be one of enum values
+    submitted_date: str
+
+generator = outlines.generate.json(model, Application)
+app = generator("Generate application for John Doe")
+
+print(app.status)  # Status.PENDING (or one of the enum values)
+print(type(app.status))  # <enum 'Status'>
+```
+
+### Literal Types
+
+```python
+from typing import Literal
+
+class Task(BaseModel):
+    title: str
+    priority: Literal["low", "medium", "high", "critical"]
+    status: Literal["todo", "in_progress", "done"]
+    assigned_to: str
+
+generator = outlines.generate.json(model, Task)
+task = generator("Create high priority task: Fix bug")
+
+print(task.priority)  # One of: "low", "medium", "high", "critical"
+```
+
+### Multiple Choice Fields
+
+```python
+class Survey(BaseModel):
+    question: str
+    answer: Literal["strongly_disagree", "disagree", "neutral", "agree", "strongly_agree"]
+    confidence: Literal["low", "medium", "high"]
+
+generator = outlines.generate.json(model, Survey)
+survey = generator("Rate: 'I enjoy using this product'")
+```
+
+## Nested Structures
+
+### Nested Models
+
+```python
+class Address(BaseModel):
+    street: str
+    city: str
+    state: str
+    zip_code: str
+    country: str = "USA"
+
+class Person(BaseModel):
+    name: str
+    age: int
+    email: str
+    address: Address  # Nested model
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, Person)
+
+prompt = """
+Extract person:
+Name: Alice Johnson
+Age: 28
+Email: alice@example.com
+Address: 123 Main St, Boston, MA, 02101
+"""
+
+person = generator(prompt)
+print(person.name)  # "Alice Johnson"
+print(person.address.city)  # "Boston"
+print(person.address.state)  # "MA"
+```
+
+### Deep Nesting
+
+```python
+class Coordinates(BaseModel):
+    latitude: float
+    longitude: float
+
+class Location(BaseModel):
+    name: str
+    coordinates: Coordinates
+
+class Event(BaseModel):
+    title: str
+    date: str
+    location: Location
+
+generator = outlines.generate.json(model, Event)
+event = generator("Generate event: Tech Conference in San Francisco")
+
+print(event.title)  # "Tech Conference"
+print(event.location.name)  # "San Francisco"
+print(event.location.coordinates.latitude)  # 37.7749
+```
+
+### Lists of Nested Models
+
+```python
+class Item(BaseModel):
+    name: str
+    quantity: int
+    price: float
+
+class Order(BaseModel):
+    order_id: str
+    customer: str
+    items: list[Item]  # List of nested models
+    total: float
+
+generator = outlines.generate.json(model, Order)
+
+prompt = """
+Generate order for John:
+- 2x Widget ($10 each)
+- 3x Gadget ($15 each)
+Order ID: ORD-001
+"""
+
+order = generator(prompt)
+print(f"Order ID: {order.order_id}")
+for item in order.items:
+    print(f"- {item.quantity}x {item.name} @ ${item.price}")
+print(f"Total: ${order.total}")
+```
+
+## Complex Types
+
+### Union Types
+
+```python
+from typing import Union
+
+class TextContent(BaseModel):
+    type: Literal["text"]
+    content: str
+
+class ImageContent(BaseModel):
+    type: Literal["image"]
+    url: str
+    caption: str
+
+class Post(BaseModel):
+    title: str
+    content: Union[TextContent, ImageContent]  # Either type
+
+generator = outlines.generate.json(model, Post)
+
+# Can generate either text or image content
+post = generator("Generate blog post with image")
+if post.content.type == "text":
+    print(post.content.content)
+elif post.content.type == "image":
+    print(post.content.url)
+```
+
+### Lists and Arrays
+
+```python
+class Article(BaseModel):
+    title: str
+    authors: list[str]  # List of strings
+    tags: list[str]
+    sections: list[dict[str, str]]  # List of dicts
+    related_ids: list[int]
+
+generator = outlines.generate.json(model, Article)
+article = generator("Generate article about AI")
+
+print(article.authors)  # ["Alice", "Bob"]
+print(article.tags)  # ["AI", "Machine Learning", "Technology"]
+```
+
+### Dictionaries
+
+```python
+class Metadata(BaseModel):
+    title: str
+    properties: dict[str, str]  # String keys and values
+    counts: dict[str, int]  # String keys, int values
+    settings: dict[str, Union[str, int, bool]]  # Mixed value types
+
+generator = outlines.generate.json(model, Metadata)
+meta = generator("Generate metadata")
+
+print(meta.properties)  # {"author": "Alice", "version": "1.0"}
+print(meta.counts)  # {"views": 1000, "likes": 50}
+```
+
+### Any Type (Use Sparingly)
+
+```python
+from typing import Any
+
+class FlexibleData(BaseModel):
+    name: str
+    structured_field: str
+    flexible_field: Any  # Can be anything
+
+# Note: Any reduces type safety, use only when necessary
+generator = outlines.generate.json(model, FlexibleData)
+```
+
+## JSON Schema Support
+
+### Direct Schema Usage
+
+```python
+import outlines
+
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+
+# Define JSON schema
+schema = {
+    "type": "object",
+    "properties": {
+        "name": {"type": "string"},
+        "age": {"type": "integer", "minimum": 0, "maximum": 120},
+        "email": {"type": "string", "format": "email"}
+    },
+    "required": ["name", "age", "email"]
+}
+
+# Generate from schema
+generator = outlines.generate.json(model, schema)
+result = generator("Generate person: Alice, 25, alice@example.com")
+
+print(result)  # Valid JSON matching schema
+```
+
+### Schema from Pydantic
+
+```python
+class User(BaseModel):
+    name: str
+    age: int
+    email: str
+
+# Get JSON schema from Pydantic model
+schema = User.model_json_schema()
+print(schema)
+# {
+#   "type": "object",
+#   "properties": {
+#     "name": {"type": "string"},
+#     "age": {"type": "integer"},
+#     "email": {"type": "string"}
+#   },
+#   "required": ["name", "age", "email"]
+# }
+
+# Both approaches equivalent:
+generator1 = outlines.generate.json(model, User)
+generator2 = outlines.generate.json(model, schema)
+```
+
+## Advanced Patterns
+
+### Conditional Fields
+
+```python
+class Order(BaseModel):
+    order_type: Literal["standard", "express"]
+    delivery_date: str
+    express_fee: Optional[float] = None  # Only for express orders
+
+generator = outlines.generate.json(model, Order)
+
+# Express order
+order1 = generator("Create express order for tomorrow")
+print(order1.express_fee)  # 25.0
+
+# Standard order
+order2 = generator("Create standard order")
+print(order2.express_fee)  # None
+```
+
+### Recursive Models
+
+```python
+from typing import Optional, List
+
+class TreeNode(BaseModel):
+    value: str
+    children: Optional[List['TreeNode']] = None
+
+# Enable forward references
+TreeNode.model_rebuild()
+
+generator = outlines.generate.json(model, TreeNode)
+tree = generator("Generate file tree with subdirectories")
+
+print(tree.value)  # "root"
+print(tree.children[0].value)  # "subdir1"
+```
+
+### Model with Validation
+
+```python
+from pydantic import field_validator
+
+class DateRange(BaseModel):
+    start_date: str
+    end_date: str
+
+    @field_validator('end_date')
+    def end_after_start(cls, v, info):
+        """Ensure end_date is after start_date."""
+        if 'start_date' in info.data:
+            from datetime import datetime
+            start = datetime.strptime(info.data['start_date'], '%Y-%m-%d')
+            end = datetime.strptime(v, '%Y-%m-%d')
+            if end < start:
+                raise ValueError('end_date must be after start_date')
+        return v
+
+generator = outlines.generate.json(model, DateRange)
+# Validation happens after generation
+```
+
+## Multiple Objects
+
+### Generate List of Objects
+
+```python
+class Person(BaseModel):
+    name: str
+    age: int
+
+class Team(BaseModel):
+    team_name: str
+    members: list[Person]
+
+generator = outlines.generate.json(model, Team)
+
+team = generator("Generate engineering team with 5 members")
+print(f"Team: {team.team_name}")
+for member in team.members:
+    print(f"- {member.name}, {member.age}")
+```
+
+### Batch Generation
+
+```python
+def generate_batch(prompts: list[str], schema: type[BaseModel]):
+    """Generate structured outputs for multiple prompts."""
+    model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+    generator = outlines.generate.json(model, schema)
+
+    results = []
+    for prompt in prompts:
+        result = generator(prompt)
+        results.append(result)
+
+    return results
+
+class Product(BaseModel):
+    name: str
+    price: float
+
+prompts = [
+    "Product: iPhone 15, $999",
+    "Product: MacBook Pro, $2499",
+    "Product: AirPods, $179"
+]
+
+products = generate_batch(prompts, Product)
+for product in products:
+    print(f"{product.name}: ${product.price}")
+```
+
+## Performance Optimization
+
+### Caching Generators
+
+```python
+from functools import lru_cache
+
+@lru_cache(maxsize=10)
+def get_generator(model_name: str, schema_hash: int):
+    """Cache generators for reuse."""
+    model = outlines.models.transformers(model_name)
+    return outlines.generate.json(model, schema)
+
+# First call: creates generator
+gen1 = get_generator("microsoft/Phi-3-mini-4k-instruct", hash(User))
+
+# Second call: returns cached generator (fast!)
+gen2 = get_generator("microsoft/Phi-3-mini-4k-instruct", hash(User))
+```
+
+### Batch Processing
+
+```python
+# Process multiple items efficiently
+model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
+generator = outlines.generate.json(model, User)
+
+texts = ["User: Alice, 25", "User: Bob, 30", "User: Carol, 35"]
+
+# Reuse generator (model stays loaded)
+users = [generator(text) for text in texts]
+```
+
+### Minimize Schema Complexity
+
+```python
+# ✅ Good: Simple, flat structure (faster)
+class SimplePerson(BaseModel):
+    name: str
+    age: int
+    city: str
+
+# ⚠️ Slower: Deep nesting
+class ComplexPerson(BaseModel):
+    personal_info: PersonalInfo
+    address: Address
+    employment: Employment
+    # ... many nested levels
+```
+
+## Error Handling
+
+### Handle Missing Fields
+
+```python
+from pydantic import ValidationError
+
+class User(BaseModel):
+    name: str
+    age: int
+    email: str
+
+try:
+    user = generator("Generate user")  # May not include all fields
+except ValidationError as e:
+    print(f"Validation error: {e}")
+    # Handle gracefully
+```
+
+### Fallback with Optional Fields
+
+```python
+class RobustUser(BaseModel):
+    name: str  # Required
+    age: Optional[int] = None  # Optional
+    email: Optional[str] = None  # Optional
+
+# More likely to succeed even with incomplete data
+user = generator("Generate user: Alice")
+print(user.name)  # "Alice"
+print(user.age)  # None (not provided)
+```
+
+## Best Practices
+
+### 1. Use Specific Types
+
+```python
+# ✅ Good: Specific types
+class Product(BaseModel):
+    name: str
+    price: float  # Not Any or str
+    quantity: int  # Not str
+    in_stock: bool  # Not int
+
+# ❌ Bad: Generic types
+class Product(BaseModel):
+    name: Any
+    price: str  # Should be float
+    quantity: str  # Should be int
+```
+
+### 2. Add Descriptions
+
+```python
+# ✅ Good: Clear descriptions
+class Article(BaseModel):
+    title: str = Field(description="Article title, 10-100 characters")
+    content: str = Field(description="Main article content in paragraphs")
+    tags: list[str] = Field(description="List of relevant topic tags")
+
+# Descriptions help the model understand expected output
+```
+
+### 3. Use Constraints
+
+```python
+# ✅ Good: With constraints
+class Age(BaseModel):
+    value: int = Field(ge=0, le=120, description="Age in years")
+
+# ❌ Bad: No constraints
+class Age(BaseModel):
+    value: int  # Could be negative or > 120
+```
+
+### 4. Prefer Enums Over Strings
+
+```python
+# ✅ Good: Enum for fixed set
+class Priority(str, Enum):
+    LOW = "low"
+    MEDIUM = "medium"
+    HIGH = "high"
+
+class Task(BaseModel):
+    priority: Priority  # Guaranteed valid
+
+# ❌ Bad: Free-form string
+class Task(BaseModel):
+    priority: str  # Could be "urgent", "ASAP", "!!", etc.
+```
+
+### 5. Test Your Models
+
+```python
+# Test models work as expected
+def test_product_model():
+    product = Product(
+        name="Test Product",
+        price=19.99,
+        quantity=10,
+        in_stock=True
+    )
+    assert product.price == 19.99
+    assert isinstance(product, Product)
+
+# Run tests before using in production
+```
+
+## Resources
+
+- **Pydantic Docs**: https://docs.pydantic.dev
+- **JSON Schema**: https://json-schema.org
+- **Outlines GitHub**: https://github.com/outlines-dev/outlines
--- a/skills/mlops/inference/vllm/SKILL.md
+++ b/skills/mlops/inference/vllm/SKILL.md
@@ -0,0 +1,367 @@
+---
+name: serving-llms-vllm
+description: Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [vllm, torch, transformers]
+metadata:
+  hermes:
+    tags: [vLLM, Inference Serving, PagedAttention, Continuous Batching, High Throughput, Production, OpenAI API, Quantization, Tensor Parallelism]
+
+---
+
+# vLLM - High-Performance LLM Serving
+
+## Quick start
+
+vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
+
+**Installation**:
+```bash
+pip install vllm
+```
+
+**Basic offline inference**:
+```python
+from vllm import LLM, SamplingParams
+
+llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
+sampling = SamplingParams(temperature=0.7, max_tokens=256)
+
+outputs = llm.generate(["Explain quantum computing"], sampling)
+print(outputs[0].outputs[0].text)
+```
+
+**OpenAI-compatible server**:
+```bash
+vllm serve meta-llama/Llama-3-8B-Instruct
+
+# Query with OpenAI SDK
+python -c "
+from openai import OpenAI
+client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
+print(client.chat.completions.create(
+    model='meta-llama/Llama-3-8B-Instruct',
+    messages=[{'role': 'user', 'content': 'Hello!'}]
+).choices[0].message.content)
+"
+```
+
+## Common workflows
+
+### Workflow 1: Production API deployment
+
+Copy this checklist and track progress:
+
+```
+Deployment Progress:
+- [ ] Step 1: Configure server settings
+- [ ] Step 2: Test with limited traffic
+- [ ] Step 3: Enable monitoring
+- [ ] Step 4: Deploy to production
+- [ ] Step 5: Verify performance metrics
+```
+
+**Step 1: Configure server settings**
+
+Choose configuration based on your model size:
+
+```bash
+# For 7B-13B models on single GPU
+vllm serve meta-llama/Llama-3-8B-Instruct \
+  --gpu-memory-utilization 0.9 \
+  --max-model-len 8192 \
+  --port 8000
+
+# For 30B-70B models with tensor parallelism
+vllm serve meta-llama/Llama-2-70b-hf \
+  --tensor-parallel-size 4 \
+  --gpu-memory-utilization 0.9 \
+  --quantization awq \
+  --port 8000
+
+# For production with caching and metrics
+vllm serve meta-llama/Llama-3-8B-Instruct \
+  --gpu-memory-utilization 0.9 \
+  --enable-prefix-caching \
+  --enable-metrics \
+  --metrics-port 9090 \
+  --port 8000 \
+  --host 0.0.0.0
+```
+
+**Step 2: Test with limited traffic**
+
+Run load test before production:
+
+```bash
+# Install load testing tool
+pip install locust
+
+# Create test_load.py with sample requests
+# Run: locust -f test_load.py --host http://localhost:8000
+```
+
+Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
+
+**Step 3: Enable monitoring**
+
+vLLM exposes Prometheus metrics on port 9090:
+
+```bash
+curl http://localhost:9090/metrics | grep vllm
+```
+
+Key metrics to monitor:
+- `vllm:time_to_first_token_seconds` - Latency
+- `vllm:num_requests_running` - Active requests
+- `vllm:gpu_cache_usage_perc` - KV cache utilization
+
+**Step 4: Deploy to production**
+
+Use Docker for consistent deployment:
+
+```bash
+# Run vLLM in Docker
+docker run --gpus all -p 8000:8000 \
+  vllm/vllm-openai:latest \
+  --model meta-llama/Llama-3-8B-Instruct \
+  --gpu-memory-utilization 0.9 \
+  --enable-prefix-caching
+```
+
+**Step 5: Verify performance metrics**
+
+Check that deployment meets targets:
+- TTFT < 500ms (for short prompts)
+- Throughput > target req/sec
+- GPU utilization > 80%
+- No OOM errors in logs
+
+### Workflow 2: Offline batch inference
+
+For processing large datasets without server overhead.
+
+Copy this checklist:
+
+```
+Batch Processing:
+- [ ] Step 1: Prepare input data
+- [ ] Step 2: Configure LLM engine
+- [ ] Step 3: Run batch inference
+- [ ] Step 4: Process results
+```
+
+**Step 1: Prepare input data**
+
+```python
+# Load prompts from file
+prompts = []
+with open("prompts.txt") as f:
+    prompts = [line.strip() for line in f]
+
+print(f"Loaded {len(prompts)} prompts")
+```
+
+**Step 2: Configure LLM engine**
+
+```python
+from vllm import LLM, SamplingParams
+
+llm = LLM(
+    model="meta-llama/Llama-3-8B-Instruct",
+    tensor_parallel_size=2,  # Use 2 GPUs
+    gpu_memory_utilization=0.9,
+    max_model_len=4096
+)
+
+sampling = SamplingParams(
+    temperature=0.7,
+    top_p=0.95,
+    max_tokens=512,
+    stop=["</s>", "\n\n"]
+)
+```
+
+**Step 3: Run batch inference**
+
+vLLM automatically batches requests for efficiency:
+
+```python
+# Process all prompts in one call
+outputs = llm.generate(prompts, sampling)
+
+# vLLM handles batching internally
+# No need to manually chunk prompts
+```
+
+**Step 4: Process results**
+
+```python
+# Extract generated text
+results = []
+for output in outputs:
+    prompt = output.prompt
+    generated = output.outputs[0].text
+    results.append({
+        "prompt": prompt,
+        "generated": generated,
+        "tokens": len(output.outputs[0].token_ids)
+    })
+
+# Save to file
+import json
+with open("results.jsonl", "w") as f:
+    for result in results:
+        f.write(json.dumps(result) + "\n")
+
+print(f"Processed {len(results)} prompts")
+```
+
+### Workflow 3: Quantized model serving
+
+Fit large models in limited GPU memory.
+
+```
+Quantization Setup:
+- [ ] Step 1: Choose quantization method
+- [ ] Step 2: Find or create quantized model
+- [ ] Step 3: Launch with quantization flag
+- [ ] Step 4: Verify accuracy
+```
+
+**Step 1: Choose quantization method**
+
+- **AWQ**: Best for 70B models, minimal accuracy loss
+- **GPTQ**: Wide model support, good compression
+- **FP8**: Fastest on H100 GPUs
+
+**Step 2: Find or create quantized model**
+
+Use pre-quantized models from HuggingFace:
+
+```bash
+# Search for AWQ models
+# Example: TheBloke/Llama-2-70B-AWQ
+```
+
+**Step 3: Launch with quantization flag**
+
+```bash
+# Using pre-quantized model
+vllm serve TheBloke/Llama-2-70B-AWQ \
+  --quantization awq \
+  --tensor-parallel-size 1 \
+  --gpu-memory-utilization 0.95
+
+# Results: 70B model in ~40GB VRAM
+```
+
+**Step 4: Verify accuracy**
+
+Test outputs match expected quality:
+
+```python
+# Compare quantized vs non-quantized responses
+# Verify task-specific performance unchanged
+```
+
+## When to use vs alternatives
+
+**Use vLLM when:**
+- Deploying production LLM APIs (100+ req/sec)
+- Serving OpenAI-compatible endpoints
+- Limited GPU memory but need large models
+- Multi-user applications (chatbots, assistants)
+- Need low latency with high throughput
+
+**Use alternatives instead:**
+- **llama.cpp**: CPU/edge inference, single-user
+- **HuggingFace transformers**: Research, prototyping, one-off generation
+- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance
+- **Text-Generation-Inference**: Already in HuggingFace ecosystem
+
+## Common issues
+
+**Issue: Out of memory during model loading**
+
+Reduce memory usage:
+```bash
+vllm serve MODEL \
+  --gpu-memory-utilization 0.7 \
+  --max-model-len 4096
+```
+
+Or use quantization:
+```bash
+vllm serve MODEL --quantization awq
+```
+
+**Issue: Slow first token (TTFT > 1 second)**
+
+Enable prefix caching for repeated prompts:
+```bash
+vllm serve MODEL --enable-prefix-caching
+```
+
+For long prompts, enable chunked prefill:
+```bash
+vllm serve MODEL --enable-chunked-prefill
+```
+
+**Issue: Model not found error**
+
+Use `--trust-remote-code` for custom models:
+```bash
+vllm serve MODEL --trust-remote-code
+```
+
+**Issue: Low throughput (<50 req/sec)**
+
+Increase concurrent sequences:
+```bash
+vllm serve MODEL --max-num-seqs 512
+```
+
+Check GPU utilization with `nvidia-smi` - should be >80%.
+
+**Issue: Inference slower than expected**
+
+Verify tensor parallelism uses power of 2 GPUs:
+```bash
+vllm serve MODEL --tensor-parallel-size 4  # Not 3
+```
+
+Enable speculative decoding for faster generation:
+```bash
+vllm serve MODEL --speculative-model DRAFT_MODEL
+```
+
+## Advanced topics
+
+**Server deployment patterns**: See [references/server-deployment.md](references/server-deployment.md) for Docker, Kubernetes, and load balancing configurations.
+
+**Performance optimization**: See [references/optimization.md](references/optimization.md) for PagedAttention tuning, continuous batching details, and benchmark results.
+
+**Quantization guide**: See [references/quantization.md](references/quantization.md) for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
+
+**Troubleshooting**: See [references/troubleshooting.md](references/troubleshooting.md) for detailed error messages, debugging steps, and performance diagnostics.
+
+## Hardware requirements
+
+- **Small models (7B-13B)**: 1x A10 (24GB) or A100 (40GB)
+- **Medium models (30B-40B)**: 2x A100 (40GB) with tensor parallelism
+- **Large models (70B+)**: 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
+
+Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
+
+## Resources
+
+- Official docs: https://docs.vllm.ai
+- GitHub: https://github.com/vllm-project/vllm
+- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
+- Community: https://discuss.vllm.ai
+
+
+
--- a/skills/mlops/inference/vllm/references/optimization.md
+++ b/skills/mlops/inference/vllm/references/optimization.md
@@ -0,0 +1,226 @@
+# Performance Optimization
+
+## Contents
+- PagedAttention explained
+- Continuous batching mechanics
+- Prefix caching strategies
+- Speculative decoding setup
+- Benchmark results and comparisons
+- Performance tuning guide
+
+## PagedAttention explained
+
+**Traditional attention problem**:
+- KV cache stored in contiguous memory
+- Wastes ~50% GPU memory due to fragmentation
+- Cannot dynamically reallocate for varying sequence lengths
+
+**PagedAttention solution**:
+- Divides KV cache into fixed-size blocks (like OS virtual memory)
+- Dynamic allocation from free block queue
+- Shares blocks across sequences (for prefix caching)
+
+**Memory savings example**:
+```
+Traditional: 70B model needs 160GB KV cache → OOM on 8x A100
+PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100
+```
+
+**Configuration**:
+```bash
+# Block size (default: 16 tokens)
+vllm serve MODEL --block-size 16
+
+# Number of GPU blocks (auto-calculated)
+# Controlled by --gpu-memory-utilization
+vllm serve MODEL --gpu-memory-utilization 0.9
+```
+
+## Continuous batching mechanics
+
+**Traditional batching**:
+- Wait for all sequences in batch to finish
+- GPU idle while waiting for longest sequence
+- Low GPU utilization (~40-60%)
+
+**Continuous batching**:
+- Add new requests as slots become available
+- Mix prefill (new requests) and decode (ongoing) in same batch
+- High GPU utilization (>90%)
+
+**Throughput improvement**:
+```
+Traditional batching: 50 req/sec @ 50% GPU util
+Continuous batching: 200 req/sec @ 90% GPU util
+= 4x throughput improvement
+```
+
+**Tuning parameters**:
+```bash
+# Max concurrent sequences (higher = more batching)
+vllm serve MODEL --max-num-seqs 256
+
+# Prefill/decode schedule (auto-balanced by default)
+# No manual tuning needed
+```
+
+## Prefix caching strategies
+
+Reuse computed KV cache for common prompt prefixes.
+
+**Use cases**:
+- System prompts repeated across requests
+- Few-shot examples in every prompt
+- RAG contexts with overlapping chunks
+
+**Example savings**:
+```
+Prompt: [System: 500 tokens] + [User: 100 tokens]
+
+Without caching: Compute 600 tokens every request
+With caching: Compute 500 tokens once, then 100 tokens/request
+= 83% faster TTFT
+```
+
+**Enable prefix caching**:
+```bash
+vllm serve MODEL --enable-prefix-caching
+```
+
+**Automatic prefix detection**:
+- vLLM detects common prefixes automatically
+- No code changes required
+- Works with OpenAI-compatible API
+
+**Cache hit rate monitoring**:
+```bash
+curl http://localhost:9090/metrics | grep cache_hit
+# vllm_cache_hit_rate: 0.75  (75% hit rate)
+```
+
+## Speculative decoding setup
+
+Use smaller "draft" model to propose tokens, larger model to verify.
+
+**Speed improvement**:
+```
+Standard: Generate 1 token per forward pass
+Speculative: Generate 3-5 tokens per forward pass
+= 2-3x faster generation
+```
+
+**How it works**:
+1. Draft model proposes K tokens (fast)
+2. Target model verifies all K tokens in parallel (one pass)
+3. Accept verified tokens, restart from first rejection
+
+**Setup with separate draft model**:
+```bash
+vllm serve meta-llama/Llama-3-70B-Instruct \
+  --speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+  --num-speculative-tokens 5
+```
+
+**Setup with n-gram draft** (no separate model):
+```bash
+vllm serve MODEL \
+  --speculative-method ngram \
+  --num-speculative-tokens 3
+```
+
+**When to use**:
+- Output length > 100 tokens
+- Draft model 5-10x smaller than target
+- Acceptable 2-3% accuracy trade-off
+
+## Benchmark results
+
+**vLLM vs HuggingFace Transformers** (Llama 3 8B, A100):
+```
+Metric                  | HF Transformers | vLLM   | Improvement
+------------------------|-----------------|--------|------------
+Throughput (req/sec)    | 12              | 280    | 23x
+TTFT (ms)              | 850             | 120    | 7x
+Tokens/sec             | 45              | 2,100  | 47x
+GPU Memory (GB)        | 28              | 16     | 1.75x less
+```
+
+**vLLM vs TensorRT-LLM** (Llama 2 70B, 4x A100):
+```
+Metric                  | TensorRT-LLM | vLLM   | Notes
+------------------------|--------------|--------|------------------
+Throughput (req/sec)    | 320          | 285    | TRT 12% faster
+Setup complexity        | High         | Low    | vLLM much easier
+NVIDIA-only            | Yes          | No     | vLLM multi-platform
+Quantization support    | FP8, INT8    | AWQ/GPTQ/FP8 | vLLM more options
+```
+
+## Performance tuning guide
+
+**Step 1: Measure baseline**
+
+```bash
+# Install benchmarking tool
+pip install locust
+
+# Run baseline benchmark
+vllm bench throughput \
+  --model MODEL \
+  --input-tokens 128 \
+  --output-tokens 256 \
+  --num-prompts 1000
+
+# Record: throughput, TTFT, tokens/sec
+```
+
+**Step 2: Tune memory utilization**
+
+```bash
+# Try different values: 0.7, 0.85, 0.9, 0.95
+vllm serve MODEL --gpu-memory-utilization 0.9
+```
+
+Higher = more batch capacity = higher throughput, but risk OOM.
+
+**Step 3: Tune concurrency**
+
+```bash
+# Try values: 128, 256, 512, 1024
+vllm serve MODEL --max-num-seqs 256
+```
+
+Higher = more batching opportunity, but may increase latency.
+
+**Step 4: Enable optimizations**
+
+```bash
+vllm serve MODEL \
+  --enable-prefix-caching \     # For repeated prompts
+  --enable-chunked-prefill \    # For long prompts
+  --gpu-memory-utilization 0.9 \
+  --max-num-seqs 512
+```
+
+**Step 5: Re-benchmark and compare**
+
+Target improvements:
+- Throughput: +30-100%
+- TTFT: -20-50%
+- GPU utilization: >85%
+
+**Common performance issues**:
+
+**Low throughput (<50 req/sec)**:
+- Increase `--max-num-seqs`
+- Enable `--enable-prefix-caching`
+- Check GPU utilization (should be >80%)
+
+**High TTFT (>1 second)**:
+- Enable `--enable-chunked-prefill`
+- Reduce `--max-model-len` if possible
+- Check if model is too large for GPU
+
+**OOM errors**:
+- Reduce `--gpu-memory-utilization` to 0.7
+- Reduce `--max-model-len`
+- Use quantization (`--quantization awq`)
--- a/skills/mlops/inference/vllm/references/quantization.md
+++ b/skills/mlops/inference/vllm/references/quantization.md
@@ -0,0 +1,284 @@
+# Quantization Guide
+
+## Contents
+- Quantization methods comparison
+- AWQ setup and usage
+- GPTQ setup and usage
+- FP8 quantization (H100)
+- Model preparation
+- Accuracy vs compression trade-offs
+
+## Quantization methods comparison
+
+| Method | Compression | Accuracy Loss | Speed | Best For |
+|--------|-------------|---------------|-------|----------|
+| **AWQ** | 4-bit (75%) | <1% | Fast | 70B models, production |
+| **GPTQ** | 4-bit (75%) | 1-2% | Fast | Wide model support |
+| **FP8** | 8-bit (50%) | <0.5% | Fastest | H100 GPUs only |
+| **SqueezeLLM** | 3-4 bit (75-80%) | 2-3% | Medium | Extreme compression |
+
+**Recommendation**:
+- **Production**: Use AWQ for 70B models
+- **H100 GPUs**: Use FP8 for best speed
+- **Maximum compatibility**: Use GPTQ
+- **Extreme compression**: Use SqueezeLLM
+
+## AWQ setup and usage
+
+**AWQ** (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.
+
+**Step 1: Find pre-quantized model**
+
+Search HuggingFace for AWQ models:
+```bash
+# Example: TheBloke/Llama-2-70B-AWQ
+# Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
+```
+
+**Step 2: Launch with AWQ**
+
+```bash
+vllm serve TheBloke/Llama-2-70B-AWQ \
+  --quantization awq \
+  --tensor-parallel-size 1 \
+  --gpu-memory-utilization 0.95
+```
+
+**Memory savings**:
+```
+Llama 2 70B fp16: 140GB VRAM (4x A100 needed)
+Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)
+= 4x memory reduction
+```
+
+**Step 3: Verify performance**
+
+Test that outputs are acceptable:
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+
+# Test complex reasoning
+response = client.chat.completions.create(
+    model="TheBloke/Llama-2-70B-AWQ",
+    messages=[{"role": "user", "content": "Explain quantum entanglement"}]
+)
+
+print(response.choices[0].message.content)
+# Verify quality matches your requirements
+```
+
+**Quantize your own model** (requires GPU with 80GB+ VRAM):
+
+```python
+from awq import AutoAWQForCausalLM
+from transformers import AutoTokenizer
+
+model_path = "meta-llama/Llama-2-70b-hf"
+quant_path = "llama-2-70b-awq"
+
+# Load model
+model = AutoAWQForCausalLM.from_pretrained(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+
+# Quantize
+quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
+model.quantize(tokenizer, quant_config=quant_config)
+
+# Save
+model.save_quantized(quant_path)
+tokenizer.save_pretrained(quant_path)
+```
+
+## GPTQ setup and usage
+
+**GPTQ** has widest model support and good compression.
+
+**Step 1: Find GPTQ model**
+
+```bash
+# Example: TheBloke/Llama-2-13B-GPTQ
+# Example: TheBloke/CodeLlama-34B-GPTQ
+```
+
+**Step 2: Launch with GPTQ**
+
+```bash
+vllm serve TheBloke/Llama-2-13B-GPTQ \
+  --quantization gptq \
+  --dtype float16
+```
+
+**GPTQ configuration options**:
+```bash
+# Specify GPTQ parameters if needed
+vllm serve MODEL \
+  --quantization gptq \
+  --gptq-act-order \  # Activation ordering
+  --dtype float16
+```
+
+**Quantize your own model**:
+
+```python
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+from transformers import AutoTokenizer
+
+model_name = "meta-llama/Llama-2-13b-hf"
+quantized_name = "llama-2-13b-gptq"
+
+# Load model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
+
+# Prepare calibration data
+calib_data = [...]  # List of sample texts
+
+# Quantize
+quantize_config = BaseQuantizeConfig(
+    bits=4,
+    group_size=128,
+    desc_act=True
+)
+model.quantize(calib_data)
+
+# Save
+model.save_quantized(quantized_name)
+```
+
+## FP8 quantization (H100)
+
+**FP8** (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.
+
+**Requirements**:
+- H100 or H800 GPU
+- CUDA 12.3+ (12.8 recommended)
+- Hopper architecture support
+
+**Step 1: Enable FP8**
+
+```bash
+vllm serve meta-llama/Llama-3-70B-Instruct \
+  --quantization fp8 \
+  --tensor-parallel-size 2
+```
+
+**Performance gains on H100**:
+```
+fp16: 180 tokens/sec
+FP8: 320 tokens/sec
+= 1.8x speedup
+```
+
+**Step 2: Verify accuracy**
+
+FP8 typically has <0.5% accuracy degradation:
+```python
+# Run evaluation suite
+# Compare FP8 vs FP16 on your tasks
+# Verify acceptable accuracy
+```
+
+**Dynamic FP8 quantization** (no pre-quantized model needed):
+
+```bash
+# vLLM automatically quantizes at runtime
+vllm serve MODEL --quantization fp8
+# No model preparation required
+```
+
+## Model preparation
+
+**Pre-quantized models (easiest)**:
+
+1. Search HuggingFace: `[model name] AWQ` or `[model name] GPTQ`
+2. Download or use directly: `TheBloke/[Model]-AWQ`
+3. Launch with appropriate `--quantization` flag
+
+**Quantize your own model**:
+
+**AWQ**:
+```bash
+# Install AutoAWQ
+pip install autoawq
+
+# Run quantization script
+python quantize_awq.py --model MODEL --output OUTPUT
+```
+
+**GPTQ**:
+```bash
+# Install AutoGPTQ
+pip install auto-gptq
+
+# Run quantization script
+python quantize_gptq.py --model MODEL --output OUTPUT
+```
+
+**Calibration data**:
+- Use 128-512 diverse examples from target domain
+- Representative of production inputs
+- Higher quality calibration = better accuracy
+
+## Accuracy vs compression trade-offs
+
+**Empirical results** (Llama 2 70B on MMLU benchmark):
+
+| Quantization | Accuracy | Memory | Speed | Production-Ready |
+|--------------|----------|--------|-------|------------------|
+| FP16 (baseline) | 100% | 140GB | 1.0x | ✅ (if memory available) |
+| FP8 | 99.5% | 70GB | 1.8x | ✅ (H100 only) |
+| AWQ 4-bit | 99.0% | 35GB | 1.5x | ✅ (best for 70B) |
+| GPTQ 4-bit | 98.5% | 35GB | 1.5x | ✅ (good compatibility) |
+| SqueezeLLM 3-bit | 96.0% | 26GB | 1.3x | ⚠️ (check accuracy) |
+
+**When to use each**:
+
+**No quantization (FP16)**:
+- Have sufficient GPU memory
+- Need absolute best accuracy
+- Model <13B parameters
+
+**FP8**:
+- Using H100/H800 GPUs
+- Need best speed with minimal accuracy loss
+- Production deployment
+
+**AWQ 4-bit**:
+- Need to fit 70B model in 40GB GPU
+- Production deployment
+- <1% accuracy loss acceptable
+
+**GPTQ 4-bit**:
+- Wide model support needed
+- Not on H100 (use FP8 instead)
+- 1-2% accuracy loss acceptable
+
+**Testing strategy**:
+
+1. **Baseline**: Measure FP16 accuracy on your evaluation set
+2. **Quantize**: Create quantized version
+3. **Evaluate**: Compare quantized vs baseline on same tasks
+4. **Decide**: Accept if degradation < threshold (typically 1-2%)
+
+**Example evaluation**:
+```python
+from evaluate import load_evaluation_suite
+
+# Run on FP16 baseline
+baseline_score = evaluate(model_fp16, eval_suite)
+
+# Run on quantized
+quant_score = evaluate(model_awq, eval_suite)
+
+# Compare
+degradation = (baseline_score - quant_score) / baseline_score * 100
+print(f"Accuracy degradation: {degradation:.2f}%")
+
+# Decision
+if degradation < 1.0:
+    print("✅ Quantization acceptable for production")
+else:
+    print("⚠️ Review accuracy loss")
+```
--- a/skills/mlops/inference/vllm/references/server-deployment.md
+++ b/skills/mlops/inference/vllm/references/server-deployment.md
@@ -0,0 +1,255 @@
+# Server Deployment Patterns
+
+## Contents
+- Docker deployment
+- Kubernetes deployment
+- Load balancing with Nginx
+- Multi-node distributed serving
+- Production configuration examples
+- Health checks and monitoring
+
+## Docker deployment
+
+**Basic Dockerfile**:
+```dockerfile
+FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
+
+RUN apt-get update && apt-get install -y python3-pip
+RUN pip install vllm
+
+EXPOSE 8000
+
+CMD ["vllm", "serve", "meta-llama/Llama-3-8B-Instruct", \
+     "--host", "0.0.0.0", "--port", "8000", \
+     "--gpu-memory-utilization", "0.9"]
+```
+
+**Build and run**:
+```bash
+docker build -t vllm-server .
+docker run --gpus all -p 8000:8000 vllm-server
+```
+
+**Docker Compose** (with metrics):
+```yaml
+version: '3.8'
+services:
+  vllm:
+    image: vllm/vllm-openai:latest
+    command: >
+      --model meta-llama/Llama-3-8B-Instruct
+      --gpu-memory-utilization 0.9
+      --enable-metrics
+      --metrics-port 9090
+    ports:
+      - "8000:8000"
+      - "9090:9090"
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+```
+
+## Kubernetes deployment
+
+**Deployment manifest**:
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: vllm-server
+spec:
+  replicas: 2
+  selector:
+    matchLabels:
+      app: vllm
+  template:
+    metadata:
+      labels:
+        app: vllm
+    spec:
+      containers:
+      - name: vllm
+        image: vllm/vllm-openai:latest
+        args:
+          - "--model=meta-llama/Llama-3-8B-Instruct"
+          - "--gpu-memory-utilization=0.9"
+          - "--enable-prefix-caching"
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+        ports:
+        - containerPort: 8000
+          name: http
+        - containerPort: 9090
+          name: metrics
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8000
+          initialDelaySeconds: 30
+          periodSeconds: 10
+        livenessProbe:
+          httpGet:
+            path: /health
+            port: 8000
+          initialDelaySeconds: 60
+          periodSeconds: 30
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: vllm-service
+spec:
+  selector:
+    app: vllm
+  ports:
+  - port: 8000
+    targetPort: 8000
+    name: http
+  - port: 9090
+    targetPort: 9090
+    name: metrics
+  type: LoadBalancer
+```
+
+## Load balancing with Nginx
+
+**Nginx configuration**:
+```nginx
+upstream vllm_backend {
+    least_conn;  # Route to least-loaded server
+    server localhost:8001;
+    server localhost:8002;
+    server localhost:8003;
+}
+
+server {
+    listen 80;
+
+    location / {
+        proxy_pass http://vllm_backend;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+
+        # Timeouts for long-running inference
+        proxy_read_timeout 300s;
+        proxy_connect_timeout 75s;
+    }
+
+    # Metrics endpoint
+    location /metrics {
+        proxy_pass http://localhost:9090/metrics;
+    }
+}
+```
+
+**Start multiple vLLM instances**:
+```bash
+# Terminal 1
+vllm serve MODEL --port 8001 --tensor-parallel-size 1
+
+# Terminal 2
+vllm serve MODEL --port 8002 --tensor-parallel-size 1
+
+# Terminal 3
+vllm serve MODEL --port 8003 --tensor-parallel-size 1
+
+# Start Nginx
+nginx -c /path/to/nginx.conf
+```
+
+## Multi-node distributed serving
+
+For models too large for single node:
+
+**Node 1** (master):
+```bash
+export MASTER_ADDR=192.168.1.10
+export MASTER_PORT=29500
+export RANK=0
+export WORLD_SIZE=2
+
+vllm serve meta-llama/Llama-2-70b-hf \
+  --tensor-parallel-size 8 \
+  --pipeline-parallel-size 2
+```
+
+**Node 2** (worker):
+```bash
+export MASTER_ADDR=192.168.1.10
+export MASTER_PORT=29500
+export RANK=1
+export WORLD_SIZE=2
+
+vllm serve meta-llama/Llama-2-70b-hf \
+  --tensor-parallel-size 8 \
+  --pipeline-parallel-size 2
+```
+
+## Production configuration examples
+
+**High throughput** (batch-heavy workload):
+```bash
+vllm serve MODEL \
+  --max-num-seqs 512 \
+  --gpu-memory-utilization 0.95 \
+  --enable-prefix-caching \
+  --trust-remote-code
+```
+
+**Low latency** (interactive workload):
+```bash
+vllm serve MODEL \
+  --max-num-seqs 64 \
+  --gpu-memory-utilization 0.85 \
+  --enable-chunked-prefill
+```
+
+**Memory-constrained** (40GB GPU for 70B model):
+```bash
+vllm serve TheBloke/Llama-2-70B-AWQ \
+  --quantization awq \
+  --tensor-parallel-size 1 \
+  --gpu-memory-utilization 0.95 \
+  --max-model-len 4096
+```
+
+## Health checks and monitoring
+
+**Health check endpoint**:
+```bash
+curl http://localhost:8000/health
+# Returns: {"status": "ok"}
+```
+
+**Readiness check** (wait for model loaded):
+```bash
+#!/bin/bash
+until curl -f http://localhost:8000/health; do
+    echo "Waiting for vLLM to be ready..."
+    sleep 5
+done
+echo "vLLM is ready!"
+```
+
+**Prometheus scraping**:
+```yaml
+# prometheus.yml
+scrape_configs:
+  - job_name: 'vllm'
+    static_configs:
+      - targets: ['localhost:9090']
+    metrics_path: '/metrics'
+    scrape_interval: 15s
+```
+
+**Grafana dashboard** (key metrics):
+- Requests per second: `rate(vllm_request_success_total[5m])`
+- TTFT p50: `histogram_quantile(0.5, vllm_time_to_first_token_seconds_bucket)`
+- TTFT p99: `histogram_quantile(0.99, vllm_time_to_first_token_seconds_bucket)`
+- GPU cache usage: `vllm_gpu_cache_usage_perc`
+- Active requests: `vllm_num_requests_running`
--- a/skills/mlops/inference/vllm/references/troubleshooting.md
+++ b/skills/mlops/inference/vllm/references/troubleshooting.md
@@ -0,0 +1,447 @@
+# Troubleshooting Guide
+
+## Contents
+- Out of memory (OOM) errors
+- Performance issues
+- Model loading errors
+- Network and connection issues
+- Quantization problems
+- Distributed serving issues
+- Debugging tools and commands
+
+## Out of memory (OOM) errors
+
+### Symptom: `torch.cuda.OutOfMemoryError` during model loading
+
+**Cause**: Model + KV cache exceeds available VRAM
+
+**Solutions (try in order)**:
+
+1. **Reduce GPU memory utilization**:
+```bash
+vllm serve MODEL --gpu-memory-utilization 0.7  # Try 0.7, 0.75, 0.8
+```
+
+2. **Reduce max sequence length**:
+```bash
+vllm serve MODEL --max-model-len 4096  # Instead of 8192
+```
+
+3. **Enable quantization**:
+```bash
+vllm serve MODEL --quantization awq  # 4x memory reduction
+```
+
+4. **Use tensor parallelism** (multiple GPUs):
+```bash
+vllm serve MODEL --tensor-parallel-size 2  # Split across 2 GPUs
+```
+
+5. **Reduce max concurrent sequences**:
+```bash
+vllm serve MODEL --max-num-seqs 128  # Default is 256
+```
+
+### Symptom: OOM during inference (not model loading)
+
+**Cause**: KV cache fills up during generation
+
+**Solutions**:
+
+```bash
+# Reduce KV cache allocation
+vllm serve MODEL --gpu-memory-utilization 0.85
+
+# Reduce batch size
+vllm serve MODEL --max-num-seqs 64
+
+# Reduce max tokens per request
+# Set in client request: max_tokens=512
+```
+
+### Symptom: OOM with quantized model
+
+**Cause**: Quantization overhead or incorrect configuration
+
+**Solution**:
+```bash
+# Ensure quantization flag matches model
+vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq  # Must specify
+
+# Try different dtype
+vllm serve MODEL --quantization awq --dtype float16
+```
+
+## Performance issues
+
+### Symptom: Low throughput (<50 req/sec expected >100)
+
+**Diagnostic steps**:
+
+1. **Check GPU utilization**:
+```bash
+watch -n 1 nvidia-smi
+# GPU utilization should be >80%
+```
+
+If <80%, increase concurrent requests:
+```bash
+vllm serve MODEL --max-num-seqs 512  # Increase from 256
+```
+
+2. **Check if memory-bound**:
+```bash
+# If memory at 100% but GPU <80%, reduce sequence length
+vllm serve MODEL --max-model-len 4096
+```
+
+3. **Enable optimizations**:
+```bash
+vllm serve MODEL \
+  --enable-prefix-caching \
+  --enable-chunked-prefill \
+  --max-num-seqs 512
+```
+
+4. **Check tensor parallelism settings**:
+```bash
+# Must use power-of-2 GPUs
+vllm serve MODEL --tensor-parallel-size 4  # Not 3 or 5
+```
+
+### Symptom: High TTFT (time to first token >1 second)
+
+**Causes and solutions**:
+
+**Long prompts**:
+```bash
+vllm serve MODEL --enable-chunked-prefill
+```
+
+**No prefix caching**:
+```bash
+vllm serve MODEL --enable-prefix-caching  # For repeated prompts
+```
+
+**Too many concurrent requests**:
+```bash
+vllm serve MODEL --max-num-seqs 64  # Reduce to prioritize latency
+```
+
+**Model too large for single GPU**:
+```bash
+vllm serve MODEL --tensor-parallel-size 2  # Parallelize prefill
+```
+
+### Symptom: Slow token generation (low tokens/sec)
+
+**Diagnostic**:
+```bash
+# Check if model is correct size
+vllm serve MODEL  # Should see model size in logs
+
+# Check speculative decoding
+vllm serve MODEL --speculative-model DRAFT_MODEL
+```
+
+**For H100 GPUs**, enable FP8:
+```bash
+vllm serve MODEL --quantization fp8
+```
+
+## Model loading errors
+
+### Symptom: `OSError: MODEL not found`
+
+**Causes**:
+
+1. **Model name typo**:
+```bash
+# Check exact model name on HuggingFace
+vllm serve meta-llama/Llama-3-8B-Instruct  # Correct capitalization
+```
+
+2. **Private/gated model**:
+```bash
+# Login to HuggingFace first
+huggingface-cli login
+# Then run vLLM
+vllm serve meta-llama/Llama-3-70B-Instruct
+```
+
+3. **Custom model needs trust flag**:
+```bash
+vllm serve MODEL --trust-remote-code
+```
+
+### Symptom: `ValueError: Tokenizer not found`
+
+**Solution**:
+```bash
+# Download model manually first
+python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"
+
+# Then launch vLLM
+vllm serve MODEL
+```
+
+### Symptom: `ImportError: No module named 'flash_attn'`
+
+**Solution**:
+```bash
+# Install flash attention
+pip install flash-attn --no-build-isolation
+
+# Or disable flash attention
+vllm serve MODEL --disable-flash-attn
+```
+
+## Network and connection issues
+
+### Symptom: `Connection refused` when querying server
+
+**Diagnostic**:
+
+1. **Check server is running**:
+```bash
+curl http://localhost:8000/health
+```
+
+2. **Check port binding**:
+```bash
+# Bind to all interfaces for remote access
+vllm serve MODEL --host 0.0.0.0 --port 8000
+
+# Check if port is in use
+lsof -i :8000
+```
+
+3. **Check firewall**:
+```bash
+# Allow port through firewall
+sudo ufw allow 8000
+```
+
+### Symptom: Slow response times over network
+
+**Solutions**:
+
+1. **Increase timeout**:
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="EMPTY",
+    timeout=300.0  # 5 minute timeout
+)
+```
+
+2. **Check network latency**:
+```bash
+ping SERVER_IP  # Should be <10ms for local network
+```
+
+3. **Use connection pooling**:
+```python
+import requests
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry
+
+session = requests.Session()
+retries = Retry(total=3, backoff_factor=1)
+session.mount('http://', HTTPAdapter(max_retries=retries))
+```
+
+## Quantization problems
+
+### Symptom: `RuntimeError: Quantization format not supported`
+
+**Solution**:
+```bash
+# Ensure correct quantization method
+vllm serve MODEL --quantization awq  # For AWQ models
+vllm serve MODEL --quantization gptq  # For GPTQ models
+
+# Check model card for quantization type
+```
+
+### Symptom: Poor quality outputs after quantization
+
+**Diagnostic**:
+
+1. **Verify model is correctly quantized**:
+```bash
+# Check model config.json for quantization_config
+cat ~/.cache/huggingface/hub/models--MODEL/config.json
+```
+
+2. **Try different quantization method**:
+```bash
+# If AWQ quality issues, try FP8 (H100 only)
+vllm serve MODEL --quantization fp8
+
+# Or use less aggressive quantization
+vllm serve MODEL  # No quantization
+```
+
+3. **Increase temperature for better diversity**:
+```python
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+```
+
+## Distributed serving issues
+
+### Symptom: `RuntimeError: Distributed init failed`
+
+**Diagnostic**:
+
+1. **Check environment variables**:
+```bash
+# On all nodes
+echo $MASTER_ADDR  # Should be same
+echo $MASTER_PORT  # Should be same
+echo $RANK  # Should be unique per node (0, 1, 2, ...)
+echo $WORLD_SIZE  # Should be same (total nodes)
+```
+
+2. **Check network connectivity**:
+```bash
+# From node 1 to node 2
+ping NODE2_IP
+nc -zv NODE2_IP 29500  # Check port accessibility
+```
+
+3. **Check NCCL settings**:
+```bash
+export NCCL_DEBUG=INFO
+export NCCL_SOCKET_IFNAME=eth0  # Or your network interface
+vllm serve MODEL --tensor-parallel-size 8
+```
+
+### Symptom: `NCCL error: unhandled cuda error`
+
+**Solutions**:
+
+```bash
+# Set NCCL to use correct network interface
+export NCCL_SOCKET_IFNAME=eth0  # Replace with your interface
+
+# Increase timeout
+export NCCL_TIMEOUT=1800  # 30 minutes
+
+# Force P2P for debugging
+export NCCL_P2P_DISABLE=1
+```
+
+## Debugging tools and commands
+
+### Enable debug logging
+
+```bash
+export VLLM_LOGGING_LEVEL=DEBUG
+vllm serve MODEL
+```
+
+### Monitor GPU usage
+
+```bash
+# Real-time GPU monitoring
+watch -n 1 nvidia-smi
+
+# Memory breakdown
+nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
+```
+
+### Profile performance
+
+```bash
+# Built-in benchmarking
+vllm bench throughput \
+  --model MODEL \
+  --input-tokens 128 \
+  --output-tokens 256 \
+  --num-prompts 100
+
+vllm bench latency \
+  --model MODEL \
+  --input-tokens 128 \
+  --output-tokens 256 \
+  --batch-size 8
+```
+
+### Check metrics
+
+```bash
+# Prometheus metrics
+curl http://localhost:9090/metrics
+
+# Filter for specific metrics
+curl http://localhost:9090/metrics | grep vllm_time_to_first_token
+
+# Key metrics to monitor:
+# - vllm_time_to_first_token_seconds
+# - vllm_time_per_output_token_seconds
+# - vllm_num_requests_running
+# - vllm_gpu_cache_usage_perc
+# - vllm_request_success_total
+```
+
+### Test server health
+
+```bash
+# Health check
+curl http://localhost:8000/health
+
+# Model info
+curl http://localhost:8000/v1/models
+
+# Test completion
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "MODEL",
+    "prompt": "Hello",
+    "max_tokens": 10
+  }'
+```
+
+### Common environment variables
+
+```bash
+# CUDA settings
+export CUDA_VISIBLE_DEVICES=0,1,2,3  # Limit to specific GPUs
+
+# vLLM settings
+export VLLM_LOGGING_LEVEL=DEBUG
+export VLLM_TRACE_FUNCTION=1  # Profile functions
+export VLLM_USE_V1=1  # Use v1.0 engine (faster)
+
+# NCCL settings (distributed)
+export NCCL_DEBUG=INFO
+export NCCL_SOCKET_IFNAME=eth0
+export NCCL_IB_DISABLE=0  # Enable InfiniBand
+```
+
+### Collect diagnostic info for bug reports
+
+```bash
+# System info
+nvidia-smi
+python --version
+pip show vllm
+
+# vLLM version and config
+vllm --version
+python -c "import vllm; print(vllm.__version__)"
+
+# Run with debug logging
+export VLLM_LOGGING_LEVEL=DEBUG
+vllm serve MODEL 2>&1 | tee vllm_debug.log
+
+# Include in bug report:
+# - vllm_debug.log
+# - nvidia-smi output
+# - Full command used
+# - Expected vs actual behavior
+```