Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/inference/llama-cpp/SKILL.md
+++ b/skills/mlops/inference/llama-cpp/SKILL.md
@@ -0,0 +1,261 @@
+---
+name: llama-cpp
+description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [llama-cpp-python]
+metadata:
+  hermes:
+    tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
+
+---
+
+# llama.cpp
+
+Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
+
+## When to use llama.cpp
+
+**Use llama.cpp when:**
+- Running on CPU-only machines
+- Deploying on Apple Silicon (M1/M2/M3/M4)
+- Using AMD or Intel GPUs (no CUDA)
+- Edge deployment (Raspberry Pi, embedded systems)
+- Need simple deployment without Docker/Python
+
+**Use TensorRT-LLM instead when:**
+- Have NVIDIA GPUs (A100/H100)
+- Need maximum throughput (100K+ tok/s)
+- Running in datacenter with CUDA
+
+**Use vLLM instead when:**
+- Have NVIDIA GPUs
+- Need Python-first API
+- Want PagedAttention
+
+## Quick start
+
+### Installation
+
+```bash
+# macOS/Linux
+brew install llama.cpp
+
+# Or build from source
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make
+
+# With Metal (Apple Silicon)
+make LLAMA_METAL=1
+
+# With CUDA (NVIDIA)
+make LLAMA_CUDA=1
+
+# With ROCm (AMD)
+make LLAMA_HIP=1
+```
+
+### Download model
+
+```bash
+# Download from HuggingFace (GGUF format)
+huggingface-cli download \
+    TheBloke/Llama-2-7B-Chat-GGUF \
+    llama-2-7b-chat.Q4_K_M.gguf \
+    --local-dir models/
+
+# Or convert from HuggingFace
+python convert_hf_to_gguf.py models/llama-2-7b-chat/
+```
+
+### Run inference
+
+```bash
+# Simple chat
+./llama-cli \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    -p "Explain quantum computing" \
+    -n 256  # Max tokens
+
+# Interactive chat
+./llama-cli \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    --interactive
+```
+
+### Server mode
+
+```bash
+# Start OpenAI-compatible server
+./llama-server \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -ngl 32  # Offload 32 layers to GPU
+
+# Client request
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2-7b-chat",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "temperature": 0.7,
+    "max_tokens": 100
+  }'
+```
+
+## Quantization formats
+
+### GGUF format overview
+
+| Format | Bits | Size (7B) | Speed | Quality | Use Case |
+|--------|------|-----------|-------|---------|----------|
+| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
+| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
+| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
+| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
+| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
+| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
+
+### Choosing quantization
+
+```bash
+# General use (balanced)
+Q4_K_M  # 4-bit, medium quality
+
+# Maximum speed (more degradation)
+Q2_K or Q3_K_M
+
+# Maximum quality (slower)
+Q6_K or Q8_0
+
+# Very large models (70B, 405B)
+Q3_K_M or Q4_K_S  # Lower bits to fit in memory
+```
+
+## Hardware acceleration
+
+### Apple Silicon (Metal)
+
+```bash
+# Build with Metal
+make LLAMA_METAL=1
+
+# Run with GPU acceleration (automatic)
+./llama-cli -m model.gguf -ngl 999  # Offload all layers
+
+# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
+```
+
+### NVIDIA GPUs (CUDA)
+
+```bash
+# Build with CUDA
+make LLAMA_CUDA=1
+
+# Offload layers to GPU
+./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers
+
+# Hybrid CPU+GPU for large models
+./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest
+```
+
+### AMD GPUs (ROCm)
+
+```bash
+# Build with ROCm
+make LLAMA_HIP=1
+
+# Run with AMD GPU
+./llama-cli -m model.gguf -ngl 999
+```
+
+## Common patterns
+
+### Batch processing
+
+```bash
+# Process multiple prompts from file
+cat prompts.txt | ./llama-cli \
+    -m model.gguf \
+    --batch-size 512 \
+    -n 100
+```
+
+### Constrained generation
+
+```bash
+# JSON output with grammar
+./llama-cli \
+    -m model.gguf \
+    -p "Generate a person: " \
+    --grammar-file grammars/json.gbnf
+
+# Outputs valid JSON only
+```
+
+### Context size
+
+```bash
+# Increase context (default 512)
+./llama-cli \
+    -m model.gguf \
+    -c 4096  # 4K context window
+
+# Very long context (if model supports)
+./llama-cli -m model.gguf -c 32768  # 32K context
+```
+
+## Performance benchmarks
+
+### CPU performance (Llama 2-7B Q4_K_M)
+
+| CPU | Threads | Speed | Cost |
+|-----|---------|-------|------|
+| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
+| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
+| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
+| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
+
+### GPU acceleration (Llama 2-7B Q4_K_M)
+
+| GPU | Speed | vs CPU | Cost |
+|-----|-------|--------|------|
+| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
+| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
+| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
+| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
+
+## Supported models
+
+**LLaMA family**:
+- Llama 2 (7B, 13B, 70B)
+- Llama 3 (8B, 70B, 405B)
+- Code Llama
+
+**Mistral family**:
+- Mistral 7B
+- Mixtral 8x7B, 8x22B
+
+**Other**:
+- Falcon, BLOOM, GPT-J
+- Phi-3, Gemma, Qwen
+- LLaVA (vision), Whisper (audio)
+
+**Find models**: https://huggingface.co/models?library=gguf
+
+## References
+
+- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
+- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
+- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
+
+## Resources
+
+- **GitHub**: https://github.com/ggerganov/llama.cpp
+- **Models**: https://huggingface.co/models?library=gguf
+- **Discord**: https://discord.gg/llama-cpp
+
+
--- a/skills/mlops/inference/llama-cpp/references/optimization.md
+++ b/skills/mlops/inference/llama-cpp/references/optimization.md
@@ -0,0 +1,89 @@
+# Performance Optimization Guide
+
+Maximize llama.cpp inference speed and efficiency.
+
+## CPU Optimization
+
+### Thread tuning
+```bash
+# Set threads (default: physical cores)
+./llama-cli -m model.gguf -t 8
+
+# For AMD Ryzen 9 7950X (16 cores, 32 threads)
+-t 16  # Best: physical cores
+
+# Avoid hyperthreading (slower for matrix ops)
+```
+
+### BLAS acceleration
+```bash
+# OpenBLAS (faster matrix ops)
+make LLAMA_OPENBLAS=1
+
+# BLAS gives 2-3× speedup
+```
+
+## GPU Offloading
+
+### Layer offloading
+```bash
+# Offload 35 layers to GPU (hybrid mode)
+./llama-cli -m model.gguf -ngl 35
+
+# Offload all layers
+./llama-cli -m model.gguf -ngl 999
+
+# Find optimal value:
+# Start with -ngl 999
+# If OOM, reduce by 5 until fits
+```
+
+### Memory usage
+```bash
+# Check VRAM usage
+nvidia-smi dmon
+
+# Reduce context if needed
+./llama-cli -m model.gguf -c 2048  # 2K context instead of 4K
+```
+
+## Batch Processing
+
+```bash
+# Increase batch size for throughput
+./llama-cli -m model.gguf -b 512  # Default: 512
+
+# Physical batch (GPU)
+--ubatch 128  # Process 128 tokens at once
+```
+
+## Context Management
+
+```bash
+# Default context (512 tokens)
+-c 512
+
+# Longer context (slower, more memory)
+-c 4096
+
+# Very long context (if model supports)
+-c 32768
+```
+
+## Benchmarks
+
+### CPU Performance (Llama 2-7B Q4_K_M)
+
+| Setup | Speed | Notes |
+|-------|-------|-------|
+| Apple M3 Max | 50 tok/s | Metal acceleration |
+| AMD 7950X (16c) | 35 tok/s | OpenBLAS |
+| Intel i9-13900K | 30 tok/s | AVX2 |
+
+### GPU Offloading (RTX 4090)
+
+| Layers GPU | Speed | VRAM |
+|------------|-------|------|
+| 0 (CPU only) | 30 tok/s | 0 GB |
+| 20 (hybrid) | 80 tok/s | 8 GB |
+| 35 (all) | 120 tok/s | 12 GB |
--- a/skills/mlops/inference/llama-cpp/references/quantization.md
+++ b/skills/mlops/inference/llama-cpp/references/quantization.md
@@ -0,0 +1,213 @@
+# GGUF Quantization Guide
+
+Complete guide to GGUF quantization formats and model conversion.
+
+## Quantization Overview
+
+**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models.
+
+### Format Comparison
+
+| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
+|--------|------------|-----------|------------|-------|
+| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
+| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
+| **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
+| **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
+| **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** |
+| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
+| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
+| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
+
+**Recommendation**: Use **Q4_K_M** for best balance of quality and speed.
+
+## Converting Models
+
+### HuggingFace to GGUF
+
+```bash
+# 1. Download HuggingFace model
+huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
+    --local-dir models/llama-2-7b-chat/
+
+# 2. Convert to FP16 GGUF
+python convert_hf_to_gguf.py \
+    models/llama-2-7b-chat/ \
+    --outtype f16 \
+    --outfile models/llama-2-7b-chat-f16.gguf
+
+# 3. Quantize to Q4_K_M
+./llama-quantize \
+    models/llama-2-7b-chat-f16.gguf \
+    models/llama-2-7b-chat-Q4_K_M.gguf \
+    Q4_K_M
+```
+
+### Batch quantization
+
+```bash
+# Quantize to multiple formats
+for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
+    ./llama-quantize \
+        model-f16.gguf \
+        model-${quant}.gguf \
+        $quant
+done
+```
+
+## K-Quantization Methods
+
+**K-quants** use mixed precision for better quality:
+- Attention weights: Higher precision
+- Feed-forward weights: Lower precision
+
+**Variants**:
+- `_S` (Small): Faster, lower quality
+- `_M` (Medium): Balanced (recommended)
+- `_L` (Large): Better quality, larger size
+
+**Example**: `Q4_K_M`
+- `Q4`: 4-bit quantization
+- `K`: Mixed precision method
+- `M`: Medium quality
+
+## Quality Testing
+
+```bash
+# Calculate perplexity (quality metric)
+./llama-perplexity \
+    -m model.gguf \
+    -f wikitext-2-raw/wiki.test.raw \
+    -c 512
+
+# Lower perplexity = better quality
+# Baseline (FP16): ~5.96
+# Q4_K_M: ~6.06 (+1.7%)
+# Q2_K: ~6.87 (+15.3% - too much degradation)
+```
+
+## Use Case Guide
+
+### General purpose (chatbots, assistants)
+```
+Q4_K_M - Best balance
+Q5_K_M - If you have extra RAM
+```
+
+### Code generation
+```
+Q5_K_M or Q6_K - Higher precision helps with code
+```
+
+### Creative writing
+```
+Q4_K_M - Sufficient quality
+Q3_K_M - Acceptable for draft generation
+```
+
+### Technical/medical
+```
+Q6_K or Q8_0 - Maximum accuracy
+```
+
+### Edge devices (Raspberry Pi)
+```
+Q2_K or Q3_K_S - Fit in limited RAM
+```
+
+## Model Size Scaling
+
+### 7B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 2.7 GB | 5 GB |
+| Q3_K_M | 3.3 GB | 6 GB |
+| Q4_K_M | 4.1 GB | 7 GB |
+| Q5_K_M | 4.8 GB | 8 GB |
+| Q6_K | 5.5 GB | 9 GB |
+| Q8_0 | 7.0 GB | 11 GB |
+
+### 13B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 5.1 GB | 8 GB |
+| Q3_K_M | 6.2 GB | 10 GB |
+| Q4_K_M | 7.9 GB | 12 GB |
+| Q5_K_M | 9.2 GB | 14 GB |
+| Q6_K | 10.7 GB | 16 GB |
+
+### 70B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 26 GB | 32 GB |
+| Q3_K_M | 32 GB | 40 GB |
+| Q4_K_M | 41 GB | 48 GB |
+| Q4_K_S | 39 GB | 46 GB |
+| Q5_K_M | 48 GB | 56 GB |
+
+**Recommendation for 70B**: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
+
+## Finding Pre-Quantized Models
+
+**TheBloke** on HuggingFace:
+- https://huggingface.co/TheBloke
+- Most models available in all GGUF formats
+- No conversion needed
+
+**Example**:
+```bash
+# Download pre-quantized Llama 2-7B
+huggingface-cli download \
+    TheBloke/Llama-2-7B-Chat-GGUF \
+    llama-2-7b-chat.Q4_K_M.gguf \
+    --local-dir models/
+```
+
+## Importance Matrices (imatrix)
+
+**What**: Calibration data to improve quantization quality.
+
+**Benefits**:
+- 10-20% perplexity improvement with Q4
+- Essential for Q3 and below
+
+**Usage**:
+```bash
+# 1. Generate importance matrix
+./llama-imatrix \
+    -m model-f16.gguf \
+    -f calibration-data.txt \
+    -o model.imatrix
+
+# 2. Quantize with imatrix
+./llama-quantize \
+    --imatrix model.imatrix \
+    model-f16.gguf \
+    model-Q4_K_M.gguf \
+    Q4_K_M
+```
+
+**Calibration data**:
+- Use domain-specific text (e.g., code for code models)
+- ~100MB of representative text
+- Higher quality data = better quantization
+
+## Troubleshooting
+
+**Model outputs gibberish**:
+- Quantization too aggressive (Q2_K)
+- Try Q4_K_M or Q5_K_M
+- Verify model converted correctly
+
+**Out of memory**:
+- Use lower quantization (Q4_K_S instead of Q5_K_M)
+- Offload fewer layers to GPU (`-ngl`)
+- Use smaller context (`-c 2048`)
+
+**Slow inference**:
+- Higher quantization uses more compute
+- Q8_0 much slower than Q4_K_M
+- Consider speed vs quality trade-off
--- a/skills/mlops/inference/llama-cpp/references/server.md
+++ b/skills/mlops/inference/llama-cpp/references/server.md
@@ -0,0 +1,125 @@
+# Server Deployment Guide
+
+Production deployment of llama.cpp server with OpenAI-compatible API.
+
+## Server Modes
+
+### llama-server
+
+```bash
+# Basic server
+./llama-server \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -c 4096  # Context size
+
+# With GPU acceleration
+./llama-server \
+    -m models/llama-2-70b.Q4_K_M.gguf \
+    -ngl 40  # Offload 40 layers to GPU
+```
+
+## OpenAI-Compatible API
+
+### Chat completions
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2",
+    "messages": [
+      {"role": "system", "content": "You are helpful"},
+      {"role": "user", "content": "Hello"}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 100
+  }'
+```
+
+### Streaming
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2",
+    "messages": [{"role": "user", "content": "Count to 10"}],
+    "stream": true
+  }'
+```
+
+## Docker Deployment
+
+**Dockerfile**:
+```dockerfile
+FROM ubuntu:22.04
+RUN apt-get update && apt-get install -y git build-essential
+RUN git clone https://github.com/ggerganov/llama.cpp
+WORKDIR /llama.cpp
+RUN make LLAMA_CUDA=1
+COPY models/ /models/
+EXPOSE 8080
+CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
+```
+
+**Run**:
+```bash
+docker run --gpus all -p 8080:8080 llama-cpp:latest
+```
+
+## Monitoring
+
+```bash
+# Server metrics endpoint
+curl http://localhost:8080/metrics
+
+# Health check
+curl http://localhost:8080/health
+```
+
+**Metrics**:
+- requests_total
+- tokens_generated
+- prompt_tokens
+- completion_tokens
+- kv_cache_tokens
+
+## Load Balancing
+
+**NGINX**:
+```nginx
+upstream llama_cpp {
+    server llama1:8080;
+    server llama2:8080;
+}
+
+server {
+    location / {
+        proxy_pass http://llama_cpp;
+        proxy_read_timeout 300s;
+    }
+}
+```
+
+## Performance Tuning
+
+**Parallel requests**:
+```bash
+./llama-server \
+    -m model.gguf \
+    -np 4  # 4 parallel slots
+```
+
+**Continuous batching**:
+```bash
+./llama-server \
+    -m model.gguf \
+    --cont-batching  # Enable continuous batching
+```
+
+**Context caching**:
+```bash
+./llama-server \
+    -m model.gguf \
+    --cache-prompt  # Cache processed prompts
+```