Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/mlops/inference/llama-cpp/references/quantization.md
+++ b/skills/mlops/inference/llama-cpp/references/quantization.md
@@ -0,0 +1,213 @@
+# GGUF Quantization Guide
+
+Complete guide to GGUF quantization formats and model conversion.
+
+## Quantization Overview
+
+**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models.
+
+### Format Comparison
+
+| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
+|--------|------------|-----------|------------|-------|
+| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
+| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
+| **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
+| **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
+| **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** |
+| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
+| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
+| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
+
+**Recommendation**: Use **Q4_K_M** for best balance of quality and speed.
+
+## Converting Models
+
+### HuggingFace to GGUF
+
+```bash
+# 1. Download HuggingFace model
+huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
+    --local-dir models/llama-2-7b-chat/
+
+# 2. Convert to FP16 GGUF
+python convert_hf_to_gguf.py \
+    models/llama-2-7b-chat/ \
+    --outtype f16 \
+    --outfile models/llama-2-7b-chat-f16.gguf
+
+# 3. Quantize to Q4_K_M
+./llama-quantize \
+    models/llama-2-7b-chat-f16.gguf \
+    models/llama-2-7b-chat-Q4_K_M.gguf \
+    Q4_K_M
+```
+
+### Batch quantization
+
+```bash
+# Quantize to multiple formats
+for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
+    ./llama-quantize \
+        model-f16.gguf \
+        model-${quant}.gguf \
+        $quant
+done
+```
+
+## K-Quantization Methods
+
+**K-quants** use mixed precision for better quality:
+- Attention weights: Higher precision
+- Feed-forward weights: Lower precision
+
+**Variants**:
+- `_S` (Small): Faster, lower quality
+- `_M` (Medium): Balanced (recommended)
+- `_L` (Large): Better quality, larger size
+
+**Example**: `Q4_K_M`
+- `Q4`: 4-bit quantization
+- `K`: Mixed precision method
+- `M`: Medium quality
+
+## Quality Testing
+
+```bash
+# Calculate perplexity (quality metric)
+./llama-perplexity \
+    -m model.gguf \
+    -f wikitext-2-raw/wiki.test.raw \
+    -c 512
+
+# Lower perplexity = better quality
+# Baseline (FP16): ~5.96
+# Q4_K_M: ~6.06 (+1.7%)
+# Q2_K: ~6.87 (+15.3% - too much degradation)
+```
+
+## Use Case Guide
+
+### General purpose (chatbots, assistants)
+```
+Q4_K_M - Best balance
+Q5_K_M - If you have extra RAM
+```
+
+### Code generation
+```
+Q5_K_M or Q6_K - Higher precision helps with code
+```
+
+### Creative writing
+```
+Q4_K_M - Sufficient quality
+Q3_K_M - Acceptable for draft generation
+```
+
+### Technical/medical
+```
+Q6_K or Q8_0 - Maximum accuracy
+```
+
+### Edge devices (Raspberry Pi)
+```
+Q2_K or Q3_K_S - Fit in limited RAM
+```
+
+## Model Size Scaling
+
+### 7B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 2.7 GB | 5 GB |
+| Q3_K_M | 3.3 GB | 6 GB |
+| Q4_K_M | 4.1 GB | 7 GB |
+| Q5_K_M | 4.8 GB | 8 GB |
+| Q6_K | 5.5 GB | 9 GB |
+| Q8_0 | 7.0 GB | 11 GB |
+
+### 13B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 5.1 GB | 8 GB |
+| Q3_K_M | 6.2 GB | 10 GB |
+| Q4_K_M | 7.9 GB | 12 GB |
+| Q5_K_M | 9.2 GB | 14 GB |
+| Q6_K | 10.7 GB | 16 GB |
+
+### 70B parameter models
+
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 26 GB | 32 GB |
+| Q3_K_M | 32 GB | 40 GB |
+| Q4_K_M | 41 GB | 48 GB |
+| Q4_K_S | 39 GB | 46 GB |
+| Q5_K_M | 48 GB | 56 GB |
+
+**Recommendation for 70B**: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
+
+## Finding Pre-Quantized Models
+
+**TheBloke** on HuggingFace:
+- https://huggingface.co/TheBloke
+- Most models available in all GGUF formats
+- No conversion needed
+
+**Example**:
+```bash
+# Download pre-quantized Llama 2-7B
+huggingface-cli download \
+    TheBloke/Llama-2-7B-Chat-GGUF \
+    llama-2-7b-chat.Q4_K_M.gguf \
+    --local-dir models/
+```
+
+## Importance Matrices (imatrix)
+
+**What**: Calibration data to improve quantization quality.
+
+**Benefits**:
+- 10-20% perplexity improvement with Q4
+- Essential for Q3 and below
+
+**Usage**:
+```bash
+# 1. Generate importance matrix
+./llama-imatrix \
+    -m model-f16.gguf \
+    -f calibration-data.txt \
+    -o model.imatrix
+
+# 2. Quantize with imatrix
+./llama-quantize \
+    --imatrix model.imatrix \
+    model-f16.gguf \
+    model-Q4_K_M.gguf \
+    Q4_K_M
+```
+
+**Calibration data**:
+- Use domain-specific text (e.g., code for code models)
+- ~100MB of representative text
+- Higher quality data = better quantization
+
+## Troubleshooting
+
+**Model outputs gibberish**:
+- Quantization too aggressive (Q2_K)
+- Try Q4_K_M or Q5_K_M
+- Verify model converted correctly
+
+**Out of memory**:
+- Use lower quantization (Q4_K_S instead of Q5_K_M)
+- Offload fewer layers to GPU (`-ngl`)
+- Use smaller context (`-c 2048`)
+
+**Slow inference**:
+- Higher quantization uses more compute
+- Q8_0 much slower than Q4_K_M
+- Consider speed vs quality trade-off