skills/mlops/inference/gguf/references/troubleshooting.md

# GGUF Troubleshooting Guide

## Installation Issues

### Build Fails

**Error**: `make: *** No targets specified and no makefile found`

**Fix**:
```bash
# Ensure you're in llama.cpp directory
cd llama.cpp
make
```

**Error**: `fatal error: cuda_runtime.h: No such file or directory`

**Fix**:
```bash
# Install CUDA toolkit
# Ubuntu
sudo apt install nvidia-cuda-toolkit

# Or set CUDA path
export CUDA_PATH=/usr/local/cuda
export PATH=$CUDA_PATH/bin:$PATH
make GGML_CUDA=1
```

### Python Bindings Issues

**Error**: `ERROR: Failed building wheel for llama-cpp-python`

**Fix**:
```bash
# Install build dependencies
pip install cmake scikit-build-core

# For CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# For Metal (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
```

**Error**: `ImportError: libcudart.so.XX: cannot open shared object file`

**Fix**:
```bash
# Add CUDA libraries to path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Or reinstall with correct CUDA version
pip uninstall llama-cpp-python
CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
```

## Conversion Issues

### Model Not Supported

**Error**: `KeyError: 'model.embed_tokens.weight'`

**Fix**:
```bash
# Check model architecture
python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"

# Use appropriate conversion script
# For most models:
python convert_hf_to_gguf.py ./model --outfile model.gguf

# For older models, check if legacy script needed
```

### Vocabulary Mismatch

**Error**: `RuntimeError: Vocabulary size mismatch`

**Fix**:
```python
# Ensure tokenizer matches model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForCausalLM.from_pretrained("./model")

print(f"Tokenizer vocab size: {len(tokenizer)}")
print(f"Model vocab size: {model.config.vocab_size}")

# If mismatch, resize embeddings before conversion
model.resize_token_embeddings(len(tokenizer))
model.save_pretrained("./model-fixed")
```

### Out of Memory During Conversion

**Error**: `torch.cuda.OutOfMemoryError` during conversion

**Fix**:
```bash
# Use CPU for conversion
CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf

# Or use low memory mode
python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
```

## Quantization Issues

### Wrong Output File Size

**Problem**: Quantized file is larger than expected

**Check**:
```bash
# Verify quantization type
./llama-cli -m model.gguf --verbose

# Expected sizes for 7B model:
# Q4_K_M: ~4.1 GB
# Q5_K_M: ~4.8 GB
# Q8_0: ~7.2 GB
# F16: ~13.5 GB
```

### Quantization Crashes

**Error**: `Segmentation fault` during quantization

**Fix**:
```bash
# Increase stack size
ulimit -s unlimited

# Or use less threads
./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
```

### Poor Quality After Quantization

**Problem**: Model outputs gibberish after quantization

**Solutions**:

1. **Use importance matrix**:
```bash
# Generate imatrix with good calibration data
./llama-imatrix -m model-f16.gguf \
    -f wiki_sample.txt \
    --chunk 512 \
    -o model.imatrix

# Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
    model-f16.gguf model-q4_k_m.gguf Q4_K_M
```

2. **Try higher precision**:
```bash
# Use Q5_K_M or Q6_K instead of Q4
./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
```

3. **Check original model**:
```bash
# Test FP16 version first
./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
```

## Inference Issues

### Slow Generation

**Problem**: Generation is slower than expected

**Solutions**:

1. **Enable GPU offload**:
```bash
./llama-cli -m model.gguf -ngl 35 -p "Hello"
```

2. **Optimize batch size**:
```python
llm = Llama(
    model_path="model.gguf",
    n_batch=512,        # Increase for faster prompt processing
    n_gpu_layers=35
)
```

3. **Use appropriate threads**:
```bash
# Match physical cores, not logical
./llama-cli -m model.gguf -t 8 -p "Hello"
```

4. **Enable Flash Attention** (if supported):
```bash
./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
```

### Out of Memory

**Error**: `CUDA out of memory` or system freeze

**Solutions**:

1. **Reduce GPU layers**:
```python
# Start low and increase
llm = Llama(model_path="model.gguf", n_gpu_layers=10)
```

2. **Use smaller quantization**:
```bash
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
```

3. **Reduce context length**:
```python
llm = Llama(
    model_path="model.gguf",
    n_ctx=2048,  # Reduce from 4096
    n_gpu_layers=35
)
```

4. **Quantize KV cache**:
```python
llm = Llama(
    model_path="model.gguf",
    type_k=2,    # Q4_0 for K cache
    type_v=2,    # Q4_0 for V cache
    n_gpu_layers=35
)
```

### Garbage Output

**Problem**: Model outputs random characters or nonsense

**Diagnose**:
```python
# Check model loading
llm = Llama(model_path="model.gguf", verbose=True)

# Test with simple prompt
output = llm("1+1=", max_tokens=5, temperature=0)
print(output)
```

**Solutions**:

1. **Check model integrity**:
```bash
# Verify GGUF file
./llama-cli -m model.gguf --verbose 2>&1 | head -50
```

2. **Use correct chat format**:
```python
llm = Llama(
    model_path="model.gguf",
    chat_format="llama-3"  # Match your model: chatml, mistral, etc.
)
```

3. **Check temperature**:
```python
# Use lower temperature for deterministic output
output = llm("Hello", max_tokens=50, temperature=0.1)
```

### Token Issues

**Error**: `RuntimeError: unknown token` or encoding errors

**Fix**:
```python
# Ensure UTF-8 encoding
prompt = "Hello, world!".encode('utf-8').decode('utf-8')
output = llm(prompt, max_tokens=50)
```

## Server Issues

### Connection Refused

**Error**: `Connection refused` when accessing server

**Fix**:
```bash
# Bind to all interfaces
./llama-server -m model.gguf --host 0.0.0.0 --port 8080

# Check if port is in use
lsof -i :8080
```

### Server Crashes Under Load

**Problem**: Server crashes with multiple concurrent requests

**Solutions**:

1. **Limit parallelism**:
```bash
./llama-server -m model.gguf \
    --parallel 2 \
    -c 4096 \
    --cont-batching
```

2. **Add request timeout**:
```bash
./llama-server -m model.gguf --timeout 300
```

3. **Monitor memory**:
```bash
watch -n 1 nvidia-smi  # For GPU
watch -n 1 free -h     # For RAM
```

### API Compatibility Issues

**Problem**: OpenAI client not working with server

**Fix**:
```python
from openai import OpenAI

# Use correct base URL format
client = OpenAI(
    base_url="http://localhost:8080/v1",  # Include /v1
    api_key="not-needed"
)

# Use correct model name
response = client.chat.completions.create(
    model="local",  # Or the actual model name
    messages=[{"role": "user", "content": "Hello"}]
)
```

## Apple Silicon Issues

### Metal Not Working

**Problem**: Metal acceleration not enabled

**Check**:
```bash
# Verify Metal support
./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
```

**Fix**:
```bash
# Rebuild with Metal
make clean
make GGML_METAL=1

# Python bindings
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
```

### Incorrect Memory Usage on M1/M2

**Problem**: Model uses too much unified memory

**Fix**:
```python
# Offload all layers for Metal
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=99,    # Offload everything
    n_threads=1         # Metal handles parallelism
)
```

## Debugging

### Enable Verbose Output

```bash
# CLI verbose mode
./llama-cli -m model.gguf --verbose -p "Hello" -n 50

# Python verbose
llm = Llama(model_path="model.gguf", verbose=True)
```

### Check Model Metadata

```bash
# View GGUF metadata
./llama-cli -m model.gguf --verbose 2>&1 | head -100
```

### Validate GGUF File

```python
import struct

def validate_gguf(filepath):
    with open(filepath, 'rb') as f:
        magic = f.read(4)
        if magic != b'GGUF':
            print(f"Invalid magic: {magic}")
            return False

        version = struct.unpack('<I', f.read(4))[0]
        print(f"GGUF version: {version}")

        tensor_count = struct.unpack('<Q', f.read(8))[0]
        metadata_count = struct.unpack('<Q', f.read(8))[0]
        print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")

        return True

validate_gguf("model.gguf")
```

## Getting Help

1. **GitHub Issues**: https://github.com/ggml-org/llama.cpp/issues
2. **Discussions**: https://github.com/ggml-org/llama.cpp/discussions
3. **Reddit**: r/LocalLLaMA

### Reporting Issues

Include:
- llama.cpp version/commit hash
- Build command used
- Model name and quantization
- Full error message/stack trace
- Hardware: CPU/GPU model, RAM, VRAM
- OS version
- Minimal reproduction steps
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`# GGUF Troubleshooting Guide`

			`## Installation Issues`

			`### Build Fails`

			Error: `make: *** No targets specified and no makefile found`

			`Fix:`
			```bash
			`# Ensure you're in llama.cpp directory`
			`cd llama.cpp`
			`make`
			```

			Error: `fatal error: cuda_runtime.h: No such file or directory`

			`Fix:`
			```bash
			`# Install CUDA toolkit`
			`# Ubuntu`
			`sudo apt install nvidia-cuda-toolkit`

			`# Or set CUDA path`
			`export CUDA_PATH=/usr/local/cuda`
			`export PATH=$CUDA_PATH/bin:$PATH`
			`make GGML_CUDA=1`
			```

			`### Python Bindings Issues`

			Error: `ERROR: Failed building wheel for llama-cpp-python`

			`Fix:`
			```bash
			`# Install build dependencies`
			`pip install cmake scikit-build-core`

			`# For CUDA support`
			`CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir`

			`# For Metal (macOS)`
			`CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir`
			```

			Error: `ImportError: libcudart.so.XX: cannot open shared object file`

			`Fix:`
			```bash
			`# Add CUDA libraries to path`
			`export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH`

			`# Or reinstall with correct CUDA version`
			`pip uninstall llama-cpp-python`
			`CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python`
			```

			`## Conversion Issues`

			`### Model Not Supported`

			Error: `KeyError: 'model.embed_tokens.weight'`

			`Fix:`
			```bash
			`# Check model architecture`
			`python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"`

			`# Use appropriate conversion script`
			`# For most models:`
			`python convert_hf_to_gguf.py ./model --outfile model.gguf`

			`# For older models, check if legacy script needed`
			```

			`### Vocabulary Mismatch`

			Error: `RuntimeError: Vocabulary size mismatch`

			`Fix:`
			```python
			`# Ensure tokenizer matches model`
			`from transformers import AutoTokenizer, AutoModelForCausalLM`

			`tokenizer = AutoTokenizer.from_pretrained("./model")`
			`model = AutoModelForCausalLM.from_pretrained("./model")`

			`print(f"Tokenizer vocab size: {len(tokenizer)}")`
			`print(f"Model vocab size: {model.config.vocab_size}")`

			`# If mismatch, resize embeddings before conversion`
			`model.resize_token_embeddings(len(tokenizer))`
			`model.save_pretrained("./model-fixed")`
			```

			`### Out of Memory During Conversion`

			Error: `torch.cuda.OutOfMemoryError` during conversion

			`Fix:`
			```bash
			`# Use CPU for conversion`
			`CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf`

			`# Or use low memory mode`
			`python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16`
			```

			`## Quantization Issues`

			`### Wrong Output File Size`

			`Problem: Quantized file is larger than expected`

			`Check:`
			```bash
			`# Verify quantization type`
			`./llama-cli -m model.gguf --verbose`

			`# Expected sizes for 7B model:`
			`# Q4_K_M: ~4.1 GB`
			`# Q5_K_M: ~4.8 GB`
			`# Q8_0: ~7.2 GB`
			`# F16: ~13.5 GB`
			```

			`### Quantization Crashes`

			Error: `Segmentation fault` during quantization

			`Fix:`
			```bash
			`# Increase stack size`
			`ulimit -s unlimited`

			`# Or use less threads`
			`./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M`
			```

			`### Poor Quality After Quantization`

			`Problem: Model outputs gibberish after quantization`

			`Solutions:`

			`1. Use importance matrix:`
			```bash
			`# Generate imatrix with good calibration data`
			`./llama-imatrix -m model-f16.gguf \`
			`-f wiki_sample.txt \`
			`--chunk 512 \`
			`-o model.imatrix`

			`# Quantize with imatrix`
			`./llama-quantize --imatrix model.imatrix \`
			`model-f16.gguf model-q4_k_m.gguf Q4_K_M`
			```

			`2. Try higher precision:`
			```bash
			`# Use Q5_K_M or Q6_K instead of Q4`
			`./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M`
			```

			`3. Check original model:`
			```bash
			`# Test FP16 version first`
			`./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50`
			```

			`## Inference Issues`

			`### Slow Generation`

			`Problem: Generation is slower than expected`

			`Solutions:`

			`1. Enable GPU offload:`
			```bash
			`./llama-cli -m model.gguf -ngl 35 -p "Hello"`
			```

			`2. Optimize batch size:`
			```python
			`llm = Llama(`
			`model_path="model.gguf",`
			`n_batch=512, # Increase for faster prompt processing`
			`n_gpu_layers=35`
			`)`
			```

			`3. Use appropriate threads:`
			```bash
			`# Match physical cores, not logical`
			`./llama-cli -m model.gguf -t 8 -p "Hello"`
			```

			`4. Enable Flash Attention (if supported):`
			```bash
			`./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"`
			```

			`### Out of Memory`

			Error: `CUDA out of memory` or system freeze

			`Solutions:`

			`1. Reduce GPU layers:`
			```python
			`# Start low and increase`
			`llm = Llama(model_path="model.gguf", n_gpu_layers=10)`
			```

			`2. Use smaller quantization:`
			```bash
			`./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M`
			```

			`3. Reduce context length:`
			```python
			`llm = Llama(`
			`model_path="model.gguf",`
			`n_ctx=2048, # Reduce from 4096`
			`n_gpu_layers=35`
			`)`
			```

			`4. Quantize KV cache:`
			```python
			`llm = Llama(`
			`model_path="model.gguf",`
			`type_k=2, # Q4_0 for K cache`
			`type_v=2, # Q4_0 for V cache`
			`n_gpu_layers=35`
			`)`
			```

			`### Garbage Output`

			`Problem: Model outputs random characters or nonsense`

			`Diagnose:`
			```python
			`# Check model loading`
			`llm = Llama(model_path="model.gguf", verbose=True)`

			`# Test with simple prompt`
			`output = llm("1+1=", max_tokens=5, temperature=0)`
			`print(output)`
			```

			`Solutions:`

			`1. Check model integrity:`
			```bash
			`# Verify GGUF file`
			`./llama-cli -m model.gguf --verbose 2>&1 \| head -50`
			```

			`2. Use correct chat format:`
			```python
			`llm = Llama(`
			`model_path="model.gguf",`
			`chat_format="llama-3" # Match your model: chatml, mistral, etc.`
			`)`
			```

			`3. Check temperature:`
			```python
			`# Use lower temperature for deterministic output`
			`output = llm("Hello", max_tokens=50, temperature=0.1)`
			```

			`### Token Issues`

			Error: `RuntimeError: unknown token` or encoding errors

			`Fix:`
			```python
			`# Ensure UTF-8 encoding`
			`prompt = "Hello, world!".encode('utf-8').decode('utf-8')`
			`output = llm(prompt, max_tokens=50)`
			```

			`## Server Issues`

			`### Connection Refused`

			Error: `Connection refused` when accessing server

			`Fix:`
			```bash
			`# Bind to all interfaces`
			`./llama-server -m model.gguf --host 0.0.0.0 --port 8080`

			`# Check if port is in use`
			`lsof -i :8080`
			```

			`### Server Crashes Under Load`

			`Problem: Server crashes with multiple concurrent requests`

			`Solutions:`

			`1. Limit parallelism:`
			```bash
			`./llama-server -m model.gguf \`
			`--parallel 2 \`
			`-c 4096 \`
			`--cont-batching`
			```

			`2. Add request timeout:`
			```bash
			`./llama-server -m model.gguf --timeout 300`
			```

			`3. Monitor memory:`
			```bash
			`watch -n 1 nvidia-smi # For GPU`
			`watch -n 1 free -h # For RAM`
			```

			`### API Compatibility Issues`

			`Problem: OpenAI client not working with server`

			`Fix:`
			```python
			`from openai import OpenAI`

			`# Use correct base URL format`
			`client = OpenAI(`
			`base_url="http://localhost:8080/v1", # Include /v1`
			`api_key="not-needed"`
			`)`

			`# Use correct model name`
			`response = client.chat.completions.create(`
			`model="local", # Or the actual model name`
			`messages=[{"role": "user", "content": "Hello"}]`
			`)`
			```

			`## Apple Silicon Issues`

			`### Metal Not Working`

			`Problem: Metal acceleration not enabled`

			`Check:`
			```bash
			`# Verify Metal support`
			`./llama-cli -m model.gguf --verbose 2>&1 \| grep -i metal`
			```

			`Fix:`
			```bash
			`# Rebuild with Metal`
			`make clean`
			`make GGML_METAL=1`

			`# Python bindings`
			`CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall`
			```

			`### Incorrect Memory Usage on M1/M2`

			`Problem: Model uses too much unified memory`

			`Fix:`
			```python
			`# Offload all layers for Metal`
			`llm = Llama(`
			`model_path="model.gguf",`
			`n_gpu_layers=99, # Offload everything`
			`n_threads=1 # Metal handles parallelism`
			`)`
			```

			`## Debugging`

			`### Enable Verbose Output`

			```bash
			`# CLI verbose mode`
			`./llama-cli -m model.gguf --verbose -p "Hello" -n 50`

			`# Python verbose`
			`llm = Llama(model_path="model.gguf", verbose=True)`
			```

			`### Check Model Metadata`

			```bash
			`# View GGUF metadata`
			`./llama-cli -m model.gguf --verbose 2>&1 \| head -100`
			```

			`### Validate GGUF File`

			```python
			`import struct`

			`def validate_gguf(filepath):`
			`with open(filepath, 'rb') as f:`
			`magic = f.read(4)`
			`if magic != b'GGUF':`
			`print(f"Invalid magic: {magic}")`
			`return False`

			`version = struct.unpack('<I', f.read(4))[0]`
			`print(f"GGUF version: {version}")`

			`tensor_count = struct.unpack('<Q', f.read(8))[0]`
			`metadata_count = struct.unpack('<Q', f.read(8))[0]`
			`print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")`

			`return True`

			`validate_gguf("model.gguf")`
			```

			`## Getting Help`

			`1. GitHub Issues: https://github.com/ggml-org/llama.cpp/issues`
			`2. Discussions: https://github.com/ggml-org/llama.cpp/discussions`
			`3. Reddit: r/LocalLLaMA`

			`### Reporting Issues`

			`Include:`
			`- llama.cpp version/commit hash`
			`- Build command used`
			`- Model name and quantization`
			`- Full error message/stack trace`
			`- Hardware: CPU/GPU model, RAM, VRAM`
			`- OS version`
			`- Minimal reproduction steps`