Skip to content

Latest commit

 

History

History
426 lines (313 loc) · 12 KB

File metadata and controls

426 lines (313 loc) · 12 KB

Importance Matrix (imatrix) Files: Complete Guide

What is an IMatrix File?

An importance matrix (imatrix) file is a data structure that contains information about which weights in a neural network are most important during inference. It's generated by running the model on a calibration dataset and measuring how much each weight contributes to the output.

Key Concepts

  • Purpose: Improve quantization quality by preserving precision for important weights
  • How it works: Tracks squared activations (importance scores) for each weight during inference
  • Format: Stored as GGUF files (or legacy .dat format)
  • Usage: Passed to the quantization tool to guide which weights should be quantized more carefully

Why Use an IMatrix?

When quantizing a model, you're reducing precision from 16-bit or 32-bit floats to 3-bit, 4-bit, or other low-precision formats. This compression can cause quality loss. An imatrix helps by:

  1. Identifying Critical Weights: Shows which weights are most active/important during inference
  2. Guiding Quantization: Allows the quantizer to:
    • Preserve precision for important weights
    • Use more aggressive quantization for less important weights
    • Make smarter decisions about outlier selection (especially for Q3_K_HIFI)
  3. Improving Quality: Can significantly reduce perplexity increase compared to quantization without imatrix

Example Impact

For Q3_K_HIFI specifically, the imatrix is used to:

  • Weight the magnitude calculation when selecting outliers: mag[i] = fabsf(xb[i]) * quant_weights[i]
  • Prioritize important weights as outliers (stored in FP16)
  • Improve overall quantization quality

How to Generate an IMatrix File

Step 1: Prepare a Calibration Dataset

You need a text file with representative data that the model will process. This should be similar to the data your model will see in production.

Good sources for calibration data:

  • Wikipedia articles (e.g., wiki.train.raw)
  • Books or text corpora
  • Domain-specific text relevant to your use case
  • The model's training data (if available)

File format: Plain text, one example per line (or use --parse-special for special token parsing)

Step 2: Build the IMatrix Tool

First, make sure you've built llama-imatrix:

# On Linux/Mac
make llama-imatrix

# On Windows (MSVC)
cmake --build build --config Release --target llama-imatrix

Step 3: Generate the IMatrix

Basic usage:

./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-data.txt \
    -o imatrix.gguf \
    -ngl 99

Parameters explained:

  • -m, --model: Your F16 or F32 model file (input)
  • -f, --file: Your calibration text file
  • -o, --output-file: Output imatrix filename (default: imatrix.gguf)
  • -ngl, --n-gpu-layers: Number of layers to offload to GPU (speeds up generation)

Advanced Options

./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-data.txt \
    -o imatrix.gguf \
    -ngl 99 \
    --output-frequency 10 \      # Save every 10 chunks
    --save-frequency 50 \         # Create snapshots every 50 chunks
    --chunk 0 \                   # Start from chunk 0
    --chunks 100 \                # Process 100 chunks total
    --parse-special \             # Parse special tokens
    --process-output              # Include output.weight tensor

Important Options:

  • --output-frequency N: How often to save progress (default: 10 chunks)
  • --save-frequency N: Create backup snapshots (default: 0 = never)
  • --chunk N: Skip first N chunks (useful for resuming)
  • --chunks N: Maximum chunks to process (default: -1 = all)
  • --parse-special: Enable special token parsing (e.g., <|im_start|>)
  • --process-output: Include output.weight tensor (usually not recommended)
  • --no-ppl: Disable perplexity calculation (faster, less info)
  • -lv, --verbosity: Verbosity level (0=silent, 1=default, 2+=verbose)

Example: Full Workflow

# 1. Generate imatrix with GPU acceleration
./llama-imatrix \
    -m ./models/llama-3-8b-f16.gguf \
    -f ./data/wiki.train.raw \
    -o ./imatrix.gguf \
    -ngl 99 \
    --output-frequency 20 \
    --save-frequency 100

# This will:
# - Process the calibration data
# - Track activations for each tensor
# - Save progress every 20 chunks
# - Create snapshots every 100 chunks
# - Output: imatrix.gguf

How to Use an IMatrix During Quantization

Basic Usage

Once you have an imatrix file, use it during quantization:

./llama-quantize \
    --imatrix imatrix.gguf \
    input-model-f16.gguf \
    output-model-q3_k_hifi.gguf \
    Q3_K_HIFI

With Specific Tensor Types

You can target specific tensors:

# Use imatrix only for attention and feed-forward layers
./llama-quantize \
    --imatrix imatrix.gguf \
    --include-weights attn_v \
    --include-weights ffn_down \
    input-model-f16.gguf \
    output-model-q3_k_hifi.gguf \
    Q3_K_HIFI

Advanced Usage

# Quantize with imatrix, custom tensor types, and output settings
./llama-quantize \
    --imatrix imatrix.gguf \
    --output-tensor-type q5_k \
    --token-embedding-type q3_k_hifi \
    input-model-f16.gguf \
    output-model-q3_k_hifi.gguf \
    Q3_K_HIFI

IMatrix File Formats

GGUF Format (Recommended)

Modern format, stored as .gguf files:

  • More efficient
  • Better metadata support
  • Can store multiple datasets
  • Default format in recent versions

Legacy Format

Older binary format, stored as .dat files:

  • Still supported for compatibility
  • Use --output-format dat to generate

Converting Between Formats

# Convert legacy to GGUF
./llama-imatrix --in-file imatrix.dat -o imatrix.gguf

# Convert GGUF to legacy
./llama-imatrix --in-file imatrix.gguf --output-format dat -o imatrix.dat

Combining Multiple IMatrix Files

You can merge imatrix files from multiple runs or datasets:

./llama-imatrix \
    --in-file imatrix-dataset1.gguf \
    --in-file imatrix-dataset2.gguf \
    --in-file imatrix-dataset3.gguf \
    -o imatrix-combined.gguf

This is useful for:

  • Combining data from different domains
  • Merging results from multiple calibration runs
  • Creating a more comprehensive importance matrix

Analyzing IMatrix Files

View Statistics

./llama-imatrix --in-file imatrix.gguf --show-statistics

This displays:

  • Per Tensor:

    • Σ(Act²): Sum of squared activations (importance scores)
    • Min & Max: Range of importance values
    • μ & σ: Mean and standard deviation
    • % Active: Proportion of active elements
    • Entropy: Information content
    • ZD Score: Layer importance metric
    • CosSim: Cosine similarity with previous layer
  • Per Layer:

    • Weighted averages of importance metrics

Understanding the Statistics

  • High Σ(Act²): Tensor is very active during inference
  • High % Active: Many weights contribute significantly
  • High Entropy: Weights have diverse importance (good for quantization)
  • High ZD Score: Layer is important to preserve
  • High CosSim: Layer is similar to previous (may indicate redundancy)

Best Practices

1. Calibration Dataset Selection

Do:

  • Use representative data similar to your use case
  • Include diverse examples
  • Use at least 1000-10000 chunks for good coverage
  • Match the domain (e.g., code for code models, text for language models)

Don't:

  • Use too small a dataset (< 100 chunks)
  • Use completely unrelated data
  • Use only one type of example

2. Processing Settings

Do:

  • Use GPU offloading (-ngl 99) for speed
  • Save frequently (--output-frequency 10)
  • Create snapshots (--save-frequency 50) for long runs
  • Process enough chunks (1000+ recommended)

Don't:

  • Process output.weight unless necessary (--process-output is usually not needed)
  • Skip validation of your calibration data

3. Quantization Usage

Do:

  • Always use imatrix for Q3_K_HIFI (it significantly improves outlier selection)
  • Use imatrix for aggressive quantizations (Q2_K, Q3_K_S)
  • Include attention and feed-forward weights
  • Test quality after quantization

Don't:

  • Use imatrix for output.weight (usually excluded by default)
  • Assume imatrix will always improve quality (test it)
  • Use an imatrix from a different model architecture

Complete Workflow Example

Here's a complete example for quantizing a model with Q3_K_HIFI using an imatrix:

# Step 1: Generate importance matrix
./llama-imatrix \
    -m ./models/llama-3-8b-f16.gguf \
    -f ./data/calibration-text.txt \
    -o ./imatrix.gguf \
    -ngl 99 \
    --output-frequency 20 \
    --chunks 1000

# Step 2: (Optional) View statistics
./llama-imatrix --in-file ./imatrix.gguf --show-statistics

# Step 3: Quantize using the imatrix
./llama-quantize \
    --imatrix ./imatrix.gguf \
    ./models/llama-3-8b-f16.gguf \
    ./models/llama-3-8b-q3_k_hifi.gguf \
    Q3_K_HIFI

# Step 4: Test the quantized model
./llama-cli \
    -m ./models/llama-3-8b-q3_k_hifi.gguf \
    -p "Hello, how are you?"

How IMatrix Works with Q3_K_HIFI

For Q3_K_HIFI specifically, the imatrix is particularly valuable:

  1. Outlier Selection: The imatrix weights the magnitude calculation:

    mag[i] = fabsf(xb[i]) * quant_weights[i]

    This means important weights (high imatrix values) are more likely to be selected as outliers.

  2. Better Quality: By preserving important weights as FP16 outliers, the model maintains better accuracy.

  3. Smart Compression: Less important weights can be more aggressively quantized to 3-bit, while critical ones stay in FP16.

Example Impact

Without imatrix:

  • Outliers selected purely by magnitude
  • May miss important but smaller-magnitude weights
  • Quality: Baseline

With imatrix:

  • Outliers selected by importance-weighted magnitude
  • Preserves critical weights even if not the largest
  • Quality: Typically 5-15% better perplexity

Troubleshooting

Problem: IMatrix generation is slow

Solutions:

  • Use GPU offloading: -ngl 99
  • Reduce chunks: --chunks 500
  • Disable perplexity: --no-ppl

Problem: IMatrix file is very large

Solutions:

  • This is normal (can be 100MB-1GB+)
  • Use GGUF format (more efficient than legacy)
  • The file is only needed during quantization, not inference

Problem: Quantization quality didn't improve

Solutions:

  • Check that imatrix was generated on similar data
  • Verify imatrix file loaded correctly (check logs)
  • Try including/excluding specific tensors
  • Ensure calibration dataset is representative

Problem: "imatrix mapping error"

Solutions:

  • IMatrix was generated for a different model architecture
  • Tensor names don't match
  • Regenerate imatrix for your specific model

Technical Details

What Gets Stored

For each tensor, the imatrix stores:

  • Squared activations: act² for each weight position
  • Call count: How many times the tensor was accessed
  • Averaged values: Σ(act²) / n_calls for normalization

How It's Used

During quantization:

  1. IMatrix data is loaded and mapped to tensor names
  2. For each weight block, importance scores are retrieved
  3. Quantization algorithms use these scores to:
    • Weight magnitude calculations
    • Select outliers (Q3_K_HIFI)
    • Choose quantization scales
    • Determine precision levels

File Structure

GGUF format imatrix contains:

  • Metadata: chunk count, chunk size, dataset names
  • Tensor data: For each tensor, arrays of importance scores
  • Statistics: Optional computed statistics

Summary

IMatrix files are essential for high-quality quantization, especially for formats like Q3_K_HIFI that benefit from intelligent outlier selection.

Key Takeaways:

  1. Generate imatrix using representative calibration data
  2. Use GPU acceleration for faster generation
  3. Always use imatrix when quantizing to Q3_K_HIFI
  4. Combine multiple imatrix files for better coverage
  5. Analyze statistics to understand your model's weight importance

For Q3_K_HIFI specifically: The imatrix directly improves outlier selection, making it one of the most impactful uses of importance matrices in quantization.