Importance Matrix (imatrix) Files: Complete Guide

What is an IMatrix File?

An importance matrix (imatrix) file is a data structure that contains information about which weights in a neural network are most important during inference. It's generated by running the model on a calibration dataset and measuring how much each weight contributes to the output.

Key Concepts

Purpose: Improve quantization quality by preserving precision for important weights
How it works: Tracks squared activations (importance scores) for each weight during inference
Format: Stored as GGUF files (or legacy .dat format)
Usage: Passed to the quantization tool to guide which weights should be quantized more carefully

Why Use an IMatrix?

When quantizing a model, you're reducing precision from 16-bit or 32-bit floats to 3-bit, 4-bit, or other low-precision formats. This compression can cause quality loss. An imatrix helps by:

Identifying Critical Weights: Shows which weights are most active/important during inference
Guiding Quantization: Allows the quantizer to:
- Preserve precision for important weights
- Use more aggressive quantization for less important weights
- Make smarter decisions about outlier selection (especially for Q3_K_HIFI)
Improving Quality: Can significantly reduce perplexity increase compared to quantization without imatrix

Example Impact

For Q3_K_HIFI specifically, the imatrix is used to:

Weight the magnitude calculation when selecting outliers: mag[i] = fabsf(xb[i]) * quant_weights[i]
Prioritize important weights as outliers (stored in FP16)
Improve overall quantization quality

How to Generate an IMatrix File

Step 1: Prepare a Calibration Dataset

You need a text file with representative data that the model will process. This should be similar to the data your model will see in production.

Good sources for calibration data:

Wikipedia articles (e.g., wiki.train.raw)
Books or text corpora
Domain-specific text relevant to your use case
The model's training data (if available)

File format: Plain text, one example per line (or use --parse-special for special token parsing)

Step 2: Build the IMatrix Tool

First, make sure you've built llama-imatrix:

# On Linux/Mac
make llama-imatrix

# On Windows (MSVC)
cmake --build build --config Release --target llama-imatrix

Step 3: Generate the IMatrix

Basic usage:

./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-data.txt \
    -o imatrix.gguf \
    -ngl 99

Parameters explained:

-m, --model: Your F16 or F32 model file (input)
-f, --file: Your calibration text file
-o, --output-file: Output imatrix filename (default: imatrix.gguf)
-ngl, --n-gpu-layers: Number of layers to offload to GPU (speeds up generation)

Advanced Options

./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-data.txt \
    -o imatrix.gguf \
    -ngl 99 \
    --output-frequency 10 \      # Save every 10 chunks
    --save-frequency 50 \         # Create snapshots every 50 chunks
    --chunk 0 \                   # Start from chunk 0
    --chunks 100 \                # Process 100 chunks total
    --parse-special \             # Parse special tokens
    --process-output              # Include output.weight tensor

Important Options:

--output-frequency N: How often to save progress (default: 10 chunks)
--save-frequency N: Create backup snapshots (default: 0 = never)
--chunk N: Skip first N chunks (useful for resuming)
--chunks N: Maximum chunks to process (default: -1 = all)
--parse-special: Enable special token parsing (e.g., <|im_start|>)
--process-output: Include output.weight tensor (usually not recommended)
--no-ppl: Disable perplexity calculation (faster, less info)
-lv, --verbosity: Verbosity level (0=silent, 1=default, 2+=verbose)

Example: Full Workflow

# 1. Generate imatrix with GPU acceleration
./llama-imatrix \
    -m ./models/llama-3-8b-f16.gguf \
    -f ./data/wiki.train.raw \
    -o ./imatrix.gguf \
    -ngl 99 \
    --output-frequency 20 \
    --save-frequency 100

# This will:
# - Process the calibration data
# - Track activations for each tensor
# - Save progress every 20 chunks
# - Create snapshots every 100 chunks
# - Output: imatrix.gguf

How to Use an IMatrix During Quantization

Basic Usage

Once you have an imatrix file, use it during quantization:

./llama-quantize \
    --imatrix imatrix.gguf \
    input-model-f16.gguf \
    output-model-q3_k_hifi.gguf \
    Q3_K_HIFI

With Specific Tensor Types

You can target specific tensors:

# Use imatrix only for attention and feed-forward layers
./llama-quantize \
    --imatrix imatrix.gguf \
    --include-weights attn_v \
    --include-weights ffn_down \
    input-model-f16.gguf \
    output-model-q3_k_hifi.gguf \
    Q3_K_HIFI

Advanced Usage

# Quantize with imatrix, custom tensor types, and output settings
./llama-quantize \
    --imatrix imatrix.gguf \
    --output-tensor-type q5_k \
    --token-embedding-type q3_k_hifi \
    input-model-f16.gguf \
    output-model-q3_k_hifi.gguf \
    Q3_K_HIFI

IMatrix File Formats

GGUF Format (Recommended)

Modern format, stored as .gguf files:

More efficient
Better metadata support
Can store multiple datasets
Default format in recent versions

Legacy Format

Older binary format, stored as .dat files:

Still supported for compatibility
Use --output-format dat to generate

Converting Between Formats

# Convert legacy to GGUF
./llama-imatrix --in-file imatrix.dat -o imatrix.gguf

# Convert GGUF to legacy
./llama-imatrix --in-file imatrix.gguf --output-format dat -o imatrix.dat

Combining Multiple IMatrix Files

You can merge imatrix files from multiple runs or datasets:

./llama-imatrix \
    --in-file imatrix-dataset1.gguf \
    --in-file imatrix-dataset2.gguf \
    --in-file imatrix-dataset3.gguf \
    -o imatrix-combined.gguf

This is useful for:

Combining data from different domains
Merging results from multiple calibration runs
Creating a more comprehensive importance matrix

Analyzing IMatrix Files

View Statistics

./llama-imatrix --in-file imatrix.gguf --show-statistics

This displays:

Per Tensor:
- Σ(Act²): Sum of squared activations (importance scores)
- Min & Max: Range of importance values
- μ & σ: Mean and standard deviation
- % Active: Proportion of active elements
- Entropy: Information content
- ZD Score: Layer importance metric
- CosSim: Cosine similarity with previous layer
Per Layer:
- Weighted averages of importance metrics

Understanding the Statistics

High Σ(Act²): Tensor is very active during inference
High % Active: Many weights contribute significantly
High Entropy: Weights have diverse importance (good for quantization)
High ZD Score: Layer is important to preserve
High CosSim: Layer is similar to previous (may indicate redundancy)

Best Practices

1. Calibration Dataset Selection

✅ Do:

Use representative data similar to your use case
Include diverse examples
Use at least 1000-10000 chunks for good coverage
Match the domain (e.g., code for code models, text for language models)

❌ Don't:

Use too small a dataset (< 100 chunks)
Use completely unrelated data
Use only one type of example

2. Processing Settings

✅ Do:

Use GPU offloading (-ngl 99) for speed
Save frequently (--output-frequency 10)
Create snapshots (--save-frequency 50) for long runs
Process enough chunks (1000+ recommended)

❌ Don't:

Process output.weight unless necessary (--process-output is usually not needed)
Skip validation of your calibration data

3. Quantization Usage

✅ Do:

Always use imatrix for Q3_K_HIFI (it significantly improves outlier selection)
Use imatrix for aggressive quantizations (Q2_K, Q3_K_S)
Include attention and feed-forward weights
Test quality after quantization

❌ Don't:

Use imatrix for output.weight (usually excluded by default)
Assume imatrix will always improve quality (test it)
Use an imatrix from a different model architecture

Complete Workflow Example

Here's a complete example for quantizing a model with Q3_K_HIFI using an imatrix:

# Step 1: Generate importance matrix
./llama-imatrix \
    -m ./models/llama-3-8b-f16.gguf \
    -f ./data/calibration-text.txt \
    -o ./imatrix.gguf \
    -ngl 99 \
    --output-frequency 20 \
    --chunks 1000

# Step 2: (Optional) View statistics
./llama-imatrix --in-file ./imatrix.gguf --show-statistics

# Step 3: Quantize using the imatrix
./llama-quantize \
    --imatrix ./imatrix.gguf \
    ./models/llama-3-8b-f16.gguf \
    ./models/llama-3-8b-q3_k_hifi.gguf \
    Q3_K_HIFI

# Step 4: Test the quantized model
./llama-cli \
    -m ./models/llama-3-8b-q3_k_hifi.gguf \
    -p "Hello, how are you?"

How IMatrix Works with Q3_K_HIFI

For Q3_K_HIFI specifically, the imatrix is particularly valuable:

Outlier Selection: The imatrix weights the magnitude calculation:
```
mag[i] = fabsf(xb[i]) * quant_weights[i]
```
This means important weights (high imatrix values) are more likely to be selected as outliers.
Better Quality: By preserving important weights as FP16 outliers, the model maintains better accuracy.
Smart Compression: Less important weights can be more aggressively quantized to 3-bit, while critical ones stay in FP16.

Example Impact

Without imatrix:

Outliers selected purely by magnitude
May miss important but smaller-magnitude weights
Quality: Baseline

With imatrix:

Outliers selected by importance-weighted magnitude
Preserves critical weights even if not the largest
Quality: Typically 5-15% better perplexity

Troubleshooting

Problem: IMatrix generation is slow

Solutions:

Use GPU offloading: -ngl 99
Reduce chunks: --chunks 500
Disable perplexity: --no-ppl

Problem: IMatrix file is very large

Solutions:

This is normal (can be 100MB-1GB+)
Use GGUF format (more efficient than legacy)
The file is only needed during quantization, not inference

Problem: Quantization quality didn't improve

Solutions:

Check that imatrix was generated on similar data
Verify imatrix file loaded correctly (check logs)
Try including/excluding specific tensors
Ensure calibration dataset is representative

Problem: "imatrix mapping error"

Solutions:

IMatrix was generated for a different model architecture
Tensor names don't match
Regenerate imatrix for your specific model

Technical Details

What Gets Stored

For each tensor, the imatrix stores:

Squared activations: act² for each weight position
Call count: How many times the tensor was accessed
Averaged values: Σ(act²) / n_calls for normalization

How It's Used

During quantization:

IMatrix data is loaded and mapped to tensor names
For each weight block, importance scores are retrieved
Quantization algorithms use these scores to:
- Weight magnitude calculations
- Select outliers (Q3_K_HIFI)
- Choose quantization scales
- Determine precision levels

File Structure

GGUF format imatrix contains:

Metadata: chunk count, chunk size, dataset names
Tensor data: For each tensor, arrays of importance scores
Statistics: Optional computed statistics

Summary

IMatrix files are essential for high-quality quantization, especially for formats like Q3_K_HIFI that benefit from intelligent outlier selection.

Key Takeaways:

Generate imatrix using representative calibration data
Use GPU acceleration for faster generation
Always use imatrix when quantizing to Q3_K_HIFI
Combine multiple imatrix files for better coverage
Analyze statistics to understand your model's weight importance

For Q3_K_HIFI specifically: The imatrix directly improves outlier selection, making it one of the most impactful uses of importance matrices in quantization.

FilesExpand file tree

IMatrix_Guide.md

Latest commit

History

IMatrix_Guide.md

File metadata and controls

Importance Matrix (imatrix) Files: Complete Guide

What is an IMatrix File?

Key Concepts

Why Use an IMatrix?

Example Impact

How to Generate an IMatrix File

Step 1: Prepare a Calibration Dataset

Step 2: Build the IMatrix Tool

Step 3: Generate the IMatrix

Advanced Options

Example: Full Workflow

How to Use an IMatrix During Quantization

Basic Usage

With Specific Tensor Types

Advanced Usage

IMatrix File Formats

GGUF Format (Recommended)

Legacy Format

Converting Between Formats

Combining Multiple IMatrix Files

Analyzing IMatrix Files

View Statistics

Understanding the Statistics

Best Practices

1. Calibration Dataset Selection

2. Processing Settings

3. Quantization Usage

Complete Workflow Example

How IMatrix Works with Q3_K_HIFI

Example Impact

Troubleshooting

Problem: IMatrix generation is slow

Problem: IMatrix file is very large

Problem: Quantization quality didn't improve

Problem: "imatrix mapping error"

Technical Details

What Gets Stored

How It's Used

File Structure

Summary