An importance matrix (imatrix) file is a data structure that contains information about which weights in a neural network are most important during inference. It's generated by running the model on a calibration dataset and measuring how much each weight contributes to the output.
- Purpose: Improve quantization quality by preserving precision for important weights
- How it works: Tracks squared activations (importance scores) for each weight during inference
- Format: Stored as GGUF files (or legacy
.datformat) - Usage: Passed to the quantization tool to guide which weights should be quantized more carefully
When quantizing a model, you're reducing precision from 16-bit or 32-bit floats to 3-bit, 4-bit, or other low-precision formats. This compression can cause quality loss. An imatrix helps by:
- Identifying Critical Weights: Shows which weights are most active/important during inference
- Guiding Quantization: Allows the quantizer to:
- Preserve precision for important weights
- Use more aggressive quantization for less important weights
- Make smarter decisions about outlier selection (especially for Q3_K_HIFI)
- Improving Quality: Can significantly reduce perplexity increase compared to quantization without imatrix
For Q3_K_HIFI specifically, the imatrix is used to:
- Weight the magnitude calculation when selecting outliers:
mag[i] = fabsf(xb[i]) * quant_weights[i] - Prioritize important weights as outliers (stored in FP16)
- Improve overall quantization quality
You need a text file with representative data that the model will process. This should be similar to the data your model will see in production.
Good sources for calibration data:
- Wikipedia articles (e.g.,
wiki.train.raw) - Books or text corpora
- Domain-specific text relevant to your use case
- The model's training data (if available)
File format: Plain text, one example per line (or use --parse-special for special token parsing)
First, make sure you've built llama-imatrix:
# On Linux/Mac
make llama-imatrix
# On Windows (MSVC)
cmake --build build --config Release --target llama-imatrixBasic usage:
./llama-imatrix \
-m model-f16.gguf \
-f calibration-data.txt \
-o imatrix.gguf \
-ngl 99Parameters explained:
-m, --model: Your F16 or F32 model file (input)-f, --file: Your calibration text file-o, --output-file: Output imatrix filename (default:imatrix.gguf)-ngl, --n-gpu-layers: Number of layers to offload to GPU (speeds up generation)
./llama-imatrix \
-m model-f16.gguf \
-f calibration-data.txt \
-o imatrix.gguf \
-ngl 99 \
--output-frequency 10 \ # Save every 10 chunks
--save-frequency 50 \ # Create snapshots every 50 chunks
--chunk 0 \ # Start from chunk 0
--chunks 100 \ # Process 100 chunks total
--parse-special \ # Parse special tokens
--process-output # Include output.weight tensorImportant Options:
--output-frequency N: How often to save progress (default: 10 chunks)--save-frequency N: Create backup snapshots (default: 0 = never)--chunk N: Skip first N chunks (useful for resuming)--chunks N: Maximum chunks to process (default: -1 = all)--parse-special: Enable special token parsing (e.g.,<|im_start|>)--process-output: Includeoutput.weighttensor (usually not recommended)--no-ppl: Disable perplexity calculation (faster, less info)-lv, --verbosity: Verbosity level (0=silent, 1=default, 2+=verbose)
# 1. Generate imatrix with GPU acceleration
./llama-imatrix \
-m ./models/llama-3-8b-f16.gguf \
-f ./data/wiki.train.raw \
-o ./imatrix.gguf \
-ngl 99 \
--output-frequency 20 \
--save-frequency 100
# This will:
# - Process the calibration data
# - Track activations for each tensor
# - Save progress every 20 chunks
# - Create snapshots every 100 chunks
# - Output: imatrix.ggufOnce you have an imatrix file, use it during quantization:
./llama-quantize \
--imatrix imatrix.gguf \
input-model-f16.gguf \
output-model-q3_k_hifi.gguf \
Q3_K_HIFIYou can target specific tensors:
# Use imatrix only for attention and feed-forward layers
./llama-quantize \
--imatrix imatrix.gguf \
--include-weights attn_v \
--include-weights ffn_down \
input-model-f16.gguf \
output-model-q3_k_hifi.gguf \
Q3_K_HIFI# Quantize with imatrix, custom tensor types, and output settings
./llama-quantize \
--imatrix imatrix.gguf \
--output-tensor-type q5_k \
--token-embedding-type q3_k_hifi \
input-model-f16.gguf \
output-model-q3_k_hifi.gguf \
Q3_K_HIFIModern format, stored as .gguf files:
- More efficient
- Better metadata support
- Can store multiple datasets
- Default format in recent versions
Older binary format, stored as .dat files:
- Still supported for compatibility
- Use
--output-format datto generate
# Convert legacy to GGUF
./llama-imatrix --in-file imatrix.dat -o imatrix.gguf
# Convert GGUF to legacy
./llama-imatrix --in-file imatrix.gguf --output-format dat -o imatrix.datYou can merge imatrix files from multiple runs or datasets:
./llama-imatrix \
--in-file imatrix-dataset1.gguf \
--in-file imatrix-dataset2.gguf \
--in-file imatrix-dataset3.gguf \
-o imatrix-combined.ggufThis is useful for:
- Combining data from different domains
- Merging results from multiple calibration runs
- Creating a more comprehensive importance matrix
./llama-imatrix --in-file imatrix.gguf --show-statisticsThis displays:
-
Per Tensor:
- Σ(Act²): Sum of squared activations (importance scores)
- Min & Max: Range of importance values
- μ & σ: Mean and standard deviation
- % Active: Proportion of active elements
- Entropy: Information content
- ZD Score: Layer importance metric
- CosSim: Cosine similarity with previous layer
-
Per Layer:
- Weighted averages of importance metrics
- High Σ(Act²): Tensor is very active during inference
- High % Active: Many weights contribute significantly
- High Entropy: Weights have diverse importance (good for quantization)
- High ZD Score: Layer is important to preserve
- High CosSim: Layer is similar to previous (may indicate redundancy)
✅ Do:
- Use representative data similar to your use case
- Include diverse examples
- Use at least 1000-10000 chunks for good coverage
- Match the domain (e.g., code for code models, text for language models)
❌ Don't:
- Use too small a dataset (< 100 chunks)
- Use completely unrelated data
- Use only one type of example
✅ Do:
- Use GPU offloading (
-ngl 99) for speed - Save frequently (
--output-frequency 10) - Create snapshots (
--save-frequency 50) for long runs - Process enough chunks (1000+ recommended)
❌ Don't:
- Process
output.weightunless necessary (--process-outputis usually not needed) - Skip validation of your calibration data
✅ Do:
- Always use imatrix for Q3_K_HIFI (it significantly improves outlier selection)
- Use imatrix for aggressive quantizations (Q2_K, Q3_K_S)
- Include attention and feed-forward weights
- Test quality after quantization
❌ Don't:
- Use imatrix for
output.weight(usually excluded by default) - Assume imatrix will always improve quality (test it)
- Use an imatrix from a different model architecture
Here's a complete example for quantizing a model with Q3_K_HIFI using an imatrix:
# Step 1: Generate importance matrix
./llama-imatrix \
-m ./models/llama-3-8b-f16.gguf \
-f ./data/calibration-text.txt \
-o ./imatrix.gguf \
-ngl 99 \
--output-frequency 20 \
--chunks 1000
# Step 2: (Optional) View statistics
./llama-imatrix --in-file ./imatrix.gguf --show-statistics
# Step 3: Quantize using the imatrix
./llama-quantize \
--imatrix ./imatrix.gguf \
./models/llama-3-8b-f16.gguf \
./models/llama-3-8b-q3_k_hifi.gguf \
Q3_K_HIFI
# Step 4: Test the quantized model
./llama-cli \
-m ./models/llama-3-8b-q3_k_hifi.gguf \
-p "Hello, how are you?"For Q3_K_HIFI specifically, the imatrix is particularly valuable:
-
Outlier Selection: The imatrix weights the magnitude calculation:
mag[i] = fabsf(xb[i]) * quant_weights[i]
This means important weights (high imatrix values) are more likely to be selected as outliers.
-
Better Quality: By preserving important weights as FP16 outliers, the model maintains better accuracy.
-
Smart Compression: Less important weights can be more aggressively quantized to 3-bit, while critical ones stay in FP16.
Without imatrix:
- Outliers selected purely by magnitude
- May miss important but smaller-magnitude weights
- Quality: Baseline
With imatrix:
- Outliers selected by importance-weighted magnitude
- Preserves critical weights even if not the largest
- Quality: Typically 5-15% better perplexity
Solutions:
- Use GPU offloading:
-ngl 99 - Reduce chunks:
--chunks 500 - Disable perplexity:
--no-ppl
Solutions:
- This is normal (can be 100MB-1GB+)
- Use GGUF format (more efficient than legacy)
- The file is only needed during quantization, not inference
Solutions:
- Check that imatrix was generated on similar data
- Verify imatrix file loaded correctly (check logs)
- Try including/excluding specific tensors
- Ensure calibration dataset is representative
Solutions:
- IMatrix was generated for a different model architecture
- Tensor names don't match
- Regenerate imatrix for your specific model
For each tensor, the imatrix stores:
- Squared activations:
act²for each weight position - Call count: How many times the tensor was accessed
- Averaged values:
Σ(act²) / n_callsfor normalization
During quantization:
- IMatrix data is loaded and mapped to tensor names
- For each weight block, importance scores are retrieved
- Quantization algorithms use these scores to:
- Weight magnitude calculations
- Select outliers (Q3_K_HIFI)
- Choose quantization scales
- Determine precision levels
GGUF format imatrix contains:
- Metadata: chunk count, chunk size, dataset names
- Tensor data: For each tensor, arrays of importance scores
- Statistics: Optional computed statistics
IMatrix files are essential for high-quality quantization, especially for formats like Q3_K_HIFI that benefit from intelligent outlier selection.
Key Takeaways:
- Generate imatrix using representative calibration data
- Use GPU acceleration for faster generation
- Always use imatrix when quantizing to Q3_K_HIFI
- Combine multiple imatrix files for better coverage
- Analyze statistics to understand your model's weight importance
For Q3_K_HIFI specifically: The imatrix directly improves outlier selection, making it one of the most impactful uses of importance matrices in quantization.