🚧 Configure Quantization¶
Quantization Algorithm¶
The default quantization algorithm used in MLC-LLM is grouping quantization method discussed in the papers The case for 4-bit precision: k-bit Inference Scaling Laws and LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models.
🚧 Quantization Mode (Not Stable)¶
In MLC-LLM we use a short code that indicates the quantization mode to use.
The format of the code is qAfB(_id)
, where A
represents the number
of bits for storing weights and B
represents the number of bits for storing activations. The _id
is an integer identifier to distinguish different quantization algorithms (e.g. symmetric, non-symmetric, GPTQ, etc).
Currently, the available options are: q3f16_0
, q4f16_1
, q4f16_2
, q4f32_0
, q0f32
, and q0f16
.