Compile Model Libraries

To run a model with MLC LLM in any platform, you need:

  1. Model weights converted to MLC format (e.g. RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC.)

  2. Model library that comprises the inference logic (see repo binary-mlc-llm-libs).

If you are simply adding a model variant, follow Convert Weights via MLC suffices.

This page describes how to compile a model library with MLC LLM. Model compilation optimizes the model inference for a given platform, allowing users bring their own new model architecture, use different quantization modes, and customize the overall model optimization flow.

We compile RedPajama-INCITE-Chat-3B-v1 with q4f16_1 as an example for all platforms.

Note

Before you proceed, make sure you followed Install TVM Unity Compiler, a required backend to compile models with MLC LLM.

Please also follow the instructions in CLI / Python API to obtain the CLI app / Python API that can be used to chat with the compiled model. Finally, we strongly recommend you to read Project Overview first to get familiarized with the high-level terminologies.

0. Verify Installation

Step 1. Verify mlc_llm

We use the python package mlc_llm to compile models. This can be installed by following Install MLC LLM Python Package, either by building from source, or by installing the prebuilt package. Verify mlc_llm installation in command line via:

$ mlc_llm --help
# You should see help information with this line
usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config}

Note

If it runs into error command not found: mlc_llm, try python -m mlc_llm --help.

Step 2. Verify TVM

To compile models, you also need to follow Install TVM Unity Compiler. Here we verify tvm quickly with command line (for full verification, see Validate TVM Installation):

$ python -c "import tvm; print(tvm.__file__)"
/some-path/lib/python3.11/site-packages/tvm/__init__.py

1. Clone from HF and convert_weight

This replicates Convert Weights via MLC, see that page for more details.

You can be under the mlc-llm repo, or your own working directory. Note that all platforms can share the same compiled/quantized weights.

# Create directory
mkdir -p dist/models && cd dist/models
# Clone HF weights
git lfs install
git clone https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1
cd ../..
# Convert weight
mlc_llm convert_weight ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC

2. Generate mlc-chat-config and compile

A model library is specified by:

  • The model architecture (e.g. llama-2, gpt-neox)

  • Quantization (e.g. q4f16_1, q0f32)

  • Metadata (e.g. context_window_size, sliding_window_size, prefill-chunk-size), which affects memory planning

  • Platform (e.g. cuda, webgpu, iOS)

All these knobs are specified in mlc-chat-config.json generated by gen_config.

# Create output directory for the model library compiled
mkdir dist/libs
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device cuda -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so

Note

For the conv-template, conv_template.cc contains a full list of conversation templates that MLC provides. If the model you are adding requires a new conversation template, you would need to add your own. Follow this PR as an example. However, adding your own template would require you build mlc_llm from source in order for it to be recognized by the runtime.

For more details, please see Configure MLCChat in JSON.

3. Verify output and chat

By executing the compile command above, we generate the model weights, model lib, and a chat config. We can check the output with the commands below:

~/mlc-llm > ls dist/libs
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so      # ===> the model library

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

We can now chat with the model using the command line interface (CLI) app or the Python API.

python
>>> from mlc_llm import ChatModule
>>> cm = ChatModule(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC", \
    model_lib_path="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so")
>>> cm.generate("hi")
'Hi! How can I assist you today?'

Compile Commands for More Models

This section lists compile commands for more models that you can try out. Note that this can be easily generalized to any model variant, as long as mlc-llm supports the architecture.

Please request for access to the Llama-2 weights from Meta first. After granted access, first create directory dist/models and download the model to the directory. For example, you can run the following code:

mkdir -p dist/models && cd dist/models
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
cd ../..

Then convert the HF weights into MLC-compatible weights. Note that all platforms can share the same compiled/quantized weights.

mlc_llm convert_weight ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC

Afterwards, run the following command to generate mlc config and compile the model.

# Create output directory for the model library compiled
mkdir dist/libs
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device cuda -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-cuda.so

For each model and each backend, the above only provides the most recommended build command (which is the most optimized). You can also try with different argument values (e.g., different quantization modes, context window size, etc.), whose build results affect runtime memory requirement, and it is possible that they may not run as fast and robustly as the provided one when running the model.

Note

Uing 3-bit quantization usually can be overly aggressive and only works for limited settings. If you encounter issues where the compiled model does not perform as expected, consider utilizing a higher number of bits for quantization (e.g., 4-bit quantization).

If you are interested in distributing the model besides local execution, please checkout (Optional) 3. Upload weights to HF.

Compile Command Specification

As you have seen in the section above, the model compilation is split into three steps: convert weights, generate mlc-chat-config.json, and compile the model. This section describes the list of options that can be used during compilation.

1. Convert Weight

Weight conversion command follows the pattern below:

mlc_llm convert_weight \
    CONFIG \
    --quantization QUANTIZATION_MODE \
    [--model-type MODEL_TYPE] \
    [--device DEVICE] \
    [--source SOURCE] \
    [--source-format SOURCE_FORMAT] \
    --output OUTPUT

Note that CONFIG is a positional argument. Arguments wrapped with [ ] are optional.

--CONFIG

It can be one of the following:

  1. Path to a HuggingFace model directory that contains a config.json or

  2. Path to config.json in HuggingFace format, or

  3. The name of a pre-defined model architecture.

A config.json file in HuggingFace format defines the model architecture, including the vocabulary size, the number of layers, the hidden size, number of attention heads, etc. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json.

A HuggingFace directory often contains a config.json which defines the model architecture, the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional generation_config.json provides additional default configuration for text generation. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main.

For existing pre-defined model architecture, see MODEL_PRESETS here.

--quantization QUANTIZATION_MODE

The quantization mode we use to compile.

See Quantization Mode for more information. Available options are: q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.

We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.

--model-type MODEL_TYPE

Model architecture such as “llama”. If not set, it is inferred from config.json.

--device DEVICE

The device used to do quantization such as “cuda” or “cuda:0”. Will detect from local available GPUs if not specified.

--source SOURCE

The path to original model weight, infer from config if missing.

--source-format SOURCE_FORMAT

The format of source model weight, infer from config if missing.

--output OUTPUT

The output directory to save the quantized model weight. Will create params_shard_*.bin and `ndarray-cache.json` in this directory.

2. Generate MLC Chat Config

In order to compile a model, we first need to generate the mlc-chat-config.json. This file contains specifications like context-window-size and sliding-window-size, among others that can alter the model compiled. We also process tokenizers in this step.

Config generation command follows the pattern below:

mlc_llm gen_config \
    CONFIG \
    --quantization QUANTIZATION_MODE \
    [--model-type MODEL_TYPE] \
    --conv-template CONV_TEMPLATE \
    [--context-window-size CONTEXT_WINDOW_SIZE] \
    [--sliding-window-size SLIDING_WINDOW_SIZE] \
    [--prefill-chunk-size PREFILL_CHUNK_SIZE] \
    [--tensor-parallel-shard TENSOR_PARALLEL_SHARDS] \
    --output OUTPUT

Note that CONFIG is a positional argument. Arguments wrapped with [ ] are optional.

--CONFIG

It can be one of the following:

  1. Path to a HuggingFace model directory that contains a config.json or

  2. Path to config.json in HuggingFace format, or

  3. The name of a pre-defined model architecture.

A config.json file in HuggingFace format defines the model architecture, including the vocabulary size, the number of layers, the hidden size, number of attention heads, etc. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json.

A HuggingFace directory often contains a config.json which defines the model architecture, the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional generation_config.json provides additional default configuration for text generation. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main.

For existing pre-defined model architecture, see MODEL_PRESETS here.

--quantization QUANTIZATION_MODE

The quantization mode we use to compile.

See Quantization Mode for more information. Available options are: q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.

We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.

--model-type MODEL_TYPE

Model architecture such as “llama”. If not set, it is inferred from config.json.

--conv-template CONV_TEMPLATE

Conversation template. It depends on how the model is tuned. Use “LM” for vanilla base model For existing pre-defined templates, see CONV_TEMPLATES here.

--context-window-size CONTEXT_WINDOW_SIZE

Option to provide the maximum sequence length supported by the model. This is usually explicitly shown as context length or context window in the model card. If this option is not set explicitly, by default, it will be determined by context_window_size or max_position_embeddings in config.json, and the latter is usually inaccurate for some models.

--sliding-window-size SLIDING_WINDOW

(Experimental) The sliding window size in sliding window attention (SWA). This optional field overrides the sliding_window in config.json for those models that use SWA. Currently only useful when compiling mistral-based models. This flag subjects to future refactoring.

--prefill-chunk-size PREFILL_CHUNK_SIZE

(Experimental) The chunk size during prefilling. By default, the chunk size is the same as context_window_size or sliding_window_size. This flag subjects to future refactoring.

--tensor-parallel-shard TENSOR_PARALLEL_SHARDS

Number of shards to split the model into in tensor parallelism multi-gpu inference.

--output OUTPUT

The output directory for generated configurations, including mlc-chat-config.json and tokenizer configuration.

3. Compile Model Library

After generating mlc-chat-config.json, we can compile the model into a model library (files ending in .so, .tar, etc. that contains the inference logic of a model).

Model compilation command follows the pattern below:

mlc_llm compile \
    MODEL \
    [--quantization QUANTIZATION_MODE] \
    [--model-type MODEL_TYPE] \
    [--device DEVICE] \
    [--host HOST] \
    [--opt OPT] \
    [--system-lib-prefix SYSTEM_LIB_PREFIX] \
    --output OUTPUT \
    [--overrides OVERRIDES]

Note that MODEL is a positional argument. Arguments wrapped with [ ] are optional.

--MODEL

A path to mlc-chat-config.json, or an MLC model directory that contains mlc-chat-config.json.

--quantization QUANTIZATION_MODE

The quantization mode we use to compile. If unprovided, will infer from MODEL.

See Quantization Mode for more information. Available options are: q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.

We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.

--model-type MODEL_TYPE

Model architecture such as “llama”. If not set, it is inferred from mlc-chat-config.json.

--device DEVICE

The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.

--host HOST

The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS. Examples of the LLVM triple:

  1. iPhones: arm64-apple-ios;

  2. ARM64 Android phones: aarch64-linux-android;

  3. WebAssembly: wasm32-unknown-unknown-wasm;

  4. Windows: x86_64-pc-windows-msvc;

  5. ARM macOS: arm64-apple-darwin.

--opt OPT

Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2, O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme optimization that could potentially break the system.

Meanwhile, optimization flags could be explicitly specified via details knobs, e.g. --opt="cutlass_attn=1;cutlass_norm=0;cublas_gemm=0;cudagraph=0".

--system-lib-prefix SYSTEM_LIB_PREFIX

Adding a prefix to all symbols exported. Similar to objcopy --prefix-symbols. This is useful when compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy, this takes no effect for shared library.

--output OUTPUT

The path to the output file. The suffix determines if the output file is a shared library or objects. Available suffixes:

  1. Linux: .so (shared), .tar (objects);

  2. macOS: .dylib (shared), .tar (objects);

  3. Windows: .dll (shared), .tar (objects);

  4. Android, iOS: .tar (objects);

  5. Web: .wasm (web assembly).

--overrides OVERRIDES

Model configuration override. Configurations to override mlc-chat-config.json. Supports context_window_size, prefill_chunk_size, sliding_window, max_batch_size and tensor_parallel_shards. Meanwhile, model config could be explicitly specified via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128".