Compile Models via MLC¶

This page describes how to compile a model with MLC LLM. Model compilation takes model inputs, produces quantized model weights, and optimizes model lib for a given platform. It enables users to bring their own new model weights, try different quantization modes, and customize the overall model optimization flow.

Note

Before you proceed, please make sure that you have Install TVM Unity Compiler correctly installed on your machine. TVM-Unity is the necessary foundation for us to compile models with MLC LLM. If you want to build webgpu, please also complete Install Wasm Build Environment. Please also follow the instructions in CLI and C++ API to obtain the CLI app that can be used to chat with the compiled model. Finally, we strongly recommend you read Project Overview first to get familiarized with the high-level terminologies.

Install MLC-LLM Package¶

Work with Source Code¶

The easiest way to use MLC-LLM is to clone the repository, and compile models under the root directory of the repository.

# clone the repository
git clone https://github.com/mlc-ai/mlc-llm.git --recursive
# enter to root directory of the repo
cd mlc-llm
# install mlc-llm
pip install .

Verify Installation¶

python3 -m mlc_llm.build --help

You are expected to see the help information of the building script.

Get Started¶

This section provides a step by step instructions to guide you through the compilation process of one specific model. We take the RedPajama-v1-3B as an example. You can select the platform where you want to run your model from the tabs below and run the corresponding command. We strongly recommend you start with Metal/CUDA/Vulkan as it is easier to validate the compilation result on your personal computer.

On Apple Silicon powered Mac, compile for Apple Silicon Mac:

python3 -m mlc_llm.build --hf-path togethercomputer/RedPajama-INCITE-Chat-3B-v1 --target metal --quantization q4f16_1

On Apple Silicon powered Mac, compile for x86 Mac:

python3 -m mlc_llm.build --hf-path togethercomputer/RedPajama-INCITE-Chat-3B-v1 --target metal_x86_64 --quantization q4f16_1

By executing the compile command above, we generate the model weights, model lib, and a chat config. We can check the output with the commands below:

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so     # ===> the model library
  mod_cache_before_build_metal.pkl                 # ===> a cached file for future builds
  params                                           # ===> containing the model weights, tokenizer and chat config

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1/params
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

We now chat with the model using the command line interface (CLI) app.

# Run CLI
mlc_chat_cli --model RedPajama-INCITE-Chat-3B-v1-q4f16_1

The CLI will use the config file dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1/params/mlc-chat-config.json and model library dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so.

Each compilation target produces a specific model library for the given platform. The model weight is shared across different targets. If you are interested in distributing the model besides local execution, please checkout Distribute Compiled Models. You are also more than welcome to read the following sections for more details about the compilation.

Compile Command Specification¶

This section describes the list of options that can be used during compilation. Note that the arguments are generated by the dataclass BuildArgs, read more in API Reference. Generally, the model compile command is specified by a sequence of arguments and in the following pattern:

python3 -m mlc_llm.build \
    --model MODEL_NAME_OR_PATH \
    [--hf-path HUGGINGFACE_NAME] \
    --target TARGET_NAME \
    --quantization QUANTIZATION_MODE \
    [--max-seq-len MAX_ALLOWED_SEQUENCE_LENGTH] \
    [--reuse-lib LIB_NAME] \
    [--use-cache=0] \
    [--debug-dump] \
    [--use-safetensors]

This command first goes with --model or --hf-path. Only one of them needs to be specified: when the model is publicly available on Hugging Face, you can use --hf-path to specify the model. In other cases you need to specify the model via --model.

--model MODEL_NAME_OR_PATH

The name or local path of the model to compile. We will search for the model on your disk in the following two candidates:

  • dist/models/MODEL_NAME_OR_PATH (e.g., --model Llama-2-7b-chat-hf),

  • MODEL_NAME_OR_PATH (e.g., --model /my-model/Llama-2-7b-chat-hf).

When running the compile command using --model, please make sure you have placed the model to compile under dist/models/ or another location on the disk.

--hf-path HUGGINGFACE_NAME

The name of the model’s Hugging Face repository. We will download the model to dist/models/HUGGINGFACE_NAME and load the model from this directory.

For example, by specifying --hf-path togethercomputer/RedPajama-INCITE-Chat-3B-v1, it will download the model from https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1 to dist/models/.

Another two necessary arguments for the compile command are the target and the quantization mode:

--target TARGET_NAME

The target platform to compile the model for. The default target is auto, using which we will detect from cuda, metal, vulkan and opencl. Besides auto, other available options are: metal (for M1/M2), metal_x86_64 (for Intel CPU), iphone, vulkan, cuda, webgpu, android, and opencl.

--quantization QUANTIZATION_MODE

The quantization mode we use to compile. The format of the code is qAfB(_0), where A represents the number of bits for storing weights and B represents the number of bits for storing activations. Available options are: q3f16_0, q4f16_1, q4f16_2, q4f32_0, q0f32, and q0f16. We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.

The following arguments are optional:

--max-seq-len MAX_ALLOWED_SEQUENCE_LENGTH

The maximum allowed sequence length for the model. When it is not specified, we will use the maximum sequence length from the config.json in the model directory.

--reuse-lib LIB_NAME

Specifies the previously generated library to reuse. This is useful when building the same model architecture with different weights. You can refer to the model distribution page for details of this argument.

--use-cache

When --use-cache=0 is specified, the model compilation will not use cached file from previous builds, and will compile the model from the very start. Using a cache can help reduce the time needed to compile.

--debug-dump

Specifies whether to dump debugging files during compilation.

--use-safetensors

Specifies whether to use .safetensors instead of the default .bin when loading in model weights.

More Model Compile Commands¶

This section lists compile commands for more models that you can try out.

Please request for access to the Llama-2 weights from Meta first. After granted access, please create directory dist/models and download the model to the directory. For example, you can run the following code:

mkdir -p dist/models
cd dist/models
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
cd ../..

After downloading the model, run the following command to compile the model.

python3 -m mlc_llm.build --model Llama-2-7b-chat-hf --target cuda --quantization q4f16_1

For each model and each backend, the above only provides the most recommended build command (which is the most optimized). You can also try with different argument values (e.g., different quantization modes), whose build results may not run as fast and robustly as the provided one when running the model.

Note

Uing 3-bit quantization usually can be overly aggressive and only works for limited settings. If you encounter issues where the compiled model does not perform as expected, consider utilizing a higher number of bits for quantization (e.g., 4-bit quantization).

If you are interested in distributing the model besides local execution, please checkout Distribute Compiled Models.