Compile Model Libraries¶
To run a model with MLC LLM in any platform, we need:
Model weights converted to MLC format (e.g. RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC.)
Model library that comprises the inference logic
This page describes how to compile a model library with MLC LLM. Model compilation optimizes the model inference for a given platform, allowing users bring their own new model architecture, use different quantization modes, and customize the overall model optimization flow.
Notably, in many cases you do not need to explicit call compile.
If you are using the Python API, you can skip specifying
model_lib
and the system will JIT compile the library.If you are building iOS/android package, checkout Package Libraries and Weights, which provides a simpler high-level command that leverages the compile behind the scheme.
This page is still helpful to understand the compilation flow behind the scheme,
or be used to explicit create model libraries.
We compile RedPajama-INCITE-Chat-3B-v1
with q4f16_1
as an example for all platforms.
Note
Before you proceed, make sure you followed Install TVM Unity Compiler, a required backend to compile models with MLC LLM.
Please also follow the instructions in CLI / Python API to obtain the CLI app / Python API that can be used to chat with the compiled model.
0. Verify Installation¶
Step 1. Verify mlc_llm
We use the python package mlc_llm
to compile models. This can be installed by
following Install MLC LLM Python Package, either by building from source, or by
installing the prebuilt package. Verify mlc_llm
installation in command line via:
$ mlc_llm --help
# You should see help information with this line
usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config}
Note
If it runs into error command not found: mlc_llm
, try python -m mlc_llm --help
.
Step 2. Verify TVM
To compile models, you also need to follow Install TVM Unity Compiler.
Here we verify tvm
quickly with command line (for full verification, see Validate TVM Installation):
$ python -c "import tvm; print(tvm.__file__)"
/some-path/lib/python3.11/site-packages/tvm/__init__.py
1. Clone from HF and convert_weight¶
This replicates Convert Model Weights, see that page for more details.
You can be under the mlc-llm repo, or your own working directory. Note that all platforms can share the same compiled/quantized weights.
# Create directory
mkdir -p dist/models && cd dist/models
# Clone HF weights
git lfs install
git clone https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1
cd ../..
# Convert weight
mlc_llm convert_weight ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
2. Generate mlc-chat-config and compile¶
A model library is specified by:
The model architecture (e.g.
llama-2
,gpt-neox
)Quantization (e.g.
q4f16_1
,q0f32
)Metadata (e.g.
context_window_size
,sliding_window_size
,prefill-chunk-size
), which affects memory planningPlatform (e.g.
cuda
,webgpu
,iOS
)
All these knobs are specified in mlc-chat-config.json
generated by gen_config
.
# Create output directory for the model library compiled
mkdir dist/libs
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 --conv-template redpajama_chat \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device cuda -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so
For M-chip Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 --conv-template redpajama_chat \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device metal -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so
Cross-Compiling for Intel Mac on M-chip Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 --conv-template redpajama_chat \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device metal:x86-64 -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib
For Intel Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 --conv-template redpajama_chat \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device metal -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib
For Linux:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 --conv-template redpajama_chat \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device vulkan -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so
For Windows:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 --conv-template redpajama_chat \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device vulkan -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.dll
You need a Mac to compile models for it.
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ --quantization q4f16_1 \
--conv-template redpajama_chat --context-window-size 768 \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device iphone -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar
Note
If it runs into error
Compilation error:
xcrun: error: unable to find utility "metal", not a developer tool or in PATH
xcrun: error: unable to find utility "metallib", not a developer tool or in PATH
, please check and make sure you have Command Line Tools for Xcode installed correctly.
You can use xcrun metal
to validate: when it prints metal: error: no input files
, it means the Command Line Tools for Xcode is installed and can be found, and you can proceed with the model compiling.
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ --quantization q4f16_1 \
--conv-template redpajama_chat --context-window-size 768 \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device android -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 --conv-template redpajama_chat \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device webgpu -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm
Note
To compile for webgpu, you need to build from source when installing mlc_llm
. Besides, you also need to follow Install Wasm Build Environment.
Otherwise, it would run into error
RuntimeError: Cannot find libraries: wasm_runtime.bc
Note
For webgpu, when compiling larger models like Llama-2-7B
, you may want to add --prefill-chunk-size 1024
or lower --context-window-size
to decrease memory usage.
Otherwise, you may run into issues like:
TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.
Note
For the conv-template
, conversation_template.py
contains a full list of conversation templates that MLC provides. If the model you are adding
requires a new conversation template, you would need to add your own.
Follow this PR as an example.
However, adding your own template would require you build mlc_llm from source
in order for it to be recognized by the runtime.
For more details, please see Customize MLC Chat Config.
3. Verify output and chat¶
By executing the compile command above, we generate the model weights, model lib, and a chat config. We can check the output with the commands below:
~/mlc-llm > ls dist/libs
RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so # ===> the model library
~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
mlc-chat-config.json # ===> the chat config
ndarray-cache.json # ===> the model weight info
params_shard_0.bin # ===> the model weights
params_shard_1.bin
...
tokenizer.json # ===> the tokenizer files
tokenizer_config.json
We can now chat with the model using the command line interface (CLI) app or the Python API.
python
>>> from mlc_llm import MLCEngine
>>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
... model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so")
>>> engine.chat.completions.create(
... messages=[{"role": "user", "content": "hello"}]
... )
ChatCompletionResponse(
choices=[ChatCompletionResponseChoice(
message=ChatCompletionMessage(
content="Hi! How can I assist you today?", role='assistant'
)
)],
...
)
~/mlc-llm > ls dist/libs
RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so # ===> the model library (will be -metal_x86_64.dylib for Intel Mac)
~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
mlc-chat-config.json # ===> the chat config
ndarray-cache.json # ===> the model weight info
params_shard_0.bin # ===> the model weights
params_shard_1.bin
...
tokenizer.json # ===> the tokenizer files
tokenizer_config.json
We can now chat with the model using the command line interface (CLI) app or the Python API.
python
>>> from mlc_llm import MLCEngine
>>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
... model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so")
>>> engine.chat.completions.create(
... messages=[{"role": "user", "content": "hello"}]
... )
ChatCompletionResponse(
choices=[ChatCompletionResponseChoice(
message=ChatCompletionMessage(
content="Hi! How can I assist you today?", role='assistant'
)
)],
...
)
~/mlc-llm > ls dist/libs
RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so # ===> the model library (will be .dll for Windows)
~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
mlc-chat-config.json # ===> the chat config
ndarray-cache.json # ===> the model weight info
params_shard_0.bin # ===> the model weights
params_shard_1.bin
...
tokenizer.json # ===> the tokenizer files
tokenizer_config.json
We can now chat with the model using the command line interface (CLI) app or the Python API.
python
>>> from mlc_llm import MLCEngine
>>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
... model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so")
>>> engine.chat.completions.create(
... messages=[{"role": "user", "content": "hello"}]
... )
ChatCompletionResponse(
choices=[ChatCompletionResponseChoice(
message=ChatCompletionMessage(
content="Hi! How can I assist you today?", role='assistant'
)
)],
...
)
~/mlc-llm > ls dist/libs
RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar # ===> the model library
~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
mlc-chat-config.json # ===> the chat config
ndarray-cache.json # ===> the model weight info
params_shard_0.bin # ===> the model weights
params_shard_1.bin
...
tokenizer.json # ===> the tokenizer files
tokenizer_config.json
The model lib dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar
will be packaged as a static library into the iOS app. Checkout iOS Swift SDK for more details.
~/mlc-llm > ls dist/libs
RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar # ===> the model library
~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
mlc-chat-config.json # ===> the chat config
ndarray-cache.json # ===> the model weight info
params_shard_0.bin # ===> the model weights
params_shard_1.bin
...
tokenizer.json # ===> the tokenizer files
tokenizer_config.json
The model lib dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar
will be packaged as a static library into the android app. Checkout Android SDK for more details.
~/mlc-llm > ls dist/libs
RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm # ===> the model library
~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
mlc-chat-config.json # ===> the chat config
ndarray-cache.json # ===> the model weight info
params_shard_0.bin # ===> the model weights
params_shard_1.bin
...
tokenizer.json # ===> the tokenizer files
tokenizer_config.json
To use this in WebGPU runtime, checkout WebLLM Javascript SDK.
Compile Commands for More Models¶
This section lists compile commands for more models that you can try out. Note that this can be easily generalized to any model variant, as long as mlc-llm supports the architecture.
Please request for access to the Llama-2 weights from Meta first.
After granted access, first create directory dist/models
and download the model to the directory.
For example, you can run the following code:
mkdir -p dist/models && cd dist/models
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
cd ../..
Then convert the HF weights into MLC-compatible weights. Note that all platforms can share the same compiled/quantized weights.
mlc_llm convert_weight ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC
Afterwards, run the following command to generate mlc config and compile the model.
# Create output directory for the model library compiled
mkdir dist/libs
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
--conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
--device cuda -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-cuda.so
For M-chip Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
--conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
--device metal -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-metal.so
Cross-Compiling for Intel Mac on M-chip Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
--quantization q4f16_1 --conv-template redpajama_chat \
-o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
--device metal:x86-64 -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib
For Intel Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
--conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
--device metal -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-metal_x86_64.dylib
For Linux:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
--conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
--device vulkan -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-vulkan.so
For Windows:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
--conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
--device vulkan -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-vulkan.dll
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
--context-window-size 2048 --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
--device webgpu -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-webgpu.wasm
Note
To compile for webgpu, you need to build from source when installing mlc_llm
. Besides, you also need to follow Install Wasm Build Environment.
Otherwise, it would run into error
RuntimeError: Cannot find libraries: wasm_runtime.bc
You need a Mac to compile models for it.
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
--conv-template llama-2 --context-window-size 768 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
--device iphone -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-iphone.tar
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
--conv-template llama-2 --context-window-size 768 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
--device android -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-android.tar
Note that Mistral uses sliding window attention (SWA). Thus, instead of specifying
context-window-size
, we specify sliding-window-size
.
First create directory dist/models
and download the model to the directory.
For example, you can run the following code:
mkdir -p dist/models && cd dist/models
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
cd ../..
Then convert the HF weights into MLC-compatible weights. Note that all platforms can share the same compiled/quantized weights.
mlc_llm convert_weight ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
-o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
Afterwards, run the following command to generate mlc config and compile the model.
# Create output directory for the model library compiled
mkdir dist/libs
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
--conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
--device cuda -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-cuda.so
For M-chip Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
--conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
--device metal -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal.so
For Intel Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
--conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
--device metal -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal_x86_64.dylib
For Linux:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
--conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
--device vulkan -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-vulkan.so
For Windows:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
--conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
--device vulkan -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-vulkan.dll
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
--prefill-chunk-size 1024 --conv-template mistral_default \
-o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
--device webgpu -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-webgpu.wasm
Note
To compile for webgpu, you need to build from source when installing mlc_llm
. Besides, you also need to follow Install Wasm Build Environment.
Otherwise, it would run into error
RuntimeError: Cannot find libraries: wasm_runtime.bc
Note
For webgpu, when compiling larger models like Llama-2-7B
, you may want to add --prefill-chunk-size 1024
or lower --context-window-size
to decrease memory usage.
Otherwise, you may run into issues like:
TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.
You need a Mac to compile models for it.
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
--conv-template mistral_default --sliding-window-size 1024 --prefill-chunk-size 128 \
-o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
--device iphone -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-iphone.tar
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
--conv-template mistral_default --sliding-window-size 1024 --prefill-chunk-size 128 -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
--device android -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-android.tar
First create directory dist/models
and download the model to the directory.
For example, you can run the following code:
mkdir -p dist/models && cd dist/models
git lfs install
git clone https://huggingface.co/DISTRIBUTOR/HF_MODEL
cd ../..
Then convert the HF weights into MLC-compatible weights. Note that all platforms can share the same compiled/quantized weights.
mlc_llm convert_weight ./dist/models/HF_MODEL/ --quantization q4f16_1 -o dist/OUTPUT-MLC
Afterwards, run the following command to generate mlc config and compile the model.
# Create output directory for the model library compiled
mkdir dist/libs
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device cuda -o dist/libs/OUTPUT-cuda.so
For M-chip Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device metal -o dist/libs/OUTPUT-metal.so
For Intel Mac:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device metal -o dist/libs/OUTPUT-metal_x86_64.dylib
For Linux:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device vulkan -o dist/libs/OUTPUT-vulkan.so
For Windows:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device vulkan -o dist/libs/OUTPUT-vulkan.dll
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device webgpu -o dist/libs/OUTPUT-webgpu.wasm
Note
To compile for webgpu, you need to build from source when installing mlc_llm
. Besides, you also need to follow Install Wasm Build Environment.
Otherwise, it would run into error
RuntimeError: Cannot find libraries: wasm_runtime.bc
Note
For webgpu, when compiling larger models like Llama-2-7B
, you may want to add --prefill-chunk-size 1024
or lower --context-window-size
to decrease memory usage.
Otherwise, you may run into issues like:
TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.
You need a Mac to compile models for it.
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE \
--context-window-size 768 -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device iphone -o dist/libs/OUTPUT-iphone.tar
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE \
--context-window-size 768 -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device android -o dist/libs/OUTPUT-android.tar
For each model and each backend, the above only provides the most recommended build command (which is the most optimized). You can also try with different argument values (e.g., different quantization modes, context window size, etc.), whose build results affect runtime memory requirement, and it is possible that they may not run as fast and robustly as the provided one when running the model.
Note
Uing 3-bit quantization usually can be overly aggressive and only works for limited settings. If you encounter issues where the compiled model does not perform as expected, consider utilizing a higher number of bits for quantization (e.g., 4-bit quantization).
If you are interested in distributing the model besides local execution, please checkout (Optional) 3. Upload weights to HF.
Compile Command Specification¶
As you have seen in the section above, the model compilation is split into three steps: convert weights, generate
mlc-chat-config.json
, and compile the model. This section describes the list of options that can be used
during compilation.
1. Convert Weight¶
Weight conversion command follows the pattern below:
mlc_llm convert_weight \
CONFIG \
--quantization QUANTIZATION_MODE \
[--model-type MODEL_TYPE] \
[--device DEVICE] \
[--source SOURCE] \
[--source-format SOURCE_FORMAT] \
--output OUTPUT
Note that CONFIG
is a positional argument. Arguments wrapped with [ ]
are optional.
- --CONFIG
It can be one of the following:
Path to a HuggingFace model directory that contains a
config.json
orPath to
config.json
in HuggingFace format, orThe name of a pre-defined model architecture.
A
config.json
file in HuggingFace format defines the model architecture, including the vocabulary size, the number of layers, the hidden size, number of attention heads, etc. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json.A HuggingFace directory often contains a
config.json
which defines the model architecture, the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations, as well as an optionalgeneration_config.json
provides additional default configuration for text generation. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main.For existing pre-defined model architecture, see
MODEL_PRESETS
here.- --quantization QUANTIZATION_MODE
The quantization mode we use to compile.
See Quantization Mode for more information. Available options are:
q0f16
,q0f32
,q3f16_1
,q4f16_1
,q4f32_1
, andq4f16_awq
.We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.
- --model-type MODEL_TYPE
Model architecture such as “llama”. If not set, it is inferred from
config.json
.- --device DEVICE
The device used to do quantization such as “cuda” or “cuda:0”. Will detect from local available GPUs if not specified.
- --source SOURCE
The path to original model weight, infer from
config
if missing.- --source-format SOURCE_FORMAT
The format of source model weight, infer from
config
if missing.- --output OUTPUT
The output directory to save the quantized model weight. Will create
params_shard_*.bin
and`ndarray-cache.json`
in this directory.
2. Generate MLC Chat Config¶
In order to compile a model, we first need to generate the mlc-chat-config.json
. This file contains specifications
like context-window-size
and sliding-window-size
, among others that can alter the model compiled. We also process
tokenizers in this step.
Config generation command follows the pattern below:
mlc_llm gen_config \
CONFIG \
--quantization QUANTIZATION_MODE \
[--model-type MODEL_TYPE] \
--conv-template CONV_TEMPLATE \
[--context-window-size CONTEXT_WINDOW_SIZE] \
[--sliding-window-size SLIDING_WINDOW_SIZE] \
[--prefill-chunk-size PREFILL_CHUNK_SIZE] \
[--tensor-parallel-shard TENSOR_PARALLEL_SHARDS] \
--output OUTPUT
Note that CONFIG
is a positional argument. Arguments wrapped with [ ]
are optional.
- --CONFIG
It can be one of the following:
Path to a HuggingFace model directory that contains a
config.json
orPath to
config.json
in HuggingFace format, orThe name of a pre-defined model architecture.
A
config.json
file in HuggingFace format defines the model architecture, including the vocabulary size, the number of layers, the hidden size, number of attention heads, etc. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json.A HuggingFace directory often contains a
config.json
which defines the model architecture, the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations, as well as an optionalgeneration_config.json
provides additional default configuration for text generation. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main.For existing pre-defined model architecture, see
MODEL_PRESETS
here.- --quantization QUANTIZATION_MODE
The quantization mode we use to compile.
See Quantization Mode for more information. Available options are:
q0f16
,q0f32
,q3f16_1
,q4f16_1
,q4f32_1
, andq4f16_awq
.We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.
- --model-type MODEL_TYPE
Model architecture such as “llama”. If not set, it is inferred from
config.json
.- --conv-template CONV_TEMPLATE
Conversation template. It depends on how the model is tuned. Use “LM” for vanilla base model For existing pre-defined templates, see
CONV_TEMPLATES
here.- --context-window-size CONTEXT_WINDOW_SIZE
Option to provide the maximum sequence length supported by the model. This is usually explicitly shown as context length or context window in the model card. If this option is not set explicitly, by default, it will be determined by
context_window_size
ormax_position_embeddings
inconfig.json
, and the latter is usually inaccurate for some models.- --sliding-window-size SLIDING_WINDOW
(Experimental) The sliding window size in sliding window attention (SWA). This optional field overrides the
sliding_window
inconfig.json
for those models that use SWA. Currently only useful when compiling mistral-based models. This flag subjects to future refactoring.- --prefill-chunk-size PREFILL_CHUNK_SIZE
(Experimental) The chunk size during prefilling. By default, the chunk size is the same as
context_window_size
orsliding_window_size
. This flag subjects to future refactoring.- --tensor-parallel-shard TENSOR_PARALLEL_SHARDS
Number of shards to split the model into in tensor parallelism multi-gpu inference.
- --output OUTPUT
The output directory for generated configurations, including mlc-chat-config.json and tokenizer configuration.
3. Compile Model Library¶
After generating mlc-chat-config.json
, we can compile the model into a model library (files ending in .so
, .tar
, etc. that contains
the inference logic of a model).
Model compilation command follows the pattern below:
mlc_llm compile \
MODEL \
[--quantization QUANTIZATION_MODE] \
[--model-type MODEL_TYPE] \
[--device DEVICE] \
[--host HOST] \
[--opt OPT] \
[--system-lib-prefix SYSTEM_LIB_PREFIX] \
--output OUTPUT \
[--overrides OVERRIDES]
Note that MODEL
is a positional argument. Arguments wrapped with [ ]
are optional.
- --MODEL
A path to
mlc-chat-config.json
, or an MLC model directory that containsmlc-chat-config.json
.- --quantization QUANTIZATION_MODE
The quantization mode we use to compile. If unprovided, will infer from
MODEL
.See Quantization Mode for more information. Available options are:
q0f16
,q0f32
,q3f16_1
,q4f16_1
,q4f32_1
, andq4f16_awq
.We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.
- --model-type MODEL_TYPE
Model architecture such as “llama”. If not set, it is inferred from
mlc-chat-config.json
.- --device DEVICE
The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.
- --host HOST
The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS. Examples of the LLVM triple:
iPhones: arm64-apple-ios;
ARM64 Android phones: aarch64-linux-android;
WebAssembly: wasm32-unknown-unknown-wasm;
Windows: x86_64-pc-windows-msvc;
ARM macOS: arm64-apple-darwin.
- --opt OPT
Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as
O0
,O1
,O2
,O3
, whereO0
means no optimization,O2
means majority of them, andO3
represents extreme optimization that could potentially break the system.Meanwhile, optimization flags could be explicitly specified via details knobs, e.g.
--opt="cutlass_attn=1;cutlass_norm=0;cublas_gemm=0;cudagraph=0"
.- --system-lib-prefix SYSTEM_LIB_PREFIX
Adding a prefix to all symbols exported. Similar to
objcopy --prefix-symbols
. This is useful when compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy, this takes no effect for shared library.- --output OUTPUT
The path to the output file. The suffix determines if the output file is a shared library or objects. Available suffixes:
Linux: .so (shared), .tar (objects);
macOS: .dylib (shared), .tar (objects);
Windows: .dll (shared), .tar (objects);
Android, iOS: .tar (objects);
Web: .wasm (web assembly).
- --overrides OVERRIDES
Model configuration override. Configurations to override
mlc-chat-config.json
. Supportscontext_window_size
,prefill_chunk_size
,sliding_window
,max_batch_size
andtensor_parallel_shards
. Meanwhile, model config could be explicitly specified via details knobs, e.g.--overrides "context_window_size=1024;prefill_chunk_size=128"
.