Distribute Compiled Models¶
This page describes how to distribute the model you compiled so others can use the model in MLC chat runtime. For demonstration purposes, we show how to compile the RedPajama-3B instruct model (which has different weights from the RedPajama chat model).
If you have not compiled the RedPajama-3B instruct model, you can use the following command to compile it:
python3 -m mlc_llm.build --hf-path togethercomputer/RedPajama-INCITE-Instruct-3B-v1 --target metal --quantization q4f16_1
python3 -m mlc_llm.build --hf-path togethercomputer/RedPajama-INCITE-Instruct-3B-v1 --target cuda --quantization q4f16_1
python3 -m mlc_llm.build --hf-path togethercomputer/RedPajama-INCITE-Instruct-3B-v1 --target vulkan --quantization q4f16_1
To begin with, we can check that we have the compilation artifact ready on the disk.
~/mlc-llm > ls dist/RedPajama-INCITE-Instruct-3B-v1-q4f16_1 RedPajama-INCITE-Instruct-3B-v1-q4f16_1-metal.so # ===> the model library mod_cache_before_build_metal.pkl # ===> a cached file for future builds params # ===> containing the model weights, tokenizer and chat config ~/mlc-llm > ls dist/RedPajama-INCITE-Instruct-3B-v1-q4f16_1/params mlc-chat-config.json # ===> the chat config ndarray-cache.json # ===> the model weight info params_shard_0.bin # ===> the model weights params_shard_1.bin ... tokenizer.json # ===> the tokenizer files tokenizer_config.json
You are expected to see the same folder structure for the model you compiled.
You can optionally customize the chat config file
dist/RedPajama-INCITE-Instruct-3B-v1-q4f16_1/params/mlc-chat-config.json (checkout Configure MLCChat in JSON for more detailed instructions).
You can also simply use the default configuration and skip this step.
For demonstration purposes, we update
mean_gen_len to 32 and
max_gen_len to 64.
We also update
"LM" because the model is instruction-tuned.
An MLC chat app needs to look for the model library to run the model.
In the case of RedPajama-3B instruct model, we already have a prebuilt model lib for RedPajama-3B chat model that shares the
same model architecture and quantization mode as the instruct model.
We can edit
and update the value of field
We recommend reusing the model lib for the same architecture with different weight variants.
You can leverage the
--reuse-lib in the compilation command to specify the library you want to reuse or edit the chat config afterward.
Reusing model lib allows us to run the model on existing MLC apps (e.g. iOS) that requires static packaging.
For example, if you have compiled RedPajama-3B chat model before, then you can use the following command to compile the instruct model, which reuses the compiled chat model library:
python3 -m mlc_llm.build --hf-path togethercomputer/RedPajama-INCITE-Instruct-3B-v1 --reuse-lib RedPajama-INCITE-Chat-3B-v1-q4f16_1 --target [your target] --quantization q4f16_1
In this way, mlc_llm.build does not produce the model library for the instruct model, and in mlc-chat-config.json
model_lib field is set to
Please note that only models with same architecture and compiled with same quantization modes can reuse and share model library.
We should distribute the generated model lib if we want to build a new model architecture or try out customized compilation optimizations.
In this case, we should keep the
model_lib field as
You can upload the model library
and ask others to download it to dist/prebuilt/lib directory so the CLI app can pick it up.
As a next step, we need to upload the model weights.
We only need to upload the files in
If you also want to host the compiled models on Hugging Face, you can follow the instructions below:
# First, please create a repository on Hugging Face. # With the repository created, run git lfs install git clone https://huggingface.co/my-huggingface-account/my-redpajama3b-weight-huggingface-repo cd my-redpajama3b-weight-huggingface-repo cp path/to/mlc-llm/dist/RedPajama-INCITE-Instruct-3B-v1-q4f16_1/params/* . git add . && git commit -m "Add redpajama-3b instruct model weights" git push origin main
Here we provide an example distributed RedPajama-3B instruct model repository which you can refer to.
Good job, you have successfully distributed the model you compiled. Next, we will talk about how we can consume the model weights in applications.
The steps needed to run models in CLI are similar to the steps to download the prebuilt model weights and libraries.
# Clone prebuilt libs so we can reuse them: mkdir -p dist/prebuilt git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib # Or download the model library (only needed if we do not reuse the model lib): cd dist/prebuilt/lib wget url-to-my-model-lib cd ../../.. # Download the model weights cd dist/prebuilt git clone https://huggingface.co/my-huggingface-account/my-redpajama3b-weight-huggingface-repo RedPajama-INCITE-Instruct-3B-v1-q4f16_1 cd ../.. # Run CLI mlc_chat_cli --model RedPajama-INCITE-Instruct-3B-v1-q4f16_1
For iOS app, model libraries are statically packed into the app at the time of app building. Therefore, the iOS app supports running any model whose model libraries are integrated into the app. You can check the list of supported model libraries.
To download and run the compiled RedPajama-3B instruct model on iPhone, we need to reuse the integrated
RedPajama-INCITE-Chat-3B-v1-q4f16_1 model library.
Please revisit Step 3. Specify the Model Lib and make sure the
model_lib field of mlc-chat-config.json is set to
Now we can download the model weights in iOS app and run the model by following the steps below:
Paste the repository URL of the model built on your own, and click “Add”.
You can refer to the link in the image as an example.