Model Prebuilts¶
Overview¶
MLC-LLM is a universal solution for deploying different language models. Any models that can be described in TVM Relax (a general representation for Neural Networks and can be imported from models written in PyTorch) can be recognized by MLC-LLM and thus deployed to different backends with the help of TVM Unity.
There are two ways to run a model on MLC-LLM:
Compile your own models following the model compilation page.
Use off-the-shelf prebuilts models following this current page.
This page focuses on the second option:
Documenting how to use prebuilts for various platforms, and
Tracking what current prebuilt models we provide.
Prerequisite: Model Libraries and Compiled Weights¶
In order to run a specific model on MLC-LLM, you need:
1. A model library: a binary file containing the end-to-end functionality to inference a model (e.g. Llama-2-7b-chat-hf-q4f16_1-cuda.so
). See the full list of all precompiled model libraries here.
2. Compiled weights: a folder containing multiple files that store the compiled and quantized weights of a model (e.g. https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1). See the full list of all precompiled weights here.
Using Prebuilt Models for Different Platforms¶
We quickly go over how to use prebuilt models for each platform. You can find detailed instruction on each platform’s corresponding page.
Prebuilt Models on CLI / Python¶
For more, please see the CLI page, and the the Python page.
Click to show details
First create the conda environment if you have not done so.
conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-cli-nightly conda activate mlc-chat-venv conda install git git-lfs git lfs install
Download the prebuilt model libraries from github.
mkdir -p dist/prebuilt git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib
Download the prebuilt model weights from hugging face for the model variant you want.
# Say we want to run rwkv-raven-7b-q8f16_0 cd dist/prebuilt git clone https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-7b-q8f16_0 cd ../.. # The format being: # cd dist/prebuilt # git clone https://huggingface.co/mlc-ai/mlc-chat-[model-code] # cd ../.. # mlc_chat_cli --model [model-code]
Run the model with CLI:
# For CLI mlc_chat_cli --model rwkv-raven-7b-q8f16_0
To run the model with Python API, see the Python page (all other downloading steps are the same as CLI).
Prebuilt Models on iOS¶
For more, please see the iOS page.
Click to show details
The iOS app has builtin RedPajama-3B and Llama-2-7b support.
All prebuilt models with an entry in iOS
in the model library table are supported by iOS. Namely, we have:
Model library name |
Model Family |
Quantization Mode |
---|---|---|
Llama-2-7b-chat-hf-q3f16_1 |
LLaMA |
|
vicuna-v1-7b-q3f16_0 |
LLaMA |
|
RedPajama-INCITE-Chat-3B-v1-q4f16_1 |
GPT-NeoX |
|
As for prebuilt model weights, the ones we have integrated into app are listed below:
Model code |
Model Series |
Quantization Mode |
Hugging Face repo |
---|---|---|---|
Llama-2-7b-q3f16_1 |
|
||
vicuna-v1-7b-q3f16_0 |
|
||
RedPajama-INCITE-Chat-3B-v1-q4f16_1 |
|
To run a model variant you compiled on your own, you can directly reuse the above integrated prebuilt model libraries, as long as the model shares the architecture and is compiled with the same quantization mode. For example, if you compile OpenLLaMA-7B with quantization mode q3f16_0
, then you can run the compiled OpenLLaMA model on iPhone without rebuilding the iOS app by reusing the vicuna-v1-7b-q3f16_0 model library. Then you can upload the compiled weights to hugging face so that you can download the weights in the app as shown below (for more on uploading to hugging face, please check the model distribution page).
To add a model to the iOS app, follow the steps below:
Prebuilt Models on Android¶
For more, please see the Android page.
Click to show details
The apk for demo Android app includes the following models. To add more, check out the Android page.
Model code |
Model Series |
Quantization Mode |
Hugging Face repo |
---|---|---|---|
Llama-2-7b-q4f16_1 |
|
||
RedPajama-INCITE-Chat-3B-v1-q4f16_1 |
|
Level 1: Supported Model Architectures (The All-In-One Table)¶
For each model architecture (e.g. Llama), there are multiple variants (e.g. CodeLlama, WizardLM). The variants share the same code for inference and only differ in their weights. In other words, running CodeLlama and WizardLM can use the same model library file (specified in Level 2 tables), but different precompiled weights (specified in Level 3 tables). Note that we have not provided prebuilt weights for all model variants.
Each entry below hyperlinks to the corresponding level 2 and level 3 tables.
MLC-LLM supports the following model architectures:
Model Architecture |
Support |
Available MLC Prebuilts |
Unavailable in MLC Prebuilts |
---|---|---|---|
|
|||
|
|||
|
|||
|
If the model variant you are interested in uses one of these model architectures we support (but we have not provided the prebuilt weights yet), you can check out Compile Models via MLC on how to compile your own models. Afterwards, you may follow Distribute Compiled Models to upload your prebuilt weights to hugging face, and submit a PR that adds an entry to this page, contributing to the community.
For models structured in an architecture we have not supported yet, you could:
Either create a [Model Request] issue which automatically shows up on our Model Request Tracking Board.
Or follow our tutorial Define New Models, which introduces how to bring a new model architecture to MLC-LLM.
Level 2: Model Library Tables (Precompiled Binary Files)¶
As mentioned earlier, each model architecture corresponds to a different model library file. That is, you cannot use the same model library file to run RedPajama
and Llama-2
. However, you can use the same Llama
model library file to run Llama-2
, WizardLM
, CodeLlama
, etc, but just with different weight files (from tables in Level 3).
Each table below demonstrates the pre-compiled model library files for each model architecture. This is categorized by:
Size: each size of model has its own distinct model library file (e.g. 7B or 13B number of parameters)
Platform: the backend that the model library is intended to be run on (e.g. CUDA, ROCm, iphone, etc.)
Quantization scheme: the model library file also differs due to the quantization scheme used. For more on this, please see the model compilation page (e.g.
q3f16_1
vs.q4f16_1
)
Each entry links to the specific model library file found in this github repo.
Llama¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M1/M2) |
Metal (Intel) |
iOS |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|
7B |
|||||||||
13B |
|||||||||
34B |
|||||||||
70B |
GPT-NeoX (RedPajama-INCITE)¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M1/M2) |
Metal (Intel) |
iOS |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|
3B |
RWKV¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M1/M2) |
Metal (Intel) |
iOS |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|
1B5 |
|||||||||
3B |
|||||||||
7B |
GPTBigCode¶
Note that these all links to model libraries for WizardCoder (the older version released in Jun. 2023). However, any GPTBigCode model variants should be able to reuse these (e.g. StarCoder, SantaCoder).
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M1/M2) |
Metal (Intel) |
iOS |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|
15B |
Level 3: Model Variant Tables (Precompiled Weights)¶
Finally, for each model variant, we provide the precompiled weights we uploaded to hugging face.
Each precompiled weight is categorized by its model size (e.g. 7B vs. 13B) and the quantization scheme (e.g. q3f16_1
vs. q4f16_1
). We note that the weights are platform-agnostic.
Each model variant also loads its conversation configuration from a pre-defined conversation template. Note that multiple model variants can share a common conversation template.
Some of these files are uploaded by our community contributors–thank you!
Llama-2¶
Conversation template: llama-2
Size |
Hugging Face Repo Link |
---|---|
7B |
|
13B |
|
70B |
Code Llama¶
Conversation template: codellama_completion
Size |
Hugging Face Repo Link |
---|---|
7B |
|
13B |
|
34B |
Vicuna¶
Conversation template: vicuna_v1.1
Size |
Hugging Face Repo Link |
---|---|
7B |
WizardLM¶
Conversation template: vicuna_v1.1
Size |
Hugging Face Repo Link |
---|---|
13B |
|
70B |
WizardMath¶
Conversation template: wizard_coder_or_math
Size |
Hugging Face Repo Link |
---|---|
7B |
|
13B |
|
70B |
OpenOrca Platypus2¶
Conversation template: llama-2
Size |
Hugging Face Repo Link |
---|---|
13B |
FlagAlpha Llama-2 Chinese¶
Conversation template: llama-2
Size |
Hugging Face Repo Link |
---|---|
7B |
Llama2 uncensored (georgesung)¶
Conversation template: llama-default
Size |
Hugging Face Repo Link |
---|---|
7B |
RedPajama¶
Conversation template: LM
Size |
Hugging Face Repo Link |
---|---|
3B |
RWKV-raven¶
Conversation template: rwkv
Size |
Hugging Face Repo Link |
---|---|
1B5 |
|
3B |
|
7B |
WizardCoder¶
Conversation template: wizard_coder_or_math
Size |
Hugging Face Repo Link |
---|---|
15B |
Contribute Models to MLC-LLM¶
Ready to contribute your compiled models/new model architectures? Awesome! Please check Contribute New Models to MLC-LLM on how to contribute new models to MLC-LLM.