Model Prebuilts¶

Overview ¶

MLC-LLM is a universal solution for deploying different language models. Any models that can be described in TVM Relax (a general representation for Neural Networks and can be imported from models written in PyTorch) can be recognized by MLC-LLM and thus deployed to different backends with the help of TVM Unity.

There are two ways to run a model on MLC-LLM (this page focuses on the second one):

Compile your own models following the model compilation page.
Use off-the-shelf prebuilt models following this current page.

In order to run a specific model on MLC-LLM, you need:

1. A model library: a binary file containing the end-to-end functionality to inference a model (e.g. Llama-2-7b-chat-hf-q4f16_1-cuda.so). See the full list of all precompiled model libraries here.

2. Compiled weights: a folder containing multiple files that store the compiled and quantized weights of a model (e.g. https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC). See the full list of all precompiled weights here.

In this page, we first quickly go over how to use prebuilts for different platforms, then track what current prebuilt models we provide.

Using Prebuilt Models for Different Platforms ¶

We quickly go over how to use prebuilt models for each platform. You can find detailed instruction on each platform’s corresponding page.

Prebuilt Models on CLI / Python

For more, please see the CLI page, and the the Python page.

Click to show details

First create the conda environment if you have not done so.

conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-cli-nightly
conda activate mlc-chat-venv
conda install git git-lfs
git lfs install

Download the prebuilt model libraries from github.

mkdir dist/
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt_libs

Run the model with CLI:

mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

To run the model with Python API, see the Python page (all other downloading steps are the same as CLI).

Prebuilt Models on iOS

For more, please see the iOS page.

Click to show details

The iOS app has builtin RedPajama-3B and Mistral-7B-Instruct-v0.2 support.

All prebuilt models with an entry in iOS in the model library table are supported by iOS. Namely, we have:

Prebuilt Models for iOS¶
Model Code	Model Series	Quantization Mode	MLC HuggingFace Weights Repo
Mistral-7B-Instruct-v0.2-q3f16_1	Mistral	Weight storage data type: int3 Running data type: float16 Symmetric quantization	link
RedPajama-INCITE-Chat-3B-v1-q4f16_1	RedPajama	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link
phi-2-q4f16_1	Microsoft Phi-2	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link

Prebuilt Models on Android

For more, please see the Android page.

Click to show details

The apk for demo Android app includes the following models. To add more, check out the Android page.

Prebuilt Models for Android¶
Model code	Model Series	Quantization Mode	Hugging Face repo
Llama-2-7b-q4f16_1	Llama	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link
RedPajama-INCITE-Chat-3B-v1-q4f16_1	RedPajama	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link

Level 1: Supported Model Architectures (The All-In-One Table)¶

For each model architecture (e.g. Llama), there are multiple variants (e.g. CodeLlama, WizardLM). The variants share the same code for inference and only differ in their weights. In other words, running CodeLlama and WizardLM can use the same model library file (specified in Level 2 tables), but different precompiled weights (specified in Level 3 tables). Note that we have not provided prebuilt weights for all model variants.

Each entry below hyperlinks to the corresponding level 2 and level 3 tables.

MLC-LLM supports the following model architectures:

Supported Model Architectures¶
Model Architecture	Support	Available MLC Prebuilts	Unavailable in MLC Prebuilts
LLaMA	Prebuilt Model Library MLC Implementation	Llama-2-chat	Code Llama Vicuna WizardLM WizardCoder (new) OpenOrca Platypus2 FlagAlpha Llama-2 Chinese georgesung Llama-2 Uncensored Alpaca Guanaco OpenLLaMA Gorilla YuLan-Chat
Mistral	Prebuilt Model Library MLC Implementation	Mistral-7B-Instruct-v0.2 NeuralHermes-2.5-Mistral-7B OpenHermes-2.5-Mistral-7B WizardMath-7B-V1.1
GPT-NeoX	Prebuilt Model Library MLC Implementation	RedPajama	Dolly Pythia StableCode
GPTBigCode	Prebuilt Model Library MLC Implementation		StarCoder SantaCoder WizardCoder (old)
Phi	Prebuilt Model Library MLC Implementation	Phi-1_5 Phi-2
GPT2	Prebuilt Model Library MLC Implementation	GPT2

If the model variant you are interested in uses one of these model architectures we support, (but we have not provided the prebuilt weights yet), you can check out Convert Weights via MLC on how to convert the weights. Afterwards, you may follow (Optional) 3. Upload weights to HF to upload your prebuilt weights to hugging face, and submit a PR that adds an entry to this page, contributing to the community.

For models structured in an architecture we have not supported yet, you could:

Either create a [Model Request] issue which automatically shows up on our Model Request Tracking Board.
Or follow our tutorial Define New Models, which introduces how to bring a new model architecture to MLC-LLM.

Level 2: Model Library Tables (Precompiled Binary Files)¶

As mentioned earlier, each model architecture corresponds to a different model library file. That is, you cannot use the same model library file to run RedPajama and Llama-2. However, you can use the same Llama model library file to run Llama-2, WizardLM, CodeLlama, etc, but just with different weight files (from tables in Level 3).

Each table below demonstrates the pre-compiled model library files for each model architecture. This is categorized by:

Size: each size of model has its own distinct model library file (e.g. 7B or 13B number of parameters)
Platform: the backend that the model library is intended to be run on (e.g. CUDA, ROCm, iphone, etc.)
Quantization scheme: the model library file also differs due to the quantization scheme used. For more on this, please see the quantization page (e.g. q3f16_1 vs. q4f16_1).

Each entry links to the specific model library file found in this github repo.

If the model library you found is not available as a prebuilt, you can compile it yourself by following the model compilation page, and submit a PR to the repo binary-mlc-llm-libs afterwards.

Llama ¶

Llama¶
	CUDA	Vulkan (Linux)	Metal (M Chip)	Android	webgpu
7B	q4f16_1 q4f32_1	q4f16_1 q4f32_1	q4f16_1 q4f32_1	q4f16_1 q4f32_1	q4f16_1 q4f32_1
13B	q4f16_1	q4f16_1	q4f16_1		q4f16_1
34B
70B	q4f16_1	q4f16_1	q4f16_1		q4f16_1

Mistral ¶

Mistral¶
	CUDA	ROCm	Vulkan (Linux)	Vulkan (Windows)	Metal (M Chip)	Metal (Intel)	iOS	Android	webgpu	mali
7B	q4f16_1		q4f16_1		q4f16_1		q3f16_1	q4f16_1	q4f16_1

GPT-NeoX (RedPajama-INCITE)¶

GPT-NeoX (RedPajama-INCITE)¶
	CUDA	ROCm	Vulkan (Linux)	Vulkan (Windows)	Metal (M Chip)	Metal (Intel)	iOS	Android	webgpu	mali
3B	q4f16_1 q4f32_1		q4f16_1 q4f32_1		q4f16_1 q4f32_1		q4f16_1	q4f16_1 q4f32_1	q4f16_1 q4f32_1

GPTBigCode ¶

GPTBigCode¶
	CUDA	ROCm	Vulkan (Linux)	Vulkan (Windows)	Metal (M Chip)	Metal (Intel)	iOS	Android	webgpu	mali
15B

Phi ¶

Phi¶
	CUDA	ROCm	Vulkan (Linux)	Vulkan (Windows)	Metal (M Chip)	Metal (Intel)	iOS	Android	webgpu	mali
Phi-2 (2.7B)	q0f16 q4f16_1		q0f16 q4f16_1		q0f16 q4f16_1				q0f16 q4f16_1
Phi-1.5 (1.3B)	q0f16 q4f16_1		q0f16 q4f16_1		q0f16 q4f16_1				q0f16 q4f16_1

GPT2 ¶

GPT2¶
	CUDA	ROCm	Vulkan (Linux)	Vulkan (Windows)	Metal (M Chip)	Metal (Intel)	iOS	Android	webgpu	mali
GPT2 (124M)	q0f16		q0f16		q0f16				q0f16
GPT2-med (355M)	q0f16		q0f16		q0f16				q0f16

Level 3: Model Variant Tables (Precompiled Weights)¶

Finally, for each model variant, we provide the precompiled weights we uploaded to hugging face.

Each precompiled weight is categorized by its model size (e.g. 7B vs. 13B) and the quantization scheme (e.g. q3f16_1 vs. q4f16_1). We note that the weights are platform-agnostic.

Each model variant also loads its conversation configuration from a pre-defined conversation template. Note that multiple model variants can share a common conversation template.

Some of these files are uploaded by our community contributors–thank you!

Llama-2 ¶

Conversation template: llama-2

Llama-2¶
Size	Hugging Face Repo Link
7B	q4f16_1 (Chat) q4f32_1 (Chat)
13B	q4f16_1
70B	q4f16_1

Mistral ¶

Conversation template: mistral_default

Mistral¶
Size	Hugging Face Repo Link
7B	q3f16_1 (Instruct) q4f16_1 (Instruct)

NeuralHermes-2.5-Mistral ¶

Conversation template: neural_hermes_mistral

Neural Hermes¶
Size	Hugging Face Repo Link
7B	q4f16_1

OpenHermes-2-Mistral ¶

Conversation template: open_hermes_mistral

Open Hermes¶
Size	Hugging Face Repo Link
7B	q4f16_1

WizardMath V1.1 ¶

Conversation template: wizard_coder_or_math

WizardMath¶
Size	Hugging Face Repo Link
7B	q4f16_1

RedPajama ¶

Conversation template: redpajama_chat

Red Pajama¶
Size	Hugging Face Repo Link
3B	q4f16_1 (Chat) q4f32_1 (Chat)

Phi ¶

Conversation template: phi-2

Phi¶
Size	Hugging Face Repo Link
Phi-2 (2.7B)	q0f16 q4f16_1
Phi-1.5 (1.3B)	q0f16 q4f16_1

GPT2 ¶

Conversation template: gpt2

GPT2¶
Size	Hugging Face Repo Link
GPT2 (124M)	q0f16
GPT2-medium (355M)	q0f16

Contribute Models to MLC-LLM ¶

Ready to contribute your compiled models/new model architectures? Awesome! Please check Contribute New Models to MLC-LLM on how to contribute new models to MLC-LLM.