Model Prebuilts

Overview

MLC-LLM is a universal solution for deploying different language models. Any models that can be described in TVM Relax (a general representation for Neural Networks and can be imported from models written in PyTorch) can be recognized by MLC-LLM and thus deployed to different backends with the help of TVM Unity.

There are two ways to run a model on MLC-LLM:

  1. Compile your own models following the model compilation page.

  2. Use off-the-shelf prebuilts models following this current page.

This page focuses on the second option:

Prerequisite: Model Libraries and Compiled Weights

In order to run a specific model on MLC-LLM, you need:

1. A model library: a binary file containing the end-to-end functionality to inference a model (e.g. Llama-2-7b-chat-hf-q4f16_1-cuda.so). See the full list of all precompiled model libraries here.

2. Compiled weights: a folder containing multiple files that store the compiled and quantized weights of a model (e.g. https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1). See the full list of all precompiled weights here.

Using Prebuilt Models for Different Platforms

We quickly go over how to use prebuilt models for each platform. You can find detailed instruction on each platform’s corresponding page.

Prebuilt Models on CLI / Python

For more, please see the CLI page, and the the Python page.

Click to show details

First create the conda environment if you have not done so.

conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-cli-nightly
conda activate mlc-chat-venv
conda install git git-lfs
git lfs install

Download the prebuilt model libraries from github.

mkdir -p dist/prebuilt
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib

Download the prebuilt model weights from hugging face for the model variant you want.

# Say we want to run rwkv-raven-7b-q8f16_0
cd dist/prebuilt
git clone https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-7b-q8f16_0
cd ../..

# The format being:
# cd dist/prebuilt
# git clone https://huggingface.co/mlc-ai/mlc-chat-[model-code]
# cd ../..
# mlc_chat_cli --model [model-code]

Run the model with CLI:

# For CLI
mlc_chat_cli --model rwkv-raven-7b-q8f16_0

To run the model with Python API, see the Python page (all other downloading steps are the same as CLI).


Prebuilt Models on iOS

For more, please see the iOS page.

Click to show details

The iOS app has builtin RedPajama-3B and Llama-2-7b support.

All prebuilt models with an entry in iOS in the model library table are supported by iOS. Namely, we have:

Prebuilt model libraries integrated in the iOS app

Model library name

Model Family

Quantization Mode

Llama-2-7b-chat-hf-q3f16_1

LLaMA

  • Weight storage data type: int3

  • Running data type: float16

  • Symmetric quantization

vicuna-v1-7b-q3f16_0

LLaMA

  • Weight storage data type: int3

  • Running data type: float16

  • Symmetric quantization

RedPajama-INCITE-Chat-3B-v1-q4f16_1

GPT-NeoX

  • Weight storage data type: int4

  • Running data type: float16

  • Symmetric quantization

As for prebuilt model weights, the ones we have integrated into app are listed below:

Tested prebuilt model weights for iOS

Model code

Model Series

Quantization Mode

Hugging Face repo

Llama-2-7b-q3f16_1

Llama

  • Weight storage data type: int3

  • Running data type: float16

  • Symmetric quantization

link

vicuna-v1-7b-q3f16_0

Vicuna

  • Weight storage data type: int3

  • Running data type: float16

  • Symmetric quantization

link

RedPajama-INCITE-Chat-3B-v1-q4f16_1

RedPajama

  • Weight storage data type: int4

  • Running data type: float16

  • Symmetric quantization

link

To run a model variant you compiled on your own, you can directly reuse the above integrated prebuilt model libraries, as long as the model shares the architecture and is compiled with the same quantization mode. For example, if you compile OpenLLaMA-7B with quantization mode q3f16_0, then you can run the compiled OpenLLaMA model on iPhone without rebuilding the iOS app by reusing the vicuna-v1-7b-q3f16_0 model library. Then you can upload the compiled weights to hugging face so that you can download the weights in the app as shown below (for more on uploading to hugging face, please check the model distribution page).

To add a model to the iOS app, follow the steps below:

Open “MLCChat” app, click “Add model variant”.

https://raw.githubusercontent.com/mlc-ai/web-data/main/images/mlc-llm/tutorials/iPhone-custom-1.png

Prebuilt Models on Android

For more, please see the Android page.

Click to show details

The apk for demo Android app includes the following models. To add more, check out the Android page.

Prebuilt Models for Android

Model code

Model Series

Quantization Mode

Hugging Face repo

Llama-2-7b-q4f16_1

Llama

  • Weight storage data type: int4

  • Running data type: float16

  • Symmetric quantization

link

RedPajama-INCITE-Chat-3B-v1-q4f16_1

RedPajama

  • Weight storage data type: int4

  • Running data type: float16

  • Symmetric quantization

link


Level 1: Supported Model Architectures (The All-In-One Table)

For each model architecture (e.g. Llama), there are multiple variants (e.g. CodeLlama, WizardLM). The variants share the same code for inference and only differ in their weights. In other words, running CodeLlama and WizardLM can use the same model library file (specified in Level 2 tables), but different precompiled weights (specified in Level 3 tables). Note that we have not provided prebuilt weights for all model variants.

Each entry below hyperlinks to the corresponding level 2 and level 3 tables.

MLC-LLM supports the following model architectures:

Supported Model Architectures

Model Architecture

Support

Available MLC Prebuilts

Unavailable in MLC Prebuilts

LLaMA

GPT-NeoX

GPT-J

RWKV

MiniGPT

GPTBigCode

ChatGLM

StableLM

If the model variant you are interested in uses one of these model architectures we support (but we have not provided the prebuilt weights yet), you can check out Compile Models via MLC on how to compile your own models. Afterwards, you may follow Distribute Compiled Models to upload your prebuilt weights to hugging face, and submit a PR that adds an entry to this page, contributing to the community.

For models structured in an architecture we have not supported yet, you could:

Level 2: Model Library Tables (Precompiled Binary Files)

As mentioned earlier, each model architecture corresponds to a different model library file. That is, you cannot use the same model library file to run RedPajama and Llama-2. However, you can use the same Llama model library file to run Llama-2, WizardLM, CodeLlama, etc, but just with different weight files (from tables in Level 3).

Each table below demonstrates the pre-compiled model library files for each model architecture. This is categorized by:

  • Size: each size of model has its own distinct model library file (e.g. 7B or 13B number of parameters)

  • Platform: the backend that the model library is intended to be run on (e.g. CUDA, ROCm, iphone, etc.)

  • Quantization scheme: the model library file also differs due to the quantization scheme used. For more on this, please see the model compilation page (e.g. q3f16_1 vs. q4f16_1)

Each entry links to the specific model library file found in this github repo.

Llama

Llama

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M1/M2)

Metal

(Intel)

iOS

webgpu

mali

7B

q4f16_1

q4f16_1

q4f16_1

q4f16_1

q4f16_1

q4f16_1

q3f16_1

q4f16_1

q4f32_1

q4f16_1

13B

q4f16_1

q4f16_1

q4f16_1

q4f16_1

q4f16_1

q4f16_1

q4f16_1

q4f32_1

q4f16_1

34B

q4f16_1

q4f16_1

q4f16_1

q4f16_1

70B

q3f16_1

q4f16_1

q4f16_1

GPT-NeoX (RedPajama-INCITE)

GPT-NeoX (RedPajama-INCITE)

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M1/M2)

Metal

(Intel)

iOS

webgpu

mali

3B

q4f16_1

q4f16_1

q4f16_0

q4f16_1

q4f16_0

q4f16_1

q4f16_0

q4f16_1

q4f16_0

q4f16_1

q4f16_0

q4f16_1

q4f16_0

q4f16_1

q4f32_0

q4f32_1

q4f16_1

RWKV

RWKV

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M1/M2)

Metal

(Intel)

iOS

webgpu

mali

1B5

q8f16_0

q8f16_0

q8f16_0

q8f16_0

3B

q8f16_0

q8f16_0

q8f16_0

q8f16_0

7B

q8f16_0

q8f16_0

q8f16_0

q8f16_0

GPTBigCode

Note that these all links to model libraries for WizardCoder (the older version released in Jun. 2023). However, any GPTBigCode model variants should be able to reuse these (e.g. StarCoder, SantaCoder).

GPTBigCode

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M1/M2)

Metal

(Intel)

iOS

webgpu

mali

15B

q4f16_1

q4f32_1

q4f16_1

q4f32_1

q4f16_1

q4f32_1

q4f16_1

q4f16_1

q4f32_1

Level 3: Model Variant Tables (Precompiled Weights)

Finally, for each model variant, we provide the precompiled weights we uploaded to hugging face.

Each precompiled weight is categorized by its model size (e.g. 7B vs. 13B) and the quantization scheme (e.g. q3f16_1 vs. q4f16_1). We note that the weights are platform-agnostic.

Each model variant also loads its conversation configuration from a pre-defined conversation template. Note that multiple model variants can share a common conversation template.

Some of these files are uploaded by our community contributors–thank you!

Llama-2

Conversation template: llama-2

Llama-2

Size

Hugging Face Repo Link

7B

13B

70B

Code Llama

Conversation template: codellama_completion

Code Llama

Size

Hugging Face Repo Link

7B

13B

34B

Vicuna

Conversation template: vicuna_v1.1

Vicuna

Size

Hugging Face Repo Link

7B

WizardLM

Conversation template: vicuna_v1.1

WizardLM

Size

Hugging Face Repo Link

13B

70B

WizardMath

Conversation template: wizard_coder_or_math

WizardMath

Size

Hugging Face Repo Link

7B

13B

q4f16_1

70B

q4f16_1

OpenOrca Platypus2

Conversation template: llama-2

OpenOrca Platypus2

Size

Hugging Face Repo Link

13B

q4f16_1

FlagAlpha Llama-2 Chinese

Conversation template: llama-2

FlagAlpha Llama-2 Chinese

Size

Hugging Face Repo Link

7B

Llama2 uncensored (georgesung)

Conversation template: llama-default

Llama2 uncensored

Size

Hugging Face Repo Link

7B

RedPajama

Conversation template: LM

Red Pajama

Size

Hugging Face Repo Link

3B

RWKV-raven

Conversation template: rwkv

RWKV-raven

Size

Hugging Face Repo Link

1B5

q8f16_0

3B

q8f16_0

7B

q8f16_0

WizardCoder

Conversation template: wizard_coder_or_math

WizardCoder

Size

Hugging Face Repo Link

15B

q4f16_1


Contribute Models to MLC-LLM

Ready to contribute your compiled models/new model architectures? Awesome! Please check Contribute New Models to MLC-LLM on how to contribute new models to MLC-LLM.