👋 Welcome to MLC LLM

Discord | GitHub

Machine Learning Compilation for Large Language Models (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone’s devices with ML compilation techniques.

Getting Started

To begin with, try out MLC LLM support for int4-quantized Llama2 7B. It is recommended to have at least 6GB free VRAM to run it.

Install MLC LLM Python. MLC LLM is available via pip. It is always recommended to install it in an isolated conda virtual environment.

Download pre-quantized weights. The commands below download the int4-quantized Llama2-7B from HuggingFace:

git lfs install && mkdir dist/
git clone https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC \

Download pre-compiled model library. The pre-compiled model library is available as below:

git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt_libs

Run in Python. The following Python script showcases the Python API of MLC LLM and its stream capability:

from mlc_llm import ChatModule
from mlc_llm.callback import StreamToStdout

cm = ChatModule(
    # Vulkan on Linux: Llama-2-7b-chat-hf-q4f16_1-vulkan.so
    # Metal on macOS: Llama-2-7b-chat-hf-q4f16_1-metal.so
    # Other platforms: Llama-2-7b-chat-hf-q4f16_1-{backend}.{suffix}
cm.generate(prompt="What is the meaning of life?", progress_callback=StreamToStdout(callback_interval=2))

Colab walkthrough. A Jupyter notebook on Colab is provided with detailed walkthrough of the Python API.

Documentation and tutorial. Python API reference and its tutorials are available online.