Android App¶
Demo App¶
The demo APK below is built for Samsung S23 with Snapdragon 8 Gen 2 chip.

Prerequisite¶
Rust (install) is needed to cross-compile HuggingFace tokenizers to Android. Make sure rustc, cargo, and rustup are available in $PATH
.
Android Studio (install) with NDK and CMake. To install NDK and CMake, in the Android Studio welcome page, click “Projects → SDK Manager → SDK Tools”. Set up the following environment variables:
ANDROID_NDK
so that$ANDROID_NDK/build/cmake/android.toolchain.cmake
is available.TVM_NDK_CC
that points to NDK’s clang compiler.
# Example on macOS
ANDROID_NDK: $HOME/Library/Android/sdk/ndk/25.2.9519653
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang
# Example on Windows
ANDROID_NDK: $HOME/Library/Android/sdk/ndk/25.2.9519653
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android24-clang
JDK, such as OpenJDK >= 17, to compile Java bindings of TVM Unity runtime. It could be installed via Homebrew on macOS, apt on Ubuntu or other package managers. Set up the following environment variable:
JAVA_HOME
so that Java is available in$JAVA_HOME/bin/java
.
TVM Unity runtime is placed under 3rdparty/tvm in MLC LLM, so there is no need to install anything extra. Set up the following environment variable:
TVM_HOME
so that its headers are available under$TVM_HOME/include/tvm/runtime
.
(Optional) TVM Unity compiler Python package (install or build from source). It is NOT required if models are prebuilt, but to compile PyTorch models from HuggingFace in the following section, the compiler is a must-dependency.
Note
❗ Whenever using Python, it is highly recommended to use conda to manage an isolated Python environment to avoid missing dependencies, incompatible versions, and package conflicts.
Check if environment variable are properly set as the last check. One way to ensure this is to place them in $HOME/.zshrc
, $HOME/.bashrc
or environment management tools.
source $HOME/.cargo/env # Rust
export ANDROID_NDK=... # Android NDK toolchain
export TVM_NDK_CC=... # Android NDK clang
export JAVA_HOME=... # Java
export TVM_HOME=... # TVM Unity runtime
Compile PyTorch Models from HuggingFace¶
To deploy models on Android with reasonable performance, one has to cross-compile to and fully utilize mobile GPUs using TVM Unity. MLC provides a few pre-compiled models, or one could compile the models on their own.
Cloning MLC LLM from GitHub. Download MLC LLM via the following command:
git clone --recursive https://github.com/mlc-ai/mlc-llm/
^^^^^^^^^^^
cd ./mlc-llm/
Note
❗ The --recursive
flag is necessary to download submodules like 3rdparty/tvm. If you see any file missing during compilation, please double check if git submodules are properly cloned.
Download the PyTorch model using Git Large File Storage (LFS), and by default, under ./dist/models/
:
MODEL_NAME=Llama-2-7b-chat-hf
QUANTIZATION=q4f16_1
git lfs install
git clone https://huggingface.co/meta-llama/$MODEL_NAME \
./dist/models/
Compile Android-capable models. Install TVM Unity compiler as a Python package, and then run the command below:
# Show help message
python3 -m mlc_llm.build --help
# Compile a PyTorch model
python3 -m mlc_llm.build \
--target android \
--max-seq-len 768 \
--model ./dist/models/$MODEL_NAME \
--quantization $QUANTIZATION
This generates the directory ./dist/$MODEL_NAME-$QUANTIZATION
which contains the necessary components to run the model, as explained below.
Expected output format. By default models are placed under ./dist/${MODEL_NAME}-${QUANTIZATION}
, and the result consists of 3 major components:
Runtime configuration: It configures conversation templates including system prompts, repetition repetition penalty, sampling including temperature and top-p probability, maximum sequence length, etc. It is usually named as
mlc-chat-config.json
under ``params/``alongside with tokenizer configurations.Model lib: The compiled library that uses mobile GPU. It is usually named as
${MODEL_NAME}-${QUANTIZATION}-android.tar
, for example,Llama-2-7b-chat-hf-q4f16_0-android.tar
.Model weights: the model weights are sharded as
params_shard_*.bin
underparams/
and the metadata is stored inndarray-cache.json
.
Create Android Project using Compiled Models¶
The source code for MLC LLM is available under android/
, including scripts to build dependencies and the main app under android/MLCChat/
that could be opened by Android studio. Enter the directory first:
cd ./android/
Build necessary dependencies. Configure the list of models the app comes with using the JSON file below, which by default, is configured to use both Llama2-7B and RedPajama-3B:
vim ./MLCChat/app/src/main/assets/app-config.json
Then bundle the android library ${MODEL_NAME}-${QUANTIZATION}-android.tar
compiled from mlc_llm.build
in the previous steps, with TVM Unity’s Java runtime by running the commands below:
./prepare_libs.sh
which generates the two files below:
>>> find ./build/output -type f
./build/output/arm64-v8a/libtvm4j_runtime_packed.so
./build/output/tvm4j_core.jar
The model execution logic in mobile GPUs is incorporated into libtvm4j_runtime_packed.so
, while tvm4j_core.jar
is a lightweight (~60 kb) Java binding to it. Copy them to the right path to be found by the Android project:
cp -a ./build/output/. ./MLCChat/app/src/main/libs
Build the Android app. Open folder ./android/MLCChat
as an Android Studio Project. Connect your Android device to your machine. In the menu bar of Android Studio, click “Build → Make Project”. Once the build is finished, click “Run → Run ‘app’” and you will see the app launched on your phone.
Note
❗ This app cannot be run in an emulator and thus a physical phone is required, because MLC LLM needs an actual mobile GPU to meaningfully run at an accelerated speed.
Incorporate Model Weights¶
Instructions have been provided to build an Android App with MLC LLM in previous sections, but it requires run-time weight downloading from HuggingFace, as configured in app-config.json in previous steps under model_url. However, it could be desirable to bundle weights together into the app to avoid downloading over the network. In this section, we provide a simple ADB-based walkthrough that hopefully helps with further development.
Generating APK. Enter Android Studio, and click “Build → Generate Signed Bundle/APK” to build an APK for release. If it is the first time you generate an APK, you will need to create a key according to the official guide from Android. This APK will be placed under android/MLCChat/app/release/app-release.apk
.
Install ADB and USB debugging. Enable “USB debugging” in the developer mode in your phone settings. In SDK manager, install Android SDK Platform-Tools. Add the path to platform-tool path to the environment variable PATH
. Run the following commands, and if ADB is installed correctly, your phone will appear as a device:
adb devices
Install the APK and weights to your phone. Run the commands below replacing ${MODEL_NAME}
and ${QUANTIZATION}
with the actual model name (e.g. Llama-2-7b-chat-hf) and quantization format (e.g. q4f16_1).
adb install android/MLCChat/app/release/app-release.apk
adb push dist/${MODEL_NAME}-${QUANTIZATION}/params /data/local/tmp/${MODEL_NAME}-${QUANTIZATION}/
adb shell "mkdir -p /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/"
adb shell "mv /data/local/tmp/${MODEL_NAME}-${QUANTIZATION} /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/"