Android App

Demo App

The demo APK below is built for Samsung S23 with Snapdragon 8 Gen 2 chip.

https://seeklogo.com/images/D/download-android-apk-badge-logo-D074C6882B-seeklogo.com.png

Prerequisite

Rust (install) is needed to cross-compile HuggingFace tokenizers to Android. Make sure rustc, cargo, and rustup are available in $PATH.

Android Studio (install) with NDK and CMake. To install NDK and CMake, in the Android Studio welcome page, click “Projects → SDK Manager → SDK Tools”. Set up the following environment variables:

  • ANDROID_NDK so that $ANDROID_NDK/build/cmake/android.toolchain.cmake is available.

  • TVM_NDK_CC that points to NDK’s clang compiler.

# Example on macOS
ANDROID_NDK: $HOME/Library/Android/sdk/ndk/25.2.9519653
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang
# Example on Windows
ANDROID_NDK: $HOME/Library/Android/sdk/ndk/25.2.9519653
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android24-clang

JDK, such as OpenJDK >= 17, to compile Java bindings of TVM Unity runtime. It could be installed via Homebrew on macOS, apt on Ubuntu or other package managers. Set up the following environment variable:

  • JAVA_HOME so that Java is available in $JAVA_HOME/bin/java.

TVM Unity runtime is placed under 3rdparty/tvm in MLC LLM, so there is no need to install anything extra. Set up the following environment variable:

  • TVM_HOME so that its headers are available under $TVM_HOME/include/tvm/runtime.

(Optional) TVM Unity compiler Python package (install or build from source). It is NOT required if models are prebuilt, but to compile PyTorch models from HuggingFace in the following section, the compiler is a must-dependency.

Note

❗ Whenever using Python, it is highly recommended to use conda to manage an isolated Python environment to avoid missing dependencies, incompatible versions, and package conflicts.

Check if environment variable are properly set as the last check. One way to ensure this is to place them in $HOME/.zshrc, $HOME/.bashrc or environment management tools.

source $HOME/.cargo/env # Rust
export ANDROID_NDK=...  # Android NDK toolchain
export TVM_NDK_CC=...   # Android NDK clang
export JAVA_HOME=...    # Java
export TVM_HOME=...     # TVM Unity runtime

Compile PyTorch Models from HuggingFace

To deploy models on Android with reasonable performance, one has to cross-compile to and fully utilize mobile GPUs using TVM Unity. MLC provides a few pre-compiled models, or one could compile the models on their own.

Cloning MLC LLM from GitHub. Download MLC LLM via the following command:

git clone --recursive https://github.com/mlc-ai/mlc-llm/
          ^^^^^^^^^^^
cd ./mlc-llm/

Note

❗ The --recursive flag is necessary to download submodules like 3rdparty/tvm. If you see any file missing during compilation, please double check if git submodules are properly cloned.

Download the PyTorch model using Git Large File Storage (LFS), and by default, under ./dist/models/:

MODEL_NAME=Llama-2-7b-chat-hf
QUANTIZATION=q4f16_1

git lfs install
git clone https://huggingface.co/meta-llama/$MODEL_NAME \
          ./dist/models/

Compile Android-capable models. Install TVM Unity compiler as a Python package, and then run the command below:

# Show help message
python3 -m mlc_llm.build --help
# Compile a PyTorch model
python3 -m mlc_llm.build \
        --target android \
        --max-seq-len 768 \
        --model ./dist/models/$MODEL_NAME \
        --quantization $QUANTIZATION

This generates the directory ./dist/$MODEL_NAME-$QUANTIZATION which contains the necessary components to run the model, as explained below.

Expected output format. By default models are placed under ./dist/${MODEL_NAME}-${QUANTIZATION}, and the result consists of 3 major components:

  • Runtime configuration: It configures conversation templates including system prompts, repetition repetition penalty, sampling including temperature and top-p probability, maximum sequence length, etc. It is usually named as mlc-chat-config.json under ``params/``alongside with tokenizer configurations.

  • Model lib: The compiled library that uses mobile GPU. It is usually named as ${MODEL_NAME}-${QUANTIZATION}-android.tar, for example, Llama-2-7b-chat-hf-q4f16_0-android.tar.

  • Model weights: the model weights are sharded as params_shard_*.bin under params/ and the metadata is stored in ndarray-cache.json.

Create Android Project using Compiled Models

The source code for MLC LLM is available under android/, including scripts to build dependencies and the main app under android/MLCChat/ that could be opened by Android studio. Enter the directory first:

cd ./android/

Build necessary dependencies. Configure the list of models the app comes with using the JSON file below, which by default, is configured to use both Llama2-7B and RedPajama-3B:

vim ./MLCChat/app/src/main/assets/app-config.json

Then bundle the android library ${MODEL_NAME}-${QUANTIZATION}-android.tar compiled from mlc_llm.build in the previous steps, with TVM Unity’s Java runtime by running the commands below:

./prepare_libs.sh

which generates the two files below:

>>> find ./build/output -type f
./build/output/arm64-v8a/libtvm4j_runtime_packed.so
./build/output/tvm4j_core.jar

The model execution logic in mobile GPUs is incorporated into libtvm4j_runtime_packed.so, while tvm4j_core.jar is a lightweight (~60 kb) Java binding to it. Copy them to the right path to be found by the Android project:

cp -a ./build/output/. ./MLCChat/app/src/main/libs

Build the Android app. Open folder ./android/MLCChat as an Android Studio Project. Connect your Android device to your machine. In the menu bar of Android Studio, click “Build → Make Project”. Once the build is finished, click “Run → Run ‘app’” and you will see the app launched on your phone.

Note

❗ This app cannot be run in an emulator and thus a physical phone is required, because MLC LLM needs an actual mobile GPU to meaningfully run at an accelerated speed.

Incorporate Model Weights

Instructions have been provided to build an Android App with MLC LLM in previous sections, but it requires run-time weight downloading from HuggingFace, as configured in app-config.json in previous steps under model_url. However, it could be desirable to bundle weights together into the app to avoid downloading over the network. In this section, we provide a simple ADB-based walkthrough that hopefully helps with further development.

Generating APK. Enter Android Studio, and click “Build → Generate Signed Bundle/APK” to build an APK for release. If it is the first time you generate an APK, you will need to create a key according to the official guide from Android. This APK will be placed under android/MLCChat/app/release/app-release.apk.

Install ADB and USB debugging. Enable “USB debugging” in the developer mode in your phone settings. In SDK manager, install Android SDK Platform-Tools. Add the path to platform-tool path to the environment variable PATH. Run the following commands, and if ADB is installed correctly, your phone will appear as a device:

adb devices

Install the APK and weights to your phone. Run the commands below replacing ${MODEL_NAME} and ${QUANTIZATION} with the actual model name (e.g. Llama-2-7b-chat-hf) and quantization format (e.g. q4f16_1).

adb install android/MLCChat/app/release/app-release.apk
adb push dist/${MODEL_NAME}-${QUANTIZATION}/params /data/local/tmp/${MODEL_NAME}-${QUANTIZATION}/
adb shell "mkdir -p /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/"
adb shell "mv /data/local/tmp/${MODEL_NAME}-${QUANTIZATION} /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/"