Python API for Model CompilationΒΆ

We expose Python API for compiling/building models in the package mlc_llm, so that users may build a model in any directory in their program (i.e. not just within the mlc-llm repo).

Install MLC-LLM as a PackageΒΆ

To install, we first clone the repository (as mentioned in Compile Models via MLC):

# clone the repository
git clone git@github.com:mlc-ai/mlc-llm.git --recursive

Afterwards, we use pip to install mlc_llm as a package so that we can use it in any directory:

# enter to root directory of the repo
cd mlc-llm
# install the package
pip install -e .

To verify installation, you are expected to see information about the package mlc_llm even if you are not in the directory of mlc-llm:

python -c "import mlc_llm; print(mlc_llm)"

Compiling Model in PythonΒΆ

After installing the package, you can build the model using mlc_llm.build_model(), which takes in an instance of BuildArgs (a dataclass that represents the arguments for building a model).

For detailed instructions with code, please refer to the Python notebook (executable in Colab), where we walk you through compiling Llama-2 with mlc_llm in Python.

API ReferenceΒΆ

In order to use the python API mlc_llm.build_model(), users need to create an instance of the dataclass BuildArgs. The corresponding arguments in the command line shown in Compile Command Specification are automatically converted from the definition of BuildArgs and are equivalent.

Then with an instantiated BuildArgs, users can call the build API mlc_llm.build_model().

class mlc_llm.BuildArgs(model: str = 'auto', hf_path: Optional[str] = None, quantization: str = 'q4f16_1', max_seq_len: int = -1, max_vocab_size: int = 40000, target: str = 'auto', reuse_lib: Optional[str] = None, artifact_path: str = 'dist', use_cache: int = 1, convert_weights_only: bool = False, build_model_only: bool = False, debug_dump: bool = False, debug_load_script: bool = False, llvm_mingw: str = '', cc_path: str = '', system_lib: bool = False, sep_embed: bool = False, use_safetensors: bool = False, enable_batching: bool = False, max_batch_size: int = 80, no_cutlass_attn: bool = False, no_cutlass_norm: bool = False, no_cublas: bool = False, use_cuda_graph: bool = False, num_shards: int = 1, use_presharded_weights: bool = False, use_flash_attn_mqa: bool = False, sliding_window: int = -1, prefill_chunk_size: int = -1, pdb: bool = False, use_vllm_attention: bool = False)ΒΆ

BuildArgs is the dataclass that organizes the arguments we use in building a model.

To use mlc_llm.build_model(), users pass in an instance of BuildArgs; for CLI entry points, an equivalent ArgumentParser instance is generated based on the definition of this class using mlc_llm.convert_build_args_to_argparser().

Parameters:
  • model (str) – The name of the model to build. If it is auto, we will automatically set the model name according to --model-path, hf-path, or the model folders under --artifact-path/models.

  • hf_path (str) – Hugging Face path from which to download params, tokenizer, and config.

  • quantization (str) – The quantization mode we use to compile.

  • max_seq_len (int) – The maximum allowed sequence length for the model.

  • target (str) – The target platform to compile the model for.

  • db_path (str) – Path to log database for all models. Default: ./log_db/.

  • reuse_lib (str) – Whether to reuse a previously generated lib.

  • artifact_path (str) – Where to store the output.

  • use_cache (int) – Whether to use previously pickled IRModule and skip trace.

  • convert_weights_only (bool) – Whether to only convert model weights and not build the model. If both convert_weight_only and build_model_only are set, the behavior is undefined.

  • build_model_only (bool) – Whether to only build model and do not convert model weights.

  • debug_dump (bool) – Whether to dump debugging files during compilation.

  • debug_load_script (bool) – Whether to load the script for debugging.

  • llvm_mingw (str) – /path/to/llvm-mingw-root, use llvm-mingw to cross compile to windows.

  • system_lib (bool) – A parameter to relax.build.

  • sep_embed (bool) – Build with separated embedding layer, only applicable to LlaMa. This feature is in testing stage, and will be formally replaced after massive overhaul of embedding feature for all models and use cases.

  • sliding_window (int) – The sliding window size in sliding window attention (SWA). This optional field overrides the sliding_window in config.json for those models that use SWA. Currently only useful when compiling Mistral.

  • prefill_chunk_size (int) – The chunk size during prefilling. By default, the chunk size is the same as max sequence length. Currently only useful when compiling Mistral.

  • cc_path (str) – /path/to/cross_compiler_path; currently only used for cross-compile for nvidia/jetson device.

  • use_safetensors (bool) – Specifies whether to use .safetensors instead of the default .bin when loading in model weights.

  • enable_batching (bool) – Build the model for batched inference. This is a temporary flag used to control the model execution flow in single- sequence and batching settings for now. We will eventually merge two flows in the future and remove this flag then.

  • no_cutlass_attn (bool) – Disable offloading attention operations to CUTLASS.

  • no_cutlass_norm (bool) – Disable offloading layer and RMS norm operations to CUTLASS.

  • no_cublas (bool) – Disable the step that offloads matmul to cuBLAS. Without this flag, matmul will be offloaded to cuBLAS if quantization mode is q0f16 or q0f32, target is CUDA and TVM has been built with cuBLAS enabled.

  • use_cuda_graph (bool) – Specifies whether to enable CUDA Graph for the decoder. MLP and QKV projection between two attention layers are put into a graph.

  • num_shards (int) – Number of shards to split the model into in tensor parallelism multi-gpu inference. Only useful when build_model_only is set.

  • use_flash_attn_mqa (bool) – Offload multi-query attention workload to Flash Attention.

  • pdb (bool) – If set, drop into a pdb debugger on error.

  • use_vllm_attention (bool) – Use vLLM paged KV cache and attention kernel, only relevant when enable_batching=True.

property convert_weight_onlyΒΆ

A backwards-compatibility helper

mlc_llm.build_model(args: BuildArgs)ΒΆ

Builds/compiles a model.

Parameters:

args (BuildArgs) – A dataclass of arguments for building models.mlc_llm/core.py

Returns:

  • lib_path (Optional[str]) – The path to the model library file. Return None if not applicable.

  • model_path (Optional[str]) – The path to the folder of the model’s parameters. Return None if not applicable.

  • chat_config_path (Optional[str]) – The path to the chat config .json file. Return None if not applicable.