Python API for Model Compilation¶
We expose Python API for compiling/building models in the package
that users may build a model in any directory in their program (i.e. not just
within the mlc-llm repo).
To install, we first clone the repository (as mentioned in Compile Models via MLC):
# clone the repository git clone email@example.com:mlc-ai/mlc-llm.git --recursive
Afterwards, we use
pip to install
mlc_llm as a package so that we can
use it in any directory:
# enter to root directory of the repo cd mlc-llm # install the package pip install -e .
To verify installation, you are expected to see information about the package
mlc_llm even if you are not in the directory of
python -c "import mlc_llm; print(mlc_llm)"
After installing the package, you can build the model using
which takes in an instance of
BuildArgs (a dataclass that represents
the arguments for building a model).
For detailed instructions with code, please refer to the Python notebook
(executable in Colab), where we walk you through compiling Llama-2 with
In order to use the python API
mlc_llm.build_model(), users need to create
an instance of the dataclass
BuildArgs. The corresponding arguments in
the command line shown in Compile Command Specification are automatically
converted from the definition of
BuildArgs and are equivalent.
Then with an instantiated
BuildArgs, users can call the build API
- class mlc_llm.BuildArgs(model: str = 'auto', hf_path: Optional[str] = None, quantization: str = 'q4f16_1', max_seq_len: int = -1, max_vocab_size: int = 40000, target: str = 'auto', reuse_lib: Optional[str] = None, artifact_path: str = 'dist', use_cache: int = 1, convert_weights_only: bool = False, build_model_only: bool = False, debug_dump: bool = False, debug_load_script: bool = False, llvm_mingw: str = '', cc_path: str = '', system_lib: bool = False, sep_embed: bool = False, use_safetensors: bool = False, enable_batching: bool = False, max_batch_size: int = 80, no_cutlass_attn: bool = False, no_cutlass_norm: bool = False, no_cublas: bool = False, use_cuda_graph: bool = False, num_shards: int = 1, use_presharded_weights: bool = False, use_flash_attn_mqa: bool = False, sliding_window: int = -1, prefill_chunk_size: int = -1, pdb: bool = False, use_vllm_attention: bool = False)¶
BuildArgs is the dataclass that organizes the arguments we use in building a model.
mlc_llm.build_model(), users pass in an instance of
BuildArgs; for CLI entry points, an equivalent
ArgumentParserinstance is generated based on the definition of this class using
model (str) – The name of the model to build. If it is
auto, we will automatically set the model name according to
hf-path, or the model folders under
hf_path (str) – Hugging Face path from which to download params, tokenizer, and config.
quantization (str) – The quantization mode we use to compile.
max_seq_len (int) – The maximum allowed sequence length for the model.
target (str) – The target platform to compile the model for.
db_path (str) – Path to log database for all models. Default:
reuse_lib (str) – Whether to reuse a previously generated lib.
artifact_path (str) – Where to store the output.
use_cache (int) – Whether to use previously pickled IRModule and skip trace.
convert_weights_only (bool) – Whether to only convert model weights and not build the model. If both
build_model_onlyare set, the behavior is undefined.
build_model_only (bool) – Whether to only build model and do not convert model weights.
debug_dump (bool) – Whether to dump debugging files during compilation.
debug_load_script (bool) – Whether to load the script for debugging.
llvm_mingw (str) –
/path/to/llvm-mingw-root, use llvm-mingw to cross compile to windows.
system_lib (bool) – A parameter to
sep_embed (bool) – Build with separated embedding layer, only applicable to LlaMa. This feature is in testing stage, and will be formally replaced after massive overhaul of embedding feature for all models and use cases.
sliding_window (int) – The sliding window size in sliding window attention (SWA). This optional field overrides the sliding_window in config.json for those models that use SWA. Currently only useful when compiling Mistral.
prefill_chunk_size (int) – The chunk size during prefilling. By default, the chunk size is the same as max sequence length. Currently only useful when compiling Mistral.
cc_path (str) –
/path/to/cross_compiler_path; currently only used for cross-compile for nvidia/jetson device.
use_safetensors (bool) – Specifies whether to use
.safetensorsinstead of the default
.binwhen loading in model weights.
enable_batching (bool) – Build the model for batched inference. This is a temporary flag used to control the model execution flow in single- sequence and batching settings for now. We will eventually merge two flows in the future and remove this flag then.
no_cutlass_attn (bool) – Disable offloading attention operations to CUTLASS.
no_cutlass_norm (bool) – Disable offloading layer and RMS norm operations to CUTLASS.
no_cublas (bool) – Disable the step that offloads matmul to cuBLAS. Without this flag, matmul will be offloaded to cuBLAS if quantization mode is
q0f32, target is CUDA and TVM has been built with cuBLAS enabled.
use_cuda_graph (bool) – Specifies whether to enable CUDA Graph for the decoder. MLP and QKV projection between two attention layers are put into a graph.
num_shards (int) – Number of shards to split the model into in tensor parallelism multi-gpu inference. Only useful when
use_flash_attn_mqa (bool) – Offload multi-query attention workload to Flash Attention.
pdb (bool) – If set, drop into a pdb debugger on error.
use_vllm_attention (bool) – Use vLLM paged KV cache and attention kernel, only relevant when enable_batching=True.
- property convert_weight_only¶
A backwards-compatibility helper
- mlc_llm.build_model(args: BuildArgs)¶
Builds/compiles a model.
BuildArgs) – A dataclass of arguments for building models.mlc_llm/core.py
lib_path (Optional[str]) – The path to the model library file. Return
Noneif not applicable.
model_path (Optional[str]) – The path to the folder of the model’s parameters. Return
Noneif not applicable.
chat_config_path (Optional[str]) – The path to the chat config .json file. Return
Noneif not applicable.