CLI¶

MLC Chat CLI is the command line tool to run MLC-compiled LLMs out of the box interactively.

Install MLC-LLM Package ¶

Chat CLI is a part of the MLC-LLM package. To use the chat CLI, first install MLC LLM by following the instructions here. Once you have install the MLC-LLM package, you can run the following command to check if the installation was successful:

mlc_llm chat --help

You should see serve help message if the installation was successful.

Quick Start ¶

This section provides a quick start guide to work with MLC-LLM chat CLI. To launch the CLI session, run the following command:

mlc_llm chat MODEL [--model-lib PATH-TO-MODEL-LIB]

where MODEL is the model folder after compiling with MLC-LLM build process. Information about other arguments can be found in the next section.

Once the chat CLI is ready, you can enter the prompt to interact with the model.

You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out stats of last request (token/sec)
  /metrics            print out full engine metrics
  /reset              restart a fresh chat
  /set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
                      Note: Separate stop words in the `stop` option with commas (,).
  Multi-line input: Use escape+enter to start a new line.

>>> What's the meaning of life?
The meaning of life is a philosophical and metaphysical question related to the purpose or significance of life or existence in general...

Run CLI with Multi-GPU ¶

If you want to enable tensor parallelism to run LLMs on multiple GPUs, please specify argument --overrides "tensor_parallel_shards=$NGPU". For example,

mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --overrides "tensor_parallel_shards=2"

The `mlc_llm chat` Command ¶

We provide the list of chat CLI interface for reference.

mlc_llm chat MODEL [--model-lib PATH-TO-MODEL-LIB] [--device DEVICE] [--overrides OVERRIDES]

MODEL The model folder after compiling with MLC-LLM build process. The parameter: can either be the model name with its quantization scheme (e.g. Llama-2-7b-chat-hf-q4f16_1), or a full path to the model folder. In the former case, we will use the provided name to search for the model folder over possible paths.

--model-lib: A field to specify the full path to the model library file to use (e.g. a .so file).
--device: The description of the device to run on. User should provide a string in the form of device_name:device_id or device_name, where device_name is one of cuda, metal, vulkan, rocm, opencl, auto (automatically detect the local device), and device_id is the device id to run on. The default value is auto, with the device id set to 0 for default.
--overrides: Model configuration override. Supports overriding context_window_size, prefill_chunk_size, sliding_window_size, attention_sink_size, and tensor_parallel_shards. The overrides could be explicitly specified via details knobs, e.g. –overrides context_window_size=1024;prefill_chunk_size=128.

CLI¶

Install MLC-LLM Package¶

Quick Start¶

Run CLI with Multi-GPU¶

The mlc_llm chat Command¶

Install MLC-LLM Package ¶

Quick Start ¶

Run CLI with Multi-GPU ¶

The `mlc_llm chat` Command ¶