CLI¶
MLC Chat CLI is the command line tool to run MLC-compiled LLMs out of the box interactively.
Install MLC-LLM Package¶
Chat CLI is a part of the MLC-LLM package. To use the chat CLI, first install MLC LLM by following the instructions here. Once you have install the MLC-LLM package, you can run the following command to check if the installation was successful:
mlc_llm chat --help
You should see serve help message if the installation was successful.
Quick Start¶
This section provides a quick start guide to work with MLC-LLM chat CLI. To launch the CLI session, run the following command:
mlc_llm chat MODEL [--model-lib PATH-TO-MODEL-LIB]
where MODEL
is the model folder after compiling with MLC-LLM build process. Information about other arguments can be found in the next section.
Once the chat CLI is ready, you can enter the prompt to interact with the model.
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out stats of last request (token/sec)
/metrics print out full engine metrics
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
`/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.
>>> What's the meaning of life?
The meaning of life is a philosophical and metaphysical question related to the purpose or significance of life or existence in general...
Run CLI with Multi-GPU¶
If you want to enable tensor parallelism to run LLMs on multiple GPUs, please specify argument --overrides "tensor_parallel_shards=$NGPU"
. For example,
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --overrides "tensor_parallel_shards=2"
The mlc_llm chat
Command¶
We provide the list of chat CLI interface for reference.
mlc_llm chat MODEL [--model-lib PATH-TO-MODEL-LIB] [--device DEVICE] [--overrides OVERRIDES]
- MODEL The model folder after compiling with MLC-LLM build process. The parameter
can either be the model name with its quantization scheme (e.g.
Llama-2-7b-chat-hf-q4f16_1
), or a full path to the model folder. In the former case, we will use the provided name to search for the model folder over possible paths.
- --model-lib
A field to specify the full path to the model library file to use (e.g. a
.so
file).- --device
The description of the device to run on. User should provide a string in the form of
device_name:device_id
ordevice_name
, wheredevice_name
is one ofcuda
,metal
,vulkan
,rocm
,opencl
,auto
(automatically detect the local device), anddevice_id
is the device id to run on. The default value isauto
, with the device id set to 0 for default.- --overrides
Model configuration override. Supports overriding
context_window_size
,prefill_chunk_size
,sliding_window_size
,attention_sink_size
, andtensor_parallel_shards
. The overrides could be explicitly specified via details knobs, e.g. –overridescontext_window_size=1024;prefill_chunk_size=128
.