Configure MLCChat in JSON¶
This page explains the components of a chat configuration and how to customize them for your own purposes.
Each MLC Chat runtime can be configured via an mlc-chat-config.json
file under the directory of each compiled model (e.g.
RedPajama chat config)
which contains the chat configuration. You can customize the chat configuration by modifying this file.
Additionally, the runtimes also provide APIs to optionally override some of the configurations.
Structure of MLCChat Configuration¶
Below is the mlc-chat-config.json
file corresponding to Llama2 model:
// mlc-chat-config.json
{
"model_lib": "Llama-2-7b-chat-hf-q4f16_1",
"local_id": "Llama-2-7b-chat-hf-q4f16_1",
"conv_template": "llama-2",
"temperature": 0.7,
"repetition_penalty": 1.0,
"top_p": 0.95,
"mean_gen_len": 128,
"max_gen_len": 512,
"shift_fill_factor": 0.3,
"tokenizer_files": [
"added_tokens.json",
"tokenizer.json",
"tokenizer.model"
],
"model_category": "llama",
"model_name": "Llama-2-7b-chat-hf"
}
The following fields contain the meta-data which affect system behaviors.
model_lib
The necessary model library to launch this model architecture. We recommend reuse model library when possible. For example, all LLaMA-7B models compiled under
q4f16_1
quantization can use Llama-2-7b-chat-hf-q4f16_1. So you can distribute LLaMA-7B weight variants and still use them in prebuilt MLC chat apps.local_id
Uniquely identifying the model in application. This is also used by command line interface app to specify which model to run.
tokenizer_files
List of tokenizer files of the model.
The following fields can be customized to change the behavior of the model:
conv_template
The name of the conversation template that this chat uses. For more information, please refer to conversation structure.
temperature
The temperature applied to logits before sampling. The default value is
0.7
. A higher temperature encourages more diverse outputs, while a lower temperature produces more deterministic outputs.repetition_penalty
The repetition penalty controls the likelihood of the model generating repeated texts. The default value is set to
1.0
, indicating that no repetition penalty is applied. Increasing the value reduces the likelihood of repeat text generation. However, setting a highrepetition_penalty
may result in the model generating meaningless texts. The ideal choice of repetition penalty may vary among models.For more details on how repetition penalty controls text generation, please check out the CTRL paper.
top_p
This parameter determines the set of tokens from which we sample during decoding. The default value is set to
0.95
. At each step, we select tokens from the minimal set that has a cumulative probability exceeding thetop_p
parameter.For additional information on top-p sampling, please refer to this blog post.
Conversation Structure¶
There are three options of loading conversation configurations:
Load from pre-defined conversation templates.
Load from JSON format conversation configuration.
First load from pre-defined conversation templates, then override some fields with JSON format conversation configuration.
Load from Pre-defined Conversation Templates¶
MLC-LLM provided a set of pre-defined conversation templates, which you can directly use by specifying the template name in conv_template
field in the mlc-chat-config.json
, below is a list (not complete) of supported conversation templates:
llama-2
vicuna_v1.1
redpajama_chat
rwkv
dolly
…
Please refer to conv_template.cc for the full list of supported templates and their implementations.
Load from JSON Conversation Configuration¶
Below is a generic structure of a JSON conversation configuration (we use vicuna as an example):
// mlc-chat-config.json
{
// ...
"conv_config": {
"seps": [
" ",
"<\/s>"
],
"stop_tokens": [
2
],
"offset": 0,
"separator_style": 0,
"messages": [],
"stop_str": "<\/s>",
"roles": [
"USER",
"ASSISTANT"
],
"role_msg_sep": ": ",
"role_empty_sep": ": ",
"system": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.",
"add_bos": true,
"name": "vicuna_v1.1"
}
}
roles
An array that describes the role names of the user and the model. These names are specific to the model being used.
system
The prompt encoded before starting the chat. It can be customized to a user-defined prompt.
add_bos
Determines whether a beginning-of-string (bos) token should be added before the input tokens.
stop_str
When the
stop_str
is encountered, the model will stop generating output.stop_tokens
A list of token IDs that act as stop tokens.
seps
An array of strings indicating the separators to be used after a user message and a model message respectively.
messages
The chat history represented as an array of string pairs in the following format:
[[role_0, msg_0], [role_1, msg_1], ...]
offset
The offset used to begin the chat from the chat history. When
offset
is not0
,messages[0:offset-1]
will be encoded.separator_style
Specifies whether we are in chat-bot mode (
0
) or pure LM prompt mode (1
).role_msg_sep
A string indicating the separator between a role and a message.
role_empty_sep
A string indicating the separator to append to a role when there is no message yet.
When the value of separator_style
is set to 0 (or kSepRoleMsg
), each round of conversation follows the format:
{role[0]}{separator_style}{user_input}{sep[0]}
{role[1]}{separator_style}{model_output}{sep[1]}
Here, {user_input}
represents the input provided by the user, and {model_output}
represents the output generated by the model.
On the other hand, if the value of separator_style
is set to 1 (or kLM
), the model is not aware of the chat history and generates the response immediately after the user input prompt:
{user_prompt}{model_output}
Customize Conversation Template¶
In the mlc-chat-config.json
file, you have the option to specify both conv_template
and conv_config
. MLC-LLM will first load the predefined template with the name specified in conv_template
and then override some of the configurations specified in conv_config
. It’s important to note that the configurations in conv_config
don’t need to be complete, allowing for partial updates.
Example 1: Replace System Prompt¶
If you’re tired of the default system prompt, here’s an example of how you can replace it:
// mlc-chat-config.json
{
// ...
"conv_template": "vicuna_v1.1",
"conv_config": {
"system": "You are not Vicuna, your name is Guanaco, now let's chat!"
}
}
The next time you run mlc_chat_cli
, you will start a chat with Vicuna using a new system prompt.
Example 2: Resume from Chat History¶
The following example demonstrates how to chat with Vicuna and resume from a chat history:
// mlc-chat-config.json
{
// ...
"conv_template": "vicuna_v1.1",
"conv_config": {
"messages": [
["USER", "Suppose we already have projects llama, alpaca and vicuna, what do you think would be a great name for the next project?"],
["ASSISTANT", "Based on the previous projects, a possible name for the next project could be \"cervidae\" which is the scientific name for deer family. This name reflects the collaboration and teamwork involved in the development of the project, and also nods to the previous projects that have been developed by the team."],
["USER", "I like cervidae, but the name is too long!"],
["ASSISTANT", "In that case, a shorter and catchier name for the next project could be \"DeerRun\" which plays on the idea of the project being fast and efficient, just like a deer running through the woods. This name is memorable and easy to pronounce, making it a good choice for a project name."]
],
"offset": 4
}
}
The next time you start mlc_chat_cli
, you will initiate a chat with Vicuna and resume from the provided chat history.