Python API and Gradio Frontend

We expose Python API for the MLC-Chat for easy integration into other Python projects. We also provide a web demo based on gradio as an example of using Python API to interact with MLC-Chat.

Python API

The Python API is a part of the MLC-Chat package, which we have prepared pre-built pip wheels via the installation page.

Verify Installation

python -c "from mlc_chat import ChatModule; print(ChatModule)"

You are expected to see the information about the mlc_chat.ChatModule class.

If the prebuilt is unavailable on your platform, or you would like to build a runtime that supports other GPU runtime than the prebuilt version. Please refer our Build MLC-Chat Package From Source tutorial.

Get Started

After confirming that the package mlc_chat is installed, we can follow the steps below to chat with an MLC-compiled model in Python.

First, let us make sure that the MLC-compiled model we want to chat with already exists.

Note

model has the format f"{model_name}-{quantize_mode}". For instance, if you used q4f16_1 as the quantize_mode to compile Llama-2-7b-chat-hf, you would have model being Llama-2-7b-chat-hf-q4f16_1.

If you do not have the MLC-compiled model ready:

If you downloaded prebuilt models from MLC LLM, by default:

  • Model lib should be placed at ./dist/prebuilt/lib/$(model)-$(arch).$(suffix).

  • Model weights and chat config are located under ./dist/prebuilt/mlc-chat-$(model)/.

Note

Please make sure that you have the same directory structure as above, because Python API relies on it to automatically search for model lib and weights. If you would like to directly provide a full model lib path to override the auto-search, you can specify ChatModule.model_lib_path

Example
>>> ls -l ./dist/prebuilt/lib
Llama-2-7b-chat-hf-q4f16_1-metal.so  # Format: $(model)-$(arch).$(suffix)
Llama-2-7b-chat-hf-q4f16_1-vulkan.so
...
>>> ls -l ./dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1  # Format: ./dist/prebuilt/mlc-chat-$(model)/
# chat config:
mlc-chat-config.json
# model weights:
ndarray-cache.json
params_shard_*.bin
...

After making sure that the files exist, use the conda environment you used to install mlc_chat, from the mlc-llm directory, you can create a Python file sample_mlc_chat.py and paste the following lines:

from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

# From the mlc-llm directory, run
# $ python sample_mlc_chat.py

# Create a ChatModule instance
cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")
# You can change to other models that you downloaded, for example,
# cm = ChatModule(model="Llama-2-13b-chat-hf-q4f16_1")  # Llama2 13b model

output = cm.generate(
   prompt="What is the meaning of life?",
   progress_callback=StreamToStdout(callback_interval=2),
)

# Print prefill and decode performance statistics
print(f"Statistics: {cm.stats()}\n")

output = cm.generate(
   prompt="How many points did you list out?",
   progress_callback=StreamToStdout(callback_interval=2),
)

# Reset the chat module by
# cm.reset_chat()

Now run the Python file to start the chat

python sample_mlc_chat.py

You can also checkout the Model Prebuilts page to run other models.

See output
Using model folder: ./dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1
Using mlc chat config: ./dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1/mlc-chat-config.json
Using library model: ./dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so

Thank you for your question! The meaning of life is a complex and subjective topic that has been debated by philosophers, theologians, scientists, and many others for centuries. There is no one definitive answer to this question, as it can vary depending on a person's beliefs, values, experiences, and perspectives.

However, here are some possible ways to approach the question:

1. Religious or spiritual beliefs: Many people believe that the meaning of life is to fulfill a divine or spiritual purpose, whether that be to follow a set of moral guidelines, to achieve spiritual enlightenment, or to fulfill a particular destiny.
2. Personal growth and development: Some people believe that the meaning of life is to learn, grow, and evolve as individuals, to develop one's talents and abilities, and to become the best version of oneself.
3. Relationships and connections: Others believe that the meaning of life is to form meaningful connections and relationships with others, to love and be loved, and to build a supportive and fulfilling social network.
4. Contribution and impact: Some people believe that the meaning of life is to make a positive impact on the world, to contribute to society in a meaningful way, and to leave a lasting legacy.
5. Simple pleasures and enjoyment: Finally, some people believe that the meaning of life is to simply enjoy the present moment, to find pleasure and happiness in the simple things in life, and to appreciate the beauty and wonder of the world around us.

Ultimately, the meaning of life is a deeply personal and subjective question, and each person must find their own answer based on their own beliefs, values, and experiences.

Statistics: prefill: 3477.5 tok/s, decode: 153.6 tok/s

I listed out 5 possible ways to approach the question of the meaning of life.

Note

You could also specify the address of model and model_lib_path explicitly. If you only specify model as model_name and quantize_mode, we will do a search for you. See more in the documentation of mlc_chat.ChatModule.__init__().

Tutorial with Python Notebooks

Now that you have tried out how to chat with the model in Python, we would recommend you to checkout the following tutorials in Python notebook (all runnable in Colab):

Configure MLCChat in Python

If you have checked out Configure MLCChat in JSON, you would know that you could configure MLCChat through various fields such as temperature. We provide the option of overriding any field you’d like in Python, so that you do not need to manually edit mlc-chat-config.json.

Since there are two concepts – MLCChat Configuration and Conversation Configuration – we correspondingly provide two dataclasses mlc_chat.ChatConfig and mlc_chat.ConvConfig.

We provide an example below.

from mlc_chat import ChatModule, ChatConfig, ConvConfig
from mlc_chat.callback import StreamToStdout

# Using a `ConvConfig`, we modify `system`, a field in the conversation template
# `system` refers to the prompt encoded before starting the chat
conv_config = ConvConfig(system='Please show as much happiness as you can when talking to me.')

# We then include the `ConvConfig` instance in `ChatConfig` while overriding `max_gen_len`
# Note that `conv_config` is an optional subfield of `chat_config`
chat_config = ChatConfig(max_gen_len=256, conv_config=conv_config)

# Using the `chat_config` we created, instantiate a `ChatModule`
cm = mlc_chat.ChatModule('Llama-2-7b-chat-hf-q4f16_1', chat_config=chat_config)

output = cm.generate(
   prompt="What is one plus one?",
   progress_callback=StreamToStdout(callback_interval=2),
)

# You could also pass in a `ConvConfig` instance to `reset_chat()`
conv_config = ConvConfig(system='Please show as much sadness as you can when talking to me.')
chat_config = ChatConfig(max_gen_len=128, conv_config=conv_config)
cm.reset_chat(chat_config)

output = cm.generate(
   prompt="What is one plus one?",
   progress_callback=StreamToStdout(callback_interval=2),
)
See output
Using model folder: ./dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1
Using mlc chat config: ./dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1/mlc-chat-config.json
Using library model: ./dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so

Oh, wow, *excitedly* one plus one? *grinning* Well, let me see... *counting on fingers* One plus one is... *eureka* Two!
...

*Sobs* Oh, the tragedy of it all... *sobs* One plus one... *chokes back tears* It's... *gulps* it's... *breaks down in tears* TWO!
...

Note

You do not need to specify the entire ChatConfig or ConvConfig. Instead, we will first load all the fields defined in mlc-chat-config.json, a file required when instantiating a mlc_chat.ChatModule. Then, we will load in the optional ChatConfig you provide, overriding the fields specified.

It is also worth noting that ConvConfig itself is overriding the original conversation template specified by the field conv_template in the chat configuration. Learn more about it in Configure MLCChat in JSON.

Raw Text Generation in Python

Raw text generation allows the user to have more flexibility over his prompts, without being forced to create a new conversational template, making prompt customization easier. This serves other demands for APIs to handle LLM generation without the usual system prompts and other items.

We provide an example below.

from mlc_chat import ChatModule, ChatConfig, ConvConfig
from mlc_chat.callback import StreamToStdout

# Use a `ConvConfig` to define the generation settings
# Since the "LM" template only supports raw text generation,
# System prompts will not be executed even if provided
conv_config = ConvConfig(stop_tokens=[2,], add_bos=True, stop_str="[INST]")

# Note that `conv_config` is an optional subfield of `chat_config`
# The "LM" template serves the basic purposes of raw text generation
chat_config = ChatConfig(conv_config=conv_config, conv_template="LM")

# Using the `chat_config` we created, instantiate a `ChatModule`
cm = ChatModule('Llama-2-7b-chat-hf-q4f16_1', chat_config=chat_config)

# To make the model follow conversations a chat structure should be provided
# This allows users to build their own prompts without building a new template
system_prompt = "<<SYS>>\nYou are a helpful, respectful and honest assistant.\n<</SYS>>\n\n"
inst_prompt = "What is mother nature?"

# Concatenate system and instruction prompts, and add instruction tags
output = cm.generate(
   prompt=f"[INST] {system_prompt+inst_prompt} [/INST]",
   progress_callback=StreamToStdout(callback_interval=2),
)

# The LM template has no memory, so it will be reset every single generation
# In this case the model will just follow normal text completion
# because there isn't a chat structure
output = cm.generate(
   prompt="Life is a quality that distinguishes",
   progress_callback=StreamToStdout(callback_interval=2),
)

Note

The LM is a template without memory, which means that every execution will be cleared. Additionally, system prompts will not be run when instantiating a mlc_chat.ChatModule, unless explicitly given inside the prompt.

Stream Iterator in Python

Stream Iterator gives users an option to stream generated text to the function that the API is called from, instead of streaming to stdout, which could be a necessity when building services on top of MLC Chat.

We provide an example below.

from mlc_chat import ChatModule
from mlc_chat.callback import StreamIterator

# Create a ChatModule instance
cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")

# Stream to an Iterator
from threading import Thread

stream = StreamIterator(callback_interval=2)
generation_thread = Thread(
   target=cm.generate,
   kwargs={"prompt": "What is the meaning of life?", "progress_callback": stream},
)
generation_thread.start()

output = ""
for delta_message in stream:
   output += delta_message

generation_thread.join()

API Reference

User can initiate a chat module by creating mlc_chat.ChatModule class, which is a wrapper of the MLC-Chat model. The mlc_chat.ChatModule class provides the following methods:

class mlc_chat.ChatModule(model: str, device: str = 'auto', chat_config: Optional[ChatConfig] = None, model_lib_path: Optional[str] = None)

Bases: object

The ChatModule for MLC LLM.

Examples

from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

# Create a ChatModule instance
cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")

# Generate a response for a given prompt
output = cm.generate(
    prompt="What is the meaning of life?",
    progress_callback=StreamToStdout(callback_interval=2),
)

# Print prefill and decode performance statistics
print(f"Statistics: {cm.stats()}\n")

output = cm.generate(
    prompt="How many points did you list out?",
    progress_callback=StreamToStdout(callback_interval=2),
)
Parameters:
  • model (str) – The model folder after compiling with MLC-LLM build process. The parameter can either be the model name with its quantization scheme (e.g. Llama-2-7b-chat-hf-q4f16_1), or a full path to the model folder. In the former case, we will use the provided name to search for the model folder over possible paths.

  • device (str) – The description of the device to run on. User should provide a string in the form of ‘device_name:device_id’ or ‘device_name’, where ‘device_name’ is one of ‘cuda’, ‘metal’, ‘vulkan’, ‘rocm’, ‘opencl’, ‘auto’ (automatically detect the local device), and ‘device_id’ is the device id to run on. If no ‘device_id’ is provided, it will be set to 0 by default.

  • chat_config (Optional[ChatConfig]) – A ChatConfig instance partially filled. Will be used to override the mlc-chat-config.json.

  • model_lib_path (Optional[str]) – The full path to the model library file to use (e.g. a .so file). If unspecified, we will use the provided model to search over possible paths.

__init__(model: str, device: str = 'auto', chat_config: Optional[ChatConfig] = None, model_lib_path: Optional[str] = None)
benchmark_generate(prompt: str, generate_length: int) str

Controlled generation with input prompt and fixed number of generated tokens, ignoring system prompt. For example,

from mlc_chat import ChatModule

cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")
output = cm.benchmark_generate("What's the meaning of life?", generate_length=256)
print(f"Generated text:\n{output}\n")
print(f"Statistics: {cm.stats()}")

will generate 256 tokens in total based on prompt “What’s the meaning of life?”. After generation, you can use cm.stats() to print the generation speed.

Notes

1. This function is typically used in controlled benchmarks. It generates text without system prompt (i.e., it is pure text generation with no chat style) and ignores the token stop model(s). 2. To make the benchmark as accurate as possible, we first do a round of warmup prefill and decode before text generation. 3. This function resets the previous performance statistics.

Parameters:
  • prompt (str) – The prompt of the text generation.

  • generate_length (int) – The target length of generation.

Returns:

output – The generated text output.

Return type:

str

embed_text(input: str)

Given a text input, returns its embedding in the LLM.

Parameters:

input (str) – The user input string.

Returns:

embedding – The embedding of the text.

Return type:

tvm.runtime.NDArray

Note

This is a high-level method and is only used for retrieving text embeddings. Users are not supposed to call generate() after calling this method in the same chat session, since the input to this method is not prefilled and will cause error. If user needs to call generate() later, please call reset_chat() first. For a more fine-grained embedding API, see _embed().

generate(prompt: Union[str, List[ChatMessage]], generation_config: Optional[GenerationConfig] = None, progress_callback=None) Union[str, List[str]]

A high-level method that returns the full response from the chat module given a user prompt. User can optionally specify which callback method to use upon receiving the response. By default, no callback will be applied.

Parameters:
  • prompt (Union[str, List[ChatMessage]]) –

    The user input prompt, i.e. a question to ask the chat module. It can also be the whole conversation history (list of messages with role and content) eg:

    [
        ChatMessage(role="user", content="Hello, how are you?"),
        ChatMessage(role="assistant", content="I'm fine, thank you. How about you?"),
        ChatMessage(role="user", content="I'm good too."),
    ]
    

  • generation_config (Optional[GenerationConfig]) – The generation config object to override the ChatConfig generation settings.

  • progress_callback (object) – The optional callback method used upon receiving a newly generated message from the chat module. See mlc_chat/callback.py for a full list of available callback classes. Currently, only streaming to stdout callback method is supported, see Examples for more detailed usage.

Returns:

output – The generated full output from the chat module.

Return type:

string

Examples

# Suppose we would like to stream the response of the chat module to stdout
# with a refresh interval of 2. Upon calling generate(), We will see the response of
# the chat module streaming to stdout piece by piece, and in the end we receive the
# full response as a single string `output`.

from mlc_chat import ChatModule, GenerationConfig, callback
cm = ChatModule(xxx)
prompt = "what's the color of banana?"
output = cm.generate(
  prompt, GenerationConfig(temperature=0.8), callback.StreamToStdout(callback_interval=2)
)
print(output)
reset_chat(chat_config: Optional[ChatConfig] = None)

Reset the chat session, clear all chat history, and potentially override the original mlc-chat-config.json.

Parameters:

chat_config (Optional[ChatConfig]) – A ChatConfig instance partially filled. If specified, the chat module will reload the mlc-chat-config.json, and override it with chat_config, just like in initialization.

Note

The model remains the same after reset_chat(). To reload module, please either re-initialize a ChatModule instance or use _reload() instead.

stats(verbose=False) str

Get the runtime stats of the encoding step, decoding step (and embedding step if exists) of the chat module in text form.

Returns:

stats – The runtime stats text.

Return type:

str

class mlc_chat.ChatConfig(model_lib: Optional[str] = None, local_id: Optional[str] = None, conv_template: Optional[str] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, top_p: Optional[float] = None, mean_gen_len: Optional[int] = None, max_gen_len: Optional[int] = None, shift_fill_factor: Optional[float] = None, tokenizer_files: Optional[List[str]] = None, conv_config: Optional[ConvConfig] = None, model_category: Optional[str] = None, model_name: Optional[str] = None, num_shards: Optional[int] = None, use_presharded_weights: Optional[bool] = None, max_window_size: Optional[int] = None)

A dataclass that represents user-defined partial configuration for the chat config file.

An instance of ChatConfig can be passed in to the instantiation of a mlc_chat.ChatModule instance to override the default setting in mlc-chat-config.json under the model folder.

Since the configuration is partial, everything will be Optional.

Note that we will exploit this class to also represent mlc-chat-config.json during intermediate processing.

Parameters:
  • model_lib (Optional[str]) – The necessary model library to launch this model architecture. We recommend reuse model library when possible. For example, all LLaMA-7B models can use vicuna-v1-7b-{matching quantization scheme}. So you can distribute LLaMA-7B weight variants and still use them in prebuilt MLC chat apps.

  • local_id (Optional[str]) – Uniquely identifying the model in application. This is also used by command line interface app to specify which model to run.

  • conv_template (Optional[str]) – The name of the conversation template that this chat uses.

  • temperature (Optional[float]) – The temperature applied to logits before sampling. The default value is 0.7. A higher temperature encourages more diverse outputs, while a lower temperature produces more deterministic outputs.

  • repetition_penalty (Optional[float]) –

    The repetition penalty controls the likelihood of the model generating repeated texts. The default value is set to 1.0, indicating that no repetition penalty is applied. Increasing the value reduces the likelihood of repeat text generation. However, setting a high repetition_penalty may result in the model generating meaningless texts. The ideal choice of repetition penalty may vary among models.

    For more details on how repetition penalty controls text generation, please check out the CTRL paper (https://arxiv.org/pdf/1909.05858.pdf).

  • top_p (Optional[float]) –

    This parameter determines the set of tokens from which we sample during decoding. The default value is set to 0.95. At each step, we select tokens from the minimal set that has a cumulative probability exceeding the top_p parameter.

    For additional information on top-p sampling, please refer to this blog post: https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling.

  • mean_gen_len (Optional[int]) – The approximated average number of generated tokens in each round. Used to determine whether the maximum window size would be exceeded.

  • max_gen_len (Optional[int]) – The maximum number of tokens to be generated in each round. Would simply stop generating after this number is exceeded.

  • shift_fill_factor (Optional[float]) – The fraction of maximum window size to shift when it is exceeded.

  • tokenizer_files (Optional[List[str]]) – List of tokenizer files of the model.

  • conv_config (Optional[ConvConfig]) – The partial overriding configuration for conversation template. Will first load the predefined template with the name specified in conv_template and then override some of the configurations specified in conv_config.

  • model_category (Optional[str]) – The category of the model’s architecture (e.g. llama, gpt_neox, rwkv).

  • model_name (Optional[str]) – Name of the model (e.g. Llama-2-7b-chat-hf).

  • num_shards (Optional[str]) – Tensor parallel degree.

  • use_presharded_weights (Optional[bool]) – If True, the weights were saved with sharding already applied.

  • max_window_size (Optional[str]) – Maximum kv cache window size.

class mlc_chat.ConvConfig(name: Optional[str] = None, system: Optional[str] = None, roles: Optional[List[str]] = None, messages: Optional[List[List[str]]] = None, offset: Optional[int] = None, separator_style: Optional[int] = None, seps: Optional[List[str]] = None, role_msg_sep: Optional[str] = None, role_empty_sep: Optional[str] = None, stop_str: Optional[str] = None, stop_tokens: Optional[List[int]] = None, prefix_tokens: Optional[List[int]] = None, add_bos: Optional[bool] = None)

A dataclass that represents user-defined partial configuration for conversation template.

This is an attribute of mlc_chat.ChatConfig, which can then be passed in to the instantiation of a mlc_chat.ChatModule instance to override the default setting in mlc-chat-config.json under the model folder. Note that we will first load the predefined template with the name specified in conv_template.

Since the configuration is partial, everything will be Optional.

Parameters:
  • name (Optional[str]) – Name of the conversation.

  • system (Optional[str]) – The prompt encoded before starting the chat.

  • roles (Optional[List[str]]) – An array that describes the role names of the user and the model. These names are specific to the model being used.

  • messages (Optional[List[List[str]]]) – The chat history represented as an array of string pairs in the following format: [[role_0, msg_0], [role_1, msg_1], ...].

  • offset (Optional[int]) – The offset used to begin the chat from the chat history. When offset is not 0, messages[0:offset-1] will be encoded.

  • separator_style (Optional[int]) – Specifies whether we are in chat-bot mode (0) or pure LM prompt mode (1).

  • seps (Optional[List[str]]) – An array of strings indicating the separators to be used after a user message and a model message respectively.

  • role_msg_sep (Optional[str]) – A string indicating the separator between a role and a message.

  • role_empty_sep (Optional[str]) – A string indicating the separator to append to a role when there is no message yet.

  • stop_str (Optional[str]) – When the stop_str is encountered, the model will stop generating output.

  • stop_tokens (Optional[List[int]]) – A list of token IDs that act as stop tokens.

  • prefix_tokens (Optional[List[int]]) – Token list prefixing the conversation.

  • add_bos (Optional[bool]) – Determines whether a beginning-of-string (bos) token should be added before the input tokens.

class mlc_chat.GenerationConfig(temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, top_p: Optional[float] = None, mean_gen_len: Optional[int] = None, max_gen_len: Optional[int] = None, presence_penalty: Optional[float] = 0.0, frequency_penalty: Optional[float] = 0.0, n: Optional[int] = None, stop: Optional[Union[str, List[str]]] = None)

A dataclass that represents user-defined generation configuration.

An instance of GenerationConfig can be passed in to the generate function of a mlc_chat.ChatModule instance to override the default generation setting in mlc-chat-config.json and ChatConfig under the model folder.

Once the generation ends, GenerationConfig is discarded, since the values will only override the ChatConfig generation settings during one generation, unless it is recurrently passed to generate function. This allows changing generation settings over time, without overriding ChatConfig permanently.

Since the configuraiton is partial, everything will be Optional.

Parameters:
  • temperature (Optional[float]) – The temperature applied to logits before sampling. The default value is 0.7. A higher temperature encourages more diverse outputs, while a lower temperature produces more deterministic outputs.

  • presence_penalty (Optional[float]) – Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. Negative values can increase the likelihood of repetition.

  • frequency_penalty (Optional[float]) – Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. Negative values can increase the likelihood of repetition.

  • repetition_penalty (Optional[float]) –

    The repetition penalty controls the likelihood of the model generating repeated texts. The default value is set to 1.0, indicating that no repetition penalty is applied. Increasing the value reduces the likelihood of repeat text generation. However, setting a high repetition_penalty may result in the model generating meaningless texts. The ideal choice of repetition penalty may vary among models. Only Active when presence_penalty and frequency_penalty are both 0.0.

    For more details on how repetition penalty controls text generation, please check out the CTRL paper (https://arxiv.org/pdf/1909.05858.pdf).

  • top_p (Optional[float]) –

    This parameter determines the set of tokens from which we sample during decoding. The default value is set to 0.95. At each step, we select tokens from the minimal set that has a cumulative probability exceeding the top_p parameter.

    For additional information on top-p sampling, please refer to this blog post: https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling.

  • mean_gen_len (Optional[int]) – The approximated average number of generated tokens in each round. Used to determine whether the maximum window size would be exceeded.

  • max_gen_len (Optional[int]) – This parameter determines the maximum length of the generated text. If it is not set, the model will generate text until it encounters a stop token.

  • n (Optional[int]) – This parameter determines the number of text samples to generate. The default value is 1. Note that this parameter is only used when stream is set to False.

  • stop (Optional[Union[str, List[str]]]) – When stop is encountered, the model will stop generating output. It can be a string or a list of strings. If it is a list of strings, the model will stop generating output when any of the strings in the list is encountered. Note that this parameter does not override the default stop string of the model.

Gradio Frontend

The gradio frontend provides a web interface for the MLC-Chat model, which allows users to interact with the model in a more user-friendly way and switch between different models to compare performance. To use gradio frontend, you need to install gradio first:

pip install gradio

Then you can run the following code to start the interface:

python -m mlc_chat.gradio --artifact-path ARTIFACT_PATH [--device DEVICE] [--port PORT_NUMBER] [--share]
--artifact-path

Please provide a path containing all the model folders you wish to use. The default value is dist.

--device

The description of the device to run on. User should provide a string in the form of ‘device_name:device_id’ or ‘device_name’, where ‘device_name’ is one of ‘cuda’, ‘metal’, ‘vulkan’, ‘rocm’, ‘opencl’, ‘auto’ (automatically detect the local device), and ‘device_id’ is the device id to run on. If no ‘device_id’ is provided, it will be set to 0. The default value is auto.

--port

The port number to run gradio. The default value is 7860.

--share

Whether to create a publicly shareable link for the interface.

After setting it up properly, you are expected to see the following interface in your browser:

https://raw.githubusercontent.com/mlc-ai/web-data/main/images/mlc-llm/tutorials/gradio-interface.png