Python API

Note

This page introduces the Python API with MLCEngine in MLC LLM.

MLC LLM provides Python API through classes mlc_llm.MLCEngine and mlc_llm.AsyncMLCEngine which support full OpenAI API completeness for easy integration into other Python projects.

This page introduces how to use the engines in MLC LLM. The Python API is a part of the MLC-LLM package, which we have prepared pre-built pip wheels via the installation page.

Verify Installation

python -c "from mlc_llm import MLCEngine; print(MLCEngine)"

You are expected to see the output of <class 'mlc_llm.serve.engine.MLCEngine'>.

If the command above results in error, follow Install MLC LLM Python Package to install prebuilt pip packages or build MLC LLM from source.

Run MLCEngine

mlc_llm.MLCEngine provides the interface of OpenAI chat completion synchronously. mlc_llm.MLCEngine does not batch concurrent request due to the synchronous design, and please use AsyncMLCEngine for request batching process.

Stream Response. In Quick Start and Introduction to MLC LLM, we introduced the basic use of mlc_llm.MLCEngine.

from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

This code example first creates an mlc_llm.MLCEngine instance with the 8B Llama-3 model. We design the Python API mlc_llm.MLCEngine to align with OpenAI API, which means you can use mlc_llm.MLCEngine in the same way of using OpenAI’s Python package for both synchronous and asynchronous generation.

Non-stream Response. The code example above uses the synchronous chat completion interface and iterate over all the stream responses. If you want to run without streaming, you can run

response = engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=False,
)
print(response)

Please refer to OpenAI’s Python package and OpenAI chat completion API for the complete chat completion interface.

Note

If you want to enable tensor parallelism to run LLMs on multiple GPUs, please specify argument model_config_overrides in MLCEngine constructor. For example,

from mlc_llm import MLCEngine
from mlc_llm.serve.config import EngineConfig

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(
    model,
    engine_config=EngineConfig(tensor_parallel_shards=2),
)

Run AsyncMLCEngine

mlc_llm.AsyncMLCEngine provides the interface of OpenAI chat completion with asynchronous features. We recommend using mlc_llm.AsyncMLCEngine to batch concurrent request for better throughput.

Stream Response. The core use of mlc_llm.AsyncMLCEngine for stream responses is as follows.

async for response in await engine.chat.completions.create(
  messages=[{"role": "user", "content": "What is the meaning of life?"}],
  model=model,
  stream=True,
):
  for choice in response.choices:
      print(choice.delta.content, end="", flush=True)
The collapsed is a complete runnable example of AsyncMLCEngine in Python.
import asyncio
from typing import Dict

from mlc_llm.serve import AsyncMLCEngine

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
prompts = [
    "Write a three-day travel plan to Pittsburgh.",
    "What is the meaning of life?",
]


async def test_completion():
    # Create engine
    async_engine = AsyncMLCEngine(model=model)

    num_requests = len(prompts)
    output_texts: Dict[str, str] = {}

    async def generate_task(prompt: str):
        async for response in await async_engine.chat.completions.create(
            messages=[{"role": "user", "content": prompt}],
            model=model,
            stream=True,
        ):
            if response.id not in output_texts:
                output_texts[response.id] = ""
            output_texts[response.id] += response.choices[0].delta.content

    tasks = [asyncio.create_task(generate_task(prompts[i])) for i in range(num_requests)]
    await asyncio.gather(*tasks)

    # Print output.
    for request_id, output in output_texts.items():
        print(f"Output of request {request_id}:\n{output}\n")

    async_engine.terminate()


asyncio.run(test_completion())

Non-stream Response. Similarly, mlc_llm.AsyncEngine provides the non-stream response interface.

response = await engine.chat.completions.create(
  messages=[{"role": "user", "content": "What is the meaning of life?"}],
  model=model,
  stream=False,
)
print(response)

Please refer to OpenAI’s Python package and OpenAI chat completion API for the complete chat completion interface.

Note

If you want to enable tensor parallelism to run LLMs on multiple GPUs, please specify argument model_config_overrides in AsyncMLCEngine constructor. For example,

from mlc_llm import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = AsyncMLCEngine(
    model,
    engine_config=EngineConfig(tensor_parallel_shards=2),
)

Engine Mode

To ease the engine configuration, the constructors of mlc_llm.MLCEngine and mlc_llm.AsyncMLCEngine have an optional argument mode, which falls into one of the three options "local", "interactive" or "server". The default mode is "local".

Each mode denotes a pre-defined configuration of the engine to satisfy different use cases. The choice of the mode controls the request concurrency of the engine, as well as engine’s KV cache token capacity (or in other words, the maximum number of tokens that the engine’s KV cache can hold), and further affects the GPU memory usage of the engine.

In short,

  • mode "local" uses low request concurrency and low KV cache capacity, which is suitable for cases where concurrent requests are not too many, and the user wants to save GPU memory usage.

  • mode "interactive" uses 1 as the request concurrency and low KV cache capacity, which is designed for interactive use cases such as chats and conversations.

  • mode "server" uses as much request concurrency and KV cache capacity as possible. This mode aims to fully utilize the GPU memory for large server scenarios where concurrent requests may be many.

For system benchmark, please select mode "server". Please refer to API Reference for detailed documentation of the engine mode.

Deploy Your Own Model with Python API

The introduction page introduces how we can deploy our own models with MLC LLM. This section introduces how you can use the model weights you convert and the model library you build in mlc_llm.MLCEngine and mlc_llm.AsyncMLCEngine.

We use the Phi-2 as the example model.

Specify Model Weight Path. Assume you have converted the model weights for your own model, you can construct a mlc_llm.MLCEngine as follows:

from mlc_llm import MLCEngine

model = "models/phi-2"  # Assuming the converted phi-2 model weights are under "models/phi-2"
engine = MLCEngine(model)

Specify Model Library Path. Further, if you build the model library on your own, you can use it in mlc_llm.MLCEngine by passing the library path through argument model_lib.

from mlc_llm import MLCEngine

model = "models/phi-2"
model_lib = "models/phi-2/lib.so"  # Assuming the phi-2 model library is built at "models/phi-2/lib.so"
engine = MLCEngine(model, model_lib=model_lib)

The same applies to mlc_llm.AsyncMLCEngine.

API Reference

The mlc_llm.MLCEngine and mlc_llm.AsyncMLCEngine classes provide the following constructors.

The MLCEngine and AsyncMLCEngine have full OpenAI API completeness. Please refer to OpenAI’s Python package and OpenAI chat completion API for the complete chat completion interface.

class mlc_llm.MLCEngine(model: str, device: Union[str, Device] = 'auto', *, model_lib: Optional[str] = None, mode: Literal['local', 'interactive', 'server'] = 'local', engine_config: Optional[EngineConfig] = None, enable_tracing: bool = False)

Bases: MLCEngineBase

The MLCEngine in MLC LLM that provides the synchronous interfaces with regard to OpenAI API.

Parameters:
  • model (str) – A path to mlc-chat-config.json, or an MLC model directory that contains mlc-chat-config.json. It can also be a link to a HF repository pointing to an MLC compiled model.

  • device (Union[str, Device]) – The device used to deploy the model such as “cuda” or “cuda:0”. Will default to “auto” and detect from local available GPUs if not specified.

  • model_lib (Optional[str]) – The full path to the model library file to use (e.g. a .so file). If unspecified, we will use the provided model to search over possible paths. It the model lib is not found, it will be compiled in a JIT manner.

  • mode (Literal["local", "interactive", "server"]) –

    The engine mode in MLC LLM. We provide three preset modes: “local”, “interactive” and “server”. The default mode is “local”. The choice of mode decides the values of “max_num_sequence”, “max_total_sequence_length” and “prefill_chunk_size” when they are not explicitly specified. 1. Mode “local” refers to the local server deployment which has low request concurrency. So the max batch size will be set to 4, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 2. Mode “interactive” refers to the interactive use of server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 3. Mode “server” refers to the large server use case which may handle many concurrent request and want to use GPU memory as much as possible. In this mode, we will automatically infer the largest possible max batch size and max total sequence length.

    You can manually specify arguments “max_num_sequence”, “max_total_sequence_length” and “prefill_chunk_size” to override the automatic inferred values.

  • engine_config (Optional[EngineConfig]) – Additional configurable arguments of MLC engine. See class “EngineConfig” for more detail.

  • enable_tracing (bool) – A boolean indicating if to enable event logging for requests.

__init__(model: str, device: Union[str, Device] = 'auto', *, model_lib: Optional[str] = None, mode: Literal['local', 'interactive', 'server'] = 'local', engine_config: Optional[EngineConfig] = None, enable_tracing: bool = False) None
abort(request_id: str) None

Generation abortion interface.

Parameters:

request_id (str) – The id of the request to abort.

metrics() EngineMetrics

Get engine metrics

Returns:

metrics – The engine metrics

Return type:

EngineMetrics

class mlc_llm.AsyncMLCEngine(model: str, device: Union[str, Device] = 'auto', *, model_lib: Optional[str] = None, mode: Literal['local', 'interactive', 'server'] = 'local', engine_config: Optional[EngineConfig] = None, enable_tracing: bool = False)

Bases: MLCEngineBase

The AsyncMLCEngine in MLC LLM that provides the asynchronous interfaces with regard to OpenAI API.

Parameters:
  • model (str) – A path to mlc-chat-config.json, or an MLC model directory that contains mlc-chat-config.json. It can also be a link to a HF repository pointing to an MLC compiled model.

  • device (Union[str, Device]) – The device used to deploy the model such as “cuda” or “cuda:0”. Will default to “auto” and detect from local available GPUs if not specified.

  • model_lib (Optional[str]) – The full path to the model library file to use (e.g. a .so file). If unspecified, we will use the provided model to search over possible paths. It the model lib is not found, it will be compiled in a JIT manner.

  • mode (Literal["local", "interactive", "server"]) –

    The engine mode in MLC LLM. We provide three preset modes: “local”, “interactive” and “server”. The default mode is “local”. The choice of mode decides the values of “max_num_sequence”, “max_total_sequence_length” and “prefill_chunk_size” when they are not explicitly specified. 1. Mode “local” refers to the local server deployment which has low request concurrency. So the max batch size will be set to 4, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 2. Mode “interactive” refers to the interactive use of server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 3. Mode “server” refers to the large server use case which may handle many concurrent request and want to use GPU memory as much as possible. In this mode, we will automatically infer the largest possible max batch size and max total sequence length.

    You can manually specify arguments “max_num_sequence”, “max_total_sequence_length” and “prefill_chunk_size” to override the automatic inferred values.

  • engine_config (Optional[EngineConfig]) – Additional configurable arguments of MLC engine. See class “EngineConfig” for more detail.

  • enable_tracing (bool) – A boolean indicating if to enable event logging for requests.

__init__(model: str, device: Union[str, Device] = 'auto', *, model_lib: Optional[str] = None, mode: Literal['local', 'interactive', 'server'] = 'local', engine_config: Optional[EngineConfig] = None, enable_tracing: bool = False) None
async abort(request_id: str) None

Generation abortion interface.

Parameters:

request_id (str) – The id of the request to abort.

async metrics() EngineMetrics

Get engine metrics

Returns:

metrics – The engine metrics

Return type:

EngineMetrics