Python API¶
Note
This page introduces the Python API with MLCEngine in MLC LLM.
MLC LLM provides Python API through classes mlc_llm.MLCEngine
and mlc_llm.AsyncMLCEngine
which support full OpenAI API completeness for easy integration into other Python projects.
This page introduces how to use the engines in MLC LLM. The Python API is a part of the MLC-LLM package, which we have prepared pre-built pip wheels via the installation page.
Verify Installation¶
python -c "from mlc_llm import MLCEngine; print(MLCEngine)"
You are expected to see the output of <class 'mlc_llm.serve.engine.MLCEngine'>
.
If the command above results in error, follow Install MLC LLM Python Package to install prebuilt pip packages or build MLC LLM from source.
Run MLCEngine¶
mlc_llm.MLCEngine
provides the interface of OpenAI chat completion synchronously.
mlc_llm.MLCEngine
does not batch concurrent request due to the synchronous design,
and please use AsyncMLCEngine for request batching process.
Stream Response. In Quick Start and Introduction to MLC LLM,
we introduced the basic use of mlc_llm.MLCEngine
.
from mlc_llm import MLCEngine
# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)
# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=True,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
print("\n")
engine.terminate()
This code example first creates an mlc_llm.MLCEngine
instance with the 8B Llama-3 model.
We design the Python API mlc_llm.MLCEngine
to align with OpenAI API,
which means you can use mlc_llm.MLCEngine
in the same way of using
OpenAI’s Python package
for both synchronous and asynchronous generation.
Non-stream Response. The code example above uses the synchronous chat completion interface and iterate over all the stream responses. If you want to run without streaming, you can run
response = engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=False,
)
print(response)
Please refer to OpenAI’s Python package and OpenAI chat completion API for the complete chat completion interface.
Note
If you want to enable tensor parallelism to run LLMs on multiple GPUs,
please specify argument model_config_overrides
in MLCEngine constructor.
For example,
from mlc_llm import MLCEngine
from mlc_llm.serve.config import EngineConfig
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(
model,
engine_config=EngineConfig(tensor_parallel_shards=2),
)
Run AsyncMLCEngine¶
mlc_llm.AsyncMLCEngine
provides the interface of OpenAI chat completion with
asynchronous features.
We recommend using mlc_llm.AsyncMLCEngine
to batch concurrent request for better throughput.
Stream Response. The core use of mlc_llm.AsyncMLCEngine
for stream responses is as follows.
async for response in await engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=True,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
The collapsed is a complete runnable example of AsyncMLCEngine in Python.
import asyncio
from typing import Dict
from mlc_llm.serve import AsyncMLCEngine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
prompts = [
"Write a three-day travel plan to Pittsburgh.",
"What is the meaning of life?",
]
async def test_completion():
# Create engine
async_engine = AsyncMLCEngine(model=model)
num_requests = len(prompts)
output_texts: Dict[str, str] = {}
async def generate_task(prompt: str):
async for response in await async_engine.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model=model,
stream=True,
):
if response.id not in output_texts:
output_texts[response.id] = ""
output_texts[response.id] += response.choices[0].delta.content
tasks = [asyncio.create_task(generate_task(prompts[i])) for i in range(num_requests)]
await asyncio.gather(*tasks)
# Print output.
for request_id, output in output_texts.items():
print(f"Output of request {request_id}:\n{output}\n")
async_engine.terminate()
asyncio.run(test_completion())
Non-stream Response. Similarly, mlc_llm.AsyncEngine
provides the non-stream response
interface.
response = await engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=False,
)
print(response)
Please refer to OpenAI’s Python package and OpenAI chat completion API for the complete chat completion interface.
Note
If you want to enable tensor parallelism to run LLMs on multiple GPUs,
please specify argument model_config_overrides
in AsyncMLCEngine constructor.
For example,
from mlc_llm import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = AsyncMLCEngine(
model,
engine_config=EngineConfig(tensor_parallel_shards=2),
)
Engine Mode¶
To ease the engine configuration, the constructors of mlc_llm.MLCEngine
and
mlc_llm.AsyncMLCEngine
have an optional argument mode
,
which falls into one of the three options "local"
, "interactive"
or "server"
.
The default mode is "local"
.
Each mode denotes a pre-defined configuration of the engine to satisfy different use cases. The choice of the mode controls the request concurrency of the engine, as well as engine’s KV cache token capacity (or in other words, the maximum number of tokens that the engine’s KV cache can hold), and further affects the GPU memory usage of the engine.
In short,
mode
"local"
uses low request concurrency and low KV cache capacity, which is suitable for cases where concurrent requests are not too many, and the user wants to save GPU memory usage.mode
"interactive"
uses 1 as the request concurrency and low KV cache capacity, which is designed for interactive use cases such as chats and conversations.mode
"server"
uses as much request concurrency and KV cache capacity as possible. This mode aims to fully utilize the GPU memory for large server scenarios where concurrent requests may be many.
For system benchmark, please select mode "server"
.
Please refer to API Reference for detailed documentation of the engine mode.
Deploy Your Own Model with Python API¶
The introduction page introduces how we can deploy our
own models with MLC LLM.
This section introduces how you can use the model weights you convert and the model library you build
in mlc_llm.MLCEngine
and mlc_llm.AsyncMLCEngine
.
We use the Phi-2 as the example model.
Specify Model Weight Path. Assume you have converted the model weights for your own model,
you can construct a mlc_llm.MLCEngine
as follows:
from mlc_llm import MLCEngine
model = "models/phi-2" # Assuming the converted phi-2 model weights are under "models/phi-2"
engine = MLCEngine(model)
Specify Model Library Path. Further, if you build the model library on your own,
you can use it in mlc_llm.MLCEngine
by passing the library path through argument model_lib
.
from mlc_llm import MLCEngine
model = "models/phi-2"
model_lib = "models/phi-2/lib.so" # Assuming the phi-2 model library is built at "models/phi-2/lib.so"
engine = MLCEngine(model, model_lib=model_lib)
The same applies to mlc_llm.AsyncMLCEngine
.
API Reference¶
The mlc_llm.MLCEngine
and mlc_llm.AsyncMLCEngine
classes provide the following constructors.
The MLCEngine and AsyncMLCEngine have full OpenAI API completeness. Please refer to OpenAI’s Python package and OpenAI chat completion API for the complete chat completion interface.
- class mlc_llm.MLCEngine(model: str, device: Union[str, Device] = 'auto', *, model_lib: Optional[str] = None, mode: Literal['local', 'interactive', 'server'] = 'local', engine_config: Optional[EngineConfig] = None, enable_tracing: bool = False)¶
Bases:
MLCEngineBase
The MLCEngine in MLC LLM that provides the synchronous interfaces with regard to OpenAI API.
- Parameters:
model (str) – A path to
mlc-chat-config.json
, or an MLC model directory that contains mlc-chat-config.json. It can also be a link to a HF repository pointing to an MLC compiled model.device (Union[str, Device]) – The device used to deploy the model such as “cuda” or “cuda:0”. Will default to “auto” and detect from local available GPUs if not specified.
model_lib (Optional[str]) – The full path to the model library file to use (e.g. a
.so
file). If unspecified, we will use the providedmodel
to search over possible paths. It the model lib is not found, it will be compiled in a JIT manner.mode (Literal["local", "interactive", "server"]) –
The engine mode in MLC LLM. We provide three preset modes: “local”, “interactive” and “server”. The default mode is “local”. The choice of mode decides the values of “max_num_sequence”, “max_total_sequence_length” and “prefill_chunk_size” when they are not explicitly specified. 1. Mode “local” refers to the local server deployment which has low request concurrency. So the max batch size will be set to 4, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 2. Mode “interactive” refers to the interactive use of server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 3. Mode “server” refers to the large server use case which may handle many concurrent request and want to use GPU memory as much as possible. In this mode, we will automatically infer the largest possible max batch size and max total sequence length.
You can manually specify arguments “max_num_sequence”, “max_total_sequence_length” and “prefill_chunk_size” to override the automatic inferred values.
engine_config (Optional[EngineConfig]) – Additional configurable arguments of MLC engine. See class “EngineConfig” for more detail.
enable_tracing (bool) – A boolean indicating if to enable event logging for requests.
- __init__(model: str, device: Union[str, Device] = 'auto', *, model_lib: Optional[str] = None, mode: Literal['local', 'interactive', 'server'] = 'local', engine_config: Optional[EngineConfig] = None, enable_tracing: bool = False) None ¶
- abort(request_id: str) None ¶
Generation abortion interface.
- Parameters:
request_id (str) – The id of the request to abort.
- metrics() EngineMetrics ¶
Get engine metrics
- Returns:
metrics – The engine metrics
- Return type:
EngineMetrics
- class mlc_llm.AsyncMLCEngine(model: str, device: Union[str, Device] = 'auto', *, model_lib: Optional[str] = None, mode: Literal['local', 'interactive', 'server'] = 'local', engine_config: Optional[EngineConfig] = None, enable_tracing: bool = False)¶
Bases:
MLCEngineBase
The AsyncMLCEngine in MLC LLM that provides the asynchronous interfaces with regard to OpenAI API.
- Parameters:
model (str) – A path to
mlc-chat-config.json
, or an MLC model directory that contains mlc-chat-config.json. It can also be a link to a HF repository pointing to an MLC compiled model.device (Union[str, Device]) – The device used to deploy the model such as “cuda” or “cuda:0”. Will default to “auto” and detect from local available GPUs if not specified.
model_lib (Optional[str]) – The full path to the model library file to use (e.g. a
.so
file). If unspecified, we will use the providedmodel
to search over possible paths. It the model lib is not found, it will be compiled in a JIT manner.mode (Literal["local", "interactive", "server"]) –
The engine mode in MLC LLM. We provide three preset modes: “local”, “interactive” and “server”. The default mode is “local”. The choice of mode decides the values of “max_num_sequence”, “max_total_sequence_length” and “prefill_chunk_size” when they are not explicitly specified. 1. Mode “local” refers to the local server deployment which has low request concurrency. So the max batch size will be set to 4, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 2. Mode “interactive” refers to the interactive use of server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 3. Mode “server” refers to the large server use case which may handle many concurrent request and want to use GPU memory as much as possible. In this mode, we will automatically infer the largest possible max batch size and max total sequence length.
You can manually specify arguments “max_num_sequence”, “max_total_sequence_length” and “prefill_chunk_size” to override the automatic inferred values.
engine_config (Optional[EngineConfig]) – Additional configurable arguments of MLC engine. See class “EngineConfig” for more detail.
enable_tracing (bool) – A boolean indicating if to enable event logging for requests.
- __init__(model: str, device: Union[str, Device] = 'auto', *, model_lib: Optional[str] = None, mode: Literal['local', 'interactive', 'server'] = 'local', engine_config: Optional[EngineConfig] = None, enable_tracing: bool = False) None ¶
- async abort(request_id: str) None ¶
Generation abortion interface.
- Parameters:
request_id (str) – The id of the request to abort.
- async metrics() EngineMetrics ¶
Get engine metrics
- Returns:
metrics – The engine metrics
- Return type:
EngineMetrics