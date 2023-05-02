Large language models(LLM) and generative artificial intelligence have transformed the area of machine learning, allowing developers to design strong and accurate language-based models. However, deploying and running LLMs can be time-consuming and difficult, necessitating powerful hardware and optimization techniques. In this essay, we will look into the MLC LLM idea, its benefits, and how to use it to construct scalable and efficient AI systems.

What is MLC LLM?

In recent years, generative artificial intelligence (AI) and large language models (LLMs) have made significant advances and are becoming more widely used. These models may now be used to construct personal AI helpers thanks to open-source projects. LLMs, on the other hand, are often resource-intensive and demand a large amount of computer power. To provide a scalable service, developers may need to rely on strong clusters and costly hardware for model inference. Furthermore, deploying LLMs presents several challenges, such as continuous model innovation, memory constraints, and the need for potential optimization techniques.

The goal of this program is to make it easier to create, improve, and deploy AI models for inference on a variety of devices, including not only server-class hardware but also users’ browsers, laptops, and mobile apps. To achieve this aim, we must address the various features of computing equipment as well as deployment conditions. Among the key challenges are:

Assisting multiple CPU and GPU models and potentially other co-processors and accelerators.

Implementing on user devices’ native environment, which might not have Python or other required dependencies easily accessible.

Managing memory constraints by carefully allocating and compressing model parameters.

MLC LLM offers developers and AI system researchers with a flexible, systematic, and repeatable process that prioritizes efficiency and a Python-first approach for constructing models and optimizations. This technique enables quick experimentation with new models, ideas, and compiler passes, resulting in native deployment to the appropriate targets. Furthermore, we are constantly improving LLM acceleration by expanding TVM backends to improve model compilation transparency and efficiency.

How does MLC Enable Universal Native Deployment?

At the heart of our approach is machine learning compilation (MLC), which we use to automate the deployment of AI models. To do this, we depend on a variety of open-source ecosystems, including tokenizers supplied by Hugging Face and Google, as well as open-source LLMs like as Llama, Vicuna, Dolly, MOSS, and others. Our primary process revolves around Apache TVM Unity, an interesting ongoing initiative within the Apache TVM Community.

Dynamic shape: We bake a language model as a TVM IRModule with native dynamic shape support, which eliminates the requirement for additional padding to the maximum length and reduces both computation amount and memory use.

Composable ML compilation optimizations: Many model deployment optimizations, such as better compilation code transformation, fusion, memory planning, library offloading, and manual code optimization, can be easily incorporated as TVM's IRModule transformations exposed as Python APIs.

Quantization: To compress the model weights, we use low-bit quantization's and leverage TVM's loop-level TensorIR to easily customize code generators for multiple compression encoding strategies.

Runtime: The final created libraries operate in the native environment, with a TVM runtime that has few dependencies and supports several GPU driver APIs as well as native language bindings (C, JavaScript, and so on).

In addition, we include a lightweight C++-based example CLI programme that demonstrates how to wrap up the compiled things and appropriate pre/post-processing, which should help simplify the workflow for embedding them into native apps.

MLC LLM creates GPU shaders for CUDA, Vulkan, and Metal as a starting point. More support, such as OpenCL, sycl, and webgpu-native, can be added by improving the TVM compiler and runtime. MLC also supports a variety of CPU targets, including ARM and x86, through LLVM.

We rely heavily on the open-source ecosystem, specifically TVM Unity, an exciting new development in the TVM project that enables python-first interactive MLC LLM development experiences, allowing us to easily compose new optimizations all in Python and incrementally bring our app to the environment of interest. We also employed optimizations including fused quantization kernels, first-rate dynamic shape support, and a variety of GPU backends.

How to install MLC LLM on Windows, Linux, and Mac

To install MLC LLM, we provide a CLI (command-line interface) app to chat with the bot in your terminal. Before installing the CLI app, we should install some dependencies first.

We use Conda to manage our app, so we need to install a version of conda. We can install Miniconda or Miniforge. On Windows and Linux, the chatbot application runs on GPU via the Vulkan platform. For Windows and Linux users, please install the latest Vulkan driver. For NVIDIA GPU users, please make sure to install Vulkan driver, as the CUDA driver may not be good.

After installing all the dependencies, just follow the instructions below the install the CLI app:

1. Open your terminal or command prompt.

2. Type the following command to create a new conda environment named “mlc-chat”:

conda create -n mlc-chat

3. After creating the environment, activate it by typing:

conda activate mlc-chat

4. Next, install Git and Git-LFS using the following command:

conda install git git-lfs

5. Now, install the chat CLI app from Conda using this command:

conda install -c mlc-ai -c conda-forge mlc-chat-nightly

6. Create a directory named “dist” using the following command:

mkdir -p dist

7. Download the model weights from Hugging Face and binary libraries from GitHub using the following commands:

git lfs install git clone https://huggingface.co/mlc-ai/demo-vicuna-v1-7b-int3 dist/vicuna-v1-7b git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/lib

8. Finally, to start chatting with the bot running natively on your machine, enter the following command:

mlc_chat_cli

You should now be able to enjoy chatting with the bot!

This article is to help you learn how to install MLC LLM on Windows, Linux, and Mac. We trust that it has been helpful to you. Please feel free to share your thoughts and feedback in the comment section below.