Readers like you help support Cloudbooklet. When you make a purchase using links on our site, we may earn an affiliate commission.
Discover MLC LLM, a scalable and cost-effective solution for deploying and running large language models. Discover the benefits of MLC LLM and how to install it to create powerful AI services.
Large language models(LLM) and generative artificial intelligence have transformed the area of machine learning, allowing developers to design strong and accurate language-based models. However, deploying and running LLMs can be time-consuming and difficult, necessitating powerful hardware and optimization techniques. In this essay, we will look into the MLC LLM idea, its benefits, and how to use it to construct scalable and efficient AI systems.
Table of Contents
What is MLC LLM
In recent years, generative artificial intelligence (AI) and large language models (LLMs) have made significant advances and are becoming more widely used. These models may now be used to construct personal AI helpers thanks to open-source projects. LLMs, on the other hand, are often resource-intensive and demand a large amount of computer power. To provide a scalable service, developers may need to rely on strong clusters and costly hardware for model inference. Furthermore, deploying LLMs presents several challenges, such as continuous model innovation, memory constraints, and the need for potential optimization techniques.
The goal of this program is to make it easier to create, improve, and deploy AI models for inference on a variety of devices, including not only server-class hardware but also users’ browsers, laptops, and mobile apps. To achieve this aim, we must address the various features of computing equipment as well as deployment conditions. Among the key challenges are:
Assisting multiple CPU and GPU models and potentially other co-processors and accelerators.
Implementing on user devices’ native environment, which might not have Python or other required dependencies easily accessible.
Managing memory constraints by carefully allocating and compressing model parameters.
MLC LLM offers developers and AI system researchers with a flexible, systematic, and repeatable process that prioritizes efficiency and a Python-first approach for constructing models and optimizations. This technique enables quick experimentation with new models, ideas, and compiler passes, resulting in native deployment to the appropriate targets. Furthermore, we are constantly improving LLM acceleration by expanding TVM backends to improve model compilation transparency and efficiency.
At the heart of our approach is machine learning compilation (MLC), which we use to automate the deployment of AI models. To do this, we depend on a variety of open-source ecosystems, including tokenizers supplied by Hugging Face and Google, as well as open-source LLMs like as Llama, Vicuna, Dolly, MOSS, and others. Primary process revolves around Apache TVM Unity, an interesting ongoing initiative within the Apache TVM Community.
Dynamic shape: We bake a language model as a TVM IRModule with native dynamic shape support, which eliminates the requirement for additional padding to the maximum length and reduces both computation amount and memory use.
Composable ML compilation optimizations: Many model deployment optimizations, such as better compilation code transformation, fusion, memory planning, library offloading, and manual code optimization, can be easily incorporated as TVM’s IRModule transformations exposed as Python APIs.
Quantization: To compress the model weights, we use low-bit quantization’s and leverage TVM’s loop-level TensorIR to easily customize code generators for multiple compression encoding strategies.
In addition, we include a lightweight C++-based example CLI programme that demonstrates how to wrap up the compiled things and appropriate pre/post-processing, which should help simplify the workflow for embedding them into native apps.
MLC LLM creates GPU shaders for CUDA, Vulkan, and Metal as a starting point. More support, such as OpenCL, sycl, and webgpu-native, can be added by improving the TVM compiler and runtime. MLC also supports a variety of CPU targets, including ARM and x86, through LLVM.
We rely heavily on the open-source ecosystem, specifically TVM Unity, an exciting new development in the TVM project that enables python-first interactive MLC LLM development experiences, allowing us to easily compose new optimizations all in Python and incrementally bring app to the environment of interest. We also employed optimizations including fused quantization kernels, first-rate dynamic shape support, and a variety of GPU backends.
To install MLC LLM, we provide a CLI (command-line interface) app to chat with the bot in your terminal. Before installing the CLI app, we should install some dependencies first.
We use Conda to manage app, so we need to install a version of conda. We can install Miniconda or Miniforge.
On Windows and Linux, the chatbot application runs on GPU via the Vulkan platform. For Windows and Linux users, please install the latest Vulkan driver. For NVIDIA GPU users, please make sure to install Vulkan driver, as the CUDA driver may not be good.
After installing all the dependencies, just follow the instructions below the install the CLI app:
1. Open your terminal or command prompt.
2. Type the following command to create a new conda environment named “mlc-chat”:
conda create -n mlc-chat
3. After creating the environment, activate it by typing:
conda activate mlc-chat
4. Next, install Git and Git-LFS using the following command:
conda install git git-lfs
5. Now, install the chat CLI app from Conda using this command:
8. Finally, to start chatting with the bot running natively on your machine, enter the following command:
You should now be able to enjoy chatting with the bot!
How to install MLC LLM on iPhone
To install TestFlight on your iOS or iPadOS device and access the app for testing, follow these steps:
Open the App Store on your iOS or iPadOS device.
Search for “TestFlight” and install Apple’s TestFlight app.
Once TestFlight is installed, open the email invitation you received from the app developer on your smartphone. Alternatively, if you have a public link, access it in Safari or another web browser on your smartphone.
If you received an email invitation, click the “View in TestFlight” or “Start testing” button within the email. If you’re using a public link, click the “Install” or “Update” button on the webpage.
This will launch the TestFlight app, and you’ll see the app you wish to test listed there.
Tap the “Install” or “Update” button next to the program you wish to test.
The app will then be downloaded and installed on your device by TestFlight. Once the installation is complete, you may launch the app and begin testing it.
It should be noted that the app has particular device requirements. Vicuna-7B requires 4GB of RAM to function properly, whereas RedPajama-3B requires 2.2GB of RAM. We recommend utilizing a current iPhone with at least 6GB of RAM for Vicuna-7B or 4GB of RAM for RedPajama-3B, taking into account the operating system and other active programs.
The app has been tested on variants of the iPhone 14 Pro Max, iPhone 14 Pro, and iPhone 12 Pro.
If you want to develop the iOS app from scratch, you may find additional information on GitHub site.
Finally, please be advised that the text production speed on the iOS app might be unreliable at times. It may start slowly at first, but it should quickly pick up speed.
How to install MLC LLM on Android
To get started with chat app demo on Android, please follow the steps below:
Download the APK file to your device with Android from here.
Once the download is complete, look for the APK file on your device, which is normally in the “Downloads” folder or the downloads location you selected.
To begin the installation process, tap on the APK file. Allow installation from unknown sources in your device settings if asked. This varies according on your Android version and device manufacturer.
To finish the installation, follow the on-screen directions. The app may take a few seconds to install on your phone.
After the installation is complete, you may use the app and begin a conversation with LLM.
Please keep in mind that the program will need to download certain parameters the first time you launch it, which may result in a sluggish loading process. In subsequent runs, however, these settings will be loaded from the cache, resulting in speedier loading times. Furthermore, once the parameters are downloaded, the software may be used offline.
Current demo requires OpenCL capability on your phone and consumes about 6GB of RAM. You should be able to try out demo if you have a phone with the newest Snapdragon processor.
During testing, we concentrated on the Samsung Galaxy S23, and we can affirm that the app works properly on this smartphone. However, owing to restricted OpenCL support, it is presently incompatible with Google Pixel phones.
If you want to develop the Android app from scratch, you can find all of the essential materials and information in GitHub repository.
Web LLM chat
WebGPU is now available in Chrome, allowing you to test out chatbot example. Please use Chrome version 113 or later, as versions 112 and before are not supported. If you try to run the demo on an unsupported version, you may see an error message such as “Operation Error: Required limit (1073741824) is greater than supported limit (268435456) – While validating maxBufferSize – While validating required limits.”
To get started, follow these instructions:
Upgrade your Chrome browser to 113 or higher. This will verify that the chatbot demo is compatible.
For Mac users, use the following command to launch Chrome, preferably from the terminal:
This command disables Chrome’s robustness check, which might slow down the chatbot’s responses. While not required, it is strongly advised to use this command for a more seamless experience.
After Chrome has been started, you can proceed to the chatbot demonstration. Choose the model you wish to test and add your data. Then press the “Send” button to start the discussion.
Please note the following requirements and considerations:
Vicuna-7B and RedPajama-3B chatbot models require a GPU with around 6GB and 3GB of RAM, respectively.
Some models may need the use of the fp16 shader. To enable fp16 shaders in Chrome Canary, use the following command (allow_unsafe_apis).
For Mac users with Apple Silicon, here are the specific instructions to run the chatbot demo locally in your browser:
Chrome should be updated to version 113 or higher.
Use the previously suggested command to launch Chrome.
Choose the chosen model, enter your text, then click “Send” to begin the discussion.
The chatbot will load the model parameters into the local cache on the first run. This download might take a few minutes. However, because the settings are already cached, subsequent refreshes and runs will be quicker.
This article is to help you learn how to install MLC LLM on Windows, Linux, and Mac. We trust that it has been helpful to you. Please feel free to share your thoughts and feedback in the comment section below.
Welcome to our technology blog, where we explore the latest advancements in the field of artificial intelligence (AI) and how they are revolutionizing cloud computing. In this blog, we dive into the powerful capabilities of cloud platforms like Google Cloud Platform (GCP), Amazon Web Services (AWS), and Microsoft Azure, and how they are accelerating the adoption and deployment of AI solutions across various industries. Join us on this exciting journey as we explore the endless possibilities of AI and cloud computing.