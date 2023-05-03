Web LLM is a project that enables language model chats to be accessed directly through web browsers without the need for server support. This innovative technology is accelerated with WebGPU, offering lightning-fast performance while also ensuring user privacy. With Web LLM, building AI assistants and exploring the possibilities of language models have never been easier.

This project integrates language model chats into web browsers. Everything is done in the browser, with no server assistance, and is accelerated via WebGPU. We can introduce a lot of exciting options for everyone to develop AI helpers and enable privacy while enjoying GPU acceleration. Check out the demo webpage to try it out!

These models are often large and computationally intensive. A large cluster will be required to run an inference server while clients submit queries to servers and obtain the inference output. In addition, we must generally operate on a certain sort of GPU where popular deep-learning frameworks are easily available.

This project adds to the ecosystem’s variety. for example, build LLMs straight into the client side and execute them inside a browser? If that is realized, we will be able to provide support for client personal AI models with the benefits of cost reduction, personalization enhancement, and privacy protection. The client side is becoming increasingly strong.

Wouldn’t it be even more great if we could simply open a browser and bring AI immediately to your browser tab? The ecology is prepared to some extent. WebGPU, which enables native GPU operations in the browser, has recently been released.

Still, there are significant obstacles to overcome, to name a few:

We need to transport the models somewhere that does not have the necessary GPU-accelerated Python libraries.

The majority of AI frameworks rely significantly on optimized calculated libraries supplied by hardware vendors. We must begin from scratch.

Memory utilization must be carefully planned, and weights must be aggressively compressed in order to fit the models into memory.

We also don’t want to limit ourselves to just one model. Instead, we’d want to propose a repeatable and hackable workflow that allows anybody to quickly construct and optimize these models in a productive Python-first method, and then deploy them globally, including on the web.

This project, in addition to supporting WebGPU, offers the harness for other types of GPU backends that TVM supports (such as CUDA, OpenCL, and Vulkan) and truly enables accessible deployment of LLM models.

Instructions for local deployment

1. Install TVM Unity. Open mlc.ai wheels for more version.

pip3 install -r requirements.txt

2. Install all the prerequisite for web deployment:

emscripten. It is an LLVM-based compiler which compiles C/C++ source code to Web Assembly. Follow the installation instruction to install the latest emsdk.

Source emsdk_env.sh by source path/to/emsdk_env.sh , so that emcc is reachable from PATH and the command emcc works. Rust. wasm-pack . It helps build Rust-generated WebAssembly, which used for tokenizer in our case here. Install jekyll by following the official guides. It is the package we use for website. Install jekyll-remote-theme by command. Try gem mirror if install blocked.

gem install jekyll-remote-theme

6. Install Chrome Canary. It is a developer version of Chrome that enables the use of WebGPU.

We can verify the success installation by trying out emcc , jekyll and wasm-pack in terminal respectively.

3. Import, optimize and build the LLM model:

Get Model Weight

Currently we support LLaMA and Vicuna.

Get the original LLaMA weights in the huggingface format by following the instructions here. Use instructions here to get vicuna weights. Create a soft link to the model path under dist/models

mkdir -p dist/models ln -s your_model_path dist/models/model_name # For example: # ln -s path/to/vicuna-7b-v1 dist/models/vicuna-7b-v1

Optimize and build model to webgpu backend and export the executable to disk in the WebAssembly file format.

python3 build.py --target webgpu

By default, build.py takes vicuna-7b-v1 as model name. You can also specify model name as

python3 build.py --target webgpu --model llama-7b

Note: build.py can be run on MacOS with 32GB memory and other OS with at least 50GB CPU memory. We are currently optimizing the memory usage to enable more people to try out locally.

4. Deploy the model on web with WebGPU runtime

Prepare all the necessary dependencies for web build:

./scripts/prep_deps.sh

The last thing to do is setting up the site with

./scripts/local_deploy_site.sh

With the site set up, you can go to localhost:8888/web-llm/ in Chrome Canary to try out the demo on your local machine. Remember: you will need 6.4G GPU memory to run the demo. Don’t forget to use.

/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --enable-dawn-features=disable_robustness

To launch Chrome Canary to turn off the robustness check from Chrome.

How We Deploy Machine Learning Models on the Web with Machine Learning Compilation (MLC)

MLC (machine learning compilation) is the essential technique here. Our solution is built on the shoulders of the open-source ecosystem, which includes Hugging Face, LLaMA and Vicuna model variations, wasm, and WebGPU. The primary flow is based on Apache TVM Unity, a fascinating developing project in the Apache TVM Community.

We bake an IRModule from a language model in TVM with native dynamic shape support, eliminating the requirement for padding to max length and decreasing both computation amount and memory use.

Each function in TVM’s IRModule can be further transformed to generate runnable code that can be deployed universally on any environment that supports the minimum tvm runtime (one of which is JavaScript).

TensorIR is the primary approach for producing optimized programs. We deliver results by fast changing TensorIR programs using a mix of expert knowledge and an automated scheduler.

Heuristics are utilized to decrease engineering pressure while optimizing light-weight operators.

The model weights are compressed using int4 quantization techniques so that they can fit into memory.

To reuse memory across various levels, we provide static memory planning optimizations.

Emscripten and TypeScript are used to create a TVM web runtime that can deploy created modules.

We also used a wasm port of SentencePiece tokenizer.

All portions of this workflow are written in Python, with the exception of the last phase, which creates a 600 loc JavaScript app that links everything. This is also a fascinating participatory development process that results in new models.

All of this is made possible by the open-source environment on which we rely. We make extensive use of TVM unity, a thrilling new invention in the TVM project that enables such Python-first interactive MLC development experiences, allowing us to quickly construct new optimizations, completely in Python, and progressively push our app to the web.

TVM unity also makes it simple to create new solutions in the ecosystem. We will continue to deliver other optimizations, such as fused quantization kernels, to more systems.

The dynamic nature of LLM models is one of its unique characteristics. Because the decoding and encoding processes rely on computations that scale with token size, we use TVM unity’s first-class dynamic shape support, which represents sequence dimensions as symbolic integers. This allows us to plan ahead and allocate all of the memory required for the sequence window of interest statically, without padding.

We also used tensor expression integration to easily define partial-tensor calculations like rotational embedding without converting them to full-tensor matrix computations.

Comparison to Native GPU Runtime, Limitations and Opportunities



In addition to the WebGPU runtime, we provide native deployment alternatives with local GPU runtime. As a result, they may be used as both a tool for native deployment and a reference point for comparing native GPU driver performance with WebGPU performance.

WebGPU functions by converting WGSL shaders to native shaders. We discovered chances to close the gap between the WebGPU runtime and the native environment.

Some of the current gaps are caused by Chrome’s WebGPU implementation inserts bound clips for all array index access, such that a[i] becomes a[min(i, a.size)] . This can be optimized out as the WebGPU support continues to mature.

You can get around this by using a special flag to launch Chrome by exiting Chrome completely, then in command line, type

/path/to/Chrome --enable-dawn-features=disable_robustness

The execution speed will thereafter be as fast as the native GPU environment. We expect this issue to be rectified when WebGPU evolves. WebGPU has just arrived, and we are thrilled to explore what potential it can open up. There are also many exciting forthcoming features that we can use to enhance things, such as fp16 extensions.

