Are you fascinated by the idea of creating realistic videos from images? Do you want to learn how to use a powerful generative model that can transform paintings into movies, or photos into video clips? If yes, then this article is for you.
In this article, you will learn how to run Stable Video Diffusion img2vid on Colab Windows, How to Use Stable Video Diffusion with ComfyUI. By the end of this article, you will be able to create your own videos from images using Stable Video Diffusion img2vid. Let’s get started.
Stable Video Diffusion
Stable Video Diffusion (SVD) by Stability AI is an incredibly powerful tool that transforms images into captivating videos by adding movement and creating stunning visual sequences. SVD works as a special type of model that learns to generate short video clips starting from a single image.
It comes in two versions: one called
img2vid, which produces 14 frames of motion in a video at a size of 576×1024 pixels, and another version known as
img2vid-xt, which is an improved version fine-tuned to create longer videos with 25 frames of motion at the same image resolution.
How the SVD Model Trained to Generate Videos
The SVD model underwent three main stages of training to become proficient in creating videos:
- Training an Image Model: Initially, an image model known as Stable Diffusion 2.1 was trained. This model served as the foundation for subsequent video-related advancements.
- Expanding to a Video Model: The image model was extended to understand and generate videos. It was trained using a vast dataset of videos to familiarize itself with video sequences.
- Refinement with High-Quality Data: Following the video model’s initial training, it underwent further refinement. A smaller but high-quality dataset was used to fine-tune the model, enhancing its ability to generate better videos.
The quality and selection of the dataset played a crucial role in shaping the capabilities of the video model. To create the video model, specific enhancements were made to the image model. Temporal convolution and attention layers were added to the U-Net noise estimator. This modification transformed the model’s latent tensor, allowing it to represent and understand videos instead of just images. Additionally, all frames were denoised simultaneously using a reverse diffusion process, akin to the VideoLDM model.
This resulting video model boasts an impressive 1.5 billion parameters and was trained extensively using a substantial video dataset. Finally, it underwent a focused fine-tuning process using a smaller yet higher-quality dataset to maximize its video generation capabilities.
How to Use Stable Video Diffusion on Colab
To run Stable Video Diffusion, you’ll need a powerful NVidia GPU card with high VRAM (video memory). If you don’t have access to such hardware, a great alternative is using Google Colab—an online platform that allows you to run the model using its free account feature.
Here’s a step-by-step guide to using Stable Video Diffusion on Google Colab:
- Open the Colab Notebook: Visit the GitHub page hosting the Colab notebook and click on the “Open in Colab” icon to access the notebook directly.
- Review Notebook Options: The default settings in the notebook are usually good to go. However, there’s an optional setting to choose whether or not to save the final video in your Google Drive.
- Run the Notebook: Click the “run” button within the notebook to initiate the execution of the code.
- Start the GUI: Once the notebook finishes loading, you’ll see a link to a
gradio.livelink. Click on this link to launch the graphical user interface (GUI) provided by Stable Video Diffusion.
- Upload an Initial Image: Drag and drop the image you want to use as the starting frame for your video. You can adjust the crop offset to fine-tune the position of the crop if needed.
- Begin Video Generation: Click the “run” button on the GUI interface to start the process of generating the video. The generated video will be displayed on the GUI once it’s completed.
How to Use Stable Video Diffusion with ComfyUI
Sure, here’s a simplified version of the steps to use the text-to-video workflow in ComfyUI:
Load the text-to-video Workflow: Download the ComfyUI workflow and drag-drop it into ComfyUI.
- Update ComfyUI, install any missing custom nodes, and ensure all custom nodes are updated. You can use the ComfyUI manager to make this process easier.
- Completely restart ComfyUI and load the text-to-video workflow again. If everything is updated correctly, there should be no issues.
- Download the SVD XT model and place it in the ‘ComfyUI > models > checkpoints’ folder.
- Refresh the ComfyUI page and select the SVD XT model in the ‘Image Only Checkpoint Loader’ node.
- Use the SDXL 1.0 model for this workflow. Download it if you haven’t already and place it in the ‘ComfyUI > models > checkpoints’ folder.
- Refresh the ComfyUI page and select the SDXL model in the ‘Load Checkpoint’ node.
Run the Workflow: Click ‘Queue Prompt’ to execute the workflow. This action should generate a video.
- video_frame: Specifies the number of frames in the video. It’s recommended to keep it at 25, as this matches the model’s training.
- motion_bucket_id: Determines the amount of motion in the video. Higher values result in more motion.
- fps: Defines the frames per second for the video.
- Augmentation_level: Controls the level of noise added to the initial image. Higher levels make the video more distinct from the starting frame.
- min_cfg: Establishes the CFG scale at the video’s start. The CFG scale linearly changes to the ‘cfg’ value set in the KSampler node by the video’s end.
By following these steps and adjusting these parameters, users can utilize ComfyUI’s text-to-video workflow, leveraging Stable Diffusion XL and SVD XT models to generate videos from text inputs.
How to Install Stable Video Diffusion on Windows
Here’s a simplified guide to installing and running Stable Video Diffusion software on your computer using Windows PowerShell:
Requirements: You’ll need Python 3.10, git, and a high-RAM GPU card like the RTX4090.
Clone the Repository
- Open PowerShell (not Command Prompt).
- Check Python version using
- Change to the desired directory and clone the repository using the command:
git clone https://github.com/Stability-AI/generative-models
Create a Virtual Environment
- Navigate into the cloned folder using
- Create a virtual environment with
python -m venv venv.
- Activate the virtual environment with
Remove Triton Package
- Go to the
requirementsfolder and open
- Remove the line “triton==2.0.0” and save the file.
Install Required Libraries
- Ensure you’re in the virtual environment.
- Install PyTorch:
pip3 install torch==2.0.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Install required libraries:
pip3 install -r .\requirements\pt2.txt.
- Install the generative model software:
- Install a required library:
pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata
Download the Video Model
- Create a folder named “checkpoints” in the
- Download the ‘svd_xt.safetensors‘ model and place it in the
Run the GUI
- Navigate to the
generative-modelsfolder in PowerShell.
- Set Python path:
- Start the GUI:
streamlit run scripts/demo/video_sampling.py.
Generate a Video
- Open the provided local URL (usually http://localhost:8501) in a browser.
- Select ‘svd_xt’ in the Model Version dropdown and check ‘Load Model’.
- Drop an image as the initial frame in the Input box.
- Set ‘Decode t frames at a time’ to 1 and click ‘Sample’ to start video generation.
- Monitor PowerShell for progress, and once done, the video will appear in the GUI.
Starting the GUI Again:
- Open PowerShell, navigate to the
- Activate the virtual environment:
- Set Python path:
- Start the GUI:
streamlit run scripts/demo/video_sampling.py.
Benefits of using Stable Video Diffusion
Stable Video Diffusion is a state-of-the-art AI video generation technology that creates dynamic videos from static images or text. It is based on the image model Stable Diffusion, which can upscale, enhance, and remove backgrounds from images. Some of the benefits of using Stable Video Diffusion are:
- It can create realistic and high-quality videos from images, with customizable frame rates and resolutions.
- It can adapt to various downstream tasks, such as multi-view synthesis, text-to-video, and video editing.
- It can serve a wide range of video applications in fields such as media, entertainment, education, and marketing.
- It can animate imagination and elevate concepts into live action, cinematic creations.
In this article, you learned how to run Stable Video Diffusion img2vid, a generative model that can create videos from images using a technique called diffusion. You learned how to use the model on different platforms, such as Google Colab, ComfyUI, and Windows PowerShell, and how to adjust the parameters to improve the results. I hope you found this article helpful and informative, and that you enjoyed creating your own videos from images.