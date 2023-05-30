Soundstorm-Pytorch is a powerful tool for audio generation. It is a PyTorch implementation of Google Deepmind’s efficient parallel audio generation method. Soundstorm-Pytorch can be used to generate high-quality audio faster and more consistently than the autoregressive approach of AudioLM. It can also synthesize natural dialogue segments from a transcript with speaker turns and voice prompts.

Soundstorm is a PyTorch implementation of SoundStorm, an efficient parallel audio generation method from Google Deepmind. It uses MaskGiT to transform the residual vector quantized codes from Soundstream. The transformer model is based on Conformer, which is suitable for the audio domain.

Installation

Install the soundstorm-pytorch library using pip:

$ pip install soundstorm-pytorch

Import the required modules in your Python script:

import torch from soundstorm_pytorch import SoundStorm, ConformerWrapper

Create an instance of the ConformerWrapper class, which wraps the Conformer model:

conformer = ConformerWrapper( codebook_size=1024, num_quantizers=4, conformer=dict( dim=512, depth=2 ), )

Create an instance of the SoundStorm model by passing the conformer instance and other parameters:

model = SoundStorm( conformer, steps=18, # 18 steps, as in the original maskgit paper schedule='cosine' # currently the best schedule is cosine )

Generate pre-encoded codebook IDs from the soundstream using raw audio data. Here’s an example of generating random codebook IDs:

codes = torch.randint(0, 1024, (2, 1024))

Perform the training loop for a given amount of data:

loss, _ = model(codes) loss.backward()

Use the trained model to generate speech. Specify the desired length and batch size:

generated = model.generate(1024, batch_size=2) # (2, 1024)

If you want to train the model on raw audio, you can pass a pretrained SoundStream instance to the SoundStorm model. Here’s an example:

Import the required modules for training on raw audio:

from soundstorm_pytorch import Conformer, SoundStream

Create instances of the ConformerWrapper and SoundStream classes:

conformer = ConformerWrapper( codebook_size=1024, num_quantizers=4, conformer=dict( dim=512, depth=2 ), ) soundstream = SoundStream( codebook_size=1024, rq_num_quantizers=4, attn_window_size=128, attn_depth=2 )

Create an instance of the SoundStorm model, passing the conformer and soundstream instances:

model = SoundStorm( conformer, soundstream=soundstream # pass in the soundstream )

Prepare the audio data you want the model to learn. Here’s an example of generating random audio:

audio = torch.randn(2, 10080)

Perform the training loop on the audio data:

loss, _ = model(audio) loss.backward()

Use the trained model to generate speech:

generated_audio = model.generate(seconds=30, batch_size=2) # generate 30 seconds of audio

Make sure you have the required dependencies installed and the necessary audio data available before running the code.

