Stability.ai, the company behind the popular AI art generator Stable Diffusion, recently released Stable Audio Open. Stable Audio Open is an open-source model optimised for generating audio samples and effects using textual prompts. It supports generation based on drum beats, instrument riffs, ambient sounds, etc. The model was trained on a huge Free Music Archive and Freesound dataset.
This article explores Stable Audio Open based on its implementation.
Contents
- Understanding Stable Audio Open
- Model Overview
- Key Features
- Model Limitations
- Stable Audio Tools
- Hands-on Implementation of Stable Audio Open
Understanding Stable Audio Open
Stability.ai’s Stable Audio Open is an open-source text-to-audio model that lets users generate short audio samples and sound effects, up to 47 seconds, based on textual prompts and descriptions. The model enables different audio variations and style transfer of audio samples to assist in the generation of audio and sound effects.
The model weights have been released on Hugging Face to support users in fine-tuning their custom data for audio generation.
Model Overview
Stable Audio Open 1.0 is a latent diffusion model based on transformer architecture. It combines three primary components:
An autoencoder – It compresses the audio waveforms into a manageable format for processing.
T5-based text embedding – It converts textual descriptions into a format the model understands. The open-source pre-trained T5 model is used for text conditioning.
Transformer-based diffusion model – It generates audio in the latent space based on the text embedding and iteratively refines it to produce the final audio sample.
The dataset used for training the model consists of 486492 audio recordings, where 472618 are from Freesound and 13874 are from Free Music Archive.
Key Features
The key features of Stable Audio Open include the following:
Text-to-Audio Generation – The model is capable of generating corresponding audio clips based on user prompts describing the desired sound (e.g., “Indian Classical Music”).
Sample Creation – The mode is ideal for creating sound effects, ambient sounds and different production elements for music and sound design projects.
Audio Variation & Style Transfer – The model can be used for generating different variations of the same sound prompt or infusing the style of one genre into another sound clip.
Customisation – The open weights released on the Hugging Face platform, enable the users to fine-tune the model on specific data and customise it.
Model Limitations
Stable Audio Open comes with a wide range of advantages but also has certain limitations such as:
- The model is incapable of generating realistic vocal sounds or complex melodies and is limited to generating audio snippets up to 47 seconds long.
- The model may perform better for certain musical styles and cultures due to training data limitations.
- The model is good at generating sound effects rather than complete music.
- Prompt engineering may be required in certain scenarios to obtain proper and appropriate audio responses.
Stable Audio Tools
This is the overarching project that houses the different tools and functionalities related to Stability.ai’s text-to-audio generation. It provides the framework for training and using different models, including Stable Audio Open.
Stable Audio Tools provide key functionalities as a toolbox containing different tools for audio generation:
Training Wrapper – These are tools that help manage the training process for new text-to-audio models, allowing for multi-GPU and multi-node setups for faster and more efficient training.
Model Unwrapping – This helps in the extraction of the trained model from the training environment, making it usable for actual audio generation.
Hands-on Implementation of Stable Audio Open
Step 1 – Install the necessary libraries:
- stable-audio-tools – The code package for using Stable Audio Open that includes functions for generating audio from text prompts.
- torch – Stable Audio Open relies on PyTorch for its core functionalities.
- torchaudio – Official PyTorch library specifically designed for working with audio data.
- einops – It offers functionalities for manipulating the tensor shapes used in models.
!pip install stable-audio-tools torch torchaudio einops
Step 2 – Import the libraries:
- einops’s rearrange function for manipulating tensor shapes.
- stable_audio_tools’s get_pretrained_model function is used for downloading the trained Stable Audio Open model.
- stable_audio_tools.inference.generation’s generate_diffusion_cond is used for generating audio based on prompt and conditioning factors.
Also, check if a CUDA-enabled GPU is available. The device variable is set to “cuda” to use the GPU for faster computations, else, it defaults to “cpu”.
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
device = "cuda" if torch.cuda.is_available() else "cpu"
Step 3 – Download the pre-trained model and extract the necessary configuration details from the model configuration dictionary:
sample_rate determines the number of samples per second in the audio signal, whereas sample_size refers to the number of samples in each audio frame.
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)
Step 4 – Model Conditioning through a dictionary of conditioning parameters:
It’s implemented to generate a 30-second audio sample that sounds like a combination of various Indian Classical Music Instruments (prompt). The audio sample starts from the beginning as the seconds_start is set to 0.
conditioning = [{
"prompt": "Indian Classical Music instruments",
"seconds_start": 0,
"seconds_total": 30
}]
Step 5 – Model Execution for generating audio sample:
- steps indicate the iterations that the diffusion process will take.
- cfg_scale indicates classifier-free guidance scale
- sample_size represents the length of audio to generate, in samples.
- sigma_min and sigma_max are the minimum and maximum noise magnitude.
- sampler_type determines the type of sampler used for the diffusion process.
- device parameter specifies “cuda” or “cpu”.
output = generate_diffusion_cond(
model,
steps=100,
cfg_scale=7,
conditioning=conditioning,
sample_size=sample_size,
sigma_min=0.3,
sigma_max=500,
sampler_type="dpmpp-3m-sde",
device=device
)
Step 6 – Tensor shape manipulation:
rearrange function from the einops library is used for defining the rearrangement pattern. b, d and n represent the batch dimension, the dimension of the audio data and the potential dimension. This rearrangement converts the output tensor from separate dimensions for batch and channels into a single sequence.
output = rearrange(output, "b d n -> d (b n)")
Step 7 – Process and save the generated audio:
- output.to(torch.float32) is for converting the output tensor to 32-bit floating point format.
- .div(torch.max(torch.abs(output))) normalises the audio .
- .clamp(-1, 1) is for clipping the audio values between -1 and 1.
- .mul(32767) scales the audio values to the range of 16-bit signed integers.
- .to(torch.int16) is for converting the audio to 16-bit signed integer format.
- .cpu moves the audio tensor to the CPU.
- torchaudio.save(“output.wav”, output, sample_rate) saves the generated audio to a WAV file named “output.wav” using torchaudio.save function.
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
An audio sample is generated based on the prompt “Indian Classical Music Instruments”.
Output:
Final Words
While limitations exist, Stable Audio Open is a great tool that empowers audio and sound designers and lowers the barrier to entry for AI-based audio creation. With Stable Audio Open, users can create unique audio samples with text-based prompts and explore the boundless possibilities of AI-generated audio.
References
- Link to Notebook
- Stable Audio Tools Git Repo
- Stable Audio Open Hugging Face Repo
- Stable Audio Open Official Website
Learn more about Generative AI and Large Language Models through our hand-picked modules: