Stability.ai’s Stable Audio Open: A Text-to-Audio Generation Model

Stability.ai's new Stable Audio Open generates audio from text prompts, enhancing creative possibilities in sound design.

Stability.ai, the company behind the popular AI art generator Stable Diffusion, recently released Stable Audio Open. Stable Audio Open is an open-source model optimised for generating audio samples and effects using textual prompts. It supports generation based on drum beats, instrument riffs, ambient sounds, etc. The model was trained on a huge Free Music Archive and Freesound dataset. 

This article explores Stable Audio Open based on its implementation. 

Contents

  1. Understanding Stable Audio Open
    1. Model Overview
    2. Key Features
    3. Model Limitations
  2. Stable Audio Tools
  3. Hands-on Implementation of Stable Audio Open

Understanding Stable Audio Open

Stability.ai’s Stable Audio Open is an open-source text-to-audio model that lets users generate short audio samples and sound effects, up to 47 seconds, based on textual prompts and descriptions. The model enables different audio variations and style transfer of audio samples to assist in the generation of audio and sound effects. 

The model weights have been released on Hugging Face to support users in fine-tuning their custom data for audio generation. 

Model Overview

Stable Audio Open 1.0 is a latent diffusion model based on transformer architecture. It combines three primary components: 

An autoencoder – It compresses the audio waveforms into a manageable format for processing. 

T5-based text embedding It converts textual descriptions into a format the model understands. The open-source pre-trained T5 model is used for text conditioning. 

Transformer-based diffusion model – It generates audio in the latent space based on the text embedding and iteratively refines it to produce the final audio sample. 

The dataset used for training the model consists of 486492 audio recordings, where 472618 are from Freesound and 13874 are from Free Music Archive. 

Key Features

The key features of Stable Audio Open include the following: 

Text-to-Audio Generation – The model is capable of generating corresponding audio clips based on user prompts describing the desired sound (e.g., “Indian Classical Music”). 

Sample Creation – The mode is ideal for creating sound effects, ambient sounds and different production elements for music and sound design projects. 

Audio Variation & Style Transfer – The model can be used for generating different variations of the same sound prompt or infusing the style of one genre into another sound clip. 

Customisation – The open weights released on the Hugging Face platform, enable the users to fine-tune the model on specific data and customise it. 

Model Limitations

Stable Audio Open comes with a wide range of advantages but also has certain limitations such as: 

  • The model is incapable of generating realistic vocal sounds or complex melodies and is limited to generating audio snippets up to 47 seconds long. 
  • The model may perform better for certain musical styles and cultures due to training data limitations. 
  • The model is good at generating sound effects rather than complete music. 
  • Prompt engineering may be required in certain scenarios to obtain proper and appropriate audio responses. 

Stable Audio Tools 

This is the overarching project that houses the different tools and functionalities related to Stability.ai’s text-to-audio generation. It provides the framework for training and using different models, including Stable Audio Open. 

Stable Audio Tools provide key functionalities as a toolbox containing different tools for audio generation: 

Training Wrapper – These are tools that help manage the training process for new text-to-audio models, allowing for multi-GPU and multi-node setups for faster and more efficient training. 

Model Unwrapping – This helps in the extraction of the trained model from the training environment, making it usable for actual audio generation. 

Hands-on Implementation of Stable Audio Open

Step 1 – Install the necessary libraries: 

  • stable-audio-tools – The code package for using Stable Audio Open that includes functions for generating audio from text prompts. 
  • torch – Stable Audio Open relies on PyTorch for its core functionalities. 
  • torchaudio – Official PyTorch library specifically designed for working with audio data. 
  • einops – It offers functionalities for manipulating the tensor shapes used in models. 
!pip install stable-audio-tools torch torchaudio einops

Step 2 – Import the libraries: 

  • einops’s rearrange function for manipulating tensor shapes. 
  • stable_audio_tools’s get_pretrained_model function is used for downloading the trained Stable Audio Open model. 
  • stable_audio_tools.inference.generation’s generate_diffusion_cond is used for generating audio based on prompt and conditioning factors. 

Also, check if a CUDA-enabled GPU is available. The device variable is set to “cuda” to use the GPU for faster computations, else, it defaults to “cpu”. 

import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

device = "cuda" if torch.cuda.is_available() else "cpu"

Step 3 – Download the pre-trained model and extract the necessary configuration details from the model configuration dictionary:

sample_rate determines the number of samples per second in the audio signal, whereas sample_size refers to the number of samples in each audio frame.   

model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)

Step 4 – Model Conditioning through a dictionary of conditioning parameters:

It’s implemented to generate a 30-second audio sample that sounds like a combination of various Indian Classical Music Instruments (prompt). The audio sample starts from the beginning as the seconds_start is set to 0. 

conditioning = [{
   "prompt": "Indian Classical Music instruments",
   "seconds_start": 0,
   "seconds_total": 30
}]

Step 5 – Model Execution for generating audio sample: 

  • steps indicate the iterations that the diffusion process will take. 
  • cfg_scale indicates classifier-free guidance scale 
  • sample_size represents the length of audio to generate, in samples. 
  • sigma_min and sigma_max are the minimum and maximum noise magnitude. 
  • sampler_type determines the type of sampler used for the diffusion process. 
  • device parameter specifies “cuda” or “cpu”.
output = generate_diffusion_cond(
   model,
   steps=100,
   cfg_scale=7,
   conditioning=conditioning,
   sample_size=sample_size,
   sigma_min=0.3,
   sigma_max=500,
   sampler_type="dpmpp-3m-sde",
   device=device
)

Step 6 – Tensor shape manipulation:

rearrange function from the einops library is used for defining the rearrangement pattern. b, d and n represent the batch dimension, the dimension of the audio data and the potential dimension. This rearrangement converts the output tensor from separate dimensions for batch and channels into a single sequence. 

output = rearrange(output, "b d n -> d (b n)")

Step 7 – Process and save the generated audio: 

  • output.to(torch.float32) is for converting the output tensor to 32-bit floating point format. 
  • .div(torch.max(torch.abs(output))) normalises the audio . 
  • .clamp(-1, 1) is for clipping the audio values between -1 and 1. 
  • .mul(32767) scales the audio values to the range of 16-bit signed integers. 
  • .to(torch.int16) is for converting the audio to 16-bit signed integer format. 
  • .cpu moves the audio tensor to the CPU. 
  • torchaudio.save(“output.wav”, output, sample_rate) saves the generated audio to a WAV file named “output.wav” using torchaudio.save function. 
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()

torchaudio.save("output.wav", output, sample_rate)

An audio sample is generated based on the prompt “Indian Classical Music Instruments”. 

Output:

Final Words

While limitations exist, Stable Audio Open is a great tool that empowers audio and sound designers and lowers the barrier to entry for AI-based audio creation. With Stable Audio Open, users can create unique audio samples with text-based prompts and explore the boundless possibilities of AI-generated audio. 

References

  1. Link to Notebook
  2. Stable Audio Tools Git Repo
  3. Stable Audio Open Hugging Face Repo
  4. Stable Audio Open Official Website

Learn more about Generative AI and Large Language Models through our hand-picked modules:

Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.