top of page

How to Use Stable Audio, Stability AI's Text-to-Music Service

Stability AI has been celebrated as a best-in-class AI model developer since their initial launch back in 2021. One of their most popular products, Stable Diffusion, has rivaled other big text-to-image services like Midjourney and Dalle-2.

The September 2023 release of Stable Audio marks a milestone for both Stability AI and the generative music sector. The new platform's audio fidelity seems to be a step up from the output of text-to-music competitors MusicGen and MusicLM.

Within hours of launching, the product received glowing reviews from Billboard, TechCrunch, VentureBeat and the Verge. But less than 24 hours later, interest in the app surged to a breaking point and servers reached full capacity. One developer from the team reported that they had run up against an unexpected Cloudflare API bug. At the time of writing this article, the service has been back up and running intermittently.

Stable Audio downtime tweet

While the team "stabilizes" their platform (sorry, I couldn't resist), we've gone ahead and typed up a tutorial that will help readers learn how to formulate the best possible text-to-music prompts. These guidelines were tested and published by members of the Stability AI team.

After this tutorial, we'll summarize Stable Audio's terms of service so you know how to use it safely, and include a brief recap of their product backstory. To round things out, we'll include a feature comparison with their main competitor, MusicGen. Meta stills has a leg up in the race, due to capabilities like audio conditioning and track extensibility.

Table of Contents

How to use Stable Audio's text-to-music generator

Stable Audio lets users generate raw audio from descriptions of music and sound in general. This tutorial will start with some prompt techniques and work its way back through the app's terms of service agreements, ultimately arriving at how they came to be, through Stability's Harmonai music lab.

To get started, navigate to the Stable Audio website. Once you've signed up and accepted their terms of service, you'll arrive at the dashboard shown below.

Overview of the Stable Audio interface

Stable Audio interface

The top left quarter of Stable Audio's interface holds the text area where you'll enter your music prompts. It also offers controls for output duration. You can monitor the number of music generation credits left in your account by referencing the music note symbol in the upper right corner. The number goes down with each track you create.

Each time you submit a text prompt, a new list item appears in the bottom left container. Meanwhile, the right half of the interface displays playback controls and gives you the option to download the track or vote on its quality.

Using AudioSparx to guide Stable Audio prompts

Of course, the million dollar question is: What the hell do I type into this thing?

Stable Audio's AI model was trained on AudioSparx, a dataset with over 800,000 audio files and 19.5K hours of music, sound effects, and single-instrument stems.

Having trained exclusively on audio from this library, the model performs best when you use terms that align with that dataset. To discover the terms in their training data, navigate to the AudioSparx website and click on the music tab.

The following dropdown menu will appear:

AudioSparx genre list

Each of the top-level music genres are linked to a separate page, where visitors will find a list of relevant subgenres.

In the example below, we've selected electronic music and are viewing the first few subgenres, listed alphabetically. The number of tracks in each collection is indicated to the lefthand side of the label. Subgenres with a greater number of tracks may lead to a richer and more diverse set of ideas for Stable Audio to draw from when it's time to generate music.

AudioSparx subgenres

Click through a subgenre to view the full collection of audio files it contains. Under the title of each track, you'll find a rich text description. Try copying and pasting the descriptive text directly into Sample Audio's prompt field and see what happens. Tweak the text and iterate through multiple rounds until you're satisfied with the music it creates.

AudioSparx track descriptions
AudioSparx track descriptions

Just be careful when using descriptions that contain the name of artists. The third example, shown above, names Aphex Twin, Radiohead, and others.

As we explain later in this article, Stable Audio's terms of service forbids the misuse of intellectual property. I didn't see any lines specifically stating that users cannot submit artist names in their prompts, but read between the lines and this is the most obvious interpretation.

My interpretation is that you are probably safe to experiment with artist names as long as it's for your own enjoyment. For ethical and legal reasons, it would be best to avoid using music commercially that was seeded from artist names.

Stability AI's Jordi Pons shares text prompting techniques

Stability research scientist Jordi Pons published an article this month with some great tips for prompting Stable Audio. I'll summarize these techniques below, so you have a better idea of how to construct phrases from the genre, subgenre, and music descriptions found on AudioSparx.

Music prompt technique 1: Provide a list of musical attributes

One of the simplest places to start is with descriptors like genre, instrument, mood, and tempo.

Example: Lo-fi hip hop, piano, bass, drums, relaxing, chill, 90 BPM

Music prompt technique 2: Combine musical and non-musical descriptions

Try adding in non-musical descriptions and see how it carries over into the feeling of the music.

Example: Island song, marimba, the relaxing experience of standing on the ocean with sand beneath your feet, listening to the waves while palm trees sway in the breeze.

Refining your prompts

  • If you find that musical output sounds too digital or electronic, Jordi suggests adding keywords like "Live" or "Band" to the prompt.

  • You may be able to improve on the audio quality by typing in "stereo", "high-quality", and "44.1kHz".

  • To spice up the melody, try pairing adding the word "Solo" after the name of the track's lead instrument.

These are the basic principles of text prompting, but there's always room for further experimentation. In the next section we'll share a novel example.

The Genre Fusion technique by CJ Carr of Dadabots

CJ Carr of Dadabots has been a part of the Harmonai team for several years. He's one of my personal favorites in the AI music space, because of his anachronistic and mind-bending approach to audio synthesis.

The video above begins with a quick overview from CJ, explaining how the system works, followed by a demo of his new genre fusion technique. What happens when you mash together two unlikely genres? Can we create entirely new styles of music, the likes of which the world would otherwise never have heard?

Stable Audio genre fusion

Genre Fusion prompt format: The prompt format in this demo combines two phrases, each prefixed with "Subgenre: " and separated by a pipe symbol (|).

Ideas for experimenting: Try entering two styles that are known to have opposite tempos, like "Subgenre: Breakbeat|Subgenre: Lo-fi Hip Hop". You can also try genres with opposite styles like "Subgenre: Death Metal|Subgenre: New Age Relaxation".

At any other time in history, genre fusions like this would have been stuck in a kind of unexpressed, latent space. But now, with a few bits of text and a moment of processing, Stable Audio do the heavy lifting for us and present us with new ideas.

Genre bending is a safer alternative to what I would call artist blending. The prompt can swap in the word artist for subgenre and create hybrids of multiple artists. But as I mentioned earlier, we're entering a legal gray area as soon as we start injecting individual artists' brands into our prompts.

So on that note, let's have a closer look at the terms of service.

Stable Audio's Terms of Service

Most of us skip right through the terms and service agreements when we register for a new app. But when it comes to AI music generation, it's important to have a basic familiarity with their terms.

Here are some of the most important things to know about Stable Audio's TOS:

  1. Age limit: You've got to be 13 years of age to legally use the service

  2. The music is yours: Users own the content they generate, subject to the terms and laws.

  3. Don't train other AI models with Stable Audio: Users are prohibited from using the services or its generated content to train other AI models.

  4. Respect artist IP: Users must not infringe on intellectual property.

  5. No class action lawsuits, please: Both parties waive the right to a jury trial. Disputes between you and "Stability" will be settled through binding arbitration rather than in court. Small claims court can be used for individual claims, and you can report to federal or state agencies.

    1. Dispute Resolution: Both parties must try to resolve disputes informally for 60 days after notice is given. If unresolved, arbitration can commence.

    2. Confidentiality: Arbitration proceedings are confidential.

    3. Opt Out: You can opt-out of the arbitration agreement within 30 days of account creation by notifying Stability.

  6. If you get sued, you pay the legal bills: Users indemnify Stability against claims arising from intellectual property infringement, misuse of the Services, or violation of the Terms. Stability and its representatives are not liable for indirect, special, or consequential damages or losses.

Summary of pricing for free, premium and enterprise tiers

  1. Free Tier users are capped at 20 tracks per month with a maximum of 45 second duration. They cannot use generated content commercially.

  2. Professional Tier users get 500 tracks per month with 90 second duration and can use the music for commercial projects of less than 100K monthly annual users.

  3. Enterprise Tier users can customize the max duration and volume of music generation, but need to contact the company for a quote. See details on the pricing page.

This is only a summary of terms and should not be considered a complete report. We've covered the points that we found to be the most relevant, but you should still read the TOS before signing up.

How does Stable Audio compare to MusicGen?

Stable Audio and Meta's MusicGen are both AI text-to-music platforms, but MusicGen includes options for audio conditioning and track extendibility. Users can upload audio files to MusicGen and submit text prompts to modify them.

We've previously covered several use cases for MusicGen, including the creation of film scores, infinite music, and the ability to expand on songs we've heard in our dreams. Of these three, Stable Audio could be swapped in for the film scoring and infinite music techniques. To expand on music from our own mind, we need to be able to pass in a musical condition, so MusicGen is still the superior service in that regard.

You can watch a demo of MusicGen's audio conditioning below:

The MusicGen model doesn't have a dedicated user interface. Instead, it typically runs on services like Hugging Face and Google Colab, where users pay by the hour according to the amount of GPU they're using. Some people with sufficient VRAM will run the models locally on their own computer.

All in all, I think it's fair to call Stable Audio the more consumer-friendly service. If they can deliver their service at scale and introduce audio conditions with the option to extend generated tracks beyond an initial duration, they will become a superior product.

Backstory: Harmony, Dance Diffusion and Riffusion

After struggling to raise funding at a higher valuation in June 2023, Stability seems to have felt some pressure to build a profitable service. September's flood of user traffic will likely improve investor sentiment, especially if they can become profitable and deliver reliable service.

Stability's first popular AI music models, like Disco Diffusion and Dance Diffusion, were developed by the company's audio lab Harmonai. They were adopted at the grassroots level, by people running the models in Google Colab and on Hugging Face. These interfaces were a bit technical for the average user.

The decision to launch a text-to-music service may have been inspired in part by Riffusion. Published by third party developers in December 2022, the web app leveraged Stability's image generation to train on labeled spectrograms (images of sound) and generate new spectrogram images from text. Riffusion then stitched these clips together and sonified them, meaning they turned the image data into sound.

Despite its low fidelity audio output, Riffusion met the latent demand for text-to-audio services. It was the first AI product to market and beat companies like Google and Meta to the punch. Within less than a month, Google's AI music team turn around and published a paper spilling the beans on their own text-to-music prototype MusicLM.

The actual MusicLM app was released as an alpha in May, followed closely by Meta's MusicGen in June. Around this time, it became crystal clear that text prompts were on their way to becoming a staple of AI music generation.

Stable Audio represents a bold step forward into offering paid text-to-music services. To date, Google and Meta have remained in the legally safe space of free services.

None of these tools currently support text-to-MIDI, which means music producers can generate samples and ideas but not raw mataerial.

The AudioCipher VST is currently the only software available for music producers who want to explore text prompting within their DAW. It's important to note that our service is not powered by an AI model. It doesn't create music that mirrors your descriptions, but instead uses a centuries old letter-to-note musical cryptogram technique.

While the absence of artificial intelligence disappoints some people, it delights others. AudioCipher lets users choose a specific scale, control rhythm automation, generate chord progressions and melodies, use the tool offline, and pay a one-time fee with lifetime upgrades. We don't monitor your text input remotely, and you can use any word or phrase you want without worrying about legal liability.

A new version of the app is due for release before end of 2023. The money earned from AudioCipher sales goes into maintaining this free blog, where we cover the latest development in AI music.

AudioCipher v3

To learn more about our text-to-midi plugin, visit the AudioCipher website.

bottom of page