top of page

5 Reasons Google's MusicLM AI Text-to-Music App is Different

Google has officially entered the text-to-music arena with a new machine learning product called MusicLM. The developers published their first Github paper in the last week of January 2023, receiving immediate attention from tech hubs like HackerNews. The paper claims that MusicLM can generate high fidelity audio from text, providing dozens of audio samples for readers to reference.

This isn't the first time Google has taken a stab at creating music using artificial intelligence. We've previously covered their MIDI generating software, Google Magenta Studio, along with other innovative tools like DDSP for tone transfer. MusicLM does represent their first effort to create a text-to-music app, however.

If you've been following tools like Dalle and Midjourney over the past couple of years, you may already be familiar with OpenAI's MIDI generator Musenet and audio file generator, Jukebox. These apps were an impressive first step, but neither of these apps offered the freedom of text-to-music generation.

Near the end of December 2022, a small developer team published a text-to-music app called Riffusion. AI startup Mubert has also released a text-to-music web application that splices together loops created by humans based on text prompts. Neither of these tools hold a candle to what MusicLM appears to be capable of though.

In this article we'll outline all of the qualities that make Google's MusicLM product exceptional. I'll also highlight the number one limitation of their software and how we'll be able to overcome it with AudioCipher.

What makes Google's MusicLM unique?

There are several other AI music apps out there, so why should we care about Google's contribution? Let's break it down one feature at a time.

  1. MusicCaps dataset

  2. Long Generation: Consistent musical output

  3. Audio generation from rich captions

  4. Story mode

  5. Melody conditioning

  6. Painting descriptions

The MusicCaps dataset: A new approach to descriptions

MusicCaps dataset
MusicCaps dataset sample

In the spirit of transparency, Google released their MusicCaps dataset through Kaggle. Each of the 5,521 music samples is labeled with English descriptions, including aspect lists and free text captions.

An aspect list is a comma separated collection of short phrases describing the music, whereas the free text captions are written descriptions in natural language by expert musicians.

Example of an aspect list: "pop, tinny wide hi hats, mellow piano melody, high pitched female vocal melody, sustained pulsating synth lead"

Example of a free text caption: "A low sounding male voice is rapping over a fast paced drums playing a reggaeton beat along with a bass. Something like a guitar is playing the melody along. This recording is of poor audio-quality. In the background a laughter can be noticed. This song may be playing in a bar."

This training set differs from OpenAI's Jukebox data because it focuses on how the music sounds instead of metadata about the music, like artist name or genre.

The MusicCaps developers have published a separate paper on Arxiv describing their goal of generating music from text. It was released in tandem with the publication of MusicLM.

Long Generation: Consistent musical output over time

The newly published MusicLM paper claims that their network remains consistent for several minutes. AI developers have had a particularly hard time generating good AI music due to a problem with LTSM, or long term short memory. LTSM is a feature of a recurrent neural networks (RNN) that enable the machine to stay focused over a period of time.

To really get a feel for this problem, I suggest checking out OpenAI's Jukebox Sample Explorer. You'll find that the music tends to lose focus and devolve by the end of the clip, so that a rap song in the style of Machine Gun Kelly gradually morphs into a convoluted reggae death metal tune. MusicLM claims to outperform these other AI music generators with a hierarchical sequence-to-sequence model that outputs 24 kHz audio quality.

Audio generation from rich captions

MusicLM caption example
Example of a MusicLM caption

Thanks to the MusicCaps dataset, MusicLM is able to receive long form text input with rich descriptions of music. This means that users will not need any technical music theory knowledge in order to create songs. Filmmakers and video game developers will eventually be able to generate the sounds they need on demand, by simply describing the scenes in question.

Story Mode: Fluid progression through a series of prompts

MusicLM story mode
Story Mode example

MusicLM includes a story mode that lets users describe time stamps where the music should evolve. Prompts could include abstract feelings and words like "fireworks" as well as genres like "rock song" and "string quartet". Behind the scenes, the model works to create a smooth musical transition from one semantic framework to the next.

Melody conditioning: Hum or whistle melodies into any style

Melody prompt and text prompt
MusicLM's Melody and text prompt grid

Melody conditioning with text prompts is where things start to get pretty crazy.

MusicLM lets you input any kind of audio sample like humming, whistling, or even guitar melodies. You can then type in a short text prompt describing the style of audio that you want to hear and it does a phenomenal job replicating the melody provided in that style.

We'll return to this later, to explain how AudioCipher could be used to overcome the lack of key signature and tempo controls.

Painting Caption Conditioning

turning painting captions into music

The paper includes a demo that turns image captions into audio. This isn't necessarily a feature so much as it is a display of how the software might be used. Instead of trying to look at artwork and interpret it, MusicLM looks at human descriptions of the art and generates music from those ideas.

Tuning MusicLM Output with AudioCipher

One of MusicLM's major shortcomings is the absence of music theory data. Their MusicCap dataset does not include tempo or key signature information. As a result, users will not be able to gain full control over the output.

Fortunately, AudioCipher provides a text-to-music MIDI plugin that includes key signature parameters. This means you will be able to generate chords and melodies based on words in any key. Fine tune the rhythm in your piano roll and set the BPM. Lastly, you will save the audio file and pass it into the MusicLM melody prompt tool with a description of the style you want.

Pick up a copy of AudioCipher to become familiar with the interface and start experimenting. Once Google publishes the MusicLM API, it's only a matter of time before open source developers create an interface like MuseTree. Then we'll be able to put this software to the test and enter a new phase of creative freedom.


bottom of page