Google's new Music AI SandBox Expands on MusicLM & MusicFX

Google announced their new Music AI Sandbox tool on May 14th 2024 during its annual I/O developer conference. The promotion included endorsements from music industry heavyweights like Wyclef Jean, Mark Rebillet, and Donald "Childish Gambino" Glover.

The artists focused on the value of writing faster and with less hassle. Gambino was quoted saying "You can make a mistake faster. That’s all you really want at the end of the day, at least in art — it’s just to make mistakes fast.”

DeepMind is the Google team behind this new tech. It's the same AI music department that recently splintered off and created the tech behind a music-centric company Udio. This week, Music Business Worldwide reported that Udio is now churning out 10 songs per second (that's ~850,000 songs per day).

The popularity of Udio is a market indicator for how Deepmind's tech will be received and the level of engagement we can expect to see in the very near future.

The Alphabet company officially entered the text-to-music arena in 2023 with their generative AI model called MusicLM.

Developers published their first Github paper in the last week of January 2023, receiving immediate attention from tech hubs like HackerNews. On May 10th 2023, TechCrunch broke the story that Google had made MusicLM available for public use.

Follow up efforts from the team surface when Google's Deepmind team announced Lyria and SynthID in December 2023.

The bridge between MusicLM and SynthID has since been completed and is now available through Google's AI Test Kitchen. It's available in a container called MusicFX, which is more or less identical to the original MusicLM interface.

From MusicLM to MusicFX to AI Music Sandbox

The new MusicFX interface could be described as a slightly more colorful version of MusicLM, with a tagging service that detects your most important prompt phrases. Users can click on settings to select a longer track length than before, including 50 and 70 second options as shown below.

There's also a new "looping" option that blends the beginning and end of your track to create an infinite song. One of my benchmarks for AI models is to see if it understands the notion of odd time signatures. In deed, MusicLM created a song section that alternated between two measures of 4/4 and a measure of 7/4.

MusicFX is a bit lightweight compared to most of the other popular text-to-music services available today. There's also been some fresh controversy surrounding the musical datasets they trained their model weight on. Outspoken AI music ethicist Ed Newton Rex (Formerly of Stable Audio) published this critical oped detailing Google's degrading values since their early efforts with Magenta.

Best alternatives to MusicLM for musicians

If the questionable ethics behind Google's model training rubs you the wrong way, there are some great alternatives to explore. Musicians who work in a DAW can experiment with AudioCipher, the text-to-MIDI plugin shown below:

AudioCipher is a text-to-MIDI plugin that loads within your DAW and gives you tight control over parameters like key signature, chord extensions, and rhythm automation. The app uses a musical cryptogram which means that letters are swapped for notes, generating melodic MIDI sequences and chord progressions that still require a musician to shape it into something meaningful.

Instead of creating an entire song with a text description, AudioCipher lets you set a focus point for your song. What is the concept you want to convey and what words would you use to describe it? Generating MIDI based on a word or phrase has been celebrated as a fun and simple way to overcome your creative blocks.

WavTool is a second option worth exploring. This text-to-MIDI DAW loads within your web browser and comes with many the core features you would expect from a workstation. It includes a GPT-4 AI chat bot that understands text commands and can translate basic ideas into actions in the DAW. The video above demonstrates the power of the app along with some of its shortcomings.

Beyond MIDI generation, you might also enjoy trying out Splash Music and Stable Audio. Both of these platforms trained consensually on licensed music from partners who opted in. Stable Audio generates instrumental music only, while Splash creates both instrumentals and AI vocals in what is commonly called text-to-song.

In June 2023, Meta put out a competitive product called MusicGen. On paper, they did technically train the model consensually, in partnership with Pond5. However, we've spoken to a library holder who sells over 50,000 tracks through Pond5 and learned that they never had the chance to opt out and were poorly compensated for the deal.

We've since published several articles with tutorials on how to use text to music apps, including scenarios like AI film scoring, creating infinite songs, and turning melodies into full arrangements.

What makes Google's MusicLM unique?

This isn't the first time Google has taken a stab at creating music using artificial intelligence. We've previously covered their MIDI generating software, Google Magenta Studio, along with other innovative tools like DDSP for tone transfer.

There are several other AI music apps out there, so why should we care about Google's contribution? Let's break it down one feature at a time.

The MusicCaps dataset: A new approach to descriptions

In the spirit of transparency, Google released their MusicCaps dataset through Kaggle. Each of the 5,521 music samples is labeled with English descriptions, including aspect lists and free text captions.

An aspect list is a comma separated collection of short phrases describing the music, whereas the free text captions are written descriptions in natural language by expert musicians.

Example of an aspect list: "pop, tinny wide hi hats, mellow piano melody, high pitched female vocal melody, sustained pulsating synth lead"

Example of a free text caption: "A low sounding male voice is rapping over a fast paced drums playing a reggaeton beat along with a bass. Something like a guitar is playing the melody along. This recording is of poor audio-quality. In the background a laughter can be noticed. This song may be playing in a bar."

This training set differs from OpenAI's Jukebox data because it focuses on how the music sounds instead of metadata about the music, like artist name or genre.

The MusicCaps developers have published a separate paper on Arxiv describing their goal of generating music from text. It was released in tandem with the publication of MusicLM.

Long Generation: Consistent musical output over time

The initial MusicLM paper claimed that their network remained consistent for several minutes, but when the app went live in May 2023, generations were capped at 30 seconds.

AI developers have had a particularly hard time generating good AI music due to a problem with LTSM, or long term short memory. LTSM is a feature of a recurrent neural networks (RNN) that enable the machine to stay focused over a period of time.

To really get a feel for this problem, I suggest checking out OpenAI's Jukebox Sample Explorer. You'll find that the music tends to lose focus and devolve by the end of the clip, so that a rap song in the style of Machine Gun Kelly gradually morphs into a convoluted reggae death metal tune.

MusicLM claims to outperform these other AI music generators with a hierarchical sequence-to-sequence model that outputs 24 kHz audio quality.

Audio generation from rich captions

MusicLM caption example — Example of a MusicLM caption

Thanks to the MusicCaps dataset, MusicLM is able to receive long form text input with rich descriptions of music. This means that users will not need any technical music theory knowledge in order to create songs. Filmmakers and video game developers will eventually be able to generate the sounds they need on demand, by simply describing the scenes in question.

Melody conditioning: Hum or whistle melodies into any style

Melody prompt and text prompt — MusicLM's Melody and text prompt grid

We've still not seen the hum-to-music feature promised by MusicLM in their original paper, but maybe that will change in the near future with the release of the AI Music Sandbox. Meta's MusicGen model released a feature like this nearly one year ago and it works pretty well, so there's no reason Google can't achieve something similar. It's just a matter of time.