Google's AI Music Datasets: MusicCaps, AudioSet and MuLan

Ezra Sandzer-Bell
May 17, 2023
6 min read

Updated: Oct 19, 2024

Interest in AI-generated music is reaching an all time high, from celebrity voice generators to instant loop makers. In May 2023, Google released their first version of MusicLM, a text-to-music app that builds original songs from text prompts in a matter of seconds. With these innovations comes a justified concern that artist voices are just the beginning - instrumental deepfakes could be next.

Behind the scenes, Google has used three music datasets, called MusicCaps, AudioSet and MuLan, to trained their music models for MusicLM. Far from the boring and technical topic that this might seem to be on the surface, each dataset provides a critical insight into the way AI music apps work and how ethical transgressions could be baked into their design.

AI Music datasets may soon become a legal battleground for copyright holders, as machine learning developers appropriate existing, high-quality audio for their music models. We're still in the infancy of this technology, where devs can get away with what Silicon Valley calls "asking forgiveness instead of permission".

In this article, I'll outline some core principles of machine learning and the music datasets that Google trained on in order to roll out MusicLM.

What are music datasets?
What is Google MusicLM?
What is the MusicCaps music dataset?
What is Google's AudioSet?
What is the MuLan music dataset?

What are music datasets?

Music datasets are made up of a collection of files, typically in MIDI or audio format, that can be used to train machine learning models.

MIDI files include granular and explicit data about the notes, rhythms, bpm, and dynamics of a piece of music. Audio files feature the same attributes and can be determined by musicians with a trained ear, or by audio analysis software that extracts that information.

Audio analysis is accomplished through Fast Fourier Transform (FFT) and Time-Based Fast Fourier Transform (TFFT). These techniques convert an audio file's waveforms into spectral components, to pull out frequency information that can in turn be translated into more familiar musical concepts.

TensorFlow, a popular open-source machine learning platform, includes an audio analysis service called NSynth that can detect important musical features like pitch, volume (velocity), instrument type, and even the presence of effect layers like reverb and decay.

Librosa is an even more specialized audio and music information retrieval system. The python package can detect pitch, beat, tempo, effects, and more. It also offers the option to translate audio data to MIDI format.

MIDI and audio datasets are often enhanced with layers of metadata, written in some combination of natural language, numerical values, and machine-readable code. A neural network will train on the labeled audio and MIDI files, which is how text-to-audio tools like MusicLM manage to convert written prompts into music.

What is Google MusicLM?

MusicLM is a web application hosted on Google's AI Test Kitchen that lets users submit text prompts and turn them into short music clips. No technical understanding of music theory is required to use the app, but it helps to have some very basic knowledge about instrument names and other simple descriptive terms.

When you enter a descriptive prompt, Google analyzes the meaning of your text and cross-references it with the model trained the MusicCaps, AudioSet and MuLan music datasets. The sound generation takes only a few seconds and users can review two different results with each round.

What is the MusicCaps music dataset?

The MusicCaps dataset consists of 5,521 music clips sourced from YouTube. Each clip is 10 seconds long and has been labeled with English-language text written by musicians. These captions help MusicLM match users' text input to existing clips in their database, in order to generate something new that resembles the prompt.

Text captions in this dataset are written in natural language, like "this song contains digital drums playing a simple groove along with two guitars". A second column labeled aspects pulls out the most important keywords and places the list of strings in an array, for example ['digital drums', 'simple groove', 'two guitars' ].

These aspect lists reduce the semantic noise and make captions more machine- readable. You can find a screenshot sample of the dataset below:

The first three columns in MusicCaps represent the YouTube id and start/end times of the clip. To view one of these videos, paste the hyperlink "https://www.youtube.com/watch?v=" into your browser and add the YTID value after the equal sign.

The fourth column, labeled Audioset, refers to a second Google dataset where the same clips can also be found. AudioSet is a much more elaborate database with 2,084,320 10-second YouTube video sound clips. Just like MusicCaps, the Audioset database includes human labels.

What is Google's AudioSet?

With more than two million reference files, AudioSet represents a much bigger collection of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

As the list above demonstrates, Google's Audioset encompasses every imaginable type of sound. The ontology focuses on descriptors like music mood, genre, and instrument, as more abstract ideas like music role and music concepts. Like MusicCaps, metadata like artist, song, and album are suspiciously absent.

Fifty percent of the 2.1 million files in Audioset's ontology are music clips from YouTube. Can you think of a reason why Google would make it difficult to trace their music data back to the artists that created those clips?

It's tempting to interpret the lack of transparency as an effort to obstruct copyright claims from artists and record labels. However, Google owns YouTube and may already have some legalese in their terms and conditions that grant them permission to sample and train on the data hosted on their platform.

Example of AudioSet YouTube clip labeling — AudioSet video example

Many of these clips come from DIY creators. When you click through a video in AudioSet, a popover showcases the embedded video with a list of labels attached to it. You have to click through the video a second time to view the channel name on YouTube itself. But these one million videos are just scratching the surface.

What is the MuLan music dataset?

The MuLan music dataset is a private collection of 44,000,000 music recordings that amount to 370,000 hours of recorded music. Unlike MusicCaps and AudioSet, MuLan made it possible to use an audio dataset without text descriptions.

According to the technical paper, MuLan started with a collection of 50 million YouTube music videos. They extracted a 30-second audio clip from the 30 second mark of each video, using a music audio detector to discard any clips that are less than 50% music. After filtering through that data, they were left with the 44 million 30-second clips, which amounts to 370K hours of audio.

This means MuLan has substantially longer musical clips to train on than MusicCaps or AudioSet. If you're wondering how MusicLM is able to translate these labeled audio clips into songs, check out this paper breakdown video from AI audio consultant Velario Velardo:

As Velario points out in the video above, MusicLM leverages SoundStream for audio encoding and decoding, along with a tool called w2v-BERT as an intermediary layer.

SoundStream is a neural audio codec that converts audio input into a coded signal, compresses it, and transforms it back into audio using a decoder. A crude comparison could be made to the reduction of a large wav file to a smaller mp3 file. The neural audio synthesis reconstructs audio to sound as close as possible to the original uncompressed file. Smaller files reduce the burden on the neural network.

How SoundStream works — Visualization of how SoundStream works

To summarize, SoundStream is responsible for the acoustic layer of MusicLM. The intermediary step, w2v-BERT, handles the semantic (text embedding) layer associated with text labels in MuLan's audio and music dataset. You can find a visualization of the full training architecture below:

Outline of MusicLM's training architecture — Velardo's outline of MusicLM's training architecture

When compared to two other major text-to-music AI tools, Riffusion and Mubert, data shows that MusicLM had much better audio fidelity and adherence to the text input. Velardo shared the results of these studies in the image below:

Despite MusicLM's substantially better results, the app's music output still leaves a lot to be desired. Public opinion has been mixed, with a sizable cohort expressing excitement for the progress despite its numerous limitations. In the future, users would like to see more creative control and better audio fidelity from text-to-music generators.

AudioCipher's text-to-MIDI plugin is a grassroots effort to anchor these recent innovations within the experience of a DAW, improving accessibility and usability for music producers. Instead of artificial intelligence, the VST uses a home brewed algorithm that gets users comfortable using text-to-music generators and shaping the output into something uniquely their own.

The MIDI generator comes with perks like a one-time purchase (instead of monthly subscription or pay-to-play models). You can use the app with or without the internet and as a MIDI generator, maintaining maximum control over the sound design and composition.

Our dream is to eventually bridge the gap between our current MIDI software and text-to-music generators like MusicLM. With each iteration, existing customers always receive free software upgrades, so early adopters don't have to worry about recurring costs later down the road. We appreciate everyone who has joined us so far and look forward to continuing to share our journey through this wild new world of music making.