Google officially entered the text-to-music arena in 2023 with their generative AI model called MusicLM. Follow up efforts from the team surface when Google's Deepmind team announced Lyria and SynthID in December 2023.
The bridge between MusicLM and SynthID has since been completed and is now available through Google's AI Test Kitchen. It's available in a container called MusicFX, which is more or less identical to the original MusicLM interface.
Developers published their first Github paper in the last week of January 2023, receiving immediate attention from tech hubs like HackerNews. On May 10th 2023, TechCrunch broke the story that Google had made MusicLM available for public use.
We hurried over to the App Store to give it a spin, only to discover that there's currently a waitlist. Fortunately we were admitted within 24 hours. These days, signing up for the MusicFX service is much simpler. Simply authenticate with your Google account and away you go!
What does MusicLM look like in 2024?
The new MusicFX interface could be described as a slightly more colorful version of MusicLM, with a tagging service that detects your most important prompt phrases. Users can click on settings to select a longer track length than before, including 50 and 70 second options as shown below.
There's also a new "looping" option that blends the beginning and end of your track to create an infinite song. One of my benchmarks for AI models is to see if it understands the notion of odd time signatures. In deed, MusicLM created a song section that alternated between two measures of 4/4 and a measure of 7/4.
MusicFX is a bit lightweight compared to most of the other popular text-to-music services available today. There's also been some fresh controversy surrounding the musical datasets they trained their model weight on. Outspoken AI music ethicist Ed Newton Rex (Formerly of Stable Audio) published this critical oped detailing Google's degrading values since their early efforts with Magenta.
Best alternatives to MusicLM for musicians
If the questionable ethics behind Google's model training rubs you the wrong way, there are some great alternatives to explore. Musicians who work in a DAW can experiment with AudioCipher, the text-to-MIDI plugin shown below:
AudioCipher is a text-to-MIDI plugin that loads within your DAW and gives you tight control over parameters like key signature, chord extensions, and rhythm automation. The app uses a musical cryptogram which means that letters are swapped for notes, generating melodic MIDI sequences and chord progressions that still require a musician to shape it into something meaningful.
Instead of creating an entire song with a text description, AudioCipher lets you set a focus point for your song. What is the concept you want to convey and what words would you use to describe it? Generating MIDI based on a word or phrase has been celebrated as a fun and simple way to overcome your creative blocks.
WavTool is a second option worth exploring. This text-to-MIDI DAW loads within your web browser and comes with many the core features you would expect from a workstation. It includes a GPT-4 AI chat bot that understands text commands and can translate basic ideas into actions in the DAW. The video above demonstrates the power of the app along with some of its shortcomings.
Beyond MIDI generation, you might also enjoy trying out Splash Music and Stable Audio. Both of these platforms trained consensually on licensed music from partners who opted in. Stable Audio generates instrumental music only, while Splash creates both instrumentals and AI vocals in what is commonly called text-to-song.
In June 2023, Meta put out a competitive product called MusicGen. On paper, they did technically train the model consensually, in partnership with Pond5. However, we've spoken to a library holder who sells over 50,000 tracks through Pond5 and learned that they never had the chance to opt out and were poorly compensated for the deal.
What makes Google's MusicLM unique?
This isn't the first time Google has taken a stab at creating music using artificial intelligence. We've previously covered their MIDI generating software, Google Magenta Studio, along with other innovative tools like DDSP for tone transfer.
There are several other AI music apps out there, so why should we care about Google's contribution? Let's break it down one feature at a time.
The MusicCaps dataset: A new approach to descriptions
In the spirit of transparency, Google released their MusicCaps dataset through Kaggle. Each of the 5,521 music samples is labeled with English descriptions, including aspect lists and free text captions.
An aspect list is a comma separated collection of short phrases describing the music, whereas the free text captions are written descriptions in natural language by expert musicians.
Example of an aspect list: "pop, tinny wide hi hats, mellow piano melody, high pitched female vocal melody, sustained pulsating synth lead"
Example of a free text caption: "A low sounding male voice is rapping over a fast paced drums playing a reggaeton beat along with a bass. Something like a guitar is playing the melody along. This recording is of poor audio-quality. In the background a laughter can be noticed. This song may be playing in a bar."
This training set differs from OpenAI's Jukebox data because it focuses on how the music sounds instead of metadata about the music, like artist name or genre.
The MusicCaps developers have published a separate paper on Arxiv describing their goal of generating music from text. It was released in tandem with the publication of MusicLM.
Long Generation: Consistent musical output over time
The newly published MusicLM paper claims that their network remains consistent for several minutes. AI developers have had a particularly hard time generating good AI music due to a problem with LTSM, or long term short memory. LTSM is a feature of a recurrent neural networks (RNN) that enable the machine to stay focused over a period of time.
To really get a feel for this problem, I suggest checking out OpenAI's Jukebox Sample Explorer. You'll find that the music tends to lose focus and devolve by the end of the clip, so that a rap song in the style of Machine Gun Kelly gradually morphs into a convoluted reggae death metal tune. MusicLM claims to outperform these other AI music generators with a hierarchical sequence-to-sequence model that outputs 24 kHz audio quality.
Audio generation from rich captions
Thanks to the MusicCaps dataset, MusicLM is able to receive long form text input with rich descriptions of music. This means that users will not need any technical music theory knowledge in order to create songs. Filmmakers and video game developers will eventually be able to generate the sounds they need on demand, by simply describing the scenes in question.
Story Mode: Fluid progression through a series of prompts
MusicLM includes a story mode that lets users describe time stamps where the music should evolve. Prompts could include abstract feelings and words like "fireworks" as well as genres like "rock song" and "string quartet". Behind the scenes, the model works to create a smooth musical transition from one semantic framework to the next.
Melody conditioning: Hum or whistle melodies into any style
Melody conditioning with text prompts is where things start to get pretty crazy.
MusicLM lets you input any kind of audio sample like humming, whistling, or even guitar melodies. You can then type in a short text prompt describing the style of audio that you want to hear and it does a phenomenal job replicating the melody provided in that style.
We'll return to this later, to explain how AudioCipher could be used to overcome the lack of key signature and tempo controls.
Painting Caption Conditioning
The paper includes a demo that turns image captions into audio. This isn't necessarily a feature so much as it is a display of how the software might be used. Instead of trying to look at artwork and interpret it, MusicLM looks at human descriptions of the art and generates music from those ideas.
Tuning MusicLM Output with AudioCipher
One of MusicLM's major shortcomings is the absence of music theory data. Their MusicCap dataset does not include tempo or key signature information. As a result, users will not be able to gain full control over the output.
Fortunately, AudioCipher provides a text-to-music MIDI plugin that includes key signature parameters. This means you will be able to generate chords and melodies based on words in any key. Fine tune the rhythm in your piano roll and set the BPM. Lastly, you will save the audio file and pass it into the MusicLM melody prompt tool with a description of the style you want.
Pick up a copy of AudioCipher to become familiar with the interface and start experimenting. Once Google publishes the MusicLM API, it's only a matter of time before open source developers create an interface like MuseTree. Then we'll be able to put this software to the test and enter a new phase of creative freedom.