top of page

OpenAI Turns Words Into Images with Dalle-2, is Music Next?

OpenAI was one of the first companies to roll out a high-performing text-to-image generator with Dalle-2. Since then, competitors like Midjourney and Stable Diffusion have emerged to captured their own share of the market. So where are the equivalent solutions for musicians?

text-to-music generator

Over the past couple of years, text-to-music generators have been exploding in popularity. In 2020, we rolled out the text-to-midi plugin, AudioCipher, for musicians to use in a DAW. At the end of December 2022, Riffusion released a novelty web app that trained on images of sound (spectrograms) and used text prompts to generate new sound-images, that they could sonify and turn into music.

Google's MusicLM app debuted in May 2023, providing a text-to-song experience that anyone can use regardless of their musical background. Each tool comes with pros and cons, depending on the goals you have as an artist.

To understand how Dalle-2 set the stage for this creative revolution, we'll need to go back to 2018 and take a look at an early prototype.

  1. Early origins of Dalle-2

  2. How OpenAI's Dalle-2 generates images

  3. OpenAI's Jukebox and Musenet: Dalle-2 for musicians

  4. How OpenAI generates audio with Jukebox

  5. Turning words and phrases into full songs

Artbreeder: Early origins of Dalle-2

Back in 2018, I was turned on to an image-generating web app called Ganbreeder, later rebranded Artbreeder. The site allows you to pick from what they call "image genetics" and combine them to generate unusual images. Under the hood, Artbreeder is a collaborative, machine learning-based tool that generates and modifies images publicly, so users can see one another's process.

This community approach to image generation and sharing was picked up later by MidJourney, where their Discord users share their prompts in public channels and learn techniques from one another.

Artbreeder and Ganbreeder Image Genetics
Sample of Artbreeder image genetics at play

I played with this ArtBreeder a lot when it first came out, generating images and even producing some stop-motion animations. As a mediocre visual artist, it gave me a new way to express through the medium, which was satisfying. Over time I became a bit frustrated by the disconnect between the words and the image output. It was too abstract and vague.

When OpenAI announced Dalle-2 and experimental artists like Danielle Baskin began sharing images of what it could do, I was baffled by the quality and accuracy. It was an incredible leap forward. Instead of combing through pre-defined words on ArtBreeder, Dalle-2 gave users the liberty type in any phrase and turn it into an image almost instantly.

How OpenAI's Dalle-2 Generates Images

Dalle-2 Astronaut Vaporwave

The images above and below were all generated by Dalle-2, based on the text prompt "An astronaut lounging in a tropical resort in space in a particular style". The AI can express your ideas visually in an infinite number of variations, from all angles and in any aesthetic. If you had an actual user account with Dalle-2, it would be possible to type in your own ideas instead of selecting predefined ones.

Dalle-2 Astronaut Photorealistic

DALL-E 2 uses a neural network trained on a large collection of images and text descriptions. When given a text prompt, the model generates an image that matches the description by using a combination of natural language processing and computer vision techniques. The model has learned the relationship between the pictures and text used to describe them in a process known as diffusion, where there is usually a pattern of dots that gradually alters itself toward an image when it recognizes aspects of that image.

OpenAI's Jukebox and Musenet: Dalle-2 for Musicians

As a musician, I couldn't help but wonder how Dalle-2's text prompt system could impact the future of songwriting. As it turned out, OpenAI was already exploring these possibilities with their Musenet and Jukebox systems.

Musenet was OpenAI's first effort to improve on Google Magenta's AI MIDI generation tools. Launched in 2019, the web app generated full length tunes in a variety of musical styles. The quality of the output was shaky at best, rendered in MIDI with low quality virtual instruments. It showcased their compositional advances but lacked the high fidelity appeal that something like Dalle-2 delivers.

Then in 2020, OpenAI released a second paper describing Jukebox, a system that generates raw audio instead of MIDI. Jukebox delivered a higher quality of music with a fairly straightforward rule engine. Users select the genre and artist that they want the Jukebox to mimic.

Music has been de-prioritized by OpenAI, in favor of their latest innovation, ChatGPT. We've since published articles on how musicians can use ChatGPT to create music and a second article about using AI agents like AutoGPT to improve on ChatGPT's MIDI composition workflos.

How OpenAI Generates Audio with Jukebox

This could get a little technical so I'll try to keep it short and sweet. To generate new audio from existing tracks, Jukebox starts by studying a source file. As their website shows in the graphic below, the original music goes through a chain of encoding, upsampling, and decoding before it's available for the listener to review.

OpenAI Jukebox

It takes OpenAI's jukebox nine hours to produce a single song, so for this reason it doesn't lend itself to rapid prototyping or experimentation. There are no public user interfaces available either. This makes the technology interesting but not very accessible to the everyday user.

As we outlined in an article on the best AI Music Apps, several companies are stepping forward fill the market demand. Almost all of the major players in this niche are focused on generating loops for non-musicians. Some of those companies include Boomy, AIVA, Mubert, and Soundful.

Turning Words and Phrases Into Full Songs

The Jukebox app attempts to translate ideas like "A country western song in the style of Dolly Parton" into actual music. You can imagine how their system would comb through a database of metadata, find Dolly's music, train on it, and deliver something in a similar style. But what happens when you ask for Dolly Parton to write death metal or hyperpop?

Google's MusicLM app is the first free and widely available tool to make text-to-song generation accessible. Unlike the AI loop generating services, MusicLM is synthesizing audio from scratch.

A second company, WarpSound, published a demo in May 2023 highlighting an upcoming text-to-song API that will deliver continuous, adaptive music. Their software is superior to MusicLM, combining granular MIDI composition with exceptional sound design.

Dalle-2 tested the limits of our own imagination and showed us that if we could imagine it, so could a neural network. Unlike images, music is abstract and representative. This means that it can be difficult to know whether a non-musical prompt is accurately linked to the musical output.

For example, a prompt like "pigs flying through a city" can be illustrated a hundred ways and each variation would be recognizable as an image. It's much harder to trace a song back to a prompt like that, which means that it's also hard to validate which parts of our text-to-music prompts are actually influencing the audio output.

To genuinely capture the feeling of a phrase, the model would have to analyze the meaning of the prompt and then cross reference it with labeled music in a massive dataset. This is precisely what MusicLM is doing with their MusicCaps and MuLan datasets.

For example, flying could imply a soaring melody, an airy synth pad for ambient chords, and a percussive rhythm matching the flapping of pig wings. The idea of a pig flying is fantastical and silly, so the melody might reflect that. A pig has a round shape with a coiled tail that could be incorporated into the lead instrument's sound design or the percussion. You get the idea.

Music producers have taken to social media to debate AI music software. Some have expressed that they want the text-to-music experience but don't want AI to do all the creative work. For people who share that sentiment, AudioCipher is a great middle ground.

AudioCipher V3

AudioCipher is a text-to-MIDI generator that transforms words into melodies and chord progressions. You set the parameters and encode the word into a chunk of raw MIDI that you can continue to fine tune in a DAW. This approach lets musicians leverage the inspiration they draw from a word or name, without handing over their entire creative process to artificial intelligence.

Follow along with out blog and join the mailing list to stay up to date on the latest news in next-generation music tech!

bottom of page