OpenAI Turns Words Into Images with Dalle-2, is Music Next?
OpenAI has tackled a creative problem with Dalle-2 that hardly seems possible. What if you could take your most random thoughts, dreams, or fantasies and turn them into an image instantly? Or better yet for musicians, what would it look like to rapidly prototype song ideas before spending hours trying to write, practice and record them?
While updates to Dalle-2 have stalled in the second half of 2022, OpenAI did just announce a new system called Point-E that generates 3d models. If you're wondering what that has to do with music, the answer lies in a recent innovation called Riffusion.
Riffusion was trained on visual waveforms (called spectrograms) that were labeled according to their genre and musical style. Users type in a style of music they want to hear and the neural network generates new wave form images. But the genius innovation is that they built an additional step that turns that new, AI-generated wave form back into sound.
If Point-E were trained on 3D spectrograms, they could gain even higher fidelity models of sound and turn them into higher quality audio output. Now this is all very exciting, but let's take a moment to back up and get some context for why OpenAI is well positioned to take on this monolithic effort.
A few years ago I was turned on to an image-generating web app called Ganbreeder, later rebranded Artbreeder. The site allows you to pick from what they call "image genetics" and combine them to generate unusual images.
I played with this site a lot when it first came out, generating images and even producing some stop-motion animations. As a very mediocre visual artist, it gave me a new way to express through the medium, which was satisfying. I created some social media content and album art with it. But over time I became a bit frustrated by the disconnect between the words and the image output. It was too abstract and vague. Eventually I stopped using Artbreeder, circling back every six months to check for new features.
Fast-forward to present day. When OpenAI announced Dalle-2 and began sharing images of what it could do, I was amazed beyond belief. Instead of combing pre-defined image genetics, users are able to type in any phrase and instruct the AI to turn it into a high quality graphic.
Think that sounds a bit farfetched? Check out the Dalle-2 website and spend some time exploring. There's a million-person wait list at the moment, but you can poke through their pre-rendered examples to understand what the tech is capable of.
How OpenAI's Dalle-2 Generates Images
The images above and below were all generated by Dalle-2, based on the text prompt "An astronaut lounging in a tropical resort in space in a particular style". The AI can express your ideas visually in an infinite number of variations, from all angles and in any aesthetic. If you had an actual user account with Dalle-2, it would be possible to type in your own ideas instead of selecting predefined ones.
As a musician, I can't help but wonder how this translates to the future of songwriting. Some of you readers will already know, we're a music software company that turns words into melodies. But our algorithm runs on a transcoded substitution cipher. What OpenAI has achieved with Dalle-2 is way more advanced, but to be fair, they haven't applied the same engine to music yet.
There are two public music projects currently operated by OpenAI. An early effort from 2019 was called Musenet, capable of generating full length tunes in a variety of musical styles. The quality of the output was shaky at best, rendered in MIDI with low quality virtual instruments. It showcased their compositional advances but lacked the high fidelity appeal that something like Dalle-2 delivers.
Then in 2020, OpenAI released a second iteration called Jukebox, that generated raw audio instead of MIDI. We saw a higher quality of music emerge with a fairly straightforward rule engine. Users select the genre and artist that they want the Jukebox to mimic. This could very well be the first step toward music generated through NLP (natural language processing).
How OpenAI Generates Audio with Jukebox
This could get a little technical so I'll try to keep it short and sweet. To generate new audio from existing tracks, Jukebox starts by studying a source file. As their website shows in the graphic below, the original music goes through a chain of encoding, upsampling, and decoding before it's available for the listener to review.
It takes OpenAI's jukebox nine hours to produce a single song, so for this reason it doesn't lend itself to rapid prototyping or experimentation. There are no public user interfaces available either. This makes the technology interesting but not very accessible. In the future we anticipate that the Jukebox could evolve into something closer to Dalle-2, but for music.
Give it another 3-5 years and I predict that this kind of generative music software will be commercially available. As we outlined in this article on the best AI Music Apps, several companies are already trying to fill the gap, albeit with less interesting tools.
With a quick search you'll find that Music industry trend forecasters think AI Music is one of the most promising new developments in the field. Just last week, Soundcloud announced that it had acquired the AI Music Software company Musiio. These changes are likely to transform how music is created as well as how people listen and relate to it as a whole.
Turning Words and Phrases Into Full Songs
The Jukebox app attempts to translate ideas like "A country western song in the style of Dolly Parton" into actual music. You can imagine how their system would comb through a database of metadata, find Dolly's music, train on it, and deliver something in a similar style. But what happens when you ask for Dolly Parton to write death metal or hyperpop?
Dalle-2 is amazing because the limits of our own imagination can be transcended by the neural network. There's not a lot of imagination at play when you ask something to mimic an artist and genre. But things get more interesting when you consider turning abstract thoughts like "the feeling of throwing a hamburger against a wall in the style of Kanye West" into music. Because unlike images, music is abstract and representative.
To genuinely capture the feeling of a phrase, OpenAI would have to link the emotive capacity of GPT-3 to symbolic expression. For example, throwing something implies a quick burst of energy. Throwing a hamburger against the wall will be recognized as unusual and maybe funny. Once Kanye's name is brought in, we know the genre will be hip hop.
For the ordinary person, trying to write from a cue like this would be ridiculous. Cooking up a kanye type-beat is hard enough, but capturing the feeling of throwing food at a wall? It would be too specific for any musician to take seriously. But for the right artificially intelligent composer, these type of prompts might feel more achievable.
While the world waits for OpenAI to introduce the Dalle-2 of music, our company has created a stepping stone called AudioCipher. Designed for musicians and beatmakers, the app lets you type in any phrase, choose a musical key and rhythm, and then drag the melody to your DAW.
Music has always been a fundamentally human form of creativity. Combining our natural songwriting talents with a melody generator, based on a theme of our choice, is a great way to overcome creative block. Fine tune the melodic output and start arranging your song based on the phrase that inspired you.
If you're more inclined to playing with visuals than text, check out our article on MIDI art. We'll show you how content creators draw and add color to notes in their piano roll to produce music that actually sounds good!
At the end of the day, our brains are organically intelligent and predisposed to making music. We still have a big upper hand on AI music generators -- they can help us get started but they lack emotional intelligence. It's the reason that Mubert, one of the best music AI services out there, market themselves as a collaboration between humans an AI.
Until AI music generators reach the creative capacity of tools like Dalle-2, it's on each of us to channel our thoughts and feelings into our music directly.
Update from November 26 2022: Composer and thought leader David Bruce recently published this comprehensive overview of OpenAI's music generation software, expanding on the concepts presented in this article. If you'd like to take a deeper dive into this space, be sure to check this out!