TechCrunch recently announced the launch of a free AI music generator called Riffusion that turns text prompts into audio files in real time. Users can create waveforms, visualize them, listen to what they sound like, and download the audio clips to their local computer for fine tuning.
Riffusion has started trending almost immediately. While the quality of the audio and range of prompts still need work, we're excited to see an easy-to-use tool like this made available to the public.
Riffusion Generates AI Music from Spectrograms
This has been a wild year for artificial intelligence. Earlier this month, OpenAI released ChatGPT and Elon Musk warned that it was dangerously good. We can only hope that AI music generators start delivering a similar output quality.
Like most AI-generated music, Riffusion produces low fidelity audio files that are only a few seconds long. There’s a well known problem in this space where neural networks lose the plot and veer way off course from their initial prompt, sometimes very quickly.
To address this problem, the latent space created by these deep learning algorithms has to be smoothed over with interpolation from a Stable Diffusion model called img2img. This allows the the short-duration waveforms to be stitched together and form longer audio clips.
Spectrograms are sound waves that been visualized and mapped onto a graph. Each frequency exists on an axis of time and frequency, represented by the color of the pixels and amplitude of the waveform. For a general introduction, check out the free Spectrogram tool from Google's Chrome Experiment page, shown above.
How to use Riffusion as a music producer
After experimenting with this new software, we’ve found that some prompts work better than others. You won’t have much luck naming specific artists, even if they’re well known. Instead, focus on single instruments (church bells, bongo drums) and genres (rock, hip hop, etc). The application is still in its infancy, but as it continues to grow in popularity, the performance will likely improve.
If you’re having trouble getting meaningful output, click the dice icon located just to the left of the text prompt. The app will suggest prompts that it’s familiar with and you’ll have a better chance of hearing something accurate.
To create variations on the same prompt, press play and let the process run for a while. You can also access the settings and choose from a handful of seed images and denoising options. According to the developers, this will help generate new melodic output.
Once you’re ready, you can download an mp3 file by clicking the share icon on the main screen and clicking the vertical ellipses. Drop the audio file into your digital audio workstation and be prepared to do some cleanup with EQ, denoising, and any other tools you have at your disposal.
OpenAI's Point-E: A Riffusion Competitor?
OpenAI recently announced a new cloud diffusion model called Point-E. According to their Github, Point-E generates 3D models using point clouds (hence the name). Now, Riffusion trains on 2D spectrograms and creates a 2.5D visualization of the waveform for users of the web app.
If Point-E were to train on 3D spectrograms directly, it's possible that it would generate waveforms with more data and by extension, produce higher fidelity audio. This is only speculative at this point, because Point-E has a long way to go and has not announced any plans to roll out waveform sonification.
Even if OpenAI decides not to pursue 3D-to-Audio, other services like Nvidia's Magic3D AI could be candidates. With the growing demand for AI music generation, 3D alternatives to Riffusion are bound to arise.
History of image-to-spectrogram in music
Riffusion isn't the first time audio engineers have toyed with spectrograms in the music-making process. Famed electronic music artist Aphex Twin released a track on his album Window Licker in 2001 called [Equation]. It was so noisy and barely listenable that one fan thought to run it through a spectrogram visualizer.
Sure enough, near the end of the track they found a hidden message - the image of a sinister looking face with glasses stares back at the listener. This was on brand for Aphex Twin, whose music videos often dealt with dark and absurd themes like this.
Another electronic group, Venetian Snares, released an equally noisy track Songs about my Cats, featuring the following image:
The soundtrack to a cult classic video game, Fez, has moments of hissing and noise that also contained images hidden in the spectrograms. These examples just go to show that music producers have been toying with this method for a long time now.
As Riffusion becomes more advanced, it's easy to imagine a scenario where users would be able to hide images within songs.