They say a picture’s worth a thousand words... but how much music is a picture worth? Or put another way, what does an image sound like?
AI music generator companies have started adding image-to-music workflows that can turn any visuals into audio, ranging from songs to AI sound effects.
In this article we’ll explore this topic in detail, providing tutorials and video demos that show you exactly how each service works. We also make some predictions about what's on the horizon for video-to-music workflows, particularly for AI film scoring techniques in video DAWs like Audio Design Desk.
Table of Contents
What is AI image to music?
The expression image to music refers to a generative AI technique where image files are converted into musical audio files. It's become a popular experimental feature on a few commercial AI music generators including SoundGen, Mubert and Splash Music.
We've published articles previously, speculating that in the future it would be possible to create music from image generators like Midjourney and Dalle images. That time has finally come, and it took less than a year to happen!
OpenAI released an AI-powered image captioning feature near the end of 2023 that made all of this possible. Users upload any picture file and GPT-4 not only tells you what it sees, but can engage conversationally about it.
By routing this kind of “image to text caption” service into a generative “text to music” tool, a whole new phase of musical creativity. I’ve been amazed to find that a couple of these services really can accurately translate images into related music.
Best image-to-music generators of 2024
Image to music may seem like a wild AI side quest, and it is, but the idea was incubating long before generative AI was actually powerful enough to deliver on the dream, in the way that it can today.
Melobytes put their flag in the proverbial moonrock of image-to-music several years ago. Their web app image2music defined a goal post that other companies could aspire toward. As for the actual quality and relevance of the music, that’s a different story. Melobyte's output is not very good, but it's a cult classic among meme makers.
So, let's have a look at the biggest contenders of 2024, starting with SoundGen.
Image file types: Gif, png, jpeg/jpg,
Video and audio file types: Mp4 video, mov, wav, mp3 files
SoundGen launched its AI text-to-music generator near the end of 2023 and has continued to innovate every month with new features. Among those are an embedded video player that loads files from your personal computer or YouTube URLs. When users generate an audio clip, they can sync it up to any timestamp they choose on the, so they always start from the same point.
In January 2024, SoundGen announced a new image-to-music feature. The video at the top of this article explores the feature in depth, but here's what that looks like when you're in the app:
Quick tutorial: To get started, sign up for free and verify your account. Once you’re in SoundGen’s web app, create a new project. Notice that the text prompt field has a camera icon located in the bottom right. If you click on that, you can upload any common image file type.
The image files will be turned into music based on the duration of your choice. Start with short clips and use SoundGen’s extend feature to expand on songs, for several minutes or longer.
Examples of image-to-music in action:
An image of two salsa dancers returns a flamenco track with classical guitar and castanets
An picture of Europeans in traditional outfits returns classical guitar with melodic violins
A painting of a floating space ship returns an ethereal cybernetic soundscape ala Vangelis
A crowd photo taken by someone at a rock show return uptempo, anthemic indie rock
SoundGen’s audio quality varies by the quality of the prompt and also a bit of luck. We typically ran four projects in tandem, to make sure we were getting a healthy collection of audio files back after the couple minutes of waiting.
The best part is that SoundGen will soon integrate with Audio Design Desk, a sound design DAW for video post production.
Image file types: Gif, png, website
Audio file types: Mp3 audio and wav
Mubert was one of the first AI tools to focus on generative AI music. They’ve continued to innovate each year, exploring the outer edges of what most people would have ever imagined to be possible.
AI text to music may seem obvious today, but a year ago Mubert and Riffusion were the only ones out there really doing it. So I was pleasantly not-surprised to see their addition of image to music in later 2023.
Quick tutorial: To access Mubert’s image to music app, sign up and log into your account. Open the Render page if you’re not already there. Here you’ll find a text prompt with a camera icon fixed to the right edge of the field. Click the camera to open the drag-and-drop popover.
Drag gif and png files from your computer into the upload area. Alternatively, paste an image URL into the website field to reference a hosted file. Click upload or next to continue.
Having uploaded the file, you’ll return to the prompt dashboard. Select a type of music file that you want to create (track, loop, mix and jingle). Set your duration and hit generate track.
The name of the track is often a reference to what Mubert’s AI model detected in the image. You can download the file and use it, but first you’ll have to share the URL where you publish music.
AI Image Generators: Text-to-image-to-music-to-video
Artificial intelligence enthusiasts should try this out at least once. Create an AI image and convert it into music. Then bring both files into a video editor. I recommend Neural Frames for creating AI music videos, but you can Apply animation templates, apply motion graphics to the image and voila; you've got original video clips to share on social media.
Fffiloni presents Image2SFX (AudioLDM2)
The video above showcases a HuggingFace space by Fffiloni that turned images into music. Unfortunately, the music generator has been deprecated but there's a similar one called Image2SFX that works great and provides a different spin on the musical offerings from SoundGen and Mubert.
To get started, all you need to do is upload an image and choose a model. Right now, the options include MAGNet, AudioLDM-2, and AudioGen. Of the three, AudioLDM seems to create the best AI sound effects. It generates audio quickly and accurately. Hit play in the audio output section to have a listen and click the download icon to grab the Wav file for free.
AudioLDM turned the above picture into a gentle atmospheric ambience with wet squishing sounds that seemed to represent the shiny, watery blue color of the sphere and creature. It created something similar on follow up generations.
By comparison, the Magnet model threw an error repeatedly and AudioGen generated some scratchy sounds that seemed to lack purpose.
The Splash Music easter egg (youtube-to-music)
Website: Splash Music
Video type: YouTube URLs
Splash offers a text-to-music service, and it seems that for some period, they quietly allowed URLs to be provided as a text prompt.
In the screenshot below, you can see how AmliArt’s video was translated into a series of songs called Sample 1, 2 and 3. We weren't able to reproduce it but the feature might be available soon, since they're promoting it.
For those who haven’t tried Splash Music before, it’s a lyric-to-song generator that turns your words into AI vocals. They rap and sing in ways that are comparable to Suno and Riffusion. Of the three, Splash Music specializes in electronic music.
Social sharing is big on Splash, so along with regular mp3 audio files you can also download the mp4 video. The videos include the lyrics for each song section, with a horizontal landscape layout best suited for Instagram and YouTube Videos.
AI video-to-music: The next logical step
AI image-to-music can be extended through multiple frames and become a video to audio generator. One solution involves analyzing individual slides, creating matching musical sketches, and then blending the clips together to form a coherent soundtrack.
SoundGen is positioned to deliver video-to-music. Their image-to-music generator and embedded video editor could be combined in interesting ways.
There doesn’t seem to be a web app that offers fully automated video-to-music generation, but if I’m missing something please let me know in the comments.
Let’s take this beyond speculation and have a closer look at a research paper, SonicVisionLM, where raw video has been interpreted with AI to create matching audio sound effects and foley.
SonicVisionLM: AI Generated SFX for Video
Evidence for frame-by-frame analysis and audio generation can be found in this Arxiv paper, SonicVisionLM: Playing Sound with Vision Language Models, published January 9th, 2024.
This diagram from their report depicts individual frames of a video analyzed by MiniGPT-v2. The model is instructed to create “action sound effects” to match a scene. When the response is issued, it’s passed through a latent diffusion model with an adapter, to generate sounds.
Put simply, the image-to-music feature that we’ve seen from SoundGen and Mubert so far may evolve into a more advanced video-to-soundtrack service before the end of 2024. Let's see where this takes us!