top of page

Audio to Audio AI: Melody-to-Song & Style Transfer in 2024

Generative AI music has evolved rapidly over the past two years. Following decades of research and development in academic circles, the general public is finally gaining access to consumer-friendly interfaces. Musicians are enjoying a renaissance of new plugins and DAW features powered by the same technology.

The first wave of AI music creation in 2023 centered around text-to-music. That trend was followed by a second wave of apps like Suno and Udio, offering text-to-song. However, there's a sleeping dragon in the field that's scheduled to have its big debut in 2024.

Audio to Audio AI is a subset of music generation that takes in audio files, runs it through an AI model, and transforms it into a new audio file. There are several companies offering commercial audio-to-audio products today.

Here are the most common subcategories within this niche:

  1. Melodic conditioning: Humming, whistling, or performing a solo melody on an instrument and turning it into a complete arrangement.

  2. Music samples into songs: Turning multi-instrument music arrangements into new music clips and extending them to generate new song sections.

  3. Audio-to-midi-to-audio: Converting vocal performances into beats, by analyzing the pitch and timbre of mouth sounds and connecting them to MIDI-triggered samples in your DAW.

  4. Tone transfer: Using AI models that were trained on an instrument to turn user input, like vocals or a guitar performance, into new instruments like a violin or saxophone.

  5. Style transfer: Changing the style of audio and music inputs to another, with the option for more than one output instrument. It combines melody conditioning, tone transfer, and remixing in a single function.

  6. Voice cloning for singers: Transferring an audio recording of a singing voice into another vocalist's style and timbre.

  7. Stem separation: Splitting full arrangements into individual tracks.

In this article we'll share specific apps that fall into in each category along with videos of how people are using them for music creation workflows.

Table of Contents

Music-to-song generators

AI powered music-to-song generation refers to models that take in an initial audio file and transform it into a complete arrangement.

Musicians are using these models to quickly expand on ideas and kickstart new projects. Non-musicians can hum a melodic idea or tap on a table to experience the magic of song creation, without running into usual roadblocks like music theory, instrument performance, and digital audio recording.

Udio (audio-to-audio extension)

Udio interface for audio to audio AI

Udio currently offers a powerful AI audio to audio feature. The system intakes music clips and combines them with tags provided by the user, generating new music in that same style. To access this option you'll need a paid subscription.

We experimented with this feature extensively and discovered a few interesting details about the model's performance:

  1. When we uploaded and extended music with vocals, the default settings generated music that incorporates instrument stems from the input. This helped with track cohesion, producing new verses with variations in the instrumental arrangement.

  2. With each extension, new sonic elements are introduced. Therefore, we found that by the third or fourth extension, audio could up in very different territory than when we started.

  3. You can extend forward or backward relative to the whole clip, meaning you can use Udio to generate intros that segue into your music.

To try the feature out for yourself, log into your account and at the top of the page, find the upload icon at the right corner of the text prompt field. After uploading a track, the file name will appear to the left of the prompt box. Add a series of comma separated tags to further influence the style of the musical output. Before hitting extend, review the additional parameters below.

How do you extend audio files in Udio?

  1. Lyrics: Select custom to input your own lyrics, instrumental to omit vocals, and auto-generated to let Udio's system come up with words for you. Try to come up with your own. AI generated lyrics tend to be a bit mediocre.

  2. Prompt strength: Strength refers to how much influence the input has over the output. Maximum strength values will force your input into the new generation and can create less natural sounding music, while very low strength settings may sound more natural but drift far away from the original input. For this reason, it's best to start in the middle and adjust incrementally.

  3. Seed: The default value of -1 randomizes the seed, so that the output resulting from each generation is sufficiently different from the previous one. If you type in a random positive value and keep other settings the same, you'll get a more closely related output.

  4. Clip start: This feature is used to approximate whether you want the clip to be based on a song's intro, middle, or outro. It doesn't indicate where the AI model actually extends from. Those controls are located in the section above, labeled extension placement.

  5. Context length: This refers to how much of the song Udio should take into consideration when composing the extended section. The more context you include, the closer it will adhere and the more of your melodic or instrumental concepts will be incorporated.

Udio's terms of service allows them to train on your music

Be aware that if you're uploading original music to Udio's system, their terms of service specifically states that "input content" will be used to improve and modify their machine learning models. The grant of rights includes the option for their company and affiliates to reproduce, store and modify your files.

Udio terms of service

Suno AI Music (audio-to-song)

suno's new AI audio to audio model

Suno is currently the top AI song generator on the commercial market. Until recently, they only offered text inputs for prompting music style and lyrics. In May 2024, they announced a $125M round of funding and rolled out a new model version 3.5, with a teaser for their upcoming audio-to-audio feature.

The gif above was sampled from their short promotional demo on social media (here). It shows a person tapping on a watering can, coupled with a music description that reads "In the style of heavy psych rock". The rhythm of their tapping is transformed into a complete song arrangement in that same BPM.

Models of this kind are called multimodal because they accept two or more modes of input; in this case it's text and audio. We've not yet seen how Suno's audio-to-audio model will handle melodic inputs. As soon as the feature's published, we'll update this article and let you know.

SoundGen (music-to-song)

SoundGen is already on the market and currently supports the option to turn melodies, chord progressions, percussion and music clips into full arrangements in any style. The interface is built with musicians in mind, but anyone can use it.

Like Suno, SoundGen runs on a multimodal engine that combines audio input with a text description to reimagine that clip in a new way. Check out the video above for a demo of how users are turning raw musical ideas, like a melody or guitar riff, into a complete arrangement.

The company's vision for audio-to-audio is more like a bandmate or collaborator than an instant song generator. They've built out granular editing tools like trimming, audio file management, and a standalone app that supports dragging-and-dropping creations right into a DAW.

Song expansion is a second interesting use case for SoundGen. Rather than turning a single instrument into an arrangement, it can turn finished music clips into new variations. There's an "extend" feature that allows users to continue where the generation left off and keep making music.

Check out the demo below to see how beat makers are iterating on music samples from Splice to get new ideas.

Google's MusicLM, MusicFX, and Lyria models

Google made a big splash at the I/O conference in May 2024, opening with a live performance of their latest MusicFX web app. Despite a lot of hype surrounding the event, Deepmind is still withholding their audio-to-audio feature. There's a chance we'll see it drop some time later this year.

The I/O event marked a nearly eighteen month period since the original MusicLM paper was published back in January 2023. That document had promised a melodic conditioning feature that would include humming and whistling as inputs. Have a listen to those examples here.

Text and melody conditioning with MusicLM

In November 2023, Google's Deepmind team published a follow up report that proposed the following: "Imagine singing a melody to create a horn line, transforming chords from a MIDI keyboard into a realistic vocal choir, or adding an instrumental accompaniment to a vocal track."

This system was tied to a new model called Lyria and an AI-generated watermark called SynthID that Google can use to trace songs back to their system. We've shared a screenshot of that interface below, but just to be clear, the Lyria app is still not available.

Audio-to-audio tone transfer

Tone transfer, also known as timbre transfer, refers to AI models that transform audio inputs into unrelated instruments. Some common examples of this include singing-to-saxophone, humming-to-violin, and beatbox-to-conga.

Neutone Morpho

Neutone is currently the leader in timbre transfer technology, or as they call it, tone morphing. In 2024 they released a new, state of the art VST called Morpho and shocked their fanbase with a gorgeous interface that adapts to audio in realtime.

As of May 2024 they have 19 models and counting. Each one has been trained on a unique collection of sounds, ranging from Eastern string instruments and vocal choirs to effect layers. Some of their models would qualify as style transfer, which we'll address in the next section.

There are two points that I want to underscore about Neutone. One is that they've proved AI music services can be created ethically. Each of their models was trained consensually through partnerships with musicians and audio engineers.

The second point is that they're in the process of rolling out a new product called Cocoon that will give users the opportunity to train their own tone morphing models. Until now, this capability has been reserved for programmers and tech savvy tinkerers using services like Google Colab.

Ultimately, Neutone is an experimental musician's tool more than it is a song generation app. For those who are excited to venture into that territory, it's worth signing up for the free trial and giving the app a spin.

Combobulator by DataMind Audio

DataMind Audio's Combobulator plugin is a second example of experimental tone transfer aimed at musicians who operate out of a DAW. The company refers to their approach as neural synthesis, because it combines neural networks with a kind of audio synthesis.

Each AI model provided by DataMind was trained in partnership with a popular underground electronic artist, including heavyweights like Mr Bill, Rob Clouth, and Woulg. These fine tuned AI models are branded as artist brains and give users the opportunity to transfer audio inputs into the style of those musicians. 50% of the revenue from each sale goes directly to the artist.

Check out our full review of Combobulator here to learn more.

Mawf Plugin

Mawf timbre transfer plugin

Mawf was a short-lived timbre-transfer VST designed by Yale graduate and ex-Google-turned-TikTok employee, Hanoi Hantrakul. The app never made it past its beta version, perhaps because his efforts were absorbed into an AI research scientist role at ByteDance.

The plugin offers a small collection of audio-to-audio models, ranging from Western instruments like trumpet and saxophone to a traditional Thai stringed instrument. Output controls on the pedal range from effects and dynamics to even expressive nuance like vibrato and tremolo.

The plugin is available for free when you sign up to their beta program.

Audio-to-audio style transfer

Style transfer is a third category of audio2audio that includes tone transfer and melodic conditioning. It's one of the most difficult to describe or pin down, because of the technical details about how they work. One employee called it "redenoising" or "init audio" which would be incomprehensible to most people. So we're sticking with a simpler term, style transfer, found in the parallel image-to-image world.

Stable Audio 2.0

StabilityAI's music generation app, Stable Audio 2.0, is a multimodal web application that converts text and audio input into a wide range of outputs, based on advanced parameters configured by the user.

On the surface, Stable Audio seems to belong in the same audio-to-audio camp as SoundGen. But I spoke with members of Stable Audio's team and learned that something more complex is happening under the hood.

Apps like SoundGen use a melody-centric technique called chroma conditioning that prioritizes the melody over everything else. It's pretty good at generating compatible chord progressions and arrangements with that approach.

Stable audio's diffusion model includes text conditioning but also uses timbre transfer, voice transfer, and remixing all in one function. The latent space has dimensions to describe melodics and the transformer is aware of those.

For this reason, I needed a new term for it and landed on style transfer. The breadth of possibility is greater with Stable Audio, but users can't easily pick and choose the function they want to target.

VAE and latest space diagram

Above is a simplified diagram of how this class of generative AI models, called variational autoencoders, work behind the scenes.

Stable audio traces a unique path in the VAE's latent space. It adds noise to that path, to shake things up a bit, and uses diffusion to reshape that path the remaining ~5-10% of the way.

During that phase, the VAE introduces biases that makes it closer to whatever curves the transformer has learned during training and closer to the line or "shape" of your prompt.

In practice, this means you can add latent noise to an acoustic song and use audio-to-audio style transfer to make it sound like an electronic track. It's keeping most of the shape of that curve, but (within the noise range you provide) it shifts toward the curves that your diffusion transformer model recognizes as electronic.

If you're interested in learning more about the technical nuts and bolts of VAEs and AI generated music, check out Valerio Velardo's free Sound of AI course on YouTube.

Her's a second demo of Stable Audio 2.0 where text and audio are combined to create new musical material. You can sign up and try the app for free here.

Audio-to-MIDI-to-Audio (A2M2A)

How many hyphenated words can a music tech marketer chain together?

Believe it or not, audio-to-midi-to-audio is actually a popular music software niche. The industry hasn't given it a proper name yet, so I cobbled this one together as an umbrella for tools that operate on a shared premise.

Ordinary audio-to-MIDI tools, like Samplab and Ripx, convert audio into MIDI notation for the purpose of transcription rather than sound design or tone transfer. It's up to the user to take that MIDI content and drag it into their DAW.

A2M2A (audio-midi-audio) software leverages the MIDI intermediary to provide more granular control over the end goal, which is tone transference. It also eliminates the weird audio artifacts that sometimes result from A2A models.

AI timbre transfer tools convert audio files directly to audio, so it's not possible to edit their pitch or rhythm value.

Vochlea Dubler 2: Voice to MIDI to Audio

Vochlea currently offers one of the most popular A2M2A plugins. As the company's name implies, it's a tool centered on voice inputs. Popular use cases include turning vocals into bass lines and synth leads or beatboxing into drum beats.

Musicians appreciate the granular controls over key signature, which come in handy if you don't sing with perfect intonation. Think of it like autotune, but instead of adjusting the pitch of your audio, it funnels your tones into the nearest note in that key signature.

In the video above, the artist Beatox demos how to combine live instrument performance on bass and keys with Dubler 2's voice-activated drum triggering.

MIDI allows users to choose the precise samples that they want to trigger, based on the pitches that they intend to hit. This is particularly helpful with beatboxing, because you can set up a kit and assign "s" sounds for hi-hats, "t" sounds to snares, "p" sounds to kick drums, and so forth.

Jam Origin's Guitar-to-Instrument MIDI controller

Any instrument can be used in the audio-to-midi-to-audio conversion chain. Jam Origin is an example of a MIDI controller that uses guitar audio as its input and facilitates realtime output for live performances. The timbre of the output is flawless, unlike AI-powered timbre transfer which tends to introduce unwanted audio artifacts.

In the video above, a guitarist turns the timbre of his guitar into an upright classical piano, accordion, cello, woodwind, saxophone, and more. The demo includes screenshots of the virtual instrument they used in their DAW to achieve these high quality sounds.

Stem separation (Track-to-instruments)

Stem splitting is a bit different from all of the other examples we've shared in this article. Audio-to-audio usually refers to transforming an initial file into something that's substantially different, like a voice to a synthesizer.

However, stem separation is one the most popular types of AI audio models today. There's no argument that audio output differs from the input. So we need to include it in the list. People are using splitters to isolate or remove vocals from a mixed track, grab just the drums, guitar, keys, bass, or some other instrument in the mix.

Here are a few examples of the different types of stem splitters out there.

  1. Splitter ai is one of the only free web applications we've found offering this service. It's powered by a site called VocalRemover. If you try to process too much audio at once, you'll get throttled, so just try to pace yourself.

  2. RipX and Samplab 2 are stem separators that include note-manipulation so you can adjust individual tones up or down while retaining the instrument timbre.

  3. Logic Pro 11 and FL Studio are two popular DAWs that have started offering stem separation as a native feature, so you don't have to use websites anymore.

There are dozens of stem separators out there, so do your homework and find a tool that's in your price range. Just be aware that some companies charge a premium on a per-track basis, while the DAWs and standalone apps offer similar quality output for a lifetime of unlimited processing.

Voice cloning for singers

Voice-to-voice cloning is the last category we'll touch on in this article. It's been a popular and controversial niche, with dozens of startups popping up in 2023 and 2024 to offer variations on the same basic service.

Most of these companies are using the same open source model called RVC under the hood. That means that the biggest differentiators you'll find are the ability to train your own models, the ethical training of voice models by the service, and the audio output quality of the voice models.

We've penned a comprehensive overview of singing AI voice generators here. Check it out to explore some of the most popular apps in the field.

This concludes our roundup of audio-to-audio tools. Sign up to AudioCipher's newsletter for updates about Suno, Google's Lyria model, and other newsworthy products that may arrive in the A2A space later this year.


bottom of page