AutoGPT Music Agent Solves The GPT4 Text-to-MIDI Problem

AutoGPT is an "AI agent" that expands on the existing capabilities of ChatGPT. Users define a goal and the agent generates a list of sub-tasks that should lead to the achievement of that goal. AutoGPT is not presented as a music app, however it could be the key to unlocking the full power of other GPT-4 MIDI generation tools.

In this article, I'll share a quick overview of ChatGPT and how people have started using the underlying tech (GPT-4) to create music in an AI DAW like WavTool. From there, we'll take a look at GPT's biggest limitation with MIDI composition and how AI agents like AutoGPT can help.

What is text-to-MIDI?
How does WavTool use GPT-4 to create music?
The big problem with GPT-4 text-to-MIDI composing
AutoGPT music: Goals and sub-tasks to the rescue
Demo of AutoGPT music composition with sub-tasks
Text-to-MIDI may be immune to copyright claims

What is text-to-MIDI?

Text-to-MIDI is an emerging trend in music software that allows users to type in a word or phrase and transform it into MIDI output, for the purpose of making music.

AudioCipher is one of the most popular text-to-MIDI tools for musicians. The plugin loads directly into a DAW and can be fine tuned according to your preferred key signature, chord extensions, and rhythm automation. These MIDI melodies and chord progressions are based on a musical cryptogram algorithm.

A second company, Mubert, offers text-to-MIDI services through Hugging Face. Unlike Audiocipher, the app runs in your browser and generates full songs with a single prompt. Their product analyzes your text input and attempts to match it with the metadata labels on existing MIDI loops.

AutoGPT isn't relevant for using AudioCipher of Mubert, but it could play a pivotal role in the success of a third text-to-MIDI DAW called WavTool. We'll be focusing on that use case for the remainder of this article.

How does WavTool use GPT-4 to create music?

WavTool is a next-gen digital audio workstation that comes with an AI composing assistant, powered by GPT-4. Users can ask the DAW to perform any number of tasks related to MIDI composition. In your own words, you can ask it to generate chord progressions and melodies, synthesize instruments in a wavetable and add sound effects like reverb or delay.

Our first experience with WavTool was awe inspiring. We had previously covered ChatGPT and its ability to create written descriptions of music. The novelty of watching a computer generate chord charts and tablature has entertained us, but WavTool brought everything down to Earth.

The DAW has a general understanding of GPT's musical descriptions and makes its best effort to act on those instructions. I've spoken with the company's founder, who explained that they're continuously training and improving upon the DAW's musical skills and capabilities.

For a complete overview of WavTool and how it works, check out this featured article we wrote earlier this year.

The big problem with GPT-4 text-to-MIDI composing

I don't like seeing AI tools hyped up unless they actually work. Most AI software has some fatal flaw that you only discover once you start using it. We always try to cover these tools with a balanced and honest perspective.

I experimented with hundreds of WavTool prompts and watched it successfully act upon every layer of the DAW. Despite this impressive functionality, it seems GPT-4 has one major flaw when it comes to composing MIDI tracks. To understand this problem, we need to confront the elephant in the room.

GPT-4 is a large language model that trains on written descriptions of music. It has an advanced understanding of music theory concepts, but that knowledge does not grant it the ability to compose complex music with a single prompt, the way other AI image generation services do.

An AI music theorist with no ears and a bad memory

Imagine a person who never heard a sound in their life but memorized every concept from every public book on music theory. They have mediocre short-term memory and will only self-reflect on the music they create when you tell it to. If you ask them to write a unique melody in the DAW, they do it impulsively and hand over the first half-baked clip they come up with.

This doesn't describe the overall state of ChatGPT but it does represent my experience as a user, trying to generate music with WavTool.

Listening and improving on music is essential to songwriting

GPT-4 lacks the ability to experience the sounds it creates. But the real problem with GPT-4 is that it relies on a simple call-and-response format. Users submit a prompt and receive a single response in turn. The AI model does not turn back upon the thing it just created, to ask itself whether it met the criteria of the prompt.

When you ask WavTool's AI composer to write a "unique 16-bar chord progression in C# minor", it makes a single attempt to do so and then prints the MIDI track.

Given prompts like this, WavTool might produce a C# minor scale that repeats for 8 bars. That's because the prompt uses a vague quality of "uniqueness" that the large language model can't adequately translate to more complex musical ideas.

The more specific your prompt, the better your outcome.

Bad prompt: "Write a good melody"

Better prompt: "Write a syncopated melody in F# harmonic minor"

Best prompt: "Write a syncopated melody in F# harmonic minor, using a balanced combination of quarter, eighth, and sixteenth notes. Do not repeat a musical phrase more than two times".

The better prompts require a more advanced musical vocabulary. The same is true for AI image generators like Dalle-2 and Midjourney, which is why prompt engineers have been able to sell highly descriptive text commands to people seeking a specific visual aesthetic.

I've been exploring a new technique that may solve the need for a rich music vocabulary. This would help beginner musicians use GPT-4 MIDI generation tools without prior music theory knowledge.

AutoGPT music: Goals and sub-tasks to the rescue

AI agents like AutoGPT are running on GPT-4, just like WavTool, but they introduce a single new feature that could change everything. When you provide a prompt, AutoGPT interprets it as a goal and generates a linear series of sub-tasks that it tries to accomplish.

Typically, if WavTool provides a mediocre MIDI track, I will use natural language to point out the need for improvement. It acknowledges the failure to produce a unique melody, and will iterate on the idea. However, the second and third rounds are typically pretty close to the original. I learned that I would have to come up with better written prompt, if I want to guide it toward the ideal outcome.

My experiments with AutoGPT showed me that a single goal like "write an interesting melody" can be broken into richly descriptive sub-tasks. The AI agent actually includes its own quality checks, to ensure that a subtask was successfully completed before moving on to the next one.

The lack of self-reflection that ChatGPT and WavTool's AI composer assistant offer could soon be transformed by the AI agent's ability to automatically course-correct like this.

If you want to try AutoGPT, clone the latest stable version from their Github repository. Here are instructions for installing AutoGPT on Mac and Windows.

Demo of AutoGPT music composition sub-tasks

Now that you understand the value of AutoGPT for making music, it's time to give you a demo of what these goals and sub-tasks look like in practice. I gave AutoGPT the goal of writing a unique melody. You can review that process in the screenshot below.

AutoGPT's first thought was to brainstorm melodies and generate a MIDI file. It would draw from basic music theory concepts, avoid plagiarizing any existing music (to avoid legal complications), and double check that the melody it created was actually pleasing to the ear. That final step is the nut of this whole article, because it's precisely the action that's missing in WavTool today.

After each round of generations, AutoGPT asks the user whether they would like to continue by authorizing the next command. You can approve them one at a time or use the command 'y -N', where N represents the number of rounds AutoGPT should go through before checking back in. So for example, you could type y -10 to authorize the next ten rounds of generations.

Now here's the rub. AutoGPT can technically create MIDI files on your harddrive but they have no content in them and won't work when you drag them into a DAW. That's a bummer on the surface, but it's only a short term limitation. The AI agent is working through the full plan conceptually. It just needs something else to execute on those instructions.

This is why I believe WavTool, or some other AI MIDI DAW, will eventually integrate with an AI agent, so that the user can set a descriptive music generation prompt as their goal, while the DAW handles all the sub-tasks necessary to deliver the final product in a reasonably good state.

Text-to-MIDI may be immune to copyright claims

The music industry has been voicing opposition to AI music tools as they continue to surface and pour out across the internet. Universal Music Group has won some of its takedown requests while the RIAA makes legal threats to indie developers.

Vocal deepfakes are centerstage in the conflict, dominating our collective conversation about artificial intelligence and music creation.

As non-musicians rush in to imitate famous artists and capitalize on their likeness, other non-musicians chime in with predictable objections. One camp gleefully trolls on Twitter while the other announces the imminent death of music.

Experts are drawing comparisons to other phases of history, when synthesizers and drum machines evoked a similar public response.

Our obsession with fake Grimes and Drake songs seems to be eclipsing other interesting developments in this space. Niche AI music generators like WavTool, a text-to-MIDI AI DAW, are still off the radar for most people. People chasing the one-click solution to music generation are less likely to explore a digital audio workstation powered by GPT-4.

This tide could change in the coming year, as AI agents find a home within more DAWs like WavTool. As soon as the first non-musicians get into one of these workstations and have the text-to-music experience our readers long for, word of this new app will spread like wildfire.

I don't think the industry has prepared for this. They're more likely to feel threatened by apps like Harmonai and Jukebox that create entire finished songs in the likeness of copyright material.

Unlike a vocal deepfake, there is no legal precedent for suing someone who draws inspiration from another artist. The ambiguity here sets the stage for a phase of music creation where users can imitate an artist's style without directly copying their music.

Voice changers are low-hanging fruit. Composing and modifying music in the style of another artist will be impossible for streaming platforms to detect and too expensive for the recording industry to micromanage.

We will continue to watch this space closely and report back with updates. In the mean time, music producers who want to start practicing their text-to-MIDI workflows in the DAW can use AudioCipher to get started.