Muzic: Microsoft AI Music Team Builds Text-to-MIDI & More

Ezra Sandzer-Bell
Dec 14, 2023
17 min read

Updated: Apr 27

The Windows AI music team, Muzic, is based in Asia and has at least 14 research papers to their name. All of them are available in English, as is the Github repository where most of these projects live. The research team is independent of the core Microsoft teams, so experiments that they run do not necessarily imply that similar software is on the core company's product roadmap.

As a research team, Muzic has not released any consumer products to date. But that being said, generative AI could pose an interesting opportunity for Microsoft to get in the race and contend with Apple in a new generation of emerging software. It's also worth noting that research programs have had influence on the development of new features at Microsoft in the past.

In December 2023, Microsoft announced Copilot would partner with Suno AI to deliver text-to-song functionality. It makes sense that they outsourced this challenge, as we can see from the pace of development at Muzic that they are slightly behind the curve.

We first heard murmurs of Microsoft's efforts to build AI music software in 2021. A few news sites reported on a technical paper called DeepRapper, describing a neural rap generation tool. The research projects have continued to pile up over two years, swelling to an impressive collection of machine learning tools.

In the near future, some AI developers have predicted a transition from digital audio workstations to generative audio workstations. To my knowledge, the idea of a "GAW" was coined by AI music visionary Samim Winiger of Okio.

This shift toward AI collaboration tools in the DAW could easily take place over the next 5 years, as generative workflows like text-to-music and sonic intelligence become more commonplace. We've already seen some signs of this with the chat-to-music features of an AI DAW called WavTool and self-organizing AI sample managers.

In this article, I've done my best to compile a comprehensive overview of Muzic's entire catalog of research papers, summarizing key ideas from each project. Before diving into those summaries, I'll start by sharing some open questions I have about the potential benefits that might come from their growing partnership with Meta.

Microsoft bonds with Meta, commits to AI safety

On May 31st, Microsoft's Muzic team published a new AI music paper describing a text-to-MIDI generator called MuseCoco. Based on what we gleaned from the report, their efforts seems to be lagging behind other text-to-audio generation tools like Google's MusicLM and Meta's MusicGen. I mean this in the sense that there is no consumer facing application (or video demo of the software).

There is some precedent for Windows to be focused more on MIDI generation than audio. For example, Microsoft's principle software engineer Pete Brown is the chair of the Executive Board of the MIDI Association, and he has been leading an initiative to implement MIDI 2.0 on Windows machines. On June 19th of 2023, he announced a big update to the Microsoft MIDI Github Repo and detailed many of the improvements to their system.

Clearly Microsoft is invested in MIDI, which may explain why their ML team has been so focused on this space. Then again, MIDI data is also easier to procure from public domain, cheaper to run experiments with, and sidesteps many of the challenges associated with achieving high quality timbre.

For these reasons, we've been a bit bearish on Microsoft's AI music efforts here at AudioCipher.

But in July 2023, two big announcements caught our interest and rekindled our interest in this program. They got me thinking about how Microsoft might soon become a real contender in the AI music software space, if they want to.

The first was a public statement from Microsoft on July 18th, confirming that Meta would continue to partner closely with them to make the latest large language model, Llama 2, available to developers on Windows and Azure. A key part of this agreement includes the stipulation that Azure would be the chosen cloud provider for Meta's machine learning library, PyTorch.

Bear in mind that in June 2023, Meta released MusicGen, the most powerful AI music generator to date. Data science has proven that MusicGen is objectively better than Google's competing text-to-music app, MusicLM. But more than that, Meta's codebase is entirely open source, meaning that any developer can fork it and build upon the underlying PyTorch library, AudioCraft.

Additionally, on July 21st, US president Joe Biden wrangled Meta, Microsoft, Apple, Amazon, and OpenAI into a common agreement to reduce risks associated with artificial intelligence. Music is a particularly safe space to experiment with machine learning, which leads me to believe that both Meta and Microsoft will continue to develop their tools in a non-competitive spirit.

Meanwhile, OpenAI has seemingly abandoning their AI music efforts with Jukebox and Musenet, in favor of ChatGPT. To my knowledge, OpenAI's music generation efforts were not based on consensual music licensing partnerships. The potential for legal repercussions could be the reason they bailed, although no public statement has been made at this time.

Meta, on the other hand, has a partnership with Pond5 and Shutterstock Music, and used those music libraries for their training data at MusicGen. This means that Meta can continue to train use that dataset freely, without any risk of legal repercussions. The potential for misuse by end-users could be mitigated through clear user agreements about publishing rights. They already have music revenue sharing programs in place as of 2022.

So if Meta and Microsoft are closely partnered on AI, what could this mean for the future of generative audio? Will a next-generation DAW, or collection of consume software, emerge from Microsoft and Meta's teams? If so, will it pose a meaningful threat to Apple's dominance in this space?

We'll have to put that speculation aside for the time being. In the remainder of this article, we'll take a deep look at each of Microsoft's Muzic tools. I've done what I can to summarize each paper in simpler terms, explaining some of the technical jargon (insofar as I understand it) and making it more accessible to the public.

14 AI music tools in the Microsoft "Muzic" suite

Microsoft Muzic logo

Muzic is a research project from Microsoft that focuses on two branches of AI music; understanding and generation. Both efforts are based on deep learning and artificial intelligence. The program was started by researchers at Microsoft Research Asia, with some contribution from outside collaborators.

The majority of commercial AI music apps keep their tech sealed within a vault, delivering only the final output of their music generation. In contrast, we are seeing a very open approach from Microsoft Muzic and Meta's MusicGen teams. As well-funded companies, they can afford to experiment in the open without fear of going bankrupt from a "vampire attack" by some competing company.

The information in this article comes from the Muzic team's Github page and some related repositories in its orbit. You can view the following diagram for an example of how they split up their efforts into two main categories:

Comparing Microsoft's music generation and music understanding systems — Diagram from the Muzic Github page

Muzic's understanding of music includes phases of stem separation, transcription, classification, retrieval, and recognition. These inform their ability to generate song lyrics and melodies, chord accompaniment and arrangements, AI voices, instrument timbre (tone quality), and to mix all of those layers together.

We previously covered the AI music datasets used by Google's MusicLM, to provide a digestible overview of where they sourced their music and how it was leveraged for music generation. This summary was well received, which is why we'll be attempting to do the same for these Microsoft Muzic papers.

AI Music Understanding

The three tools in Muzic's toolkit are MusicBert, PDAugment, and CLaMP. Each one has a unique function. They range from the ability to train on MIDI data and model it efficiently, study the relationship between vocal tracks and the phonemes (mouth-sounds) of lyrics, and using AI to search for existing music in a collection.

MusicBERT: Symbolic Music Understanding

The phrase "symbolic music" might bring to mind film scores that symbolize the emotions driving a story or narrative. However, in machine learning symbolic music refers to the use of MIDI in training data. That's because the MIDI notation merely symbolizes audio frequencies (in contrast to raw audio).

MusicBERT is a large-scale model powered by a technique their team call OctupleMIDI. It encodes each note with eight elements, representing different characteristics like time signature, tempo, bar, position, instrument, pitch, duration, and velocity. It uses the Lakh MIDI Dataset, aligned to the Million Song Dataset, for its training.

The OctupleMIDI encoding method significantly reduces the length of music sequences compared to other AI methods, making it easier to model them with Transformers.

MusicBERT also uses something called a "bar-level masking strategy" to prevent information leakage during pre-training. In less complex terms, this encourages the model to learn general features that can be applied to a wide range of music, rather than memorizing specific details of individual bars.

By learning general features of MIDI music, the system becomes more flexible and can support a more flexible approach to understanding and classifying music.

For an example of how MusicBert could be implemented, check out this public Google Colab for the MIDIFormers project.

PDAugment: Automatic Lyrics Transcription

If you've ever tried using AI lyric generators or spinning up a verse in ChatGPT, you probably noticed that it struggles with creating good phrasing.

PDAugment addresses this problem by creating a more nuanced system for understanding lyrics, based on actual musical qualities like pitch and duration. It's not a lyric generator, but it could provide the foundation for creating better content further down the pipeline.

The letters "PD" stand for the pitch and duration adjustments that are made to the input speech signals during the training process.

The other half of it's name refers to augmentation, a common practice in machine learning. It's used to improve the performance of existing models by exposing them to a wider range of inputs. By increasing the size and diversity of the training dataset, data augmentation can help to reduce overfitting and improve the generalization ability of the model.

Using PDAugment to improve monphonic vocal melodies

So what training data is PDAugmenting the pitch and duration information on?

DSing30: This dataset contains 4,000 monophonic karaoke recordings of English pop songs with nearly 80,000 utterances, performed by 3,205 singers. It is often used as a benchmark for evaluating the performance of ALT (automatic language transcription) systems.
Dali Corpus: This dataset consists of 1200 English polyphonic songs with a total duration of 70 hours.

The authors of the paper successfully augmented these datasets by applying their PDAugment algorithm to the audio recordings contained within them. By doing so, they aimed to increase the size and diversity of the training datasets, improving the performance of ALT systems.

CLaMP: Contrastive Language-Music Pre-training

Have you ever wished for a better way to find that song you just can't seem to remember the name of? That's exactly the problem that the team behind CLaMP set out to solve.

The secret sauce? A process known as Contrastive Language-Music Pre-training, or CLaMP for short. Rather than relying on keywords or exact matches, CLaMP pairs up language and music in a whole new way. It was trained on a massive dataset of 1.4 million music-text pairs, but they do not disclose what dataset they used.

The project included the release of WikiMusicText (WikiMT), a new dataset with over 1,000 lead sheets accompanied by a title, artist, genre, and description. These descriptions were obtained through AI automation that scanned the lead sheets, pulled out the artist and title information, and then scraped genre and description data from a corresponding Wikipedia entry.

CLaMP provides a few Hugging Face spaces to showcase their tech in action.

CLaMP - Semantic Music Search: Users can search for music based on natural descriptions of a song, including attributes like genre and mood.
CLaMP - Zero-Shot Music Classification: Users input a piece of music and retrieve a written prediction about the music, for example who the composer might be.
CLaMP - Similar Music Recommendation: Users input a musical piece and get recommendations for similar pieces, guided by matching text descriptions.

AI Music Generation

So far the examples we've provided have been based entirely on music analysis. Next up, we're going to look at Microsoft's AI music generation tools.

Lyric Generation: DeepRapper

I mentioned earlier that Muzic's DeepRapper has received the lion's share of attention from news media, first in 2021 and again in 2023, as music business news sites seek to report on competitors to Google and Meta's products.

Most AI lyric generators train on text only. They focus solely on lyric generation without any awareness of deeper rhythmic patterns.

DeepRapper is unique because it trains on actual audio files, separate vocals from the instrumental accompaniment, studies the relationship between vocal patterns and lyrics, and then models the relationship between the vocal lyrics and the beat. You can see a model of that below:

Their data mining efforts resulted in three separate datasets

D-Rap: 16,246 songs and 832,646 sentences
D-Song: 52,737 songs and 2,083,143 sentences
D-Lyrics: 272,839 songs and 9,659,503 sentences

Exciting as DeepRapper's technology sounds on the surface, the current output is a bit underwhelming. That's because their notation system is based on underlining words in a sentence, to show that a beat is aligned with that word. A screenshot from their paper is shown below:

There is no accompanying music for these lyrics, which makes the placement of the beats feel highly arbitrary. It would need to be combined with music to make sense.

The best example I've seen of a commercial product that successfully pulls of the ability to match user-generated lyrics to a generative melody is VoiceMod's text-to-song app, dubbed the Musical Meme Machine. It's a silly use case but works well enough to illustrate the idea in practice.

SongMASS: Lyric-to-Melody and Melody-to-Lyric

Whether you're writing music on your own or with a group, singers typically have two problems to solve. They need to write melodies and lyrics. These problems give way to the inverse issue. Once a melody is written, you have to come up with lyrics. The same is true for lyrics that need a god melody.

SongMASS uses generative AI to tackle these two major challenges in lyrics-to-melody and melody-to-lyrics generation. Existing models have limited paired data and tend to follow strict alignments between lyrics and melodies. This means they have a difficult time with being truly creative or flexible in their output.

SongMASS lyric and melody encoding and decoding

To conquer these obstacles, SongMASS employed a strategy called masked sequence-to-sequence pre-training, along with attention-based alignment constraints. This means the model learns to recognize patterns in your lyrics and melodies, allowing for more seamless relationships between the two. Let's break down how this works.

"Masked" refers to a technique where some of the tokens in a melody or lyrical phrase (individual elements in the sequences) are temporarily hidden during training. The goal is to train the model to learn how to generate complete songs even when some parts are missing or obscured. This leads to greater creativity and flexibility from the model.

"Sequence-to-sequence" means that the model is trained to convert one sequence (lyrics) into another sequence (melody), with the help of the masked tokens. Think of it like a game of fill-in-the-blanks, where the model has to figure out the missing bits to create a cohesive and meaningful lyrical melody.

So, putting it all together, "masked sequence-to-sequence" pre-training is a way to prepare the SongMASS model to handle incomplete or imperfect inputs, and to learn generalizable knowledge about the relationship between lyrics and melodies.

Lyric-to-Melody Generation

There are three newer lyric-to-melody system in the Muzic ecosystem, and they all outperform SongMASS. In this section we'll briefly touch on TeleMelody, ReLyMe, and Recreation of Creations.

TeleMelody

The first stage of TeleMelody involves training a machine learning model on a dataset of existing songs. This model learns to recognize patterns and relationships between lyrics and melodies, and can then use this knowledge to generate new melodies based on given lyrics.

The template used in TeleMelody includes key elements like tonality, chord progressions, rhythm patterns, and cadence - all of which are crucial in shaping the overall sound and feel of a song. By incorporating these elements into the template, songwriters can exert more control over the melody-generation process and create songs that are more nuanced and emotionally resonant.

As the image below shows, Telemelody outperforms their SongMASS system with both English and Chinese lyrics.

ReLyMe

ReLyMe pushes the lyric-to-melody even further by elaborating on the music theory principles baked into its deep learning model. The screenshot above shows some of the underlying logic that it takes into consideration. ReLyMe scores higher than TeleMelody in both melody and melody + lyric generation as shown below:

Re-creation of Creations

Unlike other approaches, ROC doesn't rely on end-to-end neural networks, which often struggle to capture the complex relationships between lyrics and melodies. Instead, ROC's retrieval-based composition allows for better alignment of rhythm and structure between the lyrics and melody, resulting in higher-quality generations.

ROC outperforms SongMASS and TeleMelody systems when working with both English and Chinese lyrics. The paper doesn't don't mention ReLyMe (published November 2022), even though ROC was published in January 2023. This may suggest that there they did not achieve statistically higher performance than ReLyMe with this new system.

Music Form/Structure Generation

MeloForm: Music Form Generation

Creating melodies that sound good and have emotional impact is tough. Even some of the best melody generators struggle to deliver high quality output. MeloForm uses neural networks in an effort to generate melodies that have precise musical form control and rich melodic expression.

The system starts by developing motifs, turn them into musical phrases and then finally into sections that repeat with variations. When the generated melody needs a bit more oomph, MeloForm uses a transformer-based refinement model to improve the melody without changing its musical form.

Where random note generators try to land on a good melody by chance, MeloForm checks its generated melodies for accuracy by comparing them to a set of reference melodies. These reference files come from the Lakh MIDI dataset, just like the aforementioned MusicBert analysis tool.

It's worth noting that this MIDI dataset is only trained on thirteen genres of music and none of the subgenres are indicated. Pop/rock is highly over-represented in the data, with 8603 of 11,946 MIDI files falling into that genre. Meanwhile, genres like the Blues have only 32 MIDI files. This may impact the ability for MeloForm to compose equally across all genres. Higher volumes of data generally produce better outcomes in generative AI models.

Museformer: Long/Short Structure Modeling

Just like a long sentence, a piece of music can be very long, containing thousands of notes. The program needs to be able to process and understand all of these notes effectively. Music also has its own special structure, including things like melody, harmony, and rhythm. A generative music model needs to recognize and replicate these structures in order to create music that sounds good.

Museformer tries to solve these two challenges, called long sequence modeling and music structure modeling using transformer-based models. It uses something called "attention" to focus on different parts of the music, allowing it to process long sequences and capture complex structures.

The diagram below shows how Musformer summarizes a collection of attributes from a single bar of music (like the shape and volume of notes in a melody). These summaries are then aggregated, or combined, into a single representation of that input data.

Each aggregation represents a particular section or theme within the song, and the algorithm generates a separate aggregation for each one. For example, a song might have one aggregation for the verse, another for the chorus, and another for the bridge. The aggregations capture unique characteristics and patterns present in that particular section of the song.

They test their algorithm on several benchmark datasets and show that it performs well, generating music that is both aesthetically pleasing and structurally sound. They also compare their algorithm to other state-of-the-art methods and show that it outperforms them in certain tasks. Here is the summary from their paper:

Ablation results comparing MuseFormer to other music transformers

Multi-Track Generation

We're almost at the finish line. The final step in this process is multi-track music generation. Let's have a look at those before we throw in the towel!

MuseCoco: Text-to-MIDI Generation

MuseCoco is Microsoft's first stab at a text-to-MIDI generator. The app comes across as a bit disconnected from the expectations of today's audience. Where Meta's MusicGen and Google's MusicLM text-to-music apps create downloadable sample, MuseCoco produces sheet music with no sound design. They're behind the curve in this regard. As if underscoring this point, their benchmarks for success were comparisons to GPT-4.

GPT is a large language model and is not trained as a composer, so it's a bit like saying that a beginner composer outperformed a mathematician who used equations to produce a melody without ever having heard a song in their life.

On the other hand, MuseCoco does offer one key advantage over MusicGen and MusicLM. By composing in multi-instrumental MIDI tracks, it's possible to add your own layers of sound design. The question at this point is whether MuseCoco has enough intelligence to create music that sounds good.

The following MIDI datasets were used to train MuseCoco, amounting to almost 950,000 MIDI files. That's a substantial collection of music, though it's unclear what the contents of those datasets are.

When researching the datasets cited here, some of the numbers did not add up. For example, the MMD dataset shown in the screenshot below states that it contains 1.5M files, but according to the MMD Github repository it contains only 436,000. That's a differential of almost one million files. Did the researchers have access to an additional private repository? It's unclear what to make of this.

MuseCoco consists of two stages: text-to-attribute understanding and attribute-to-music generation. The system uses natural language processing techniques and machine learning algorithms to extract musical attributes from text descriptions and generate corresponding music.

There are no public demos of the software, in their research paper or otherwise, which makes it difficult to evaluate the quality of its musical output. Check out our compilation of usable text-to-music apps to learn more about this software niche.

PopMAG: Accompaniment Generation

Until now, all of the Microsoft Muzic projects have been focused on creating melodies only. PopMAG was an early attempt from 2020 to understand MIDI input in a novel way. The technology has advanced significantly since then, but we'll start at the beginning.

In the system, multi-track MIDI events (MuMIDI) were analyzed as a single sequence of tokens, rather than generating each track separately. This allowed the model to directly observe the relationship between separate tracks, like melody and chord harmony.

PopMAG's MIDI representation of chord and melody tokens

Music consists of many sections, and the sum total of their relationships is just as important as what happens at any given moment between a melody and chord. To account for this, PopMAG used a sequence-to-sequence model with a TransformerXL backbone. This allowed the model to take advantage of the long-range dependencies in the music, while still capturing the local information within each track.

GETMusic: Any Track Music Generation

GETMusic is a more recent AI music generation effort from Muzik that lets users create customizable instrumental tracks from scratch, based on existing ones. The word GET stands for "Generate Music Tracks". A bit of a stretch, but hey, there's only so many cool AI music generator names available.

At its core, GETMusic employs a unified representation called GETScore, which arranges notes as tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time. Each musical note is represented by a single token, making it easier for the model to learn and generate harmonious music. The diffusion model, named GETDiff, is non-autoregressive, meaning it can predict missing tokens without relying on sequential predictions like other models do.

We're starting to get into some complex machine learning territory here. It gets harder and harder to summarize how these things work. Here's a diagram below.

Fortunately, there's a GETMusic tutorial that came out just one month ago, showcasing the tool in real time. I mentioned earlier that most of the Muzik team is located in Asia, which is why the screenshare is in Chinese. You can still get a general sense of how it works from watching.

HiFiSinger: Singing Voice Synthesis

Singing voice synthesis (SVS) is a rapidly advancing field that seeks to create realistic and expressive singing voices based on musical scores. We've recently provided a list of the best AI voice generators for musicians, which are actually available usable.

HiFiSinger was an SVS system from 2020 that utilized a high sampling rate of 48kHz to better convey expression and emotion. The system employed a FastSpeech-based neural acoustic model and a Parallel WaveGAN-based neural vocoder to ensure fast training and inference speeds while maintaining high voice quality. The repository is not available, which makes it difficult to vouch for how well it worked at time of publication.

At the end of the day, our research into Microsoft's AI music collection showed us that there are a number of efforts being made to keep pace with other big tech companies. Clearly a decent amount of brainpower has been devoted to these problems. However, in classic Microsoft style, they have missed the boat on what today's AI music culture is looking for; namely an accessible user interface with high quality audio production.

Even with an open source to their code, Microsoft will need help from Meta's MusicGen to catch up and become a serious contender in Apple's existing music production software market. Without that, they're destined to repeat history and remain a tool for businesses over artists.