ElevenLabs Releases AI Sound Effect Generator, AI Songs Next

Ezra Sandzer-Bell
Jun 3, 2024
8 min read

Updated: Apr 27

ElevenLabs is one of the world's top sites for generating realistic AI speaking voices. Users pick between text-to-speech or voice-to-voice cloning, leveraging ethically trained models or training their own custom AI voices.

In June 2024, the company released an AI sound effect generator with 1,000 seconds of audio for new users who want to try the service risk-free.

Wondering how they arrived at SFX from speaking voices?

Back in February 2024, ElevenLabs joined Disney's accelerator program. In May, they teased an upcoming AI song generator with high quality voices that could give competitors like Suno and Udio a run for their money.

Historically they've not supported AI singing voices or melodic voice transfer. The SFX and generative songs signal an intent to move deeper into the entertainment sector. Low-cost sfx, voiceovers and background music are an obvious play, where they will cut costs for media companies.

Have a listen to a demo of their AI song generation capabilities below:

ElevenLabs hasn't announced a release date for the AI song generator yet. While we wait on that bomb to drop, we've started testing their new sound effect generator and testing the output quality.

I'm going to level with you right out the gate. The SFX don't sound great today, but with $80M in funding and a $1.1B valuation, they're likely to continue improving on the product. It seems they may have rushed this product to market to keep eyes on the brand and perhaps to achieve some investor milestones.

In this article we'll share a brief tutorial on accessing and prompting the system, with takeaways on its strengths and weaknesses. We'll share some competing services in the same niche and close info you ought to have on how their tech has been used to push political psyops and election interference.

How to access the AI sound effect generator
Report: Our experiments using text-to-SFX prompts
Bugs and UX issues in the current version
Alternatives to ElevenLabs for creating AI SFX
11Labs in the news: Deepfakes and election interference

How to access the AI sound effect generator

To get started, head over to the ElevenLabs website and sign up for a free account. You can use this deep link to the sound effects page as a shortcut. If you're already signed in, use the left navigation menu and select sound effects below the speech tab.

You don't need a credit card and will get a 10,000 "character quota" as a bonus for signing up. The service consumes 10 characters per second, meaning you'll get up to 1,000 seconds of audio for free.

If you stick with the default settings, ElevenLabs will try to automatically determine the best length for each clip. In our experience testing the service, clips varied from 0:01 - 0:05 duration, but the majority of SFX were 2 seconds long.

Report: Our experiments using text-to-SFX prompts

We recorded a demo of roughly 40 text-to-sfx prompts and trimmed out the processing time, so you can hear each sample and judge the output for yourself.

I've been building a database of cinematic sound effects at Audio Design Desk over the past 9 months, so my standards are set high. These were some of my high level observations and takeaways from the ElevenLabs experiment:

As a whole, the sfx sound thin and lack character.
The AI model is hyper-sensitive to word choice. Slightly different text prompts can create very different output. Unlike other popular AI music generators, there doesn't seem to be an LLM translating the meaning of the input into corresponding meta labels. The system feels brittle as a result.
3-5 second generations often feature a 1 second sfx followed by a long tail of silence. Sometimes it simply repeats the same sound a second time to fill the space.
The default setting costs 200 credits or "characters", which means their system is creating four 5-second clips and eliminating the dead space or repetitive sounds on your behalf. Better to turn off the automation and start at one second until you find the right prompt.

ElevenLabs SFX summary: The good, the bad, the ugly

I had mixed results with the model, ranging from reasonably good to flat out horrible. Fortunately, the service is free and didn't cost us anything to play around with. Here's a list of the prompts we used and the results we got.

10 prompts that generated the audio output we expected

Glitch sound effect: An assortment of glitchy sounds
Chainsaw: Revving and sawing sounds
Zany cartoon sound: A diverse range of wacky sounds
Wind Chimes: Nice assortment of twinkly chimes
Elephants: Standard elephants crying sound
Chickens clucking in the yard: Hens with ambient noises
Woman whispering: Sounds female, she speaks made up words
Man whispering: Sounds male, one spoke without whispering
Dial tone: Three dial tones, one phone ringing
R2D2: The familiar Star Wars sound (Did they license that audio?)

8 prompts that generated okay audio output but not great

Train whistle sound: Cheap timbre, sound like toy trains in a dead room
Running through the snow: Running pace, but minimal snow crunch texture
Wooden door slam: Door closing sounds lack depth and low end
Light saber battle: Familiar saber buzz but singular, not a battle
Angelic choir: Sounds like a tinny MIDI synth
Tapping metal on metal: One-off metal impact sounds
Light rainstorm: White noise with a little pattering, mediocre
Brick going through a glass window: Two decent broken glass sounds

Bad prompts that got better after making a small change

Bad - Machine gun sounds: Clicking, cocking and reloading sounds

Better - Rapidfire machine gun: The machine gun sounds we expected

Bad - Slot machine: A 'fail' sound effect, two clicks, one coin sound

Better - Slot machine coins: Three jingling coin sounds, one tinny click

Bad - Punch in the stomach: Three light tapping sounds, one light woosh sound

Better - Deep punch sound: Two deep impact sounds, a thud and a swoosh

Bad - Punching numbers into a phone: Random tapping sounds

Better - Dialing numbers into a phone: The expected phone sounds

Bad - Crackling fireplace: Totally off base. No fire sounds, random assortment.

Still Bad - Fire: Again, not a single ambient fire sound effect. Seems strange.

Better - Striking a match: Perfect, all four sounded like matches being lit

7 terrible results that didn't sufficiently match the prompts

Zebras: Wildly random sounds, totally off base
Audience clapping: One-off clap sounds and a single person clapping
Audience applause: Same issue, no "round of applause" sfx available
Dinosaur stomping through the jungle: Slow, low quality thuds
Car chase: A couple of revving car engines, two tapping sounds
Explosion: Muffled impact sounds, they're not explosion sounds
Balloon popping: Low quality tapping, no popping sound

The AI sound effect generator is a mixed bag. Some of them are usable. The real question is, why spend money gambling on AI SFX when you can license them for cheap through Audio Design Desk, Splice, and Artlist?

Bugs and UX issues in the current version

Sound quality aside, there are a several UX issues and a flat-out bug that gave me the impression ElevenLabs rushed this product to market.

To start, their system's default to "automatically pick the best length" is actually a compensation for the model's tendency to produce one-shot clips followed by dead air or repeating sfx. So while it's framed as a benefit, it's actually inefficient and would cost you money if you're on a paid plan.

Click on the settings button to get more control over your output. You'll be able to switch off the random length feature and dictate the duration yourself.

The minimum and maximum lengths are currently set at 1 and 22 seconds. The cost for four 1-second clips will be 40 character credits instead of the 200 that it costs when you use their default automation.

Warning: Increasing the duration of your output will consume more character credits. Hover over the "generate sound" button as shown above to check how many credits you're about to spend.

There is no "save settings" option button, which is strange for a menu of this kind. ElevenLabs will reference your text prompt from the previous screen and generate sounds based on that. If you hit "X" the app will store and apply your changes on the main screen. I found this to be counterintuitive. It took me a moment to realize what was happening.

Another UX nitpick: The interface includes preset tabs with suggestions. When you click on a tab, they don't simply populate the text field with suggestions as most AI apps do. Instead, it auto-submits your prompt and consumes more credits.

For example, if you click "animal sounds" it will automatically generated 4 sounds before you have a chance to verify whether you actually want that type of animal.

Last but not least, as of June 3rd, the history section is completely broken. SFX that you've generated never show up, even after a hard refresh on the page. This means each time you run a new prompt, the previous sounds are lost. If you find sounds you like, be sure to download them first.

5 Alternatives to Elevenlabs for generating AI SFX

Several other companies already offer AI sound effect generators. Most of them use the same "text-to-sfx" approach. Some also offer images and video as input options instead. Here's a quick round up of the most popular services we're aware of:

Pika's AI video-to-sfx generator: Pika is currently one of the top AI video generation tools on the market. In March 2024 they announced a new feature that supports text-to-sfx as well as video-to-sfx. Read a full review and watch some demos here.
WavTool: We're big fans of this AI DAW. They're an OG AI MIDI generating web app. Part of their tool suite includes a text based AI SFX generator.
AudioLDM on Hugging Face: You'll need a Hugging Face account to use the service. This was the first text-to-SFX workflow I ever used. In my experience the quality is on par with 11Labs, which isn't saying much because neither are particularly good compared to professional libraries.
Image-to-SFX on Hugging Face: This Hugging Face space uses the same AudioLDM model, but presents an image-to-music workflow instead of text prompting.
MyEdit AI SFX generator: The interface is better than HuggingFace but the quality is worse. You can try it for free and see for yourself. SFX duration is displayed as 10 seconds but in reality, they're often 3-5 seconds of noise followed by 5-7 seconds of dead air.

11Labs in the news: Election interference deepfakes

ElevenLabs markets its AI voice cloning technology as a service for content creators and businesses. Their use cases are valid, but bad actors have proven this year that the technology can easily be abused.

In January 2024, a democratic consultant hired a freelance stage magician to create deepfakes of president Joe Biden. He used 11Labs with a telecom company to make robo calls urging voters not to show up to their state's primary elections.

The consultant was busted, slapped with a felony charge and fined $6M by the FCC. The offending telecom company was also fined $2M. ElevenLabs got off scot-free, but made a public statement that they banned the account and intend to take a more aggressive stance against fraud.

Pivoting into the entertainment industry through AI music and sound effects may help them build a legal defense against accusations that they're profiting from black markets and fueling socially disruptive trends.

We'll report back as ElevenLabs fixes the bugs and UX issues that we catalogued here. Our team will also run another round of tests in a few months to see if the text and sound effect parity has improved.