You really can’t talk about media or production these days without someone bringing up artificial intelligence (AI). This year, the news and social media has exploded with talk of programs like ChatGPT, Midjourney, Adobe, DALL-E and countless others that creatives can use to generate images, text, voices, and even pop songs. With AI being one of the key sticking points in both the WGA and SAG/AFTRA stikes, those of us in production and post-production have a lot of questions still yet to be answered. How will we use AI in our creative process in the future? What does this mean for job-security?

Recently, Allyson and I had the chance to virtually attend the AI Creative Summit, put together in partnership with NAB. At the summit, various experts discussed the latest advancements in AI, as well as demonstrated some of the capabilities of this technology within the creative production realm. Our main area of interest, obviously, is how this technology affects audio production. We know that generating a voice over from text has gotten better and better. But one of the latest advancements that has really piqued everyone’s interest, has been something called voice cloning. This is where you take pre-recorded audio of a particular person’s voice that a computer is able to learn and then simulate to read whatever script you want. Most famously, voice cloning was recently used to de-age actor Mark Hamill’s voice in order to recreate a young Luke Skywalker for both “The Mandalorian” and “The Book of Boba Fett” TV series. LucasFilm used a Ukrainian company called Respeecher to recreate the famous actor’s voice by using archived ADR, audio book, and radio play recordings to train their software.

AI was also used to recreate Val Kilmer’s voice for both “Top Gun: Maverick” and “VAL,” a documentary on the actor’s life. Kilmer lost much of his voice years ago after undergoing treatment for throat cancer.

Allyson and I were able to see a few demonstrations of similar voice cloning technology at the summit. One company called Instreamatic can generate text-to-speech advertisements, complete with music, as well as personalized versions for whatever platform the audience is listening on. Although the built-in library of voices that this company can generate sounded pretty good, it was when they demonstrated their voice cloning tool where I thought the effect started to fall apart a bit. The demonstrator uploaded a recording of Will Ferrell playing the character “Ron Burgandy.” The resulting voice (to me), sounded more like Will Ferrell doing a less-than-perfect impression of his character. Perhaps with a longer recording to learn from, it may have sounded better? Hard to say from this short demonstration.

Another company in the generative AI voice realm is one called ElevenLabs. In this demonstration, the host showed the various built-in voices that users can use to read a script - complete with various (although somewhat ambiguously-named) controls that you can adjust to change the style of performance. This program was also able to clone a voice pretty well, however what we noticed was just how difficult it would be to customize the performance. Although there are different settings you can adjust, such as “Stability” and “Style Exaggeration,” these are at best a shot-in-the-dark, guess-and-check method of “refining” the performance to get something that you like.

Screen shot of ElevenLabs user interface

For those of us in the advertising world, where reading a script to an exact time is imperative, ElevenLabs has yet to come out with a control for that. Near as we can tell, you would still need to bring in your resulting audio file into an editing program, chop it up to take out extra long pauses, and time compress things to get them to fit properly. There is also no controls for intonation or inflection. One thing we have learned over years of directing voice talent, is that sometimes the way you read something can completely change the meaning of the message. Listening back to the resulting voice, you occasionally get odd sounding phrasing, or an emphasis on the wrong word, etc. I would be very curious to see a text-to-voice program like this generate a VO that required sarcasm or any sort of play-on-words within a script. In these instances, a professional voice talent’s ability to “wink at the camera” (or in this case a microphone) cannot be understated.

We did notice that the style of the voice that is generated is largely at the mercy of the style of the voice that is put in. Meaning, if you input a slower, more relaxed read for the AI to learn - that’s what the program will generate. If you input a faster, more upbeat read, that’s more likely what you will get when the new recording is generated. Unfortunately, this particular demonstration didn’t include that, although the presenter did mention that he wanted to try this in the future. The presenter also pointed out that if your original recording has small mistakes in it, such as plosives (popping “P” sounds), the program will actually include that in the cloned voice. Therefore it is important to get a clean and high-quality recording to start with.

So what practical applications does a program like ElevenLabs have? Considering the need to have advertising copy fit in a very specific amount of time, while including very specific messaging, we are dubious that this would be a perfect tool for that. However, for corporate videos, or scientific explainer videos, where reading with style or reading to exact time isn’t much of a factor, we could see this as a useful application.

It should be mentioned, however, that we have heard instances of peoples’ voices being illegally uploaded and cloned on several of these types of sites, including ElevenLabs. And according to a voice actor we have spoken with, ElevenLabs has done very little to combat this. We won’t cover the ethical or legal implications of voice cloning technology in this post. That will be a separate post for the future, because there is A LOT to cover there. For the time being, if you are interested in how the voice acting community is trying to work with AI and get a framework in place to help protect artists' rights, we encourage you to visit the website of the National Association of Voice Actors (or NAVA).

As for the other big questions, such as “How will us creatives use AI in our workflows?”— we can reiterate what you have probably already heard others say: “AI is a tool.” When it comes to editing audio, there are AI programs to reduce background noise, edit music to certain lengths, create voices, create volume automation for mixing and much more. All of these are tools for a person to control and make creative decisions when working towards a final, polished product. They are tools to help us work faster and smarter, and when used properly (i.e. to enhance an already original product), simply adds to a professional’s “bag of tricks.”

As for the job security question, that may be a bit more up in the air. For years voice talent and recording studios have been competing with cut-rate internet VO products, where those on a budget, or those who don’t really care how their scripts are read, can get two-takes of an un-directed read from an anonymous speaker with fairly quick turnaround. With the advancements in AI voices, this only adds to the pile of competition. Will studios have to pivot yet again and become more “audio producer” than “engineer?” Will it be our job as experts to be the ones who input prompts and data into AI programs in order to yield the best possible result? Or will studios simply become a novelty for those who like the idea of using old-school methods for creativity? Time will tell, and at the rate AI is advancing, we may find out sooner rather than later. For now, we will try to stay up on the latest and greatest to continue to do what we have already been doing for years: deliver a damn good product.

What We Know About AI Audio (So Far)