FryAI
Posts
Breathing Life Into Automated Voices: Reading The Script Behind Coqui

Breathing Life Into Automated Voices: Reading The Script Behind Coqui

Hunter Kallay
October 29, 2023

Welcome to this week’s Deep-fried Dive with Fry Guy! In these long-form articles, Fry Guy conducts an in-depth analysis of a cutting-edge AI development or developer. Today, our dive is about Coqui.ai, which uses AI to create and edit voice-overs to deliver lines of dialogue in a way nobody has before. We hope you enjoy!

*Notice: We do not gain any monetary compensation from the people and projects we feature in the Sunday Deep-fried Dives with Fry Guy. We explore these projects and developers solely for the purpose of revealing to you interesting and cutting-edge AI projects, developers, and uses.*

🤯 MYSTERY LINK 🤯

(The mystery link can lead to ANYTHING AI related. Tools, memes, and more…)

Can automated voices feel and make you feel? Can they cause you to be happy, sad, angry, or excited? The power of artificial intelligence (AI) to replicate human voice in a way that moves the human heart and evokes the human spirit is here.

Coqui.ai is a text-to-speech platform which allows users to input dialogue and design a customizable AI voice to speak those lines in ways indistinguishable from that of a human.

THE BIRTH OF COQUI

Josh Meyer, the co-founder of Coqui, earned his Ph.D. in speech and language technology, mostly focusing on speech recognition and speech-to-text programs. Towards the end of his academic research, he joined up with Mozilla to make their speech recognition technology work for as many languages as possible. He collaborated with the machine learning team at Mozilla for a couple of years, and it gained a lot of traction in commercial applications. From here, he and a small team spun out of Mozilla to focus primarily on this type of project. They gained some venture capital and began working on what is now called, Coqui.ai.

Coqui has been in the works for about three years now. At Mozilla, the focus was mostly on speech-to-text, but as the team began working on Coqui, they had some breakthroughs in speech synthesis which led to a focus on voice cloning. This was done by using a small bit of audio and synthesizing that speaking voice into multiple languages. Coqui was the first project to have that working in a production setting.

HOW DOES COQUI WORK?

Coqui offers a few different products. First off, Coqui has an application programming interface (API) that can be implemented into speech synthesis or voice cloning applications. They also have an open-source side of the project that contains the raw models and code, which is available on places like GitHub.

The main service Coqui offers is what is called Coqui Studio. Meyer dubs this as “the Garage Band of voiceover.” This platform allows the user to organize their projects (such as a movie or video game) in terms of “scenes” which include dialogue between characters. Coqui provides the user with tools to organize their projects and scenes within projects, altering and editing the lines of dialogue as they desire. For example, the user can alter the emotion of the voice which delivers a line of dialogue, making it sound happier or excited, or even depressed or angry. The platform also gives the user the ability to adjust the pitch and inflection of certain words, allowing for an immersive and entirely customizable text-to-speech experience.

Beyond voice editing, Coqui provides the user with tools to create new ones via “prompt-to-voice.” This can help the user avoid certain commercial copyrighting concerns by using celebrity voices, for instance, in their commercials or video games. This feature works by allowing the user to describe the kind of voice they want via a prompt. If the model works well, that’s the voice you get. For example, one could type, “I want a 30-year-old man with a New York accent who sounds like he smokes too many cigarettes,” and the user will get a voice that meets that description. Coqui is the only platform that has this type of product in production, and they are continuing to develop it to make the prompts simpler and easier to use.

While the prompt-to-voice aspect of Coqui Studio continues to be developed, Coqui also offers a guided version of voice creation. Here, the user picks out the qualities of the voice they want, and the platform will create that customized voice for them. For example, the user can select, “male, teenager, Australian accent, happy-go-lucky” and based on what is clicked, the feature will offer more defined prompt options that go well with the previous selection. Often, users will save these created voices into a personalized “voice bank” and use these characters for different projects.

Meyer views Coqui Studio as a team collaboration tool, where various people working on a project can come into the studio and edit and save voices, use saved voices, and create dialogue that is in line with the team’s goals.

Currently, Coqui Studio offers seven languages and is working on polishing more via their open-access project on GitHub or Hugging Face. Meyer says, “We try to make the core models we are working on available whether or not you know how to code and also whether or not you have a big budget.”

CREATED FOR HUMANS

Meyer emphasizes that the goal of Coqui is to “create tools for humans.” When humans create voice-overs for certain projects, they have a specific goal in mind. For example, when looking for a line in a movie or video game, the director will not settle for just a “mediocre” delivery of a given line of dialogue—they will work until that line of dialogue is perfect. They specifically choose which voice actor they want for that part based on the given tone of that person’s voice, and they will coach that individual on the given line until they get the proper emotion and inflection they are looking for. Coqui has this approach in mind as they continually develop their tools. The goal is to offer a platform that makes customizable voices easier to create and gives immediate editing tools to the user to achieve their desired output.

Meyer summarizes, “Getting the right lines of dialogue is a very creativity-intensive process—it is very rigorous—and that’s why in a lot of video games you have ten characters who get voiced and 300 others who get subtitles. We are trying to make it possible to have that same level of creative control, but much more efficient so users can get excellent audio for all of those normally unvoiced characters.”

BEYOND MIMICKING HUMANS?

In the past year, AI has allowed for a major breakthrough in voice creation. As Meyer remarked, “Speech recognition has been around for a while, but only in the last year with these breakthroughs in diffusion models have we gotten to the point of not just human-like speech, but convincing, entertaining, fun human-like speech.”

Since about 2016, artificial voice has been able to pass human perception tests, proving almost indistinguishable from the voice of a human being. However, in the past year, Meyer points out that “we have gone from ‘human-like’ to ‘entertaining humans.’” Artificial voice, for the past five years, has been purely something to convey information, such as Siri telling someone the time. Meyer expresses, “Siri sounds human, but sounds like a boring human who hasn’t had enough coffee … but now we have AI voices that can convince us of something more than that it’s just a human voice; it can convince us it’s sad or happy and can make us feel afraid or excited.” Coqui is also working on ways for their models to predict the emotions associated with certain lines of dialogue. For example, if the text reads, “My cat died yesterday,” the underlying model should be able to understand that this is something a human would typically say in a sad tone. Meyer sees AI voice having this ability in the near future.

With the application of AI to voice synthesis, it causes one to wonder what impact this will have on voice actors, who rely for their livelihood on delivering dialogue that might easily be replaced by projects like Coqui Studio. Meyer doesn’t see this sort of technology as much of a concern for talented voice actors. However, Meyer points out that “the bar that you have to achieve to be a voice actor is being raised a lot. You can’t be a sub-par voice actor anymore because you’re getting beat out by AI any day of the week.”

THE FUTURE OF VOICE

The line between AI voice and human voice is continuing to blur. As AI voice control continues to improve, the landscape is dramatically shifting in a way where AI is able to not only replicate humans, but it is also able to move humans emotionally. This opens the door for a wide range of applications, including personal use and commercial use in video games, movies, and more.

Coqui.ai is taking this industry by storm, and as they continue to develop ways for voice creation, the sky is the limit on what AI can do. AI is developing its own voice, and it is a voice which rings loud and clear.