Alongside cooking for myself and strolling laps round the home, Japanese cartoons (or “anime” as the youngsters are calling it) are one thing I’ve discovered to like throughout quarantine.
The issue with watching anime, although, is that wanting studying Japanese, you grow to be depending on human translators and voice actors to port the content material to your language. Typically you get the subtitles (“subs”) however not the voicing (“dubs”). Different occasions, complete seasons of exhibits aren’t translated in any respect, and also you’re left on the sting of your seat with solely Wikipedia summaries and 90s net boards to ferry you thru the darkness.
So what are you speculated to do? The reply is clearly to not ask a pc to transcribe, translate, and voice-act complete episodes of a TV present from Japanese to English. Translation is a cautious artwork that may’t be automated and requires the loving contact of a human hand. In addition to, even for those who did use machine studying to translate a video, you couldn’t use a pc to dub… I imply, who would wish to take heed to machine voices for a whole season? It’d be terrible. Solely an actual sicko would need that.
So on this submit, I’ll present you the way to use machine studying to transcribe, translate, and voice-act movies from one language to a different, i.e. “AI-Powered Video Dubs.” It may not get you Netflix-quality outcomes, however you need to use it to localize on-line talks and YouTube movies in a pinch. We’ll begin by transcribing audio to textual content utilizing Google Cloud’s Speech-to-Textual content API. Subsequent, we’ll translate that textual content with the Translate API. Lastly, we’ll “voice act” the translations utilizing the Textual content-to-Speech API, which produces voices which might be, based on the docs, “humanlike.”
(By the way in which, earlier than you flame-blast me within the feedback, I ought to inform you that YouTube will robotically and without spending a dime transcribe and translate your movies for you. So you may deal with this venture like your new interest of baking sourdough from scratch: a extremely inefficient use of 30 hours.)
AI-dubbed movies: Do they often sound good?
Earlier than you embark on this journey, you in all probability wish to know what you need to stay up for. What high quality can we realistically anticipate to attain from an ML-video-dubbing pipeline?
Right here’s one instance dubbed robotically from English to Spanish (the subtitles are additionally robotically generated in English). I haven’t performed any tuning or adjusting on it:
As you may see, the transcriptions are respectable however not good, and the identical for the translations. (Ignore the truth that the speaker typically speaks too quick — extra on that later.) Total, you may simply get the gist of what’s happening from this dubbed video, however it’s not precisely close to human-quality.
What makes this venture trickier (learn: extra enjoyable) than most is that there are not less than three attainable factors of failure:
- The video may be incorrectly transcribed from audio to textual content by the Speech-to-Textual content API
- That textual content may be incorrectly or awkwardly translated by the Translation API
- These translations may be mispronounced by the Textual content-to-Speech API
In my expertise, probably the most profitable dubbed movies have been those who featured a single speaker over a transparent audio stream and that have been dubbed from English to a different language. That is largely as a result of the standard of transcription (Speech-to-Textual content) was a lot increased in English than in different supply languages.
Dubbing from non-English languages proved considerably tougher. Right here’s one notably unimpressive dub from Japanese to English of considered one of my favourite exhibits, Dying Notice:
If you wish to go away translation/dubbing to people, nicely–I can’t blame you. But when not, learn on!
Constructing an AI Translating Dubber
As at all times, you’ll find all the code for this venture within the Making with Machine Studying Github repo. To run the code your self, comply with the README to configure your credentials and allow APIs. Right here on this submit, I’ll simply stroll by my findings at a excessive degree.
First, listed here are the steps we’ll comply with:
- Extract audio from video recordsdata
- Convert audio to textual content utilizing the Speech-to-Textual content API
- Cut up transcribed textual content into sentences/segments for translation
- Translate textual content
- Generate spoken audio variations of the translated textual content
- Velocity up the generated audio to align with the unique speaker within the video
- Sew the brand new audio on high of the fold audio/video
I admit that after I first got down to construct this dubber, I used to be stuffed with hubris–all I needed to do was plug just a few APIs collectively, what could possibly be simpler? However as a programmer, all hubris have to be punished, and boy, was I punished.
The difficult bits are those I bolded above, that primarily come from having to align translations with video. However extra on that in a bit.
Utilizing the Google Cloud Speech-to-Textual content API
Step one in translating a video is transcribing its audio to phrases. To do that, I used Google Cloud’s Speech-to-Textual content API. This device can acknowledge audio spoken in 125 languages, however as I discussed above, the standard is highest in English. For our use case, we’ll wish to allow a few particular options, like:
- Enhanced fashions. These are Speech-to-Textual content fashions which were educated on particular information varieties (“video,” “phone_call”) and are often higher-quality. We’ll use the “video” mannequin, in fact.
- Profanity filters. This flag prevents the API from returning any naughty phrases.
- Phrase time offsets. This flag tells the API that we wish transcribed phrases returned together with the occasions that the speaker stated them. We’ll use these timestamps to assist align our subtitles and dubs with the supply video.
- Speech Adaption. Usually, Speech-to-Textual content struggles most with unusual phrases or phrases. If you recognize sure phrases or phrases are more likely to seem in your video (i.e. “gradient descent,” “assist vector machine”), you may move them to the API in an array that can make the extra more likely to be transcribed:
The API returns the transcribed textual content together with word-level timestamps as JSON. For instance, I transcribed this video. You’ll be able to see the JSON returned by the API in this gist. The output additionally lets us do a fast high quality sanity examine:
What I really stated:
“Software program Builders. We’re not recognized for our rockin’ fashion, are we? Or are we? As we speak, I’ll present you the way I used ML to make me trendier, taking inspiration from influencers.”
What the API thought I stated:
“Software program builders. We’re not recognized for our Rock and elegance. Are we or are we as we speak? I’ll present you the way I exploit ml to make new trendier taking inspiration from influencers.”
In my expertise, that is concerning the high quality you may anticipate when transcribing high-quality English audio. Notice that the punctuation is a bit off. In case you’re proud of viewers getting the gist of a video, that is in all probability adequate, though it’s simple to manually right the transcripts your self for those who converse the supply language.
At this level, we will use the API output to generate (non-translated) subtitles. In actual fact, for those who run my script with the `–srt` flag, it can do precisely that for you (srt is a file kind for closed captions):
Now that we have now the video transcripts, we will use the Translate API to… uh… translate them.
That is the place issues begin to get a bit 🤪.
Our goal is that this: we wish to have the ability to translate phrases within the unique video after which play them again at roughly the identical time limit, in order that my “dubbed” voice is talking in alignment with my precise voice.
The issue, although, is that translations aren’t word-for-word. A sentence translated from English to Japanese might have a phrase order jumbled. It could comprise fewer phrases, extra phrases, totally different phrases, or (as is the case with idioms) utterly totally different wording.
A method we will get round that is by translating complete sentences after which attempting to align the time boundaries of these sentences. However even this turns into sophisticated, as a result of how do you denote a single sentence? In English, we will cut up phrases by punctuation mark, i.e.:
However punctuation differs by language (there’s no ¿ in English), and a few languages don’t separate sentences by punctuation marks in any respect.
Plus, in real-life speech, we frequently don’t discuss in full sentences. Y’know?
One other wrinkle that makes translating transcripts troublesome is that, generally, the extra context you feed right into a translation mannequin, the upper high quality translation you may anticipate. So for instance, if I translate the next sentence into French:
“I’m feeling blue, however I like pink too.”
I’ll get the interpretation:
“Je me sens bleu, mais j’aime aussi le rose.”
That is correct. But when I cut up that sentence in two (“I’m feeling blue” and “However I like pink too”) and translate every half individually, I get:
“Je me sens triste, mais j’aime aussi le rose”, i.e. “I’m feeling unhappy, however I like pink too.”
That is all to say that the extra we chop up textual content earlier than sending it to the Translate API, the more serious high quality the translations will likely be (although it’ll be simpler to temporally align them with the video).
Finally, the technique I selected was to separate up spoken phrases each time the speaker took a greater-than-one-second pause in talking. Right here’s an instance of what that appeared like:
This naturally led to some awkward translations (i.e. “or are we” is a bizarre fragment to translate), however I discovered it labored nicely sufficient. Right here’s the place that logic appears like in code.
Facet bar: I additionally observed that the accuracy of the timestamps returned by the Speech-to-Textual content API was considerably much less for non-English languages, which additional decreased the standard of Non-English-to-English dubbing.
And one very last thing. In case you already know the way you need sure phrases to be translated (i.e. my identify, “Dale,” ought to at all times be translated merely to “Dale”), you may enhance translation high quality by profiting from the “glossary” characteristic of the Translation API Superior. I wrote a weblog submit about that right here.
The Media Translation API
Because it occurs, Google Cloud is engaged on a brand new API to deal with precisely the issue of translating spoken phrases. It’s known as the Media Translation API, and it runs translation on audio instantly (i.e. no transcribed textual content middleman). I wasn’t in a position to make use of that API on this venture as a result of it doesn’t but return timestamps (the device is at the moment in beta), however I believe it’d be nice to make use of in future iterations!
Now for the enjoyable bit–choosing out laptop voices! In case you examine my PDF-to-Audiobook converter, you recognize that I like me a funny-sounding laptop voice. To generate audio for dubbing, I used the Google Cloud Textual content-to-Speech API. The TTS API can generate numerous totally different voices in several languages with totally different accents, which you’ll find and play with right here. The “Commonplace” voices would possibly sound a bit, er, tinny, if you recognize what I imply, however the WaveNet voices, that are generated by high-quality neural networks, sound decently human.
Right here I bumped into one other downside I didn’t foresee: what if a pc voice speaks quite a bit slower than a video’s unique speaker does, in order that the generated audio file is just too lengthy? Then the dubs could be not possible to align to the supply video. Or, what if a translation is extra verbose than the unique wording, resulting in the identical downside?
To take care of this challenge, I performed round with the
speakingRate parameter accessible within the Textual content-to-Speech API. This lets you velocity up or decelerate a pc voice:
So, if it took the pc longer to talk a sentence than it did for the video’s unique speaker, I elevated the speakingRate till the pc and human took up about the identical period of time.
Sound a bit sophisticated? Right here’s what the code appears like:
This solved the issue of aligning audio to video, however it did typically imply the pc audio system in my dubs have been a bit awkwardly quick. However that’s an issue for V2.
Was it value it?
You understand the expression, “Play silly video games, win silly prizes?” It appears like each ML venture I construct right here is one thing of a labor of affection, however this time, I like my silly prize: the flexibility to generate a vast variety of bizarre, robotic, awkward anime dubs, which might be typically kinda respectable.
Take a look at my outcomes right here: