Synthesia Offers Professional-Level ADR Services Based on GAN

Synthesia recently released its “Native Dubbing” technology in collaboration with the BBC, which can seamlessly replace the facial expressions and lip movements of hosts or actors, addressing existing issues in video translation and Automated Dialogue Replacement (ADR).

Synthesia aims to eliminate language barriers in video content, allowing producers and users to enjoy video content in various languages, thereby expanding the audience and making videos more culturally inclusive. Synthesia plans to develop this groundbreaking technology into a service offering, providing professional and high-end ADR services to selected partners.

The core of the Synthesia system is its self-developed ENACT native dubbing tool, which utilizes artificial intelligence and machine learning to automatically recognize and track faces for targeted modifications. After continuous improvement, it has largely avoided jitter and blurriness, achieving professional-level results. Its operation works as follows:

Phase One: Collect or create a small dataset needed to create a digital face, requiring 3 to 5 minutes of natural voice footage and head-turning shots with dialogue from the host or actor. The system’s requirements for the material are quite flexible; they do not need to be strictly from specific studios or under special lighting conditions, as the development team aims to make the technology applicable in a broader scope.

Phase Two: Process the clips using the system, which can automatically convert according to the clips provided by the user and the new target language. If producers wish to record the target language audio themselves to match ambient sounds and original audio, the system can also process based on the new audio.

The key to this technology is precise unmarked face tracking of the source clips and “learning” facial expressions, along with post-production lighting effect matching. This process does not employ any traditional 3D modeling, texturing, animation, or rendering, but is configured with Generative Adversarial Networks (GAN), utilizing Convolutional Neural Networks (CNN) for deep learning, where each frame is based on training data and innovative software techniques.

The training process for Convolutional Neural Networks takes a certain amount of time, potentially 12 hours or more. Once training is complete, the final generated facial reenactment will approach realistic effects. Although the Synthesia system operates fully automated, if the results are not credible, the team can still improve the settings and processes and reprocess to deliver a satisfactory finished product.

The entire system runs on Amazon AWS cloud servers, allowing the company to scale up when workloads surge. For regular live programs, the team recommends reserving several weeks for processing, with service fees priced per minute, and the team is working hard to keep costs within an affordable range for content providers.

This content is sourced from FXGuide

Leave a Comment Cancel reply