
Reported by Xinjiyuan

Reported by Xinjiyuan
[Xinjiyuan Guide] The most talked-about NeRF is expected to replace Deepfake as the next-generation visual tool. Let’s see how powerful it really is.
What, you don’t know NeRF?
As the hottest AI technology in the field of computer vision this year, NeRF is widely applicable and has a promising future.
Friends on Bilibili have played with this technology in new ways.
Setting the stage
So, what exactly is NeRF?
NeRF (Neural Radiance Fields) is a concept first proposed in the best paper at the ECCV conference in 2020, which elevates implicit representation to a new height, using only 2D posed images as supervision to represent complex 3D scenes.
This sparked a wave of development, and since then NeRF has rapidly evolved and been applied to multiple technical directions, such as “novel view synthesis, 3D reconstruction,” etc.
NeRF trains a neural radiance field model from sparse multi-angle posed images, which can render clear photos from any viewpoint, as shown in the figure below. It can also be briefly summarized as using an MLP to implicitly learn a 3D scene.
Netizens naturally compare NeRF with the equally popular Deepfake.
Recently, an article published by MetaPhysics reviewed the evolution of NeRF, the challenges it faces, and its advantages, predicting that NeRF will eventually replace Deepfake.
Most of the attention-grabbing topics regarding deepfake technology refer to the two open-source software packages that became popular after deepfakes entered the public eye in 2017: DeepFaceLab (DFL) and FaceSwap.
Although both software packages have a wide user base and active developer communities, there hasn’t been a significant deviation from the original GitHub code.
Of course, the developers of DFL and FaceSwap have not been idle: they can now train deepfake models using larger input images, although this requires more expensive GPUs.
However, over the past three years, the improvement in deepfake image quality promoted by the media is mainly due to end-users.
They have accumulated “time-saving and rare” experiences in data collection and the best practices for training models (sometimes requiring weeks for a single experiment), and have learned how to leverage and extend the limitations of the original 2017 code.
Some in the VFX and ML research community are trying to break through the “hard limits” of popular deepfake packages by extending architectures so that machine learning models can be trained on images up to 1024×1024.
This resolution is twice the current practical range of DeepFaceLab or FaceSwap, making it closer to the useful resolutions in film and television production.
Next, let’s learn more about NeRF~
Unveiling the mystery
NeRF (Neural Radiance Fields), which appeared in 2020, is a method for reconstructing objects and environments by stitching together multiple viewpoint photos within a neural network.
It achieves the best results in synthesizing complex scene views by optimizing an underlying continuous volumetric scene function using a sparse set of input views.
The algorithm also uses a fully connected deep network to represent a scene, where the input is a single continuous 5D coordinate (spatial position (x, y, z) and viewing direction (θ, φ)), and the output is the volumetric density and associated emitted brightness at that spatial position.
By querying the 5D coordinates along the camera rays, views are synthesized, and classical volume rendering techniques project the output colors and densities onto the image.
Implementation process:
First, represent a continuous scene as a 5D vector-valued function, where the input is a 3D position and a 2D viewing direction, and the corresponding output is an emitted color c and volumetric density σ.
In practice, the direction is represented using a 3D Cartesian unit vector d. An MLP network approximates this continuous 5D scene representation and optimizes its weights.
Additionally, by constraining the network to predict volumetric density σ as a function of position x, while allowing RGB color c to be predicted as a function of both position and viewing direction, it encourages multi-view consistency in the representation.
To achieve this, the MLP first processes the input 3D coordinates x with 8 fully connected layers (using ReLU activation and 256 channels per layer), outputting σ and a 256-dimensional feature vector.
This feature vector is then concatenated with the viewing direction of the camera ray and passed to an additional fully connected layer, outputting the view-dependent RGB color.
Furthermore, NeRF introduces two improvements to represent high-resolution complex scenes. The first is positional encoding to help the MLP represent high-frequency functions, and the second is a hierarchical sampling process that allows efficient sampling of high-frequency representations.
As is well-known, positional encoding in Transformer architectures provides discrete positions of tokens in a sequence as input to the entire architecture. NeRF uses positional encoding to map continuous input coordinates to a higher-dimensional space, making it easier for the MLP to approximate higher frequency functions.
From the figure, it can be observed that removing positional encoding significantly reduces the model’s ability to represent high-frequency geometry and textures, ultimately leading to overly smooth appearances.
Since evaluating the neural radiance field network densely at N query points along each camera ray is quite inefficient, NeRF ultimately adopts a hierarchical representation, improving rendering efficiency by allocating samples according to the expected effect of the final rendering.
In short, NeRF no longer uses a single network to represent the scene, but optimizes two networks simultaneously: a “coarse” network and a “fine” network.
The future is promising
NeRF addresses past shortcomings by using MLP to represent objects and scenes as continuous functions. Compared to previous methods, NeRF can produce better rendering results.
However, NeRF also faces many technical bottlenecks, such as NeRF’s accelerators sacrificing other relatively useful features (like flexibility) to achieve low latency, more interactive environments, and less training time.
Therefore, while NeRF is a key breakthrough, achieving perfect results will still take some time.
Technology is advancing, and the future remains promising!
References:
https://metaphysic.ai/nerf-successor-deepfakes/
https://arxiv.org/pdf/2003.08934.pdf