NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis • Sinjoy Saha

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, ECCV, 2020.

Quick Summary

The paper introduces a new method for novel view synthesis. The authors show that a fully connected (non-convolutional) neural network can “learn” the representation of a complex scene. The inputs to the network are 5D coordinates that define the spatial position (x, y, z) and viewing direction (\(\theta\), \(\phi\)) and the output is the color (view-dependent radiance (r, g, b)) and volume density (\(\sigma\)). Essentially, for a scene, a set of known camera poses are given as input to the network, and the network is trained to “memorize” the scene fully (over-fitted) - which means that the network contains a latent scene description embedded in its parameters. Thus, if it is queried with a new camera pose (5D position and view angle), it outputs the emitted radiance (color) and volume density (similar to opacity) for the novel view. The authors then use classic volume rendering to project the output into a 2D image.

Ideas, Approach and Results

The algorithm works similar to a classic ray tracer. A ray is marched from the known camera position and viewing direction towards the scene (which is assumed to enclosed inside a box) and points are evenly sampled along the ray. The MLP then outputs color and volume density for each point. The density value ranges from \(0\) to \(\infty\), from fully transparent to fully opaque. The output values from the MLP are then out-composited from the scene towards the camera. Thus, the final 2D image is rendered. It is used, along with the ground truth image (from the original camera), to calculate a squared-L2 loss which is naturally differentiable and the network can be trained.

The 5D input is actually 3D location \(x\) (x, y, z) and a 2D viewing direction (\(\theta\), \(\phi\)) which in practice is a 3D Cartersian unit vector \(d\). The MLP itself is 9 layers - the first 8 layers process \(x\) (uses ReLU and 256 channels) which outputs density \(\sigma\) and 256-dim feature vector, which concatenated with the view direction feeds into the last layer (128-dim) to produce the output colors. Concatenating the viewing direction allows the model to “learn” non-Lambertian effects - reflections at the same surface appear different from different viewing angles.

The authors also introduce two optimizations which greatly improve the performance.

Hierarchical Volume Sampling - Instead of evenly sampling along the ray once, the authors use a two pass rendering where a coarse network first outputs and initial estimate of the volume densities of evenly sampled points. Then, the region around the volumetrically dense points are sampled again for finer estimate by another network. This allows for more efficient sample allocation for training the MLP and volumetric rendering.

Positional Encoding - The authors find that the naive implementation of directly feeding the 5D coordinates to the network has poor performance for high-frequency variation in color and geometry. Thus, they map the inputs to higher dimensional space using sinusoids - from \(\mathbb{R}\) to \(\mathbb{R}^{2L}\) - before passing them into the network. In a later paper, they go into why this is the case, draw tangents to NTK, and show that Random Fourier Features might be better than positional encodings [1].

The authors report significantly higher PSNR and SSIM than previous techniques such SRN, NV and LLFF on real and synthetic images. The supplementary materials and online videos show the differences in performances much clearer [2].

Comparison, Strengths and Weaknesses

The paper shows that a neural network can be trained to “memorize” a scene and queried to synthesize novel views. The only inputs needed are the 5D coordidates of the camera pose and the ground truths. The authors also show that the concatenation of the viewing direction after a few layers also improves performance realistic non-Lambertian materials. The hierarchical volume sampling improves sample allocation which is significant for such data hungry networks. Lastly, the positional encoding of 5D coordinates in a higher dimensional space greatly improves performance on scenes with high-frequency variations in color and geometry.

Inspite of the novel use of an MLP to learn and generate novel radiance fields, the method has some weakness. First, the time needed to train and over-fit the network is quite long and since each scene requires a new network to be trained, the method cannot be directly used for real-time view synthesis. Secondly, the need for sampling twice and training a coarse and a fine network also could be a bottleneck for practical real-time usage. However, the paper shows significant performance improvements over previous methods such as SRN (also a neural scene representation technique).

Questions/Issues

I am curious to know the theory behind using positional encodings - mapping the 5D coordinates to a higher dimensional just using sinusoidal harmonics - improves the performance to such an extent. Neural networks are known to be “universal approximators”, yet somehow they struggle to learn higher frequencies of color and geometry. Here, the authors use just 9 layers. Can the depth of the network be increased to mitigate this to some extent? Additionally, the authors use a basic MLP. Does other architectures like 1D-CNN (on the sinusoidal harmonics) provide any benefit?

References

[1] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, Ren Ng, “Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains”, NeurIPS, 2020.

[2] Jon Barron - Understanding and Extending Neural Radiance Fields, Link, Accessed on March 26, 2025.