GIRAFFE - Controllable 3D scenes

Problems With Image Generation

While GANs have shown great success in synthesizing realistic images, they have only shown moderate success in controlling this synthesizing process. One of the issues is that while there is significant progress in using GANs to synthesize images where specific attributes (like shape or size) are controlled, these changes are not decoupled from changes in other attributes. This is sometimes a good thing. When you are trying to generate controlled realistic images, and you forcibly insert a door, GANs will enforce relationships between objects- ie, the style of doors will match the scene and be placed on a wall, and can’t be placed in the sky or in nature (Bau et. al.).

This is why much work is being done in the realm of disentangling features of generated images - generation should be controllable. 2D based GANs typically do not disentangle features at all, and GAN-based models like GAN-Control (A SOTA controllable GAN method based on 3D features) work well at disentangling attributes, but do not disentangle objects, or foreground from background if they are not explicit attributes.

For example, in this demonstration the age of a person is being changed, but so is the length of the hair, the type of earings, and the background of the image, since they are not explicit attributes that the model was trained to control. So we can observe that implicit attributes are still entangled with explicit ones.

GIRAFFE is a learning based image synthesis model whose primary purpose is to resolve some of these decoupling limitations by considering images as three dimensional scenes which are composed of individual objects. This way, objects can be edited individually without altering the rest of the scene.

This demonstration showcases the capabilities of the typical 2D GAN, and here are the capabilities of GIRAFFE. As you can see, the movement of some objects is completely disentangled from the existing objects in the GIRAFFE demo. Also noticably, the GAN objects all change in shape and color if there are any positional changes in the scene.

GIRAFFE

NeRFs

At the highest level, the idea behind GIRAFFE is that each object is represented as a Neural Radiance Field (NeRF). A NeRF is a function mapping a 3D point $(x, y, z) \in \mathbb{R}^3$ and a viewing direction $(\theta, \phi) \in \mathbb{S}^2$ to an RGB color $\in \mathbb{R}^3$ and a volume density $\in \mathbb{R}^+$. The viewing direction can be thought of as the (basically the horizontal and vertical angles) of a sphere defined in polar coordinates. If you imagine a sphere,

polar angle and azimuthal angle (credit to Akihiro Oyamada)

You can convince yourself that by choosing some $\theta, \phi$ pair, you can draw a ray angled by $\theta$ and $\phi$ from the center of the sphere to any 3D point in the space. The important idea here is that the $(x, y, z)$ points are sampled from such a ray, but the ray that is used to sample these points for a NeRF originates at various camera poses. To summarize, NeRFs take in these $(x, y, z, \theta, \phi)$ coordinates and generate a color and volume density at that coordinate. The NeRF function is learned as a MLP network.

The volume density can be thought of as the probability of a ray ending at a given $x, y, z$ location. In simpler terms, it indicates the probability that something is at that location - if there is some volume of something there.

Finally, the NeRF paper also proposed a positional encoding for these $(x, y, z, \theta, \phi)$ coordinates: $\gamma(t, L) = \sin(2^0t\pi), \cos(2^0t\pi), \ldots, \sin(2^Lt\pi), \cos(2^L)t\pi)$, where $t$ is an input (in this case, this encoding is applied elementwise to the coordinates so t is $x, y, z, \theta,$ or $\phi$ ) and $L$ is a hyperparameter choosing how many frequencies will be considered. This encoding also looks very similar to a Fourier Transform, which is able to decompose complex signals into a linear combination of simple sines and cosines. This encoding is extremely beneficial because it allows for the model to choose between the elements of this encoding when choosing important information. For example, highly variable textures or geometries can be represented by a high frequency sine or cosine wave, while coarse grained ideas can be represented by low frequency embeddings.

Generative NeRFs

GIRAFFE is based on NeRFs, but the variant it uses is a little different. The inputs to the NeRFs representing objects in GIRAFFE also include latent codes for each object. The reason for this is that the GIRAFFE variant of NeRFs was not trained on posed images of the same scene. Instead, it was trained on typical datasets with unposed images. These latent codes are similar to the latent representations used in typical 2D GANs (e.g. come from normal distributions), except that they are more interpretable since there are two specific codes, one responsible for shape and one responsible for appearance. Thus, in GIRAFFE, the MLP responsible for generating the volume density and color not only takes in the 5D coordinates, but also some latent codes with which to generate an object with. Look here to see how changing the appearance code changes a generated image. Notice how the shape of each generated image is the same despite the change in appearance. Moreover, the NeRF construction is generalized in GIRAFFE: instead of outputting a color vector ($\mathbb{R}^3$), the MLP is modified to output a general feature vector. This feature vector is not interpretable its size is a hyperparameter which can be tuned.

Object Representation

Typically, NeRFs are used to represent an entire scene, but for GIRAFFE, the decoupling of objects takes place by creating a NeRF for each object individually. These NeRFs share weights, however.

Each object also has a transformation associated with it, to enable control of that object. This control is applied through affine transformations (a generalization of linear transformations), which is just some combination of translation, scaling, and rotation. These transformations are applied in the object space, based on a coordinate system from the objects point of view (think of a space centered at the center of the object, or perhaps its lower, rear left point). This is in contrast to the scene space, in which a coordinate system relative to a single fixed point is used for all objects. Each NeRF is computed in the object space, but each object’s rendering is done in the scene space, as we will shortly see.

Scene Composition

Once the volume and feature vectors for each of the objects have been computed, they must be placed on the same scene. As a quick note, the background is considered an object in this framework and any transformations applied on it are simply applied onto the scene space. To compose objects together, GIRAFFE simply uses the mean feature vector weighed by each volume density to represent the features for a given 5D coordinate. The volume density is summed across all objects for a given 5D coordinate.

Rendering

Given this composition, the final 3D feature representation of each pixel in the final image is computed by a volume renderer. This is a non-learned method, where each of the features of each pixel are mapped to an alpha map.

The volume rendering gives a feature image at a very small resolution. This was done for speed, as well as the good results that have been seen with neural network upsampling in the past. Thus, the last step to generate a full sized image (with RGB color instead of generic features) is to upsample the feature image. This is done with a typical CNN. However, this network is somewhat interesting in that there are residual connections. The way that this happens with different intermediate representations of the image is that bilinear interpolation is used to scale images up to be added as a residual connection in future layers.

Wrapping Up

All together, GIRAFFE works as a generator, so the entire process that was just described is paired in a generator discriminator model, where the generator produces images and the discriminator tries to determine if they are real given an original dataset and the generated images. To do this, the non saturating GAN objective was used. The difference between this loss function and the typical GAN objective is that the typical GAN objective looks to minimize the probability that the discriminator can classify generated images as fake, but the non saturating GAN objective seeks to maximize the probability that the generated images are real, which results in larger gradients, faster convergence and more stability in training. The metric for evaluation used was FID, which is a score for how well the Inception v3 model can classify the images when it has been pretrained on ImageNet. By this evaluation metric, GIRAFFE is a strong performer relative to other 3D-aware generation models.

In conclusion, GIRAFFE was successful in disentagling different objects in image generation, and allowing for some level of control of individual objects. However, there is still much more work to be done. For example, the latent codes used in the model are uninterpretable (aside from specifying shape or appearance), so it is not possible to choose one to modify a given object in a specific way. This is one area of further improvement that should be sought after - to provide even more control over how images are generated.

References

Akihiro Oyamada. camera-controls. https://github.com/yomotsu/camera-controls.

Bau, D., Zhu, J.Y., Strobelt, H., Lapedriza, A., Zhou, B., & Torralba, A. (2020). Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences.

Michael Niemeyer, & Andreas Geiger (2021). GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).