Mohanned Elkholy

Introduction

This article discusses image classification problems using a different approach which is converting them to sequential data and then using transformers for the classification process. Transformers[1] have witnessed so many improvements since google published the transformer paper in 2017. Lately, transformers have been used as the backbone of almost all the state-of-the-art NLP models. This is because their training is easier and they are more scalable. The article discusses the idea of utilizing the power of the transformers to achieve high accuracy in the task of image classification. It first goes through a quick revision of the language transformers. Later it goes through the way these transformers are converted to fit into the image classification problems (since images and text have different representations and converting one representation to another is not a straightforward task). After that, the article will discuss why transformers surpass CNNs and how the internal representation of transformers are more robust.

Figure 1

A quick recap of transformer components

Positional encoding

A positional encoding is a function that takes the input indices as an input and maps it to a non linear curve. This helps the model get a better sense of the index location. Positional encoding is important because it gives the model a notion of where every position is. The paper chose this sinusoidal function. A feedforward layer has been tried, but it didn’t affect the performance much. This function is used in the very beginning of the transformer. This is the formula of the positional encoding

Where pos is the position and i is the dimension.

Why do transformers use attention?

In theory, MLP (multi-layer perceptron) is a universal approximator. However, in practice, it doesn’t converge to appealing results. The reason behind this is that it can converge to infinite solutions, so the model doesn’t know what solution in the solution space that yields the best results. It needs self-guidance, which can also be named self-attention. The intuition behind this layer is that the model needs to pay attention not to all the inputs but to specific inputs that will matter in the learning process. The way it pays attention in itself is a learnable process that goes along with other neural network parameters in the learning process.

As a brief explanation to self-attention,it is a layer that takes three matrices as an input: Key, Value and Query. Later the Key matrix is multiplied with the transpose of the Query matrix. The output later passes through the softmax activation function. This creates the attention map. It is basically a matrix with the same shape as the Query matrix, but all its values are from 0-1. Later, a dot product operation is done between the attention map and the value matrix. Therefore, the model will pay closer attention to the higher values in the attention map. Therefore, self-attention is achieved. Self-attention is explained better in figure 2.

Figure 2

Attention is important because the attention map contains information on the relation between every possible pair of values in the input. However, since it is a quadratic operation, it requires a lot of memory, so having this operation on a large sentence is a hardware challenge.

Why just one attention?

Attention can be multi-headed by passing through multiple linear layers and have attention computed to every head independently. This is good because concatenating the output attention of every attention layer will enrich the output since every layer may focus on different information. In the decoder, the masked attention is used to make the model only evaluate previous results, which works as an alternative to the history notion in RNNs. It is visualized in figure 3.

Figure 3

Simple classification model

In the end, everything passes through a regular feedforward network to classify the output

Why Language transformers?

The reason why transformers have been used lately is that they only use feedforward layers that can be parallelized easily in contrast with the previous sequence models like RNNs that consist of unparalleled recurrent units. Therefore, transformers are more practical in training.

Vision transformers

Why vision transformers?

Convolutional neural networks have been widely used in computer vision tasks in the last few years. The reason behind this is the effectiveness of these models in tackling vision information problems. The advantages of Convolutional neural networks are their inductive biases like locality, two-dimensional neighborhood structure, and translation equivariance. These inductive biases are useful when it comes to training on small datasets (it guides the model). However, the model will generalize better if it is being trained with less inductive biases, but this will require more data for generalization purposes. That’s where the vision transformers came into the play.

Vision transformers [3] lack some inductive biases inherent to CNNs, such as locality and translation equivariance. Therefore, on small datasets, they don’t generalize well. However, given a big dataset, the lack of inductive biases will help the model generalize even more than CNNs because of the lack of assumptions.

Why isn’t using naive attention in vision transformers a straightforward process?

One challenge with transformers is that attention is a quadratic operation. Having, for instance, a 256x256 convolutional map in an intermediate layer would make the attention map of a dimension 65536x65536 (~4.3 *10^9) which is obviously a hardware challenge. Therefore, having attention on a pixel level is not feasible.

One way around it is to divide the image not on the pixel level, but on a patch level, so that fewer input parameters will be fed to the transformer.

How is the image converted to a sentence for the transformer?

The vision transformers basically divide the image into P×P patch images. So the idea is to reshape the image x∈ R^(H×W×C) into a sequence of flattened 2D patches x ∈ R^(N×(C×(P^2)) where (H, W) is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = H×W/P2 is the resulting number of patches. Figure 4 represents a 9×9 patch.

Figure 4

Projecting the embedding on the image sentence

Once we have the image patches, we need to transform them into sequential data since transformers work best with sequences. In order to do that, two essential steps should be taken:

1- Having a positional embedding

2- Converting the image patch into a vector

In the linear projection layer, the image patch is converted into a vector. This layer is a simple feed-forward layer. The output of this layer is concatenated with the corresponding patch index in the embedding table (that also encodes positional encoding). The output of these operations is a sequence of vectors which then can be used as an input to an ordinary transformer. The projection and the embedding are shown in figure 5.

Figure 5

What would a trained transformer tell us?

- Transformers are data hungry

Transformers require more data to surpass the performance of CNNs. In the experiments done in the paper, it was observed that the ViT-L/16 (one variant of the transformer) performs much worse on the 9M data points than ResNet152x2. However, the ViT-L/16 performs better on 90M+ data points. This result confirms the hypothesis that says that convolutional inductive bias is good for small datasets, but for larger ones, it is better to let the model learn its own relevant patterns.

- Transformers can see the whole image at a first glance

One additional advantage of the transformers is that they can relate far-away pixels in the very first layers. This is due to the fact that attention is global in the very first layers. One way that was done to experiment this was by using mean attention distance. This metric computes the pixel distance for each attention head, weighted by the attention weights, and then averages over 5000 data points.

As shown in the following figure, the very first layers still have a high mean attention distance. It means that far pixels are still related to each other in the very first layers, which is a behavior that is not exhibited by CNNs. This is shown in figure 6.

Figure 6

Layer representations in transformers are so similar

The process of Analyzing layer representations of neural networks is challenging. Comparing two layers with different numbers of neurons is still an ambitious process. One method that can be used is the Centered kernel alignment (CKA). The reason why CKA is chosen is that it is invariant to the orthogonal transformation of representations (including permutation of neurons), and the normalization term ensures invariance to isotropic scaling. These properties provide a meaningful comparison of neural network hidden representations. Specifically, CKA takes as input X ∈ R m×p1 and Y ∈ R m×p2, which are representations of two layers, with p1 and p2 neurons, respectively, evaluated on the same m examples. Letting and denote the Gram matrices for the two layers CKA computes:

where HSIC is the Hilbert-Schmidt independence criterion . Given the centering matrix and the centered Gram matrices , and , , the similarity between these centered Gram matrices It measures the statistical dependence of the distribution-based Hilbert space embedding in statistical inference. You can know more about it here. [4] As shown in figure 10, these CKA similarities between layers in both ViT and Resnet. It is clear that ViT layers are more similar to each other. The graphs of the CKA heat map [2] are shown in figure 7 .

Figure 7

When should you choose transformers over CNNs?

The choice of model should be dependent on the problem itself. Each model has its own pros and cons. If the dataset is relatively small, having inductive biases will be useful. Therefore, you should choose the CNN model. However, if the dataset is relatively big, you should choose transformers, for they generalize better on big datasets. Also one advantage of the transformers is that they have less hardware requirements. Therefore it would be optimal to choose them in big dataset problems.

References

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, December 6). Attention is all you need. arXiv.org. Retrieved December 10, 2021, from https://arxiv.org/abs/1706.03762.

[2] ArXiv:2108.08810v1 [CS.CV] 19 aug 2021. (n.d.). Retrieved December 10, 2021, from https://arxiv.org/pdf/2108.08810.pdf

[3] ArXiv:2010.11929v2 [CS.CV] 3 jun 2021. (n.d.). Retrieved December 10, 2021, from https://arxiv.org/pdf/2010.11929

[4] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, A. J. Smola, et al. A statistical kernel test of independence. In Nips, volume 20, pages 585–592. Citeseer, 2007.