The Low-Rank Simplicity Bias in Deep Networks

Minyoung Huh1
Hossein Mobahi2
Richard Zhang3
Brian Cheung1 4
Pulkit Agrawal1
Phillip Isola1

2Google Research
3Adobe Research

Code [GitHub]

Paper [arXiv]

Cite [BibTeX]


Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? We investigate the hypothesis that deeper nets are implicitly biased to find lower rank solutions, and these are solutions that generalize well. We prove that the percent volume of low effective-rank solutions increases monotonically as linear neural networks are made deeper. We empirically find that a similar result holds for non-linear networks: deeper non-linear networks learn a feature space whose kernel has a lower rank. We then demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance without changing model capacity. We evaluate on various model architectures and demonstrate that linearly over-parameterized models outperform existing baselines on image classification tasks, including ImageNet.


Insight 1: The volume of low-rank parameters increases as a function of the number of layers.

There exists more probability mass for lower-rank rank solutions when adding more layers. The effective rank is computed on the effective weights for linear networks and on the kernel for non-linear networks.

Insight 2: The parameterization of the network ultimately determines which solution the model will converge to.

In a low-rank under-determined regime, models with the same training error result in different test-error. Too shallow or too deep networks performs sub-optimally. On the contrary, if the underlying solution is full-rank, deep models fail to converge.

Insight 3: Linear over-parameterization of non-linear networks can be used to improve generalization performance.

Linear over-parameterization induces low-rank weights without increasing the modeling capacity. The figure below shows the singular values of a CNN for both the original (left) and the linearly over-parameterized (right) model throughout training. The over-parameterized model exhibits less overfitting, with lower training accuracy and higher testing accuracy.

Try our PyTorch code

Install our code [github]

      >>> git clone
      >>> cd overparam
      >>> pip install .
Integrate to your existing PyTorch code base

      from overparam import OverparamLinear, OverparamConv2d

      # over-parameterized nn.Linear layer
      layer = OverparamLinear(32, 32, depth=4)

      # over-parameterized nn.Conv2d layer (3 layers with 3x3, 3x3, 1x1 kernels)
      layer = OverparamConv2d(32, 64, kernel_sizes=(3, 3, 1), stride=1, padding=1)
Automatically linear over-parameterize existing models

      import torchvision.models as models
      from overparam.utils import overparameterize

      model = models.alexnet()
      model = overparameterize(model, depth=2)


We would like to thank Anurag Ajay, Lucy Chai, Tongzhou Wang, and Yen-Chen Lin for reading over the manuscript and Jeffrey Pennington and Alexei A. Efros for fruitful discussions. Minyoung Huh is funded by DARPA Machine Common Sense and MIT STL. Brian Cheung is funded by an MIT BCS Fellowship.

This research was also partly sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation herein.

Website template edited from Colorful Colorization.