LTE

Training Neural Networks From Scratch
with Parallel Low-Rank Adapters

Minyoung Huh¹

Brian Cheung^{1 2}

Jeremy Bernstein¹

Phillip Isola¹

Pulkit Agrawal¹

¹MIT CSAIL

²MIT BCS

GitHub

arXiv

Abstract

The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.

TLDR takeaways

Tremendous efforts have been made to enable large-scale model fine-tuning on memory-limited hardware devices using quantization and adapters. Methods such as QLoRA have demonstrated the feasibility of fine-tuning a 7B LLM with just 6GB of GPU memory.

But what about model pre-training? Can we leverage the memory-efficient algorithms when training from scratch?

We demonstrate that it is possible to train with LoRA alone to match standard pre-training performance. This enables training of much larger models using computing devices with fraction of the memory and communication cost.

Our method, LoRA-the-Explorer (LTE), involves the following steps:

Initialize each device with a unique set of LoRA parameters.
For each device, train on a different partition of the dataset using a local optimizer.
Every T iterations, communicate the LoRA parameters across devices (or to the parameter server).
Average the effective LoRA parameters to compute the update for the main parameters.
Communicate the updated main parameters back to the devices and repeat the process.

This approach ensures that the LoRA parameters serve as individual "gradient" updates for the main set of weights. By doing so, it enables collaborative training of the model using LoRA adapters ... similar to an open source GitHub project :)

Key observations

Observation 1: Models trained with LoRA alone cannot fully recover pre-training performance. However, increasing the rank r or employing a multi-head representation enables these models to achieve the desired performance levels.

Observation 2: We empirically observe the effective rank of the gradients increases over time, which signals the inability to match performance with a single LoRA alone.

Observation 3: Sequential merging, while a viable approach to recover the full-rank weight representation, is unable to recover the full-model pre-training performance in practice. Suggesting the need for full-rank update iterates.

Observation 4: We demonstrate that parallel LoRA merging can precisely replicate the full-rank weight update. Furthermore, we show that even with infrequent synchronization, our method remains effective for pre-training.

Various datasets on ViT-S-32

ImageNet100 on ViT scales

ImageNet100 on MLP-Mixer scales

Observation 5: Even when trained in parallel, LoRA heads maintain orthogonality throughout the training process.

Conclusion, limitations, and implications

Our results suggest LTE is a competitive parameter-efficient framework for distributed training. We highlight several directions for further exploration:

Improving convergence speed by integrating methods from federated learning and model averaging.
Creating adaptive mechanisms for determining the necessary number of ranks or heads.
Examining the feasibility of heterogeneous parameterization for LoRA, allowing each adapter to operate at a distinct rank.
Investigating strategies for integrating multi-level optimization (optimizer at both local- and meta-level)

Our initial findings validate the potential of low-rank adapters in neural network training, marking a significant step forward. However, further testing on larger models is essential to test the scalability of our approach. We believe our method opens up and contributes to many avenues:

Collective Intelligence: Distributed learning can lead to more efficient and scalable knowledge acquisition.
Personalized Learning: Each LoRA can serve as a localized learning, tailoring models to individual user preferences. These preferences can be shared or used to update the base model.
User Safety and Privacy: Localized training ensures user's data safety and privacy, as weight updates are the main communication medium.
Computational Efficiency: Pre-training models in environments with limited computational resources or slow interconnect speeds.

Through addressing these open questions, we hope to envision a collaborative ecosystem embodying the concept of the "wisdom of the crowd."

    @article{huh2024lte,
        title={Training Neural Networks from Scratch with Parallel Low-Rank Adapters},
        author={Huh, Minyoung and Cheung, Brian and Bernstein, Jeremy 
                and Isola, Phillip and Agrawal, Pulkit},
        journal={arXiv preprint arXiv:2402.16828},
        year={2024}
    }

Acknowledgements

The research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-23-1-0277. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. MH was supported by the ONR MURI grant N00014-22-1-2740 and the MIT-IBM Watson AI Lab. JB was funded by the MIT-IBM Watson AI Lab and the Packard Fellowship. PI was funded by the Packard Fellowship. BC was funded by the NSF STC award CCF-1231216. We thank Han Guo, Lucy Chai, Wei-Chiu Ma, Eunice Lee, Dave Epstein, and Yen-Chen Lin for their feedback and emotional support on the project.

Website template edited from Colorful Colorization.