Training Neural Networks From Scratch
with Parallel Low-Rank Adapters



Minyoung Huh1
Brian Cheung1 2
Jeremy Bernstein1
Phillip Isola1
Pulkit Agrawal1

1MIT CSAIL
2MIT BCS

GitHub

arXiv


Abstract

The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.



TLDR takeaways

Tremendous efforts have been made to enable large-scale model fine-tuning on memory-limited hardware devices using quantization and adapters. Methods such as QLoRA have demonstrated the feasibility of fine-tuning a 7B LLM with just 6GB of GPU memory.

But what about model pre-training? Can we leverage the memory-efficient algorithms when training from scratch?

We demonstrate that it is possible to train with LoRA alone to match standard pre-training performance. This enables training of much larger models using computing devices with fraction of the memory and communication cost.

Our method, LoRA-the-Explorer (LTE), involves the following steps:

This approach ensures that the LoRA parameters serve as individual "gradient" updates for the main set of weights. By doing so, it enables collaborative training of the model using LoRA adapters ... similar to an open source GitHub project :)


Key observations

Observation 1: Models trained with LoRA alone cannot fully recover pre-training performance. However, increasing the rank r or employing a multi-head representation enables these models to achieve the desired performance levels.

Observation 2: We empirically observe the effective rank of the gradients increases over time, which signals the inability to match performance with a single LoRA alone.


Observation 3: Sequential merging, while a viable approach to recover the full-rank weight representation, is unable to recover the full-model pre-training performance in practice. Suggesting the need for full-rank update iterates.

Observation 4: We demonstrate that parallel LoRA merging can precisely replicate the full-rank weight update. Furthermore, we show that even with infrequent synchronization, our method remains effective for pre-training.

Various datasets on ViT-S-32

ImageNet100 on ViT scales

ImageNet100 on MLP-Mixer scales

Observation 5: Even when trained in parallel, LoRA heads maintain orthogonality throughout the training process.



Conclusion, limitations, and implications

Our results suggest LTE is a competitive parameter-efficient framework for distributed training. We highlight several directions for further exploration:

Our initial findings validate the potential of low-rank adapters in neural network training, marking a significant step forward. However, further testing on larger models is essential to test the scalability of our approach. We believe our method opens up and contributes to many avenues:

Through addressing these open questions, we hope to envision a collaborative ecosystem embodying the concept of the "wisdom of the crowd."



    @article{huh2024lte,
        title={Training Neural Networks from Scratch with Parallel Low-Rank Adapters},
        author={Huh, Minyoung and Cheung, Brian and Bernstein, Jeremy 
                and Isola, Phillip and Agrawal, Pulkit},
        journal={arXiv preprint arXiv:2402.16828},
        year={2024}
    }
    

Acknowledgements

JH was supported by the ONR MURI grant N00014-22-1-2740 and the MIT-IBM Watson AI Lab. JB was funded by the MIT-IBM Watson AI Lab and the Packard Fellowship. PI was funded by the Packard Fellowship. BC was funded by the NSF STC award CCF-1231216. We thank Han Guo, Lucy Chai, Wei-Chiu Ma, Eunice Lee, Dave Epstein, and Yen-Chen Lin for their feedback and emotional support on the project.





Website template edited from Colorful Colorization.