Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
This article explores a new paradigm in Transformer training: train a large model first, then compress it to a smaller size. While traditional approaches favor training small models from scratch, this work argues that training a large-scale model on full data and subsequently applying quantization, pruning, or knowledge distillation often yields a better performance-to-efficiency tradeoff. The post analyzes the strengths and weaknesses of different compression strategies in preserving model expressiveness, discusses the trade-off between training scale and compression ratio, and provides practical engineering guidelines for deployment.