How to Train Your ViT
Without the translational equivariance of CNNs, ViT models are generally found to perform best in settings with large amounts of training data or to require strong AugReg schemes to avoid overfitting
Carefully selected regularization and augmentations roughly correspond to a 10x increase in training data size
AugReg
Regularization used - Dropout to intermediate activations of ViT, stochastic depth regularization
Data augmentations - Mixup, RandAugment
Weight decay
Transfer
For most practical purposes, transferring a pre-trained model is both more cost-efficient and leads to better results
Choosing which pre-trained model to transfer
- One approach is to run downstream adaptation for all available pre-trained models and then select the best performing model, based on validation score on the downstream task. (expensive)
- Select a single pre-trained model based on the upstream validation accuracy and then only use this model for adaptation (cheaper)
Cheaper strategy works equally well as the more expensive strategy in the majority of scenarios