How to Train Your ViT

Without the translational equivariance of CNNs, ViT models are generally found to perform best in settings with large amounts of training data or to require strong AugReg schemes to avoid overfitting
Carefully selected regularization and augmentations roughly correspond to a 10x increase in training data size

AugReg

Regularization used - Dropout to intermediate activations of ViT, stochastic depth regularization
Data augmentations - Mixup, RandAugment
Weight decay

Transfer

For most practical purposes, transferring a pre-trained model is both more cost-efficient and leads to better results

Choosing which pre-trained model to transfer

One approach is to run downstream adaptation for all available pre-trained models and then select the best performing model, based on validation score on the downstream task. (expensive)
Select a single pre-trained model based on the upstream validation accuracy and then only use this model for adaptation (cheaper)
Cheaper strategy works equally well as the more expensive strategy in the majority of scenarios

Reference