BERT

It comes in two model sizes BERT Base and BERT Large.

BERT is a trained transformer Encoder stack.

They have large feedforward networks (768 and 1024 hidden units) and 12 and 16 attention heads respectively.

BERT has a [CLS] token which is used for classification. The output from [CLS] will go to a feedforward network along with softmax for classification.

BERT builds on clever ideas like ELMo, ULMFiT and Transformers.

ELMo - Contextualized word-embeddings

The embedding for a word will remain the same irrespective of the context. ELMo provides contextualized word-embeddings. ELMo looks at the entire sentence before assigning each word in it an embedding. It uses Bi-directional LSTM (previous word and next words) trained on a specific task to be able to create embeddings.

ELMo gained its language understanding from being trained to predict the next word in a sequence of words - Language Modeling.

ELMo comes up with the contextualized embedding through grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation).

Concatenation and weighted summation of embedding

ULM-FiT

ULM-FiT introduced a language model and process to effectively fine-tune that language model for various tasks.

OpenAI Transformer

It turned out that we don’t need an entire transformer to adopt transfer learning, we can do with just the decoder of the Transformer.

The model stacked twelve decoder layers. Since there is no encoder in this set up, these decoder layers would not have the encoder-decoder attention sublayer that vanilla transformer decoder layers have. It would still have the self-attention layer, however (masked so it doesn’t peak at future tokens).