GPT

GPT 2

Trained on Webtext - 40 GB dataset
GPT-2 is built using only transformer decoder blocks
GPT-2 uses byte pair encoding to create tokens for its vocabulary
Few differences between GPT2 and BERT
- GPT-2 uses decoder blocks and BERT uses encoder blocks
- GPT-2 is a autoregressive model and BERT is not autoregressive. By losing autoregressive, BERT gained the ability to incorporate the context on both sides of a word to gain better results.
- GPT-2 uses masked self attention to prevent looking at tokens from the future. BERT does not do that.
- GPT-2 is trained as a language generation model (predicting next word) while BERT is trained to use as a backbone in transfer learning for different downstream tasks.
The GPT-2 decoder blocks are similar to the original transformer block except that they do not have the second self-attention layer which gets inputs from the encoder blocks.
GPT-2 can process 1024 tokens
GPT-2 had 12 attention heads
We can allow GPT-2 to ramble on its own or we can give it a prompt to have it speak about a certain topic.

The architecture is a transformer decoder model.
A total of 96 transformer decoder layers.
300 Billion word tokens are used for training.
The model consists of 175 Billion parameters.
GPT is 2048 tokens wide. This is called its “context window”
The difference between GPT2 and GPT3 is the alternating dense and sparse self-attention layers