GPT
GPT 2
- Trained on Webtext - 40 GB dataset
- GPT-2 is built using
only transformer decoder blocks
- GPT-2 uses byte pair encoding to create tokens for its vocabulary
- Few differences between GPT2 and BERT
- GPT-2 uses decoder blocks and BERT uses encoder blocks
- GPT-2 is a autoregressive model and BERT is not autoregressive. By losing autoregressive, BERT gained the ability to incorporate the context on both sides of a word to gain better results.
- GPT-2 uses masked self attention to prevent looking at tokens from the future. BERT does not do that.
- GPT-2 is trained as a language generation model (predicting next word) while BERT is trained to use as a backbone in transfer learning for different downstream tasks.
- The GPT-2 decoder blocks are similar to the original transformer block except that they do not have the second self-attention layer which gets inputs from the encoder blocks.
- GPT-2 can process 1024 tokens
- GPT-2 had 12 attention heads
- We can allow GPT-2 to ramble on its own or we can give it a prompt to have it speak about a certain topic.
GPT 3
- The architecture is a transformer decoder model.
- A total of 96 transformer decoder layers.
- 300 Billion word tokens are used for training.
- The model consists of 175 Billion parameters.
- GPT is 2048 tokens wide. This is called its “context window”
- The difference between GPT2 and GPT3 is the alternating dense and sparse self-attention layers