Text Preprocessing
Normalization
- Stripping whitespace
- Lowercasing
- Unicode Normalization - There exist various ways to write the same character. Unicode Normalization schemes like NFC, NFD, NFKC and NFKD replace various ways to write the same character with standard forms.
- Converting numbers to characters
- Expanding abbrevations
- Removing characters like @
- Correcting spellings
Pretokenization
- Splitting text into words
Tokenizer model
- Split into subwords with Byte-Pair Encoding (BPE)
- Tokenizer needs to be trained on the corpus or that has been trained if we are using a pretrained tokenizer.
- The words are divided into subwords to reduce the size of the vocabulary and try to reduce the number of out-of-vocabulary tokens
- Algos for subword tokenization are
Postprocessing
- Adding special tokens at the beginning or end
SentencePiece Tokenizer
- It encodes each input text as a sequence of Unicode characters. (useful for handling multilingual corpora)