Text Preprocessing

Steps involved in tokenization

Normalization

Stripping whitespace
Lowercasing
Unicode Normalization - There exist various ways to write the same character. Unicode Normalization schemes like NFC, NFD, NFKC and NFKD replace various ways to write the same character with standard forms.
Converting numbers to characters
Expanding abbrevations
Removing characters like @
Correcting spellings

Pretokenization

Splitting text into words

Tokenizer model

Split into subwords with Byte-Pair Encoding (BPE)
Tokenizer needs to be trained on the corpus or that has been trained if we are using a pretrained tokenizer.
The words are divided into subwords to reduce the size of the vocabulary and try to reduce the number of out-of-vocabulary tokens
Algos for subword tokenization are
- BPE
- Unigram
- WordPiece

Postprocessing

Adding special tokens at the beginning or end

SentencePiece Tokenizer

It encodes each input text as a sequence of Unicode characters. (useful for handling multilingual corpora)