Catboost

Catboost is a Newton boosting algorithm
Newton Boosting + Ordering = CatBoost
Target Leakage
Prediction Shift
Hold out encoding
Leave one out encoding does not fully eliminate target leakage and prediction shift issues
CatBoost introduces three major modifications to classical Newton Boosting approach
- It is specialized to categorical features
- It uses ordered boosting as its underlying approach which allows it to address data leakage and prediction shift
- It uses oblivious decision trees as base estimators which often leads to faster training times
Catboost addresses prediction shift by using permutation for ordering training examples

Ordered Target Statistics
One downside to this approach is that training examples that occur early in a randomized sequence are encoded with far fewer examples.
To compensate for this in practice and increase robustness, CatBoost maintains several sequences, or “histories” which are, in turn, randomly chosen. This means that CatBoost recomputes target statistics for categorical variables at each iteration

In addition to issues with encoding features, gradient and Newton boosting both reuse data between iterations, leading to a gradient distribution shift, which ultimately causes a further prediction shift. That is to say, even if we didn’t have categorical features, we would still have a prediction shift problem, which would bias our estimates of model generalization!

Oblivious decision trees use the same splitting criterion in all the nodes across an entire level/depth of the tree

Oblivious decision trees lead to less split criteria, as opposed to the standard decision tree. This makes them easier and more efficient to learn, which has the effect of speeding up overall training.
Oblivious trees are balanced and symmetric, making them less complex and less prone to overfitting

use string similarity to handle spelling mistakes in categorical features. (String similarity approaches - character based or token based)
dirty-cat package provides functionality off-the-shelf. Different methods available for handling dirty categories are