Catboost

  • Catboost is a Newton boosting algorithm

  • Newton Boosting + Ordering = CatBoost

  • Target Leakage

  • Prediction Shift

  • Hold out encoding

  • Leave one out encoding does not fully eliminate target leakage and prediction shift issues

  • CatBoost introduces three major modifications to classical Newton Boosting approach

    • It is specialized to categorical features
    • It uses ordered boosting as its underlying approach which allows it to address data leakage and prediction shift
    • It uses oblivious decision trees as base estimators which often leads to faster training times
  • Catboost addresses prediction shift by using permutation for ordering training examples

Ordered Target Statistics

  • Ordered Target Statistics

  • One downside to this approach is that training examples that occur early in a randomized sequence are encoded with far fewer examples.

  • To compensate for this in practice and increase robustness, CatBoost maintains several sequences, or “histories” which are, in turn, randomly chosen. This means that CatBoost recomputes target statistics for categorical variables at each iteration

Ordered Boosting

  • In addition to issues with encoding features, gradient and Newton boosting both reuse data between iterations, leading to a gradient distribution shift, which ultimately causes a further prediction shift. That is to say, even if we didn’t have categorical features, we would still have a prediction shift problem, which would bias our estimates of model generalization!

Ordered Boosting

Oblivious Decision Trees

  • Oblivious decision trees use the same splitting criterion in all the nodes across an entire level/depth of the tree

Oblivious Decision Trees

  • Oblivious decision trees lead to less split criteria, as opposed to the standard decision tree. This makes them easier and more efficient to learn, which has the effect of speeding up overall training.

  • Oblivious trees are balanced and symmetric, making them less complex and less prone to overfitting

Tips to handle Categorical Features

  • use string similarity to handle spelling mistakes in categorical features. (String similarity approaches - character based or token based)
  • dirty-cat package provides functionality off-the-shelf. Different methods available for handling dirty categories are

Methods in Dirty cat

Reference

  • Manning book - Ensemble methods for Machine learning