Catboost
Catboost is a Newton boosting algorithm
Newton Boosting + Ordering = CatBoost
CatBoost introduces three major modifications to classical Newton Boosting approach
- It is specialized to categorical features
- It uses ordered boosting as its underlying approach which allows it to address data leakage and prediction shift
- It uses
oblivious decision trees
as base estimators which often leads to faster training times
Catboost addresses prediction shift by using permutation for ordering training examples
Ordered Target Statistics
One downside to this approach is that training examples that occur early in a randomized sequence are encoded with far fewer examples.
To compensate for this in practice and increase robustness, CatBoost maintains several sequences, or “histories” which are, in turn, randomly chosen. This means that CatBoost recomputes target statistics for categorical variables at each iteration
Ordered Boosting
- In addition to issues with encoding features, gradient and Newton boosting both reuse data between iterations, leading to a gradient distribution shift, which ultimately causes a further prediction shift. That is to say, even if we didn’t have categorical features, we would still have a prediction shift problem, which would bias our estimates of model generalization!
Oblivious Decision Trees
- Oblivious decision trees use the same splitting criterion in all the nodes across an entire level/depth of the tree
Oblivious decision trees lead to less split criteria, as opposed to the standard decision tree. This makes them easier and more efficient to learn, which has the effect of speeding up overall training.
Oblivious trees are balanced and symmetric, making them less complex and less prone to overfitting
Tips to handle Categorical Features
- use
string similarity
to handle spelling mistakes in categorical features. (String similarity approaches - character based or token based) dirty-cat
package provides functionality off-the-shelf. Different methods available for handling dirty categories are
Reference
- Manning book - Ensemble methods for Machine learning