Boosting
- Sequential Ensembles exploit the dependence of base estimators
- sequential ensembles train a new base estimator in such a manner that it minimizes mistakes made by the base estimator trained in the previous step
- Boosting aims to combine weak learners, or “simple” base estimators. In contrast, parallel ensembles like
Bagging
use strong learners as base estimators - Decision trees typically use decision stumps, or decision trees of depth 1 as base estimators (weak learners)
- Sequential ensemble methods such as boosting aim to combine several weak learners into a single strong learner. These methods literally “boost” weak learners into a strong learner
AdaBoost: AdAptive BOOSTing
To ensure that the base learning algorithm prioritizes misclassified trainng examples, Adaboost maintains weight over individual training examples. Misclassified examples will have more weight
When we train the next base estimator sequentially, the weights will allow the learning algorithm to prioritize (and hopefully fix) mistakes from the previous iteration. This is the “adaptive” component of AdaBoost, which ultimately leads to a powerful ensemble. The whole training set is used for each iteration
Each weak estimator is different from each other and classify the problem in diversely different ways. Reweighting allows AdaBoost to train a different base estimator at each iteration, one that is often different from an estimator trained at the previous iterations.
Adaptive reweighting or updating adaptively, promotes ensemble diversity
Adaboost Steps
Use decision stumps as base estimators which are weak learners
Keep track of weights on individual training examples
Keep track of weights of individual base estimator
Adaboost Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
def fit_boosting(X,y,n_estimators=10):
= X.shape
n_samples, n_features # Initialize the weights for the training set
= np.ones((n_samples,))
D = []
estimators
for t in range(n_estimators):
# Noramlize the weights (add to 1)
= D / np.sum(D)
D
= DecisionTreeClassifier(max_depth=1)
h =D)
h.fit(X, y, sample_weight
= h.predict(X)
ypred
# Calculate the error for the weak learner
= 1 - accuracy_score(y, ypred, sample_weight = D)
e
# Update the weights for the weak learner
= 0.5*np.log((1 -e) / e)
a
# calculate the sign for each training example
# If the training example is correct then its weight will be reduced
# If the training example is incorrect then its weight will be increased
= (y == ypred) * 1 +(y != ypred) * -1
m # Calculate the new weights for the training set
*= np.exp(-a * m)
D # Save the weak learner and its weight
estimators.append(a,h)
return estimators
Adaboost with scikit-learn
Adaboost in practice
- The adaptive property (adapt to mistakes made by previous week learners) is a disadvantage when outliers are present. The weak learners continue to misclassify the outlier, in which case Adaboost will increase its weight further, which in turn, causes succeding weak learners to misclassify, fail and keep growing its weight
- It correctly classifies the outlier, in which case Adaboost overfit the data
Training Robust Adaboost Models
Learning Rate
It adjusts the contribution of each estimator to the ensemble. (A learning rate of 0.75 decrease the overall contribution of each estimator by a factor of 0.75). In case of outliers, learning rate should be lowered.
Early Stopping and Pruning
- Identifying the least number of base estimators in order to build an effective ensemble is known as early stopping.
Summary
- AdaBoost, or Adaptive Boosting is a sequential ensemble algorithm that uses weak learners as base estimators.
- In classification, a weak learner is a simple model that performs only slightly better than random guessing, that is 50% accuracy. Decision stumps and shallow decision trees are examples of weak learners.
- AdaBoost maintains and updates weights over training examples. It uses reweighting both to prioritize misclassified examples and to promote ensemble diversity.
- AdaBoost is also an additive ensemble in that is makes final predictions through weighted additive (linear) combinations of the predictions of its base estimators.
- AdaBoost is generally robust to overfitting as it ensembles several weak learners. However, it is sensitive to outliers owing to its adaptive reweighting strategy, which repeatedly increases the weight of outliers over iterations.
- The performance of AdaBoost can be improved by finding a good tradeoff between the learning rate and number of base estimators
- Cross validation with grid search is commonly deployed to identify the best parameter tradeoff between learning rate and number of estimators.
- Under the hood, AdaBoost ultimately optimizes the exponential loss function.
- LogitBoost is another boosting algorithm that optimizes the logistic loss function. It differs from AdaBoost in two other ways
Reference
- Manning book
Ensemble methods for machine learning