Fitting functions to data

Linear functions can be either concave or convex. when we take the maximum of linear functions, we are compensated with convexity
If we have non-linear convex function, then the maximum of all the linear functions that stay below our function is exactly equal to it. This gives us a path to exploit the simplicity of linear functions when we have convex functions.
For convex functions the global and local minima are same
A non-linear function can be either convex or non-convex
For non-convex functions with peaks, valleys and saddle points we run the risk of getting stuck at local minima
Numerical method’s search for minimum happens on the ground level and does not search in the same space dimension as the landscape of the function is embedded in.
It is important to locate, on ground level, a direction that quickly decreseas the function height and how far we can move in that direction on ground level (learning rate) while still decreasing the function height above us

Analytic and numeric method exist for linear regression. For others analytic solution is difficult, so we depend on numeric methods

Reason for loss function cross entropy loss function

Train function for multiclass classification Multiclass Sigmoid function Multiclass loss function

Seek to separate a labeled data using widest possible margin
An optimal highway of separation instead of thin line of separation
SVM uses hinge loss function. When the loss is greater than 1 then it is a high penalty, if the loss is between 0 and 1 then it is still penalized and if the loss is 0, there is no penalty.

It is a non-parametric model, it does not fix the shape of the function ahead of time.
This flexibility makes it overfitting to the data
Entropy and Gini Index are used to measure the importance of a feature. Gini Index is less expensive, so it a default in most packages.
Entropy approach - Feature split that provides the maximum information gain
Gini Impurity - Split that provides lowest average Gini Impurity

Entropy and Information Gain Entropy 2 Entropy 3 Entropy 4 Entropy 5 Gini Index Ginig Index 2 Regression Trees

unstable
Sensitive to rotations in the data, since their decision boundaries are ususally horizontal and vertical. Fix - Transform the data to match its principal axes, using SVD
overfit the data

It minimizes the variance (the squared euclidean distances to the mean) within each cluster

Bagging is taking random subsets with replacement and pasting taking random subsets without replacement
Random patches means taking a subset of features for modelling
Stacking is using a prediction model on all the results of all models to get final prediction

Gradient - The effect on the loss by changing a single parameter, while keeping everything else constant. How much the loss changes if one parameter changes a little bit
Gradient is steepness. Learning rate can be equated to step size and the steepness dictates the number of steps (relative impact of the parameter). we then take a number of steps that’s proportional to the relative impact: more impact, more steps
As we will be using a single learning rate for all the parameters, the size of the learning rate is limited by the steepest curve. All other curves will be using suboptimal learning rate, given their shapes. (why can’t we use different learning rates for different curves? - Is it a implementation problem or other challenge?)
To make the gradients equally steep - we should do feature standardization or normalization Impact on loss with and without scaling for two parameters. We should use training set only to fit the standardscaler, we should use its transform method to apply the preprocessing step to all datasets: training, validation and test