Gradient Descent
- The algorithm estimates the direction of steepest descent by computing the gradient of the loss function. The gradient points uphill, so the algorithm steps in the opposite direction by subtracting a fraction of the gradient
- Too bad your phone is out of juice, because the algorithm may not have propelled you to the bottom of a convex mountain. Instead, you may be stuck in a nonconvex landscape of multiple valleys (local minima), peaks (local maxima), saddles (saddle points), and plateaus. In fact, tasks like image recognition, text generation, and speech recognition are nonconvex, and many variations on gradient descent have emerged to handle such situations. For example, the algorithm may have momentum that helps it zoom over small rises and dips, giving it a better chance at arriving at the bottom. Luckily, local and global minima tend to be roughly equivalent.