# Multilayer Perceptron, SGD, and Optimization ## Multilayer Perceptron ### Background Multilayer perceptrons (MLPs) are a type of neural network that serve as the fundamental architecture for many deep learning algorithms. They consist of a series of fully connected layers of nodes, or “neurons.” With a sufficient number of nodes, MLPs can approximate any continuous function and are useful in a number of applications, including: classification, regression, image recognition, and natural language processing (NLP). *A simple MLP with two hidden layers that performs binary classification.* On a forward pass during training, a vector $ \mathbf{x} \in \mathbb{R}^d $ is multiplied by a learnable weight matrix and sent through an activation function to generate the input for the next layer; this process repeats until a final output $ \hat{\mathbf{y}} $—whether a class label or a vector—is generated. After the prediction is made, the loss is computed and backpropagation is used to update the weight matrices as needed. During testing, a single feed-forward pass through the network generates the corresponding label or prediction $ \hat{\mathbf{y}} $ for an input vector $ \mathbf{x} $. ### Forward Pass #### Through a Single Node We will first consider a forward pass through a *single node*, node $ 0 $, within hidden layer $ i $ of an MLP. Let $ x_i \in \mathbb{R}^{m \times 1} $ be the input vector to this node; i.e., our setup is as follows: *In the above example,* $ x_i \in \mathbb{R}^{4 \times 1} $ *is the output of the hidden layer* $ i-1 $ *and the input to our node* $ 0 $ *in hidden layer* $ i $. During this forward pass through our node in layer $ i $, we execute the following steps: 1. __Multiplication by a (learned) vector $ w_0 $ associated with this node.__ Since $ x_i \in \mathbb{R}^{4 \times 1} $, we know $ w_0 \in \mathbb{R}^{4 \times 1} $ as well. We perform the multiplication as follows: $$ a_0 = w_{0}^T x_i$$ which outputs a scalar value: $ a_0 $ is a weighted sum of the entries in $ x_i $. Note that we can also include a bias term $ b_0 $—i.e., compute $ w_{0}^T x_i + b_0 $—by appending $ b_0 $ to $ w_0 $ and appending $ 1 $ to our input $ x_i $: $$ \begin{bmatrix}w_{0} \\ b_0\end{bmatrix}^T \begin{bmatrix}x_i \\ 1 \end{bmatrix} = w_{0}^T x_i + b_0$$ For the sake of brevity, we will usually write this weight-vector multiplication without the bias term, knowing that the weights can include a bias. 2. __Pass through an activation function__ $ \sigma (\cdot) $. The second and final step is to send $ a_0 $ through an activation function. Selecting an appropriate activation function has important ramifications for the behavior of our node. Setting $ \sigma(\cdot) $ as a step function, for example, $$ \sigma(a_0) = \begin{cases} 1 & \text{if } \ w_{0}^T x_i > 0 \\ 0 & \text{otherwise} \\ \end{cases}$$ turns this node into the classic Perceptron, which can only learn a linear decision boundary. Furthermore, the step function is non-differentiable, which will become problematic later during backpropagation. Commonly used activation functions, which are both non-linear and differentiable, are listed below. The final output from our node $ 0 $ in Hidden Layer $ i $, then, is $$ z_{i}^0=\sigma(a_0) $$ __Common activation functions:__ | Name | Formulation | Graph | Range | Use Cases | | --- | --- | --- | --- | --- | | ReLU (Rectified Linear Unit) | $ \sigma(a) = \max(0,a) $ | |$ [0,\infty) $ | Most common default activation function; used in CNNs | Sigmoid | $ \sigma(a)=\frac{1}{1+e^{-a}} $ | |$ (0,1) $ | Used in binary classification problems| | Tanh (Hyperbolic Tangent) | $ \sigma(a)=\frac{e^a - e^{-a}}{e^a+e^{-a}} $ | |$ (-1,1) $ | Hidden layers of deep networks, especially Recurrent Neural Networks (RNNs) #### Through a Whole Layer We know that each of the $ n $ nodes in our hidden layer $ i $ has a corresponding weight vector $ w_j $ and final output value $ z_{i}^j=\sigma(a_j)=\sigma(w_{j}^T x_i) $ for $ j=0,1,2,...,n-1 $: *In our example, each of the three nodes in Hidden Layer $ i $ outputs a single value after multiplication by the corresponding weight vector and a pass through the chosen activation function $ \sigma(\cdot) $.* Thus, our output for the entire Hidden Layer $ i $ is given by the vector $$ z_i= \begin{bmatrix} z_{i}^0 \\ z_{i}^{1} \\ \vdots \\ z_{i}^{n-1} \end{bmatrix} $$ where $ n $ is the number of nodes in Hidden Layer $ i $. How do we get here? Observe that each weight vector $ w_{j} \in \mathbb{R}^{m \times 1} $ for a node $ j $ in layer $ i $ is multiplied by the input vector $ x_i $. We can concatenate these weight vectors to define a weight matrix $ \mathbf{W}_i $ for this entire Hidden Layer $ i $ as follows: *Concatenating the individual weight vectors $ \in \mathbb{R}^{4 \times 1} $ to form a weight matrix $ w_i \in \mathbb{R}^{4 \times 3} $ for Hidden Layer $ i $, which generates the vector $ a_i \in \mathbb{R}^{3 \times 1} $ via multiplication with the input vector $ x_i \in \mathbb{R}^{4 \times 1} $.* Thus, we now have a weight matrix $$ w_i \in \mathbb{R}^{m \times n}$$ for our Hidden Layer $ i $, where the input $ x_i \in \mathbb{R}^{m \times 1} $ and $ n $ is the number of nodes in Hidden Layer $ i $ (and thus the dimension of the output). As shown above, we multiply this weight matrix by our input $ x_i \in \mathbb{R}^{m \times 1} $ to obtain our vector $ a_i \in \mathbb{R}^{n \times 1} $ as follows: $$ a_i = w_{i}^T x_i = \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_{n-1}\end{bmatrix}$$ Now we simply apply our chosen activation $ \sigma(\cdot) $ element-wise to $ a_i $ in order to obtain $ z_i $ $$ z_i = \sigma(a_i)=\begin{bmatrix} \sigma(a_0) \\ \sigma(a_1) \\ \vdots \\ \sigma(a_{n-1}) \end{bmatrix} $$ which is our final output for our Hidden Layer $ i $. *Back to our example, applying $ \sigma(\cdot) $ element-wise to our vector $ a_i \in \mathbb{R}^{3 \times 1} $ generates our final output vector $ z_i \in \mathbb{R}^{3 \times 1} $ for Hidden Layer $ i $.* That is, the final output $ z_i \in \mathbb{R}^{n \times 1} $ of a Hidden Layer $ i $ with weight matrix $ w_i \in \mathbb{R}^{m \times n} $ given an input vector $ x_i \in \mathbb{R}^{m \times 1} $ is calculated as follows: $$ z_i = \sigma(a_i)=\sigma(w_{i}^T x_i) $$ ### Loss Functions Once we complete our forward pass through the layers of our MLP as described above, we generate a final prediction $ \hat{y} $. For supervised learning, we need to quantify the difference between our predicted value and the ground truth to adjust the weights of our MLP—that is, we have to pick a loss function. We use $ \ell(\cdot) $ to denote the per-sample loss (loss for a single data point), and $ \mathcal{L}(\cdot) $ to denote the total or average loss over the entire dataset. The two most common loss functions for MLPs, as well as their specific use cases, are shown below: 1. Binary Cross-Entropy (BCE, or Log Loss): $ BCE= -\frac{1}{N}\sum_{i=1}^Ny_i \log(\hat{y}_i)+(1-y_i)\log(1-\hat{y}_i) $ * Commonly used in binary classification settings 2. Mean Squared Error (MSE): $ MSE= \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 $ * Used for regression settings Regardless of which loss function we choose, let $ \ell(\cdot) $ be the loss at a single point $ (\mathbf{x}_i,\mathbf{y}_i) $ in our dataset of $ N $ points. As above, we can quantify the (average) loss over our training dataset $ \mathit{D}_{TR} $ given weight matrices $ \mathbf{W} $ as follows: $$ \mathcal{L}(\mathbf{W};\mathit{D}_{TR}) = \frac{1}{N} \sum_{i}^N \ell(\mathbf{y}_i, \hat{\mathbf{y}}_i) $$ ### Backpropagation The goal of backpropagation is to update the weight matrices and bias terms (i.e., the parameters) of our MLP. To do this, we calculate the gradient of the loss function with respect to our parameters and update them accordingly, fixing those weights that heavily contributed to any errors (as quantified by the loss function). We return to our earlier example of a simple MLP, which is now labeled with the corresponding weight matrices $ w_i $ and output values $ z_i $ for a given layer $ i $: *Our MLP from above, with calculations for the $ w_i $ (weight matrix) and $ z_i $ (final output) terms at each layer. Note that we write the bias term $ b_i $ separately from the weight matrix $ w_i $ for this section.* Recall that after our forward pass, we generate a final prediction $ \hat{y}_i $. We also have a loss function $ \mathcal{L} $ that we can use to quantify the difference between the true label $ y_i $ and our prediction $ \hat{y}_i $: $$ \mathcal{L}(y_i, \hat{y}_i) $$ However, since our goal is to ultimately take the gradient of $ \mathcal{L} $ with respect to the weights and biases, the current formulation of the loss function is unhelpful, since it is only in terms of $ \hat{y}_i $ and $ y_i $. Thank goodness for the chain rule! Observe that $ \hat{y}_i $ is exactly the output $ z_3 $ of the last layer, and that $ z_3 $ itself is $ \sigma(a_3) $. Therefore, using the chain rule, we can find the gradient of the loss with respect to the weights and bias of the last layer as follows: We can do this because the loss function $ \mathcal{L} $ is a function of $ \hat{y}_i=z_3=\sigma(a_3) $, and $ a_3=w_{3}^Tz_2+b_3 $ is a function of both $ w_3 $ and $ b_3 $. Observe that both $ \frac{\partial \mathcal{L}}{\partial \mathbf{W}_3} $ and $ \frac{\partial \mathcal{L}}{\partial b_3} $ have the term $ \frac{\partial \mathcal{L}}{\partial a_3} $ in their formulation. Let's find this term first—recalling that $ a_3=w_{3}^Tz_2+b_3 $—using the chain rule: (where $ \odot $ means element-wise multiplication). Now, we just have to find $ \frac{\partial a_3}{\partial w_3} $ and $ \frac{\partial a_3}{\partial b_3} $, which we do below: Therefore, we find that the gradients of the loss function with respect to $ \mathbf{W}_3 $ and $ b_3 $, the weight matrix and bias for the last hidden layer, are as follows: Excellent—but our work isn't done. We now need to continue working *backwards* to calculate $ \frac{\partial \mathcal{L}}{\partial \mathbf{W}_2} $ and $ \frac{\partial \mathcal{L}}{\partial b_2} $, the gradients of the loss function with respect to the parameters of the previous layer. Where should we begin? Recall that computing $ \frac{\partial \mathcal{L}}{\partial a_3}=\frac{\partial \mathcal{L}}{\partial z_3} \odot \sigma'(a_3) $ was critical for both of our gradients in layer 3. Thus, for layer 2, we first want to find the term $ \frac{\partial \mathcal{L}}{\partial \mathbf{a}_2}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}_2} \odot \sigma'(\mathbf{a}_2) $, which we do (using the chain rule, once again) below: Note that we calculated $ \frac{\partial \mathcal{L}}{\partial \mathbf{a}_3} $ in the previous layer, so we can readily substitute this value to find $ \frac{\partial \mathcal{L}}{\partial \mathbf{a}_2} $, which is the key to calculating $ \frac{\partial \mathcal{L}}{\partial \mathbf{W}_2} $ and $ \frac{\partial \mathcal{L}}{\partial b_2} $ (using our work from above to calculate $ \frac{\partial \mathcal{L}}{\partial w_3} $ and $ \frac{\partial \mathcal{L}}{\partial b_3} $): Once we find $ \frac{\partial \mathcal{L}}{\partial \mathbf{W}_2} $ and $ \frac{\partial \mathcal{L}}{\partial b_2} $ using $ \frac{\partial \mathcal{L}}{\partial \mathbf{a}_2} $, we can continue further backward to layer 1, repeating the process (first finding $ \frac{\partial \mathcal{L}}{\partial \mathbf{a}_1} $ and then calculating the appropriate gradients). Once we have calculated the gradient of the loss function with respect to each parameter, our work is (almost) done. #### Update Step Now that we have all of these gradients, how do we actually change the weights and bias terms? The update step depends on our optimization algorithm (see the "Optimization" section for a detailed description of the tradeoffs between different optimization algorithms). If we are just using vanilla gradient descent (GD), the update step simply looks like the following: $$ \mathbf{W}_i= \mathbf{W}_i-\alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}_i} $$ $$ b_i = b_i - \alpha \frac{\partial \mathcal{L}}{\partial b_i} $$ where $ \alpha $ is our learning rate. Again, see below for other (better) optimization algorithms. Once we have updated all of our weight matrices and bias terms, our backward pass is complete. ### Strengths and Limitations MLPs are an essential architecture for many machine learning algorithms, but they still have their challenges. Below we explore some of the strengths and weaknesses of MLPs in practice: #### Strengths * __Non-linear decision boundaries:__ As shown above, composing nodes with a series of non-linear activation functions helps MLPs learn complicated decision boundaries; in fact, they can learn any continuous function. * See [Neural Network Decision Boundary Visualization](https://youtu.be/k-Ann9GIbP4) for a helpful visualization * __Adaptability:__ With some modifications/variations, MLPs are suitable for a range of tasks like classification, regression, and natural language processing (NLP). #### Limitations * __Vanishing or Exploding Gradients:__ Notice that as we calculate the gradients during backpropagation, we perform a series of multiplications. If the gradients are all extremely large or extremely small, multiplying them together like this exacerbates this problem and can lead to ridiculously small or insanely large gradients, which slows (or stops) learning. * Solution: ResNets * __Large Number of Parameters:__ For very deep networks, the number of parameters can come at a high cost in terms of memory and computation. This makes regular MLPs intractable for high-dimensional data, such as raw image pixels. * Solutions: Regularization, alternative architectures (e.g., CNNs for image data) Overall, MLPs are especially useful for regression and classification. If you are working with image data or temporal data, however, you may want to opt for a different model architecture. Nonetheless, MLPs are a foundational architecture for many deep learning applications. __Sources__: * [Multi-Layer Perceptron Learning in Tensorflow](https://www.geeksforgeeks.org/multi-layer-perceptron-learning-in-tensorflow/) (GeeksforGeeks) * [Recap & Multi-Layer Perceptrons Slides](https://www.cs.cornell.edu/courses/cs4782/2025sp/slides/pdf/week1_2_slides_complete.pdf) (CS 4/5782: Deep Learning, Spring 2025) * [Mastering Tanh: A Deep Dive into Balanced Activation for Machine Learning](https://medium.com/ai-enthusiast/mastering-tanh-a-deep-dive-into-balanced-activation-for-machine-learning-4734ec147dd9) (Medium) * [Exploring the Power and Limitations of Multi-Layer Perceptron (MLP) in Machine Learning](https://shekhar-banerjee96.medium.com/exploring-the-power-and-limitations-of-multi-layer-perceptron-mlp-in-machine-learning-d97a3f84f9f4) (Medium) ## Optimization In deep learning, improving a model involves minimizing the difference between its predictions and the true labels (i.e., minimizing how far the model's predictions are from the actual values). This difference is quantified using a loss function. Let $ \ell(\mathbf{w}, \mathbf{x}_i) $ denote the per-sample loss for data point $ \mathbf{x}_i $ with model weights $ \mathbf{w} $. The overall objective is to minimize the empirical risk, i.e., the average loss across the dataset of $ n $ samples, which can be defined as: $$ \mathcal{L}(\mathbf{w})= \frac{1}{n} \sum_{i=1}^n \ \ell(\mathbf{w}, \mathbf{x}_i) $$ The goal of optimization algorithms is to find the model weights $ \mathbf{w} $ that minimize this loss function. Broadly, optimizers can be categorized into two types: 1. Non-Adaptive Optimizers: Use the same learning rate for all parameters. Examples include: Gradient Descent (GD), Stochastic Gradient Descent (SGD), Minibatch SGD, and SGD with Momentum 2. Adaptive Optimizers: Adjust the learning rate for each parameter individually based on past gradients. Examples include: AdaGrad, RMSProp, and Adam. In the sections that follow, we will explore both non-adaptive and adaptive optimization methods in detail. ### Non-Adaptive Optimizers #### Gradient Descent Gradient descent is an iterative optimization algorithm commonly used to minimize a loss function in ML and DL models. At each step, the model's parameters are updated in the direction that reduces the loss the most—i.e., the negative gradient of the loss function with respect to its parameters. The standard update rule for the traditional gradient descent algorithm is $ \mathbf{w}^{t+1} = \mathbf{w}^t - \alpha \ \nabla \mathcal{L}(\mathbf{w}^t) $ where $ \mathbf{w}^t $ is the weights at iteration $ t $, $ \alpha $ is the learning rate, and $ \nabla \mathcal{L}(\mathbf{w}^t) $ is the gradient of the empirical loss. We can expand this gradient as an average over the dataset, so the update becomes $$ \mathbf{w}^{t+1} = \mathbf{w}^t - \alpha \ \cdot \frac{1}{n} \sum_{i=1}^n \nabla \ell(\mathbf{w}^t, \mathbf{x}_i) $$ __Key Limitation__: Each update step requires computing the gradient across all training data points. For large datasets, this is computationally expensive, slow, and memory-intensive, making standard gradient descent inefficient in practice (especially when fast updates are desired). #### Stochastic Gradient Descent (SGD) Stochastic Gradient Descent addresses the inefficiency of full-batch (or traditional) gradient descent by computing the gradient using only one randomly selected sample $ \mathbf{x}^t $ at each iteration $ t $. As a result, the update rule for SGD is: $$ \mathbf{w}^{t+1} = \mathbf{w}^t - \alpha \ \nabla \ell(\mathbf{w}^t, \mathbf{x}^t) $$ While this approach is much faster per iteration and suitable for large datasets, it introduces a new challenge: the gradients from individual samples are often noisy and do not necessarily point toward the direction of the true minimum. This may cause the optimization path to fluctuate significantly, introducing a "noise ball" effect where updates wander before settling near a minimum. Despite this, stochastic gradients are unbiased estimates of the true gradient. In expectation, they still guide the optimizer toward the correct direction: $$ \mathbb{E}_{i} [\nabla \ell(\mathbf{w}^t, \mathbf{x}_i)] = \nabla \mathcal{L}(\mathbf{w}^t) $$ This means that, with appropriate tuning of the learning rate, SGD can still converge effectively in practice. #### Minibatch SGD Minibatch SGD strikes a balance between the extremes of full gradient descent and pure SGD. Instead of computing the gradient over all data or a single sample, it does so over a randomly selected batch of samples $ \boldsymbol{\beta}^t $ with batch size $ b $ at iteration $ t $. The update rule then becomes $$ \mathbf{w}^{t+1} = \mathbf{w}^t - \alpha \cdot \frac{1}{b} \sum_{i \in \boldsymbol{\beta}^t} \nabla \ell(\mathbf{w}^t, \mathbf{x}_i) $$ This approach benefits from reduced noise compared to pure SGD due to averaging over multiple samples, while still being significantly faster and more memory-efficient than full-batch gradient descent. However, minibatch SGD still involves some noise and does not provide as smooth a convergence path as full gradient descent. Let us now visualize how different optimization strategies traverse the loss surface. The figure below compares Gradient Descent, Stochastic Gradient Descent, and Minibatch SGD in terms of their update trajectories: ![Visuals of Gradient Descent, SGD, and Minibatch SGD](https://raw.githubusercontent.com/jpersonalacct/MLP-Optimization/main/THREE.jpg) These visualizations highlight the core trade-offs between stability, speed, and computational cost across these optimization strategies. To summarize: while full-batch gradient descent provides stable and consistent updates, it is computationally expensive. In contrast, SGD and Minibatch SGD offer faster and more scalable updates, but at the cost of increased noise in the optimization path. #### Behavior in Non-Convex Loss Surfaces In addition to trade-offs in performance and efficiency, SGD and GD also face challenges that stem from the complex shape of the loss surface, where the optimization landscape is often highly non-convex and filled with irregularities. Common issues include local minima, saddle points, and flat regions. Local Minima: Non-convex loss functions naturally contain local minima. Traditional gradient descent can easily get stuck in such points. In contrast, SGD is less likely to get stuck in sharp local minima. This is due to the inherent randomness in its updates: the noise introduced by sampling a single example (or a small batch) allows SGD to "bounce out" of shallow or sharp local minima and continue exploring the loss surface. Saddle Points & Flat Regions: SGD often slows down or becomes stuck around saddle points or flat regions in the loss surface. For context: * Saddle points are locations where the gradient is close to zero, but the point is neither a local minimum nor a maximum * Flat regions, or plateaus, have very small gradients across a wide area, resulting in negligible parameter updates. In both cases, the optimizer may make very slow progress or fail to converge altogether. To address these challenges, we will turn to techniques like momentum and adaptive optimizers. #### SGD with Momentum To improve convergence speed, reduce oscillations, and avoid saddle points, SGD with Momentum introduces a concept called momentum, which accumulates a history of past gradients to smooth out updates. Instead of updating weights directly based on the current gradient, we compute an exponentially weighted moving average (EWMA) of the gradients. This accumulated value acts like "velocity" in the direction of descent. The update rules for SGD with Momentum are (using a randomly selected sample $ \mathbf{x}^t $ at iteration $ t $): $$ \mathbf{m}^{t+1} = \mu \ \mathbf{m}^t - \alpha \ \nabla \ell(\mathbf{w}^t, \mathbf{x}^t) $$ $$ \mathbf{w}^{t+1} = \mathbf{w}^t + \mathbf{m}^t $$ where $ \mu \in [0,1] $ is the momentum coefficient and $ \alpha $ is the learning rate as before. The momentum term helps the optimizer build speed in consistent descent directions and dampen oscillations in noisy regions. In practice, using momentum almost always leads to faster convergence compared to standard SGD. ### Adaptive Optimizers In the variants of SGD we covered previously, the same learning rate $ \alpha $ is used for all parameters. But in many cases, especially with sparse data, we want more fine-grained control. Some features might appear very frequently, while others are rare. We want to take smaller steps (lower learning rates) in directions where the gradient has been consistently large, and larger steps in directions where the gradients are small. This is the core idea behind adaptive optimizers: learning rates that adapt individually for each parameter. #### AdaGrad AdaGrad (short for Adaptive Gradient Algorithm) modifies SGD by assigning individual learning rates to each parameter based on the history of past gradients. It increases learning rates for rarely-updated parameters and decreases them for frequently-updated ones. The AdaGrad update rule consists of two steps. First, it maintains an accumulator vector $ \mathbf{v}^t $, which stores the sum of squared gradients up to time $ t $: $$ \mathbf{v}^{t+1} = \mathbf{v}^t + (\mathbf{g}^{t})^2 $$ Here, $ \mathbf{g}^t $ is the gradient at time $ t $, and the square is applied element-wise. Second, it updates the weights as follows: $$ \mathbf{w}^{t+1} = \mathbf{w}^t - \frac{\alpha}{\sqrt{\mathbf{v}^{t+1}+\varepsilon}} \ \odot \ \mathbf{g}^t $$ Each component of the gradient vector is scaled by the inverse square root of its accumulated squared gradients, effectively assigning a unique, diminishing learning rate to each parameter. Although AdaGrad often leads to faster and more stable convergence early in training, one of its key limitations is that the accumulated gradients can grow large over time, causing the learning rates to shrink excessively. This overly aggressive decay can eventually stall learning altogether, especially in longer training runs. #### RMSProp Much like momentum, RMSProp (Root Mean Square Propagation) maintains an exponentially weighted moving average of past gradients. Instead of averaging the gradients themselves, however, it simply averages the squared gradients. This technique helps prevent the aggressive learning rate decay that occurs in AdaGrad by giving more weight to recent gradients and gradually "forgetting" older ones. The update rule is as follows: $$ \mathbf{v}^{t+1} = \beta \mathbf{v}^t + (1-\beta) \ (\mathbf{g}^{t})^2 $$ $$ \mathbf{w}^{t+1} = \mathbf{w}^t - \frac{\alpha}{\sqrt{\mathbf{v}^{t+1} + \varepsilon}} \ \odot \ \mathbf{g}^t $$ where $ \beta \in [0,1] $ is the exponential moving average constant. RMSProp adapts the learning rate for each parameter based on the recent magnitude of its gradients, which makes it well-suited for non-stationary objectives, such as those encountered in recurrent neural networks. Compared to AdaGrad, RMSProp generally stabilizes learning and leads to faster, more reliable convergence. One limitation of RMSProp is that it considers only the magnitude of recent gradients, not their direction. As a result, it cannot accelerate in consistent descent directions. This causes RMSProp to struggle to navigate narrow valleys or curved surfaces efficiently. This is addressed by Adam. #### Adam Adam (short for Adaptive Moment Estimation) combines the strength of both momentum and RMSProp into a single, highly effective optimization algorithm. In practice, Adam often outperforms other optimizers and is widely used as the default choice in many deep learning frameworks. Like momentum, Adam maintains an exponentially weighted decaying average of past gradients to capture the direction of movement. At the same time, like RMSProp, Adam maintains an exponential moving average of squared gradients to adaptively scale the learning rate for each parameter. Adam's update step includes all of the following computations: * First moment estimate (like momentum): $ \mathbf{m}^{t+1} = \beta_1 \mathbf{m}^t + (1-\beta_1) \ \mathbf{g}^t $ * Second moment estimate (like RMSProp): $ \mathbf{v}^{t+1} = \beta_2 \mathbf{v}^t + (1-\beta_2) \ (\mathbf{g}^{t})^2 $ To correct for bias introduced by initializing $ \mathbf{m}^0 $ and $ \mathbf{v}^0 $ at zero, Adam computes bias-corrected estimates: * $ \hat{\mathbf{m}}^{t+1} = \frac{\mathbf{m}^{t+1}}{1-\beta_{1}^{t+1}} $ * $ \hat{\mathbf{v}}^{t+1} = \frac{\mathbf{v}^{t+1}}{1-\beta_{2}^{t+1}} $ * Weight update: $ \mathbf{w}^{t+1} = \mathbf{w}^t - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}^{t+1}+\varepsilon}} \ \odot \ \hat{\mathbf{m}}^{t+1} $ Adam is popular because it combines the stability of RMSProp and the acceleration of momentum, often achieving faster convergence and requiring less parameter tuning than SGD. However, despite its performance benefits, Adam does not always generalize as well as SGD, particularly in tasks like image classification. ### Visualization To build deeper intuition for how different optimization methods behave, you can explore an excellent interactive tool: [Gradient Descent Viz](https://github.com/lilipads/gradient_descent_viz/tree/master?tab=readme-ov-file). This desktop application visualizes several popular optimization algorithms, including: Gradient Descent, Momentum, AdaGrad, RMSProp, and Adam. You can experiment with learning rates, momentum values, and other hyperparameters to see how they affect convergence and path trajectories. Try it yourself! In addition to the interactive app, the following two GIFs offer helpful visual comparisons that highlight how various optimizers behave side by side in different optimization scenarios. Source: [Complete Guide to Adam Optimization](https://medium.com/@LayanSA/complete-guide-to-adam-optimization-1e5f29532c3d) (Medium) In this first GIF, we clearly observe that Adam and RMSProp converge at a similar and notably faster rate than the others, quickly navigating toward the minimum. In contrast, AdaGrad seems to struggle with convergence, reflecting its tendency to decay learning rates too aggressively over time. Source: [Complete Guide to Adam Optimization](https://medium.com/@LayanSA/complete-guide-to-adam-optimization-1e5f29532c3d) (Medium) In this second GIF, we can observe that SGD, AdaGrad, and RMSProp are all taking a similar path, but AdaGrad and RMSProp are clearly faster. This shows the advantage of using adaptive learning rates. Additionally, we can see that momentum explores a wider region of loss surface initially before taking the direct path to the minimum. This capacity allows it to escape early traps like saddle points. **Sources:** * [Optimization Slides](https://www.cs.cornell.edu/courses/cs4782/2025sp/slides/pdf/week2_1_slides_complete.pdf) (CS 4/5782: Deep Learning, Spring 2025) * [Complete Guide to Adam Optimization](https://medium.com/@LayanSA/complete-guide-to-adam-optimization-1e5f29532c3d) (Medium) * [ML | Stochastic Gradient Descent (SGD)](https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/) (GeeksforGeeks)