Cover: ‘Cheese Rolling on Cooper’s Hill’ by Charles March Gere in 1948, from the Museum of Gloucester Collection. The most ancient recorded occurrence of cheese rolling traces its origins to a communication directed to the town crier of Gloucester in 1826. Even during that era, it was evident that the event had significant historical origins, with a belief that its inception goes back at least six centuries. Source: http://www.visitgloucester.co.uk

Every saint helps downwards.

If there’s a cheese rolling downhill,
there’s a Mineiro^(*) chasing after it.

Brazilian Popular Sayings
^(*)Mineiro = person from Minas Gerais (a Brazilian state).

1. Introduction

Stochastic Gradient Descent (SGD), in tandem with Backpropagation, serves as a foundational cornerstone in the world of optimization algorithms (i.e. trainning neural networks) for machine learning and deep learning. This paper’s primary objective is to provide a comprehensive elucidation of the core principles that underlie SGD, with a specific focus on Gradient Descent (GD). It aims to illuminate the fundamental concepts, inherent advantages, and the remarkable effectiveness of GD in overcoming significant challenges. Notably, while the conceptual parallels with physics are readily apparent, including analogies like a ball rolling downhill and the principles of least action and Feynman integrals, it is equally important to recognize that the mathematical foundations of gradient descent resonate within the realm of data science techniques. This mathematical congruence extends to various data science methodologies, such as polynomial interpolation, least-square curve fitting, data visualization, and more. Consequently, the synergy between the mathematical underpinnings of gradient descent and its applications in data science underscores the versatility and wide-ranging applicability of these optimization techniques across diverse fields.

2. What is Gradient Descent?

To explore the fundamental principles that form the basis of Stochastic Gradient Descent (SGD), it’s beneficial to begin by introducing Gradient Descent (GD). By drawing comparisons with a simple physical analogy involving a ball’s descent down a hill and the concept of traversing a curve towards its function’s minimum, we can establish an approachable foundation. These analogies offer an accessible starting point for grasping the mechanics of Gradient Descent (GD), which serves as the foundation for SGD.

2.1. A Ball (or Cheese) Rolling Down a Hill

Imagine you’re watching a round ball perched on top of a landscape filled with hills and valleys, much like the famous Gloucester cheese that rolls down Coopers Hill.

You know how things are pulled down due to gravity. Well, in this case, gravity gives the ball (or a cheese) a little push, making it roll down the steepest slope. Now, think about this as if we’re trying to find the best path down a hill, kind of like solving a puzzle.

Now, imagine the ball starting to roll from a high point. It follows the shape of the land as it goes down. This is similar to how computer programs work to solve problems. They take small steps to improve, like the ball adjusting its position while rolling down.

Just like the ball is guided by gravity and the shape of the land, these computer techniques use something called “gradients” to guide them. Think of gradients as markers that show the best way to go down the hill in a computer problem. This comparison helps us understand that both solving problems and the ball’s journey are trying to find the best way to reach a solution.

Let’s dive a bit deeper into this. Imagine teaching a computer program to recognize things in pictures. Using these techniques can really help the program get better. Think of the program’s settings like a hilly landscape, and the “cost function” as a map that tells us how much the program’s guesses differ from reality. When we use these techniques, the program gradually adjusts its settings, similar to how the ball fine-tunes its path while going downhill. This helps the program work better and make more accurate guesses.

Here’s a cool part: when the ball starts at the top of the hill, it has energy. As it goes down, that energy changes into speed, some is lost to friction. This is similar to how these techniques work—they move from bigger problems (large “cost function”) to smaller ones (smallest “cost function”), just like the ball goes from a higher point to a lower one. This shows how ideas from physics and solving problems can connect in a really interesting way.

Now imagine you’re on a quest to locate the lowest point in a landscape full of hills and valleys. Even though you can’t directly see the whole landscape, you can sense which way is steeper as you walk. Now, think about how computers learn to improve things. They also want to find the best “lowest point” in a virtual landscape (the smallest “cost function”) and they use a similar idea, the “Gradient Descent.”

A 3D surface map of Mt. St. Helens with a 2D contour map above for comparison. By Clarknova – Produced using Golden Software’s Surfer 8. CC BY-SA 3.0. Created: 26 October 2008. Source: Wikipedia.

In this digital landscape, instead of actual hills and valleys, we have mathematical functions that describe different shapes (a curve for a simple one-variable function, a surface for a two-variable function or a hypersurfance with higher dimensions). Similar to how you sense steepness while walking, these functions have a property called “slope” or “gradient” that indicates the direction they’re going up or down. Gradient descent is akin to a clever hiker who starts at a random spot on the landscape and takes small steps in the direction that makes the slope decrease the fastest.

However, there’s a twist: rather than using hiking boots, we employ a clever mathematical trick to determine the steepest direction. We follow the slope downward and adjust our position to get closer and closer to the lowest point. This very concept is used in machine learning, where computers “walk” down these mathematical hills to enhance tasks like predicting images or comprehending text. It’s like adopting nature’s way of guiding us down actual hills, but translated into numbers and computers.

2.2. The Math Trick of Walking the Steepest Direction

Let’s delve into a more technical and informative description of the parallel between the mathematical process of numerically walking towards the minimum of a function and the commonly used gradient descent in machine learning.

Consider the simple function y = x². We aim to move along the curve towards the minimum using gradient descent. First, we calculate the gradient (first derivative for this one-variable function): y’ = dy/dx = 2x. Since the gradient indicates the direction of the function’s steepest increase, it’s positive for x > 0 and negative for x < 0. Clearly, there’s a critical point at x = 0, where the positive second derivative suggests a minimum.

For the time being, let’s temporarily set aside the lessons involving differential and integral calculus, and instead, concentrate on a numerical example. This approach tends to be more instructive and accessible to a wide range of readers.

We’ll illustrate how an arbitrary point on the curve can be brought to the function’s minimum using the gradient descent technique (the function y can represent the cost function to be minimized in machine learning).

In the context of gradient descent, it’s important to remember that we utilize the negative gradient, which corresponds to the direction of steepest descent for the function. Therefore, considering the given curve or function, the calculation -y’ = -2x becomes necessary.

We’ll initiate with x₁ = 4 (which yields y₁ = 16). To progress, we’ll augment x₁ by a quantity equal to the negative gradient at this particular point, denoted as -y'(x₁), multiplied by the “LR” step – the learning rate. This procedure of updating the new x value will be iterated “N” times: X₂ = X₁-y'(x₁)*LR … X_N = X_N-1-y'(x_N-1)*LR.

Executing this sequence for a total of N iterations leads the point to converge towards the origin, representing the steepest descent and progressing contrary to the gradient’s ascent. Both LR and N function as hyperparameters in this context. This procedure can be executed manually or facilitated through an Excel spreadsheet.

As an alternative approach, you also possess the choice to request ChatGPT to produce a Python code on your behalf. The generated code can be conveniently copied and pasted into JupyterLite for execution. The output presented below demonstrates the interactive progression towards the minimum point, analogous to a ball (or cheese) smoothly descending a hill towards the valley’s bottom.

An iterative path of a ‘ball’ moving along a quadratic curve (y = x^2) is depicted using the Gradient Descent technique. The trajectory is marked by red dots, and blue arrows positioned along the x-axis indicate the direction and magnitude of the negative gradient. Connecting the dots to the arrows are gray dashed lines, emphasizing the notion of pursuing the steepest descent. With each iteration, the ‘ball’ progressively approaches the curve’s minimum, effectively demonstrating the fundamental principle of Gradient Descent (Code 1 in the Appendix).

By introducing alterations to this code, it becomes feasible to expand its capabilities and generate a comprehensive 3D plot.

This visualization showcases the essence of gradient descent optimization. On the left, a 3D plot exhibits a parabolic “hill” representing a mathematical function, with red markers depicting the iterative movement of a ball symbolizing optimization progress. Gray contour lines help visualize the function’s contours on the ground. On the right, a colormap contour plot offers a 2D view of the same function, with a red ball’s path illustrating gradient descent steps. The arrows signify the direction of steepest descent (negative gradient), while the color intensity corresponds to function values. This visual representation highlights how gradient descent method navigates the landscape of the function to find its lowest point, aiding in understanding optimization techniques in machine learning and mathematics (Code 2 in the Appendix).

Importantly, within the realm of machine learning, gradient descent emerges as a pivotal optimization technique employed to minimize the cost function, which quantifies the disparity between model predictions and actual observations. This minimization process transpires within the domains of weight and bias, critical parameters that shape the neural network model’s behavior. The learning rate, a significant hyperparameter that dictates the step size in each iteration of gradient descent, plays an essential role. This rate determines how swiftly the model approaches the optimal solution.

3. Linear Regression and Polynomial Interpolation: a few words

In the realm of data analysis, two pivotal techniques—linear regression and polynomial interpolation—take center stage. It’s worth dedicating a few words to distinguish these concepts, especially for beginners who might find them initially confusing.

Linear regression (and its non-linear counterpart) holds the primary objective of finding an optimal-fitting line, curve, or even a surface that minimizes the difference between predicted and actual values. This approach finds relevance across various scenarios, including those where inputs and outputs are labeled within a supervised-trained neural network. It’s akin to discovering the best-suited way to connect the dots, revealing trends and relationships inherent in the data.

Parallel to linear regression, polynomial interpolation emerges as a versatile strategy that grapples with the intricacies of data representation. Through matrix-based techniques like Gauss-Jordan elimination, polynomial interpolation steps in to bridge gaps within incomplete data points by utilizing polynomial equations. This bridging process is crucial in preparing data for consumption, say, by a neural network. While linear regression hones in on optimal linear fits, polynomial interpolation takes a step further. It accommodates polynomial curves of varying degrees, weaving through data points in a more nuanced manner. The Gauss-Jordan technique for scalonating and reducing matrices by elemetal line oprations, fundamental to this process, empowers us to accurately decipher polynomial coefficients, granting a keen understanding of trends and associations woven within the data fabric.

The synergy between linear regression and polynomial interpolation equips analysts with potent tools for untangling patterns and projecting trends across an array of diverse datasets.

In our next discussion, we’ll focus on comparing linear regression’s gradient descent method with the least squares linear fit. While both techniques have distinct roles, we’ll set aside the exploration of polynomial interpolation for now. For a deeper dive into polynomial interpolation, you can turn to the Appendix.

4. Linear Regression: Least Squares versus Gradient Descent

Both least squares linear regression and gradient descent are crucial components in the realm of machine learning and statistics, sharing a central objective: determining optimal parameters for a linear model that fits provided data with precision. Let’s delve into a comprehensive comparison of these two techniques.

In the case of the least squares fit (as discussed in our article Curve Fitting: A Comprehensive Introduction for Machine Learning), the main objective is to minimize the sum of squared differences between observed data points and the predicted values projected onto the regression line. This process involves, for a linear fit, determining the parameters (slope and intercept) that result in the smallest sum of squared errors.

On the other hand, gradient descent operates as an optimization algorithm that seeks to locate the minimum of a cost function through iterative parameter adjustments. In the context of linear regression, this cost function frequently aligns with the mean squared error (MSE), which is parallel to the core objective of least squares linear regression.

When considering initialization and iteration, least squares linear regression employs a direct formula to efficiently compute the optimal parameters (slope and intercept) that minimize the sum of squared errors. This approach benefits from streamlined computation through the use of matrix operations. For non-linear least squares fits, usually a system of equations normally in matricial form must be solved iteratively.

In contrast, gradient descent for linear regression starts with initial parameter values and progressively hones them in a direction that reduces the cost function. The magnitude of updates is proportional to the negative gradient of the cost function, guiding the algorithm toward optimal parameter values.

This comparison becomes even clearer when we look at gradient descent as an iterative approach. The direct formula of least squares linear regression can be viewed as a one-step optimization process that directly calculates the optimal parameters, providing a closed-form solution with precision.

Conversely, gradient descent unfolds as an iterative methodology. With each iteration, it updates parameters through small steps in the direction of steepest descent. These incremental adjustments continuously decrease the cost function, ultimately leading to convergence and optimal parameter values.

The colormap illustrates the cost function landscape during the Gradient Descent process for linear regression. The black trajectory represents the path of parameter updates (A and B) as the algorithm iteratively minimizes the cost function. The colormap intensity indicates the cost values at different parameter combinations, showcasing the gradual descent towards the optimal parameter values that best fit the given data points (Code 5 in the Appendix).

Gradient descent introduces the concept of a learning rate—a hyperparameter that dictates the step size taken in each iteration. The learning rate significantly impacts the speed of convergence and the overall stability of the algorithm.

In terms of learning rates, least squares linear regression does not involve this concept. Optimal parameters are directly calculated from the provided data points.

To illustrate, consider having a dataset that exhibits a linear pattern. Both least squares linear regression and gradient descent can be employed to determine the best-fitting line.

In the least squares approach, computations directly yield the slope and intercept that minimize the sum of squared errors.

Conversely, the gradient descent approach begins with parameter estimates and iteratively refines them. In each iteration, the gradient of the cost function relative to the parameters is computed, guiding the parameter updates.

While both techniques aim for the ideal-fitting line, their optimization strategies differ. Least squares provides a precise one-step solution, while gradient descent gradually hones parameters over multiple iterations.

In essence, the overarching goal of both least squares linear regression and the gradient descent method is to identify optimal parameters for linear models. While least squares offers a definitive solution, gradient descent provides a versatile iterative optimization path that extends beyond linear regression.

When choosing between Least Squares Linear Regression and Gradient Descent for model optimization, it’s crucial to weigh their advantages and disadvantages, especially in the context of dataset size and computational resources. Least Squares Linear Regression shines when dealing with smaller datasets, as it can efficiently and accurately calculate optimal parameters. Moreover, it boasts a lower computation cost overall, making it a suitable choice for small to medium-sized datasets. On the other hand, Gradient Descent is a valuable contender, particularly when dealing with substantial datasets. While it can work reasonably well with small datasets, its true potential becomes evident when handling large datasets, where the closed-form solution becomes computationally expensive and memory-intensive. However, it’s worth noting that Gradient Descent comes with a higher computation cost due to its iterative nature. Still, its scalability makes it a robust choice for optimizing models on larger scales. Thus, the decision between these methods ultimately hinges on the dataset’s size and the available computational resources.

In conclusion, while both least squares linear regression and gradient descent pursue the same goal of finding optimal parameters, their advantages and disadvantages become apparent in different contexts. Least squares is efficient for small datasets and offers a closed-form solution, while gradient descent excels for larger datasets, trading off computation cost for scalability. The choice between these methods depends on the dataset size and computational resources available. For a more detailed discussion and access to the Python code examples, please refer to the Appendix section.

5. Stochastic Gradient Descent: Addressing Large-Scale Challenges

The emergence of Stochastic (from Ancient Greek στόχος (stókhos) ‘aim, guess’) Gradient Descent (SGD) provides a solution to the computational limitations of traditional GD in very large-scale scenarios. SGD introduces an element of randomness into the gradient computation. Instead of calculating the gradient using the complete dataset, SGD works with smaller, randomly sampled subsets called “mini-batches” during each optimization iteration. This modification reduces computational intensity, making SGD viable for large-scale optimization tasks.

The stochastic nature of SGD allows it to approximate the true gradient using mini-batch sampling. This controlled randomness in parameter updates enables dynamic navigation of the complex parameter space, leading to faster convergence and enhanced generalization. These traits position SGD as a pivotal optimization technique in the evolving landscape of machine learning.

A modified code introduces the concept of Stochastic Gradient Descent (SGD) and offers a closer look at the vicinity of the cost function’s minimum. It’s important to observe the distinct wandering pattern of the parameter points that is characteristic of the stochastic approach. By reducing the number of interactions and iterating over random min-batches of data, Stochastic Gradient Descent aims to accelerate convergence (modified from Code 5 in the Appendix).

One of the advantages of Stochastic Gradient Descent over the standard Gradient Descent lies in its efficiency and suitability for machine learning tasks. In standard Gradient Descent, the entire dataset is utilized to compute the gradient and update the model parameters (weights and biases of the deep neural network) in each iteration. However, in large-scale machine learning scenarios, this can be computationally expensive and time-consuming. Stochastic Gradient Descent addresses this challenge by updating parameters using only a subset or a single data point in each iteration, making the process faster and often leading to quicker convergence. Additionally, SGD enhances model generalization by introducing noise during updates and improving adaptability to unseen data. Its stochastic nature also helps in escaping sharp and narrow minima in the loss (cost function) landscape, resulting in more robust and less overfit-prone solutions. These advantages position SGD as a valuable optimization technique, especially in the context of training deep neural networks on large and complex datasets.

One of the most significant benefits of Stochastic Gradient Descent (SGD) is its remarkable ability to achieve faster convergence. Especially when dealing with extensive datasets, computing the full gradient can be computationally prohibitive. SGD addresses this by estimating the gradient using mini-batches, leading to frequent parameter updates and accelerated convergence. This property is particularly advantageous in training complex models, such as convolutional neural networks for image classification, where rapid progress is essential for accurate predictions. Furthermore, let’s explore some additional key considerations regarding Stochastic Gradient Descent (SGD).

Improved Generalization: Battling Overfitting

SGD’s introduction of controlled noise through stochastic updates enhances model generalization. Overfitting, where models memorize noise in the training data, is a significant concern. SGD’s stochastic updates help the model focus on essential patterns, preventing overfitting and improving performance on new, unseen data. This advantage is pivotal when training complex models like natural language processing systems, ensuring robustness in diverse scenarios.

Contrasting linear and polynomial (degree 10) fittings applied to slightly noisy linear data. Although the polynomial function (obtained e.g. by polynomial interpolation of non-linear least-square fit) precisely matches the data, the linear fitting is anticipated to exhibit superior generalization abilities. When making predictions beyond the fitted data, especially in extrapolation scenarios, the linear function is likely to offer more precise forecasts. Image source: Wikimedia Commons by Ghiles, licensed under Creative Commons Attribution-Share Alike 4.0. Date: March 11, 2016

Escaping Sharp Minima: Navigating Complexity

Stochastic Gradient Descent’s stochastic nature empowers it to navigate away from sharp, narrow minima in the loss landscape. This unique property prevents convergence to overly sensitive configurations, ensuring solutions are robust and less prone to perturbations. For instance, in training generative adversarial networks for image generation, SGD’s ability to escape sharp and local minima ensures diverse and high-quality image outputs.

Visualization of Maxima and Minima in a Sinusoidal Exponential Function This diagram illustrates the concept of local and global extrema within the graph of the function sin(x)*exp(-|x/5|), -3.2 π< x < 3.2 π. Unlike most extrema diagrams that consider functions over specific intervals, this example showcases global extrema, highlighting the largest and smallest values across the entire real number set. Created by Inductiveload, this image offers valuable insights into the behavior of extrema in mathematical functions. Image Source: Wikimedia Commons Date: October 23, 2007 Dimensions: 650 × 325 pixels License: Public domain.

Challenges and Solutions

Stochastic Gradient Descent (SGD) offers significant advantages in optimization but is not without its challenges. Two primary hurdles include selecting an appropriate learning rate, where striking a balance between convergence speed and stability is crucial. Various techniques like learning rate schedules and adaptive algorithms such as RMSProp (Root Mean Square Propagation is a method invented by Geoffrey Hinton in 2012 in which the learning rate is adapted for each of the parameters of a NN) and Adam (Adaptive Moment Estimation, 2014, a modification of the RMSProp optimizer) help address this. Additionally, SGD’s stochastic nature may introduce too much noise into the optimization process, which can impede progress. Strategies like momentum, variance reduction, and adaptive methods play a role in stabilizing optimization under these noisy conditions. Overcoming these challenges ensures SGD’s robustness and effectiveness in optimizing diverse models and datasets.

6. Parallels with Modern Physics Ideas

The connections between optimization strategies and fundamental physics concepts are fascinating. Concepts such as Feynman’s Integrals, and the least action principle find parallels with the principles of Stochastic Gradient Descent (SGD). Just as optimization seeks optimal states, physics seeks states of least action. This confluence underscores the unity between exploration and minimization processes in both computational optimization and the tangible universe.

*Least action principle in configuration space, bold q is configuration vector (-tuple of generalized coordinates). By Maschen September 11, 2012 CC0 1.0. Source: Wikimedia Commons.*

In physics, the least action principle states that the path taken by a system between two points in its configuration space is the one for which the action is minimized. Similarly, in optimization, SGD seeks to find the optimal state (in parameters space) of a system by iteratively minimizing an objective function. The parallels between these two concepts highlight the deep connections between the fields of physics and optimization.

Feynman’s Integrals, which are used to calculate probabilitiy amplitudes in quantum mechanics, also have connections to optimization. This is similar to the way that SGD explores the space of possible solutions to an optimization problem, iteratively updating its estimate of the optimal solution based on the gradient of the objective function.

These connections between optimization strategies and fundamental physics concepts are intriguing and highlight the deep unity between these two fields. By exploring these connections further, we may gain new insights into both computational optimization and our understanding of the physical universe. This confluence is sure to be attractive to physicists and others interested in exploring the connections between these two fascinating fields.

7. Conclusions

Stochastic Gradient Descent (SGD) stands as a transformative optimization algorithm within machine learning, and its influence resonates across diverse applications. It excels in efficiently optimizing models, whether dealing with vast high-dimensional datasets, linear regression, or training complex deep neural networks. As machine learning advances, SGD continues to provide a reliable foundation for adapting to emerging complexities and driving progress in the field. SGD’s versatility, effectiveness, and scalability make it indispensable, underpinning the quest to build intelligent systems that impact diverse domains. Its contributions are indisputable, and ongoing research ensures that SGD remains a robust and relevant tool, further enriching the landscape of machine learning optimization.

References

https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Several authoritative references up to 2021 provide foundational insights into the concepts and limitations of Gradient Descent and the emergence of Stochastic Gradient Descent as a solution for large-scale optimization in machine learning:

L. Bottou, “Large-Scale Machine Learning with Stochastic Gradient Descent,” Proceedings of COMPSTAT’2010, 2010. DOI: 10.1007/978-3-7908-2604-3_1 This seminal paper by L. Bottou meticulously examines the challenges posed by scaling traditional Gradient Descent to accommodate large datasets. It introduces Stochastic Gradient Descent as a compelling alternative for large-scale machine learning. The insights from this work have had a profound influence on shaping the direction of scalable optimization methods in the machine learning community.
Y. LeCun, L. Bottou, G. Orr, and K. Muller, “Efficient BackProp,” Neural Networks: Tricks of the Trade, Springer, 1998. DOI: 10.1007/978-3-642-35289-8_30 The book “Neural Networks: Tricks of the Trade” authored by Y. LeCun et al. holds a pivotal position in the machine learning literature. It highlights the efficiency enhancements brought by Stochastic Gradient Descent, emphasizing its suitability for training neural networks—a cornerstone of modern machine learning. This reference demonstrates how SGD plays a fundamental role in enabling the training of complex models.
S. Shalev-Shwartz and S. Ben-David, “Understanding Machine Learning: From Theory to Algorithms,” Cambridge University Press, 2014. ISBN: 978-1107057135 The book “Understanding Machine Learning: From Theory to Algorithms” authored by S. Shalev-Shwartz and S. Ben-David provides essential knowledge on optimization in the context of machine learning. It introduces Stochastic Gradient Descent as a critical technique for addressing the challenges of large-scale optimization. This foundational reference offers comprehensive insights into the theoretical underpinnings of machine learning algorithms, including the role of SGD.
B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent,” Advances in Neural Information Processing Systems (NIPS), 2011. This groundbreaking NIPS paper introduces the “Hogwild!” algorithm, demonstrating how to parallelize Stochastic Gradient Descent efficiently in a lock-free manner. This work has been instrumental in making SGD feasible for large-scale parallel processing, addressing one of the significant challenges in applying SGD to big data scenarios.
L. Xiao and T. Zhang, “A Proximal Stochastic Gradient Descent Algorithm,” Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. This ICML paper introduces a proximal version of Stochastic Gradient Descent, enhancing its convergence properties and making it more suitable for optimization tasks with non-smooth regularizers. This development extends the applicability of SGD to a broader class of machine learning problems.
J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” Journal of Machine Learning Research (JMLR), 2011. This JMLR paper presents the AdaGrad algorithm, an adaptive learning rate method that addresses the learning rate selection challenge in SGD. AdaGrad’s ability to adjust learning rates based on the historical gradients has significantly impacted the effectiveness of SGD in practical applications.
D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations (ICLR), 2015. The “Adam” optimization algorithm introduced in this ICLR paper has become one of the most widely used adaptive learning rate methods for SGD. It combines the benefits of both momentum and adaptive learning rates, resulting in a robust and efficient optimization algorithm that has been instrumental in training deep neural networks.
L. Jiang, Z. Luo, A. G. Dimakis, and I. Dhillon, “Duality-Gap Based Optimal Stochastic Block Coordinate Descent for Regularized Loss Minimization,” Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. This ICML paper contributes to the theoretical understanding of Stochastic Gradient Descent by introducing the concept of duality-gap based optimization. This innovative approach offers insights into optimal strategies for stochastic block coordinate descent, enriching the theoretical foundations of SGD-based optimization.
N. Srebro, “Optimization for Machine Learning,” Foundations and Trends® in Machine Learning, 2010. DOI: 10.1561/2200000006 This comprehensive survey paper provides a deep understanding of optimization techniques in the context of machine learning, including a thorough exploration of Stochastic Gradient Descent. It serves as a valuable resource for understanding the theoretical and practical aspects of optimization methods used in modern machine learning.
M. R. Garey and D. S. Johnson, “Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman, 1979. ISBN: 0716710455 Although published before 2021, this classic reference remains relevant due to its foundational impact. It introduces the concept of NP-completeness, emphasizing the computational challenges inherent in optimization problems. While not specific to SGD, this work underscores the broader context of optimization complexity in computer science.

Appendix

Code 1: Gradient Descent for y = x²: 2D-plot

In this context, we define a fundamental quadratic function as y(x) = x^2, illustrating a symmetrical parabolic curve, with its gradient at any point x being 2 * x, representing the tangent line’s slope. Moving on to the iterative process, we initiate from a starting ball position (x_start = 5.0) and employ the learning_rate to regulate step sizes, while num_iterations dictates the iteration count. Within each iteration, the gradient is computed, updating the ball’s position (x_guess) using the gradient scaled by the learning rate, simulating a descent along the curve. Trajectories of x_guess and corresponding y values are meticulously logged in x_trajectory and y_trajectory, allowing us to trace the ball’s path. The visualization entails a plot showcasing the quadratic function (y(x) = x^2), the trajectory of the ball (depicted as red dots), and gradient vectors as blue arrows. Gray dashed lines connect the dots to the arrows on the x-axis, exemplifying gradient descent. Labels, title, and a legend are added to the plot for clarity of interpretation.

import numpy as np
import matplotlib.pyplot as plt

# Define the quadratic function y(x)
def y(x):
    return x**2

# Define the gradient of the function
def gradient(x):
    return 2 * x

# Initial ball position
x_start = 5.0
x_guess = x_start
learning_rate = 0.1
num_iterations = 20

# Lists to store the trajectory of x and y values
x_trajectory = []
y_trajectory = []

# Perform gradient descent
for i in range(num_iterations):
    # Calculate the gradient at the current x
    grad = gradient(x_guess)
    
    # Update x using the gradient and learning rate
    x_guess -= learning_rate * grad
    
    # Calculate the corresponding y value
    y_val = y(x_guess)
    
    # Store the x and y values in the trajectory lists
    x_trajectory.append(x_guess)
    y_trajectory.append(y_val)

# Plot the function, the gradient descent trajectory, and lines connecting red dots to arrows
x_vals = np.linspace(-5, 5, 100)
plt.plot(x_vals, y(x_vals), label='y(x) = x^2')
plt.scatter(x_trajectory, y_trajectory, color='red', label='Gradient Descent')
plt.scatter([-x for x in x_trajectory], y_trajectory, color='red')

# Plot positive gradient arrows (inverted direction) and lines
for i in range(len(x_trajectory)):
    arrow_length = -0.5 * learning_rate * gradient(x_trajectory[i])
    plt.arrow(x_trajectory[i], 0, arrow_length, 0, head_width=0.15, head_length=0.2, fc='blue', ec='blue')
    plt.plot([x_trajectory[i], x_trajectory[i]], [y_trajectory[i], 0], '--', color='gray')

# Plot negative gradient arrows and lines
for i in range(len(x_trajectory)):
    arrow_length = -0.5 * learning_rate * gradient(-x_trajectory[i])
    plt.arrow(-x_trajectory[i], 0, arrow_length, 0, head_width=0.15, head_length=0.2, fc='blue', ec='blue')
    plt.plot([-x_trajectory[i], -x_trajectory[i]], [y_trajectory[i], 0], '--', color='gray')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Symmetric Gradient Descent in 2D')
plt.legend()
plt.grid(True)
plt.show()

Code 2: Gradient Descent for y = x²: 3D-plot

By introducing alterations to this code, it becomes feasible to expand its capabilities and generate a comprehensive 3D plot. This extended plot would elegantly portray the function z = x^2 + y^2 in the three-dimensional space. Through adjustments and additions to the existing code, the intricate landscape of the function can be vividly visualized, providing a more encompassing understanding of its behavior and the dynamics of gradient descent optimization. This enhancement serves as an excellent opportunity to delve further into the intricate relationships between the variables x, y, and z, while exemplifying the versatility of the code in generating diverse visualizations. This can be achieved using Matplotlib’s Axes3D Python module. This code creates a visual representation of a mathematical function using two plots side by side. The left plot shows a 3D mountain-like surface, where the height of the surface corresponds to the function’s output. It also displays contour lines on the ground, giving a sense of the function’s shape. A red ball is placed on the surface, and it moves down the slope of the mountain as a “gradient descent” process unfolds. The right plot features a 2D “heatmap” with colors representing function values. It shows the contour lines of the same function, and a red ball moves according to gradient descent as well. The code uses mathematical concepts to illustrate how optimization methods, like gradient descent, can find the lowest point on a surface by iteratively adjusting the ball’s position. The visualization helps understand these optimization techniques in a visual and intuitive way.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Define the hill function and its gradient
def hill(x, y):
    return x**2 + y**2

# Generate grid for the plot
x_vals = np.linspace(-5, 5, 100)
y_vals = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x_vals, y_vals)
Z = hill(X, Y)

# Initialize the figure
fig = plt.figure(figsize=(14, 6))

# Create the 3D subplot for the hill plot (left side)
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.7)
ax1.set_xlabel('X')
ax1.set_ylabel('Y')
ax1.set_zlabel('Cost')
ax1.set_title('3D Plot of Hill Function')

# Plot the level curves on the xy-plane
ax1.contour(X, Y, Z, levels=np.linspace(np.min(Z), np.max(Z), 10), offset=np.min(Z) - 1, cmap='gray', alpha=0.5)

# Initial position of the ball
ball_pos = np.array([4.0, 4.0, hill(4.0, 4.0)])  # Include z-coordinate
learning_rate = 0.1
num_iterations = 20

# Perform gradient descent
for i in range(num_iterations):
    grad = np.array([2 * ball_pos[0], 2 * ball_pos[1]])
    ball_pos[:2] -= learning_rate * grad
    ball_pos[2] = hill(*ball_pos[:2])  # Update the z-coordinate using the hill function
    
    # Plot the ball's current position on the surface
    ax1.scatter(ball_pos[0], ball_pos[1], ball_pos[2], color='red', marker='o', s=50)

# Create the contour subplot (right side)
ax2 = fig.add_subplot(122)
contour = ax2.contourf(X, Y, Z, levels=20, cmap='viridis')
ax2.set_xlim(0, 4)
ax2.set_ylim(0, 4)
ax2.set_xlabel('X')
ax2.set_ylabel('Y')
ax2.set_title('Contour Plot/Colormap of Hill Function')
fig.colorbar(contour)

# Initial position of the ball
ball_pos = np.array([4.0, 4.0])
learning_rate = 0.1
num_iterations = 20

# Perform gradient descent
for i in range(num_iterations):
    grad = np.array([2 * ball_pos[0], 2 * ball_pos[1]])
    ball_pos -= learning_rate * grad
    
    # Plot the ball's current position
    ax2.scatter(ball_pos[0], ball_pos[1], color='red', marker='o', s=50)
    
    # Plot gradient arrows
    ax2.quiver(ball_pos[0], ball_pos[1], -grad[0]/20, -grad[1]/20, color='black', angles='xy', scale_units='xy', scale=1, width=0.004)

# Show the plots
plt.tight_layout()
plt.show()

Code 3: Polynomial Interpolation in a NutShell

The following code performs Gauss-Jordan elimination to find the equation of the line that passes through the given points (-1,15) and (12,7).

import numpy as np
import matplotlib.pyplot as plt

# Given data points
data_points = [(-1, 15), (12, 7)]

# Extracting x and y values from the data points
x_values = [point[0] for point in data_points]
y_values = [point[1] for point in data_points]

# Create a matrix for the linear system
matrix = np.array([[1, x] for x in x_values])
constants = np.array(y_values)

# Solve the linear system using numpy's linalg.solve
coefficients = np.linalg.solve(matrix, constants)

# Define the equation of the line
def line_equation(x):
    return coefficients[0] + coefficients[1] * x

# Generate points for plotting the line
x_line = np.linspace(min(x_values), max(x_values), 100)
y_line = line_equation(x_line)

# Plot the data points and the line
plt.scatter(x_values, y_values, color='red', label='Data Points')
plt.plot(x_line, y_line, label='Fitted Line')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Line Fitting using Gauss-Jordan Elimination')
plt.legend()

# Adding the line equation to the plot
line_formula = f'y = {coefficients[1]:.2f}x + {coefficients[0]:.2f}'
plt.text(0, 10, line_formula, fontsize=12)

# Display the plot
plt.show()

**Output:** Numeric example of Polynomial (Linear) Interpolation performed using the Gauss-Jordan Method.

Code 4: Gradient Descent Linear Regression in a Nutshell

Here, we’ll delve into the mathematical foundation of Gradient Descent for Linear Regression. This will be followed by a detailed Python code example, including a numeric dataset, the code’s execution, and the generated output. Gradient Descent is an iterative optimization technique commonly used to minimize a function, in this case, the error function in Linear Regression. For each iteration, the coefficients A and B are updated based on the gradient of the error function with respect to these coefficients. Below is a Python code snippet that demonstrates the application of Gradient Descent for Linear Regression. The code includes comments for clarity and explanation.

import numpy as np
import matplotlib.pyplot as plt

# Given data points
data = np.array([[1, 3], [2, 5], [3, 8], [4, 9], [5, 12]])

# Extract x and y values
x = data[:, 0]
y = data[:, 1]

# Initialize coefficients A and B
A = 0.0
B = 0.0

# Learning rate
alpha = 0.01

# Number of iterations
iterations = 100

# Gradient Descent
for _ in range(iterations):
    # Calculate predicted values
    y_pred = A * x + B
    
    # Calculate partial derivatives of the error
    dA = -2 * np.sum(x * (y - y_pred))
    dB = -2 * np.sum(y - y_pred)
    
    # Update coefficients
    A -= alpha * dA
    B -= alpha * dB

# Output the coefficients
print("Coefficient A:", A)
print("Coefficient B:", B)

# Plotting
plt.scatter(x, y, label='Data Points')
plt.plot(x, A * x + B, color='red', label='Fitted Line')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

For the dataset (1,3),(2,5),(3,8),(4,9),(5,12)(1,3),(2,5),(3,8),(4,9),(5,12), the code iteratively updates the coefficients A and B to fit the best linear line through the data points. The output will display the values of the coefficients A and B and generate a plot depicting the data points, the fitted line, and the coefficients A and B.

***Output:*** *Representation of the Gradient Descent linear regression for a given set of points. The coefficients are Coefficient A(Slope): 2.2089148116273587 Coefficient B(Intercept): 0.7678147013708242.*

Code 5: Visualization of GD Parameter Trajectory Toward Minimizing the Cost Function

This Python code employs Gradient Descent to iteratively refine A and B coefficients, minimizing the cost function linked to data fitting and add a colormap to allow visualization of the parameters trajectory towards the minimium of the cost function.

import numpy as np
import matplotlib.pyplot as plt

# Given data points
data = np.array([[1, 3], [2, 5], [3, 8], [4, 9], [5, 12]])

# Extract x and y values
x = data[:, 0]
y = data[:, 1]

# Initialize coefficients A and B
A = 0.0
B = 0.0

# Learning rate
alpha = 0.01

# Number of iterations
iterations = 100

# Arrays to store parameter values and cost function values
param_A = []
param_B = []
cost_values = []

# Gradient Descent
for _ in range(iterations):
    # Calculate predicted values
    y_pred = A * x + B
    
    # Calculate cost function
    cost = np.sum((y_pred - y)**2) / (2 * len(y))
    
    # Store parameter values and cost function values
    param_A.append(A)
    param_B.append(B)
    cost_values.append(cost)
    
    # Calculate partial derivatives of the error
    dA = -2 * np.sum(x * (y - y_pred)) / len(y)
    dB = -2 * np.sum(y - y_pred) / len(y)
    
    # Update coefficients
    A -= alpha * dA
    B -= alpha * dB

# Create a meshgrid for parameter values
param_A_vals = np.linspace(A - 2, A + 2, 100)
param_B_vals = np.linspace(B - 2, B + 2, 100)
param_A_grid, param_B_grid = np.meshgrid(param_A_vals, param_B_vals)

# Calculate cost values for the grid
cost_grid = np.zeros_like(param_A_grid)
for i in range(len(param_A_vals)):
    for j in range(len(param_B_vals)):
        y_pred = param_A_vals[i] * x + param_B_vals[j]
        cost_grid[j, i] = np.sum((y_pred - y)**2) / (2 * len(y))

# Plotting the colormap
plt.figure()
plt.contourf(param_A_grid, param_B_grid, cost_grid, levels=20, cmap='viridis')
plt.colorbar(label='Cost Function')
plt.plot(param_A, param_B, c='black', marker='o')
plt.xlabel('Parameter A')
plt.ylabel('Parameter B')
plt.title('Cost Function Minimization using Gradient Descent')
plt.show()

The colormap output showcases cost variations for different parameter combinations, while the black trajectory tracks A and B updates across iterations. This representation enhances comprehension of cost function minimization and parameter tuning in action.

#AI #CheeseRolling #CoopersHill #DataVisualization #DIY #GradientDescent #ArtificialIntelligence #LeastSquareCurveFitting #LinearRegression #MachineLearning #NeuralNetworkTraining #PolynomialInterpolation #Python #StochasticGradientDescent

Stochastic Gradient Descent: A Fundamental Optimization Algorithm for Machine Learning

Every saint helps downwards.

If there’s a cheese rolling downhill,
there’s a Mineiro^(*) chasing after it.

1. Introduction

2. What is Gradient Descent?

2.1. A Ball (or Cheese) Rolling Down a Hill

2.2. The Math Trick of Walking the Steepest Direction

3. Linear Regression and Polynomial Interpolation: a few words

4. Linear Regression: Least Squares versus Gradient Descent

5. Stochastic Gradient Descent: Addressing Large-Scale Challenges

Improved Generalization: Battling Overfitting

Escaping Sharp Minima: Navigating Complexity

Challenges and Solutions

6. Parallels with Modern Physics Ideas

7. Conclusions

References

Appendix

Code 1: Gradient Descent for y = x²: 2D-plot

Code 2: Gradient Descent for y = x²: 3D-plot

Code 3: Polynomial Interpolation in a NutShell

Code 4: Gradient Descent Linear Regression in a Nutshell

Code 5: Visualization of GD Parameter Trajectory Toward Minimizing the Cost Function

Like this:

Relacionado

Leave a Reply Cancel reply

Category Cloud

Every saint helps downwards.

If there’s a cheese rolling downhill, there’s a Mineiro(*) chasing after it.

1. Introduction

2. What is Gradient Descent?

2.1. A Ball (or Cheese) Rolling Down a Hill

2.2. The Math Trick of Walking the Steepest Direction

3. Linear Regression and Polynomial Interpolation: a few words

4. Linear Regression: Least Squares versus Gradient Descent

5. Stochastic Gradient Descent: Addressing Large-Scale Challenges

Improved Generalization: Battling Overfitting

Escaping Sharp Minima: Navigating Complexity

Challenges and Solutions

6. Parallels with Modern Physics Ideas

7. Conclusions

References

Appendix

Code 1: Gradient Descent for y = x2: 2D-plot

Code 2: Gradient Descent for y = x2: 3D-plot

Code 3: Polynomial Interpolation in a NutShell

Code 4: Gradient Descent Linear Regression in a Nutshell

Code 5: Visualization of GD Parameter Trajectory Toward Minimizing the Cost Function

Share this:

Like this:

Relacionado

Leave a Reply Cancel reply

Category Cloud

If there’s a cheese rolling downhill,
there’s a Mineiro^(*) chasing after it.

Code 1: Gradient Descent for y = x²: 2D-plot

Code 2: Gradient Descent for y = x²: 3D-plot