Stochastic Gradient Descent and the Law of Large Numbers: A Probabilistic Analysis of Convergence
Main Article Content
Keywords
stochastic gradient descent, law of large numbers, convergence analysis, unbiased estimation, machine learning optimization
Abstract
Stochastic Gradient Descent (SGD) is one of the most fundamental optimization algorithms in modern large-scale machine learning. However, by relying on only a minuscule number of random samples to estimate gradients at each iteration, it inevitably introduces stochastic noise. This paper aims to delve into the mathematical essence of how this highly randomized algorithm achieves stable convergence from a probabilistic perspective. The paper first rigorously reviews the theoretical foundations and mathematical proofs of the Weak and Strong Laws of Large Numbers. Subsequently, it constructs a probabilistic model for stochastic gradients and thoroughly analyzes the update mechanism of the SGD algorithm. The study demonstrates that the stochastic gradients in single iterations of SGD form a sequence of independent and identically distributed random vectors, whose mathematical expectation is the true full-batch gradient. Based on the Law of Large Numbers, the sample mean of these stochastic gradients converges asymptotically to the true gradient in probability (or almost surely), thereby effectively averaging out and dissipating the random noise during long-term iterations. The conclusion indicates that the effectiveness of SGD is mathematically equivalent to “unbiased estimation combined with the Law of Large Numbers,” and its overall optimization behavior asymptotically approximates standard, noise-free gradient descent.
References
- [1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- [2] Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT (pp. 177–186).
- [3] Durrett, R. Probability: Theory and Examples. 4th ed., Cambridge University Press, 2010.
- [4] Billingsley, P. Probability and Measure. 3rd ed., Wiley, 1995.
- [5] Tian, Y., Zhang, Y., & Zhang, H. (2023). Recent advances in stochastic gradient descent in deep learning. Mathematics, 11(3), 682.
- [6] Sclocchi, A., & Wyart, M. (2024). On the different regimes of stochastic gradient descent. Proceedings of the National Academy of Sciences, 121(9), e2316301121.
- [7] Li, T., Wang, B., Peng, C., & Yin, H. (2024). Stochastic gradient descent for kernel-based maximum correntropy criterion. Entropy, 26(12), 1104.
- [8] Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400–407.
- [9] Xia, L., Massei, S., & Hochstenbach, M. E. (2025). On the convergence of gradient descent with stochastic rounding errors under the Polyak–Łojasiewicz inequality. Computational Optimization and Applications, 90, 753–799.
- [10] Lovas, A., & Rásonyi, M. (2023). Functional central limit theorem and strong law of large numbers for stochastic gradient Langevin dynamics. Applied Mathematics and Optimization, 88, 78.
- [11] Nguegnang, G. M., Rauhut, H., & Terstiege, U. (2024). Convergence of gradient descent for learning linear neural networks. Advances in Continuous and Discrete Models, 2024, 23.
