Stochastic Gradient Descent and the Law of Large Numbers: A Probabilistic Analysis of Convergence

Haoting Chai

doi:10.70267/cai.26v3n3.6065

Haoting Chai

School of Mathematical Sciences, Dalian University of Technology, Dalian, China

DOI: https://doi.org/10.70267/cai.26v3n3.6065

Keywords

stochastic gradient descent, law of large numbers, convergence analysis, unbiased estimation, machine learning optimization

Abstract

Stochastic Gradient Descent (SGD) is one of the most fundamental optimization algorithms in modern large-scale machine learning. However, by relying on only a minuscule number of random samples to estimate gradients at each iteration, it inevitably introduces stochastic noise. This paper aims to delve into the mathematical essence of how this highly randomized algorithm achieves stable convergence from a probabilistic perspective. The paper first rigorously reviews the theoretical foundations and mathematical proofs of the Weak and Strong Laws of Large Numbers. Subsequently, it constructs a probabilistic model for stochastic gradients and thoroughly analyzes the update mechanism of the SGD algorithm. The study demonstrates that the stochastic gradients in single iterations of SGD form a sequence of independent and identically distributed random vectors, whose mathematical expectation is the true full-batch gradient. Based on the Law of Large Numbers, the sample mean of these stochastic gradients converges asymptotically to the true gradient in probability (or almost surely), thereby effectively averaging out and dissipating the random noise during long-term iterations. The conclusion indicates that the effectiveness of SGD is mathematically equivalent to “unbiased estimation combined with the Law of Large Numbers,” and its overall optimization behavior asymptotically approximates standard, noise-free gradient descent.

Abstract 56 | PDF Downloads 44

References

[1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
[2] Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT (pp. 177–186).
[3] Durrett, R. Probability: Theory and Examples. 4th ed., Cambridge University Press, 2010.
[4] Billingsley, P. Probability and Measure. 3rd ed., Wiley, 1995.
[5] Tian, Y., Zhang, Y., & Zhang, H. (2023). Recent advances in stochastic gradient descent in deep learning. Mathematics, 11(3), 682.
[6] Sclocchi, A., & Wyart, M. (2024). On the different regimes of stochastic gradient descent. Proceedings of the National Academy of Sciences, 121(9), e2316301121.
[7] Li, T., Wang, B., Peng, C., & Yin, H. (2024). Stochastic gradient descent for kernel-based maximum correntropy criterion. Entropy, 26(12), 1104.
[8] Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400–407.
[9] Xia, L., Massei, S., & Hochstenbach, M. E. (2025). On the convergence of gradient descent with stochastic rounding errors under the Polyak–Łojasiewicz inequality. Computational Optimization and Applications, 90, 753–799.
[10] Lovas, A., & Rásonyi, M. (2023). Functional central limit theorem and strong law of large numbers for stochastic gradient Langevin dynamics. Applied Mathematics and Optimization, 88, 78.
[11] Nguegnang, G. M., Rauhut, H., & Terstiege, U. (2024). Convergence of gradient descent for learning linear neural networks. Advances in Continuous and Discrete Models, 2024, 23.

PDF

Published

May 8, 2026

Issue

Vol. 3 No. 3 (2026)

Section

Research Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Chai, H. (2026). Stochastic Gradient Descent and the Law of Large Numbers: A Probabilistic Analysis of Convergence. Computers and Artificial Intelligence, 3(3), 60-65. https://doi.org/10.70267/cai.26v3n3.6065

Download Citation

Stochastic Gradient Descent and the Law of Large Numbers: A Probabilistic Analysis of Convergence

Main Article Content

Keywords

Abstract

References

Article Sidebar

How to Cite