Frontiers in Artificial Intelligence Algorithm Optimization: A Comprehensive Review of Training-Time and Inference-Time Advances

Main Article Content

Juntong Lu

Keywords

artificial intelligence optimization, deep learning efficiency, large language models (LLMs), training-time acceleration, inference-time acceleration, reinforcement learning with human feedback (RLHF), sustainable AI

Abstract

The rapid progress of artificial intelligence (AI) has been largely driven by the scaling of deep neural networks, advances in hardware accelerators, and the availability of large-scale datasets. However, the computational, memory, and energy demands of training and deploying foundation models such as GPT-5 and LLaMA-3 have created scalability and sustainability bottlenecks. Algorithmic optimization has emerged as a central strategy to alleviate these challenges across training-time efficiency, inference-time acceleration, long-context extension, and alignment learning. This article provides a comprehensive review of the state of the art in AI algorithm optimization, systematically categorizing approaches, benchmarking them under unified metrics (memory, throughput, latency, perplexity, stability, complexity, portability), and identifying failure modes and boundary conditions. We further present reproducibility artifacts, including minimal training and inference stacks (GaLore + Sophia optimizer; vLLM + FlashAttention-3 + QServe) and standardized datasets (MMLU, GSM8K, LongBench, DCLM). Our synthesis underscores that algorithm–system co-design—spanning optimizer innovations, quantization-aware serving, context length generalization, and efficient preference alignment—is critical to achieving both efficiency and ethical sustainability in next-generation AI systems.

Abstract 2 | PDF Downloads 0

References

  • Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., & McKinnon, C. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint. https://doi.org/10.48550/arXiv.2212.08073
  • Bengio, Y., Louradour, J. o., Collobert, R., & Weston, J. (2009). Curriculum learning [Paper presentation]. Proceedings of the 26th annual international conference on machine learning, New YorkNYUnited States.
  • Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., & Dao, T. (2024). MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads. arXiv preprint. https://arxiv.org/abs/2401.10774
  • Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences [Paper presentation]. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
  • Dao, T. (2023). FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint. https://arxiv.org/abs/2307.08691
  • Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness [Paper presentation]. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LO, USA.
  • Ding, Y., Zhang, L. L., Zhang, C., Xu, Y., Shang, N., Xu, J., Yang, F., & Yang, M. (2024). LongRoPE: Extending LLM context window beyond 2 million tokens [Paper presentation]. ICML'24: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria.
  • Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). Sharpness-aware minimization for efficiently improving generalization [Paper presentation]. ICLR 2021 - 9th International Conference on Learning Representations, Virtual Only Conference.
  • Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint. https://arxiv.org/abs/2210.17323
  • Graves, A., Bellemare, M. G., Menick, J., Munos, R., & Kavukcuoglu, K. (2017). Automated curriculum learning for neural networks [Paper presentation]. 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia.
  • Gu, A., & Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint. http://arxiv.org/abs/2312.00752
  • Gu, Y., Yan, Z., Wang, Y., Zhang, Y., Zhou, Q., Wu, F., & Yang, H. (2025). InfiFPO: Implicit model fusion via preference optimization in large language models. arXiv preprint. https://doi.org/10.48550/arXiv.2505.13878
  • Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint. http://arxiv.org/abs/2001.08361
  • Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization [Paper presentation]. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, San Diego, CA, USA.
  • Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., & Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention [Paper presentation]. Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, Germany.
  • Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding [Paper presentation]. International Conference on Machine Learning (ICML), 2023, Honolulu, HI, USA.
  • Lin, J., Tang, J., Tang, H., Yang, S., Xiao, G., & Han, S. (2025). AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. GetMobile: Mobile Computing and Communications, 28(4), 12-17. https://doi.org/10.1145/3714983.3714987
  • Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., & Han, S. (2025). QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. arXiv preprint. http://arxiv.org/abs/2405.04532
  • Liu, H., Li, Z., Hall, D., Liang, P., & Ma, T. (2024). Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint. https://arxiv.org/abs/2305.14342
  • Liu, X., Lei, B., Zhang, R., & Xu, D. D. K. (2025). Adaptive draft-verification for efficient large language model decoding. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24668-24676. https://doi.org/10.1609/aaai.v39i23.34647
  • Müller, R., Kornblith, S., & Hinton, G. (2019). When does label smoothing help? [Paper presentation]. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
  • Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., & Zaharia, M. (2021). Efficient large-scale language model training on GPU clusters using megatron-LM [Paper presentation]. SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New YorkNYUnited States.
  • Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback [Paper presentation]. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LO, United States.
  • Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2024). Yarn: Efficient context window extension of large language models. arXiv preprint. https://arxiv.org/abs/2309.00071
  • Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model [Paper presentation]. 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LO, United States.
  • Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). Zero: Memory optimizations toward training trillion parameter models [Paper presentation]. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA.
  • Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint. https://arxiv.org/abs/2407.08608
  • Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2020). Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint. http://arxiv.org/abs/1909.08053
  • Thompson, N., Greenewald, K., Lee, K., & Manso, G. F. (2023). The computational limits of deep learning. arXiv preprint. https://arxiv.org/abs/2007.05558
  • Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention [Paper presentation]. Proceedings of the 38th International Conference on Machine Learning, Virtual COnference Only.
  • Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models [Paper presentation]. Proceedings of the 40 th International Conference on Machine Learning, Honolulu, HI, USA.
  • Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., & Tian, Y. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. arXiv preprint. https://arxiv.org/abs/2403.03507

Similar Articles

11-18 of 18

You may also start an advanced similarity search for this article.