Frontiers in Artificial Intelligence Algorithm Optimization: A Comprehensive Review of Training-Time and Inference-Time Advances

Juntong Lu

doi:10.70267/cai.25v2n3.2940

Juntong Lu

School of International Programs, Guangdong University of Finance, Guangzhou 510521, China

DOI: https://doi.org/10.70267/cai.25v2n3.2940

Keywords

artificial intelligence optimization, deep learning efficiency, large language models (LLMs), training-time acceleration, inference-time acceleration, reinforcement learning with human feedback (RLHF), sustainable AI

Abstract

The rapid progress of artificial intelligence (AI) has been largely driven by the scaling of deep neural networks, advances in hardware accelerators, and the availability of large-scale datasets. However, the computational, memory, and energy demands of training and deploying foundation models such as GPT-5 and LLaMA-3 have created scalability and sustainability bottlenecks. Algorithmic optimization has emerged as a central strategy to alleviate these challenges across training-time efficiency, inference-time acceleration, long-context extension, and alignment learning. This article provides a comprehensive review of the state of the art in AI algorithm optimization, systematically categorizing approaches, benchmarking them under unified metrics (memory, throughput, latency, perplexity, stability, complexity, portability), and identifying failure modes and boundary conditions. We further present reproducibility artifacts, including minimal training and inference stacks (GaLore + Sophia optimizer; vLLM + FlashAttention-3 + QServe) and standardized datasets (MMLU, GSM8K, LongBench, DCLM). Our synthesis underscores that algorithm–system co-design—spanning optimizer innovations, quantization-aware serving, context length generalization, and efficient preference alignment—is critical to achieving both efficiency and ethical sustainability in next-generation AI systems.

Abstract 2 | PDF Downloads 0

References

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., & McKinnon, C. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint. https://doi.org/10.48550/arXiv.2212.08073
Bengio, Y., Louradour, J. o., Collobert, R., & Weston, J. (2009). Curriculum learning [Paper presentation]. Proceedings of the 26th annual international conference on machine learning, New YorkNYUnited States.
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., & Dao, T. (2024). MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads. arXiv preprint. https://arxiv.org/abs/2401.10774
Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences [Paper presentation]. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
Dao, T. (2023). FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint. https://arxiv.org/abs/2307.08691
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness [Paper presentation]. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LO, USA.
Ding, Y., Zhang, L. L., Zhang, C., Xu, Y., Shang, N., Xu, J., Yang, F., & Yang, M. (2024). LongRoPE: Extending LLM context window beyond 2 million tokens [Paper presentation]. ICML'24: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria.
Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). Sharpness-aware minimization for efficiently improving generalization [Paper presentation]. ICLR 2021 - 9th International Conference on Learning Representations, Virtual Only Conference.
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint. https://arxiv.org/abs/2210.17323
Graves, A., Bellemare, M. G., Menick, J., Munos, R., & Kavukcuoglu, K. (2017). Automated curriculum learning for neural networks [Paper presentation]. 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia.
Gu, A., & Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint. http://arxiv.org/abs/2312.00752
Gu, Y., Yan, Z., Wang, Y., Zhang, Y., Zhou, Q., Wu, F., & Yang, H. (2025). InfiFPO: Implicit model fusion via preference optimization in large language models. arXiv preprint. https://doi.org/10.48550/arXiv.2505.13878
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint. http://arxiv.org/abs/2001.08361
Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization [Paper presentation]. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, San Diego, CA, USA.
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., & Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention [Paper presentation]. Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, Germany.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding [Paper presentation]. International Conference on Machine Learning (ICML), 2023, Honolulu, HI, USA.
Lin, J., Tang, J., Tang, H., Yang, S., Xiao, G., & Han, S. (2025). AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. GetMobile: Mobile Computing and Communications, 28(4), 12-17. https://doi.org/10.1145/3714983.3714987
Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., & Han, S. (2025). QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. arXiv preprint. http://arxiv.org/abs/2405.04532
Liu, H., Li, Z., Hall, D., Liang, P., & Ma, T. (2024). Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint. https://arxiv.org/abs/2305.14342
Liu, X., Lei, B., Zhang, R., & Xu, D. D. K. (2025). Adaptive draft-verification for efficient large language model decoding. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24668-24676. https://doi.org/10.1609/aaai.v39i23.34647
Müller, R., Kornblith, S., & Hinton, G. (2019). When does label smoothing help? [Paper presentation]. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., & Zaharia, M. (2021). Efficient large-scale language model training on GPU clusters using megatron-LM [Paper presentation]. SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New YorkNYUnited States.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback [Paper presentation]. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LO, United States.
Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2024). Yarn: Efficient context window extension of large language models. arXiv preprint. https://arxiv.org/abs/2309.00071
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model [Paper presentation]. 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LO, United States.
Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). Zero: Memory optimizations toward training trillion parameter models [Paper presentation]. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA.
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint. https://arxiv.org/abs/2407.08608
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2020). Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint. http://arxiv.org/abs/1909.08053
Thompson, N., Greenewald, K., Lee, K., & Manso, G. F. (2023). The computational limits of deep learning. arXiv preprint. https://arxiv.org/abs/2007.05558
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention [Paper presentation]. Proceedings of the 38th International Conference on Machine Learning, Virtual COnference Only.
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models [Paper presentation]. Proceedings of the 40 th International Conference on Machine Learning, Honolulu, HI, USA.
Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., & Tian, Y. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. arXiv preprint. https://arxiv.org/abs/2403.03507

PDF

Published

Sep 10, 2025

Issue

Vol. 2 No. 3 (2025)

Section

Research Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Lu, J. (2025). Frontiers in Artificial Intelligence Algorithm Optimization: A Comprehensive Review of Training-Time and Inference-Time Advances. Computers and Artificial Intelligence, 2(3), 29-40. https://doi.org/10.70267/cai.25v2n3.2940

Download Citation

Frontiers in Artificial Intelligence Algorithm Optimization: A Comprehensive Review of Training-Time and Inference-Time Advances

Main Article Content

Keywords

Abstract

References

Article Sidebar

How to Cite

Similar Articles