From RNN to Transformer: A Review of Neural Network Architectures for Sequence Modeling in Time Series Prediction
Main Article Content
Keywords
neural networks, time series prediction, sequence modeling, RNN, transformer
Abstract
This review presents a narrative literature review of the architectural evolution of sequence modeling neural networks for time series prediction. The study systematically traces the development from Recurrent Neural Networks (RNNs) to Long Short-Term Memory (LSTM)/Gated Recurrent Units (GRUs), and subsequently to Transformers. By synthesizing findings from peer-reviewed studies published between 1997 and 2025, this paper compares the performance of these models in capturing temporal features, long-range dependencies, and gradient propagation stability. The analysis reveals that while RNNs established the foundational framework for sequence processing, their gradient instability limits applicability to short sequences. LSTM and GRU architectures significantly improve long-sequence modeling through gating mechanisms but remain constrained by sequential computation. Transformer-based models, leveraging self-attention mechanisms, enable parallel processing and superior global dependency capture, albeit at higher computational cost. Emerging strategies such as sparse attention, patching, and knowledge augmentation are addressing these limitations. This review provides a structured reference for model selection and architectural optimization in time series prediction.
References
- [1] Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/
- [2] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
- [3] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. https://doi.org/10.1017/CBO9781107415324.001
- [4] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
- [5] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
- [6] Wang, J., Li, Y., & Gao, R. X. (2021). Intelligent fault diagnosis for rotating machinery using deep learning. Mechanical Systems and Signal Processing, 152, 107456. https://doi.org/10.1016/j.ymssp.2020.107456
- [7] Sezer, O. B., Gudelek, M. U., & Ozbayoglu, A. M. (2020). Financial time series forecasting with deep learning: A systematic literature review. Applied Soft Computing, 97, 106181. https://doi.org/10.1016/j.asoc.2020.106181
- [8] Lv, Y., Duan, Y., Kang, W., Li, Z., & Wang, F. Y. (2015). Traffic flow prediction with big data: A deep learning approach. IEEE Transactions on Intelligent Transportation Systems, 16(2), 865–873. https://doi.org/10.1109/TITS.2014.2345663
- [9] Kirchner, J., & Krauße, A. (2025). From RNNs to Transformers: Benchmarking deep learning architectures for hydrologic prediction. Hydrology and Earth System Sciences, 29(12), 6811–6832. https://doi.org/10.5194/hess-29-6811-2025
- [10] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008. https://doi.org/10.48550/arXiv.1706.03762
- [11] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. https://doi.org/10.3115/v1/D14-1179
- [12] Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arXiv.1803.01271
- [13] Li, D., Zhang, X., & Wang, Y. (2025). KALFormer: Knowledge-augmented attention learning for long-term time series forecasting with transformer. PLOS ONE, 20(7), e0338052. https://doi.org/10.1371/journal.pone.0338052
- [14] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 11106–11115. https://doi.org/10.1609/aaai.v35i12.17325
- [15] Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2022). A time series is worth 64 words: Long-term forecasting with transformers. arXiv Preprint arXiv:2211.14730. https://doi.org/10.48550/arXiv.2211.14730
- [16] Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34, 22419–22430. https://doi.org/10.48550/arXiv.2106.13008
- [17] Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., & Jin, R. (2022). FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. International Conference on Machine Learning, 27268–27286. https://doi.org/10.48550/arXiv.2201.12740
- [18] Lim, B., Arik, S. O., Loeff, N., & Pfister, T. (2021). Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37(4), 1748–1764. https://doi.org/10.1016/j.ijforecast.2021.03.012
- [19] Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3), 1181–1191. https://doi.org/10.1016/j.ijforecast.2019.07.001
- [20] Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2020). N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arXiv.1905.10437
