Survey of Human-computer Interaction Based on Multimodal Fusion

Zihui Zhao

doi:10.70267/ic-aimees.202516

Zihui Zhao

School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology, Xuzhou, Jiangsu, 221116, China

DOI: https://doi.org/10.70267/ic-aimees.202516

Keywords

multi-modal fusion, human-computer interaction, feature fusion, cross-modal attention, emotion recognition

Abstract

Multimodal fusion technology achieves information communication and exchange between human and computer by integrating different modal information such as vision, speech, and touch, which has become an important research direction in the field of human-computer interaction. This paper focuses on the four mainstream multimodal fusion methods of graph-based feature fusion, cross-modal attention technology, cross-correlation attention architecture and multimodal emotion recognition technology, compares and analyzes their technical principles, advantages, disadvantages and application scenarios, and systematically sorts out the differences in technical characteristics. By integrating multiple input methods, these methods significantly improve the user interface interaction experience, optimize the efficiency of multi-source information processing, and provide new ideas for interaction design in complex scenes. Research shows that multimodal fusion human-computer interaction technology can effectively reduce user cognitive load and improve operation efficiency, which has important application value in education, medical care, smart home and other fields. In the future, it is necessary to solve the challenges of insufficient cross-modal data alignment accuracy and high real-time requirements, and explore the deep combination of affective computing and multimodal fusion.

Abstract 219 | PDF Downloads 86

References

Bao, Y., Zhao, X., Zhang, P., Qi, Y. and Li, H., (2025). HIAN: A hybrid interactive attention network for multimodal sarcasm detection. Pattern Recognition, vol. 164, p. 111535.
Chen, X., Li, Y. and Wang, Z., (2022). A comparative study of early and deep learning-based feature fusion for multimodal interaction. Journal of Intelligent Systems, vol. 31, no. 2, pp. 189-205.
Chen, Y. F., Lu, Y. Y. and Zhou, X., (2023). Multi-object tracking algorithm based on cross-correlation attention and chain frame processing. Computer Science, no. 1, pp. 231-237.
Deng, Y., Li, C. and Gu, Y., (2025a). Graph-based multimodal fusion for emotion recognition: A review. Pattern Recognition Letters, vol. 189, pp. 45-53.
Deng, Y., Li, C., Gu, Y., Zhang, H., Liu, L., Lin, H., Wang, S. and Mo, H., (2025b). Graph Convolution-Based Decoupling and Consistency-Driven Fusion for Multimodal Emotion Recognition. Electronics, vol. 14, no. 15, p. 3047.
Lee, H.-T., Shim, M., Liu, X., Cheon, H.-R., Kim, S.-G., Han, C.-H. and Hwang, H.-J., (2025). A review of hybrid EEG-based multimodal human–computer interfaces using deep learning: applications, advances, and challenges. Biomedical Engineering Letters, pp. 1-32.
Liu, J., Zhang, H. and Li, S., (2023). Sensor-computing collaborative optimization for smart home multimodal interaction. IEEE Internet of Things Journal, vol. 10, no. 5, pp. 4210-4223.
Liu, S., Shao, F., He, X., Xue, J., Zhang, H. and Liu, Q., (2025). DBSQFusion: a multimodal image fusion method based on dual-channel attention. Complex & Intelligent Systems, vol. 11, no. 10, p. 421.
Luo, Z. X., Tang, R. and Li, L., (2025). Audio-visual emotion recognition based on multi-scale KAN convolution and cross-modal attention. Computer Technology and Development, no. 7, pp. 100-107.
Niu, H., Van Leeuwen, C., Hao, J., Wang, G. and Lachmann, T., (2022). Multimodal natural human–computer interfaces for computer-aided design: A review paper. Applied sciences, vol. 12, no. 13, p. 6510.
Peng, H., Shi, N. and Wang, G., (2023). Remote sensing traffic scene retrieval based on learning control algorithm for robot multimodal sensing information fusion and human-machine interaction and collaboration. Frontiers in neurorobotics, vol. 17, p. 1267231.
Qu, H. C. and Xu, B., (2025). Multimodal sentiment analysis based on adaptive graph learning weights. Journal of Intelligent Systems, no. 2, pp. 516-528.
Schreiter, J., Heinrich, F., Hatscher, B., Schott, D. and Hansen, C., (2025). Multimodal human–computer interaction in interventional radiology and surgery: a systematic literature review. International Journal of Computer Assisted Radiology and Surgery, vol. 20, no. 4, pp. 807-816.
Sun, B., Jia, L. and Cui, Y., (2024a). Dynamic multimodal fusion: A survey. Neurocomputing, vol. 562, p. 126890.
Sun, B., Jia, L., Cui, Y., Wang, N. and Jiang, T., (2025). Conv-Enhanced Transformer and Robust Optimization Network for robust multimodal sentiment analysis. Neurocomputing, vol. 634, p. 129842.
Sun, B., Jiang, T., Jia, L. and Cui, Y. M., (2024b). Multimodal sentiment analysis based on cross-modal joint-encoding. Computer Engineering and Applications, vol. 60, no. 18, pp. 208-216.
Tao, J. H., Wu, Y. C., Yu, C., Weng, D. D., Li, G. J., Han, T., Wang, Y. T. and Liu, B., (2022). A survey on multi-modal human-computer interaction. Journal of Image and Graphics, vol. 27, no. 06, pp. 1956-1987.
Wang, L., Zhang, Y. and Zhao, J., (2024). Multimodal interaction in education: A case study on intelligent tutoring systems. Educational Technology Research, vol. 36, no. 1, pp. 56-2.
Wang, R., Xu, D., Cascone, L., Wang, Y., Chen, H., Zheng, J. and Zhu, X., (2025a). Raft: robust adversarial fusion transformer for multimodal sentiment analysis. Array, p. 100445.
Wang, Z. Y., Tian, D., Dong, Y., Qiao, N. and Shan, G. I., (2025b). Multimodal interaction: From human-computer collaboration to human-intelligence collaboration. Frontiers of Data & Computing, vol. 7, no. 3, pp. 81-93.
Yao, R., Wang, K., Guo, H. F., Hu, W. T. and Tian, X. R., (2025). Infrared and visible image fusion based on cross-modal feature interaction and multi-scale reconstruction. Infrared and Laser Engineering, vol. 54, no. 08, pp. 269-280.
Yu, X., Li, Z., Wu, J. and Liu, M., (2022). Multi-module Fusion Relevance Attention Network for Multi-label Text Classification. Engineering Letters, vol. 30, no. 4, p. 1237.
Yu, X. Y., Zhang, X., Xu, C. J. and Ou, L. L., (2025). Human-robot interaction method and system design by fusing human perception and multimodal gestures. Chinese High Technology Letters, vol. 35, no. 02, pp. 183-197.
Zhao, Q., Guo, B., Liu, Y. B., Sun, Z., Wang, H. and Chen, M. Q., (2025). Generation of enrich semantic video dialogue based on hierarchical visual attention. Computer Science, vol. 52, no. 01, pp. 315-322.
Zhu, C., Yi, B. and Luo, L., (2024). Base on contextual phrases with cross-correlation attention for aspect-level sentiment analysis. Expert Systems with Applications, vol. 241, p. 122683.

PDF

Published

Dec 19, 2025

Conference Proceedings Volume

Vol. 9 (2025): Proceedings of the 2025 International Conference on Artificial Intelligence, Modern Engineering and Environmental Sustainability (IC-AIMEES 2025)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Zhao, Zihui. “Survey of Human-Computer Interaction Based on Multimodal Fusion”. Exploring Science Academic Conference Series, vol. 9, Dec. 2025, pp. 135-43, https://doi.org/10.70267/ic-aimees.202516.

Download Citation

Survey of Human-computer Interaction Based on Multimodal Fusion

Main Article Content

Keywords

Abstract

References

Article Sidebar

How to Cite

Similar Articles