Integrating Speech into Large Language Models: Architectures, Training Strategies, and Emerging Challenges

Main Article Content

Jiajun Li

Keywords

speech language models, multimodal large language models, cross-modal alignment, speech-text integration, foundation models

Abstract

This paper provides a thorough examination of methods for incorporating speech into large language models (LLMs), with particular emphasis on architectural frameworks, training methodologies, and evaluation protocols. The analysis assesses three critical dimensions: the performance of speech encoders on tasks such as speech recognition, translation, dialogue, and affective computing, cross-modal alignment training techniques, and approaches for integrating speech encoders with LLMS. A total of 18 studies published between 2023 and 2024 were included in the survey. The unified decoder framework, the encoder-adapter LLM pipeline, and the multi-stream hierarchical model are the three architectural approaches identified. Each methodology demonstrates unique trade-offs between modularity and integration depth. Our findings indicate that dual-encoder architectures and hierarchical token representations significantly improve model robustness.
Additionally, catastrophic forgetting is effectively mitigated in cross-modal training through curriculum learning and activation tuning. The computational efficiency, uniformity of evaluation, and scaling performance of spoken language models are persistently challenged, in contrast to text-based models. To further investigate real-time full-duplex communication, systematic scaling techniques for speech foundation models, and low-resource language documentation, additional research is needed.

Abstract 23 | PDF Downloads 21

References

  • [1] Gaido, M., Papi, S., Negri, M., & Bentivogli, L. (2024). Speech translation with speech foundation models and large language models: What is there and what is missing? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 14760–14778). Association for Computational Linguistics.
  • [2] Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., & Qiu, X. (2023). SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 15757–15773). Association for Computational Linguistics.
  • [3] Tang, C., Yu, W., Sun, G., Chen, X., Tan, T., Li, W., Lu, L., Ma, Z., & Zhang, C. (2024). SALMONN: Towards generic hearing abilities for large language models. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024).
  • [4] Cuervo, S., & Marxer, R. (2024). Scaling properties of speech language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 351–361). Association for Computational Linguistics.
  • [5] Hu, S., Zhou, L., Liu, S., Chen, S., Meng, L., Hao, H., Pan, J., Liu, X., Li, J., Sivasankaran, S., Liu, L., & Wei, F. (2024). WavLLM: Towards robust and adaptive speech large language model. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 4552–4572). Association for Computational Linguistics.
  • [6] Rubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borber, Z., Riesa, J., Tanaka, K., Lamania, T., Chen, J., Ghaffarizadeh, A., Mengibar, R., & others. (2023). AudioPaLM: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
  • [7] Chu, Y., Xu, J., Zhou, X., Yang, Q., Zhang, S., Yan, Z., Zhou, C., & Zhou, J. (2023). Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919.
  • [8] Maiti, S., Peng, Y., Choi, S., Jung, J.-W., Chang, X., & Watanabe, S. (2024). VoxtLM: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks. In ICASSP 2024 – IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE.
  • [9] Nguyen, T. A., Muller, B., Yu, B., Costa-jussà, M. R., Elbayad, M., Popuri, S., Ropers, C., Duquenne, P.-A., Algayres, R., Mavlyutov, R., Gat, I., Williamson, M., Synnaeve, G., Pino, J., Sagot, B., & Dupoux, E. (2024). SPIRIT-LM: Interleaved spoken and written language model. arXiv preprint arXiv:2402.05755.
  • [10] Shaik, Z. H., Hegde, P., Bannulmath, P., & T, D. K. (2024). LaRA: Large rank adaptation for speech and text cross-modal learning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics.
  • [11] Défossez, A., Mazaré, L., Orsini, M., Royer, A., Pérez, P., Jégou, H., Grave, E., & Zeghidour, N. (2024). Moshi: A speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037.
  • [12] Zhu, Y., Su, D., He, L., Xu, L., & Yu, D. (2024). Generative pre-trained speech language model with efficient hierarchical transformer. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024).
  • [13] Hu, Y., Chen, C., Yang, C.-H. H., Li, R., Zhang, D., Chen, Z., & Chng, E. S. (2024). GenTranslate: Large language models are generative multilingual speech and machine translators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  • [14] Wang, M., Wang, Y., Vu, T.-T., Shareghi, E., & Haffari, R. (2024). Exploring the potential of multimodal LLM with knowledge-intensive multimodal ASR. In Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics.
  • [15] Voas, J., Mooney, R., & Harwath, D. (2024). Multimodal contextualized semantic parsing from speech. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
  • [16] Zhang, X., Liu, H., Xu, K., Zhang, Q., Liu, D., Ahmed, B., & Epps, J. (2024). When LLMs meet acoustic landmarks: An efficient approach to integrate speech into large language models for depression detection. In Proceedings of Interspeech 2024. ISCA.
  • [17] He, T., Choi, K., Tjuatja, L., Levin, L., Neubig, G., & Mortensen, D. R. (2024). WAV2GLOSS: Generating interlinear glossed text from speech. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  • [18] Yan, B. C., Li, J. T., Wang, Y. C., Wang, H. W., Lo, T. H., Hsu, Y. C., Chao, W. C., & Chen, B. (2024). An effective pronunciation assessment approach leveraging hierarchical transformers and pre-training strategies. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.