1887
Volume 26, Issue 3
  • ISSN 1572-0373
  • E-ISSN: 1572-0381
USD
Buy:$35.00 + Taxes

Abstract

Managing conversational interactions with groups of people is still an open challenge in human-robot interaction, requiring a multi-modal combination of sensory inputs/outputs and dialogue systems. In this paper, we present the development of an integrated multi-modal system connecting a Large Language Model (LLM) with a social robot’s perception and action modules for managing situated multi-party interactions. We describe and discuss the exploratory results of a system-wide performance evaluation via a within-subjects user study in which 27 unique pairs of participants interacted with a social robot under two conditions: a multi-party capable system and a baseline system with only single-party capabilities. Participants interacted with the two systems in a combination of task-based and open-ended scenarios, for a total of 108 interactions with each of the two systems. Our evaluation demonstrated a slight preference for the Multi-Party system and a more balanced interaction overall, and highlights potentials and open challenges in the integration of LLMs capabilities into robotic conversational systems.

Loading

Article metrics loading...

/content/journals/10.1075/is.25014.gar
2026-04-02
2026-04-21
Loading full text...

Full text loading...

References

  1. Abdelrahman, A. A., Hempel, T., Khalifa, A., Al-Hamadi, A., & Dinges, L.
    (2023) L2cs-net : Finegrained gaze estimation in unconstrained environments. 2023 8th International Conferenceon Frontiers of Signal Processing (ICFSP), 98–102. 10.1109/ICFSP59764.2023.10372944
    https://doi.org/10.1109/ICFSP59764.2023.10372944 [Google Scholar]
  2. Addlesee, A.
    (2024) Grounding LLMs to In-prompt Instructions: Reducing Hallucinations Caused by Static Pre-training Knowledge. 3rd Workshop on Safety for Conversational AI, Safety4ConvAI 2024 at LREC-COLING 2024 — Workshop Proceedings, 1–7.
    [Google Scholar]
  3. Addlesee, A., Sieińska, W., Gunson, N., Hernandez Garcia, D., Dondrup, C., & Lemon, O.
    (2023) Multi-party goal tracking with LLMs: Comparing pre-training, fine-tuning, and prompt engineering. InS. Stoyanchev, S. Joty, D. Schlangen, O. Dusek, C. Kennington, & M. Alikhani (Eds.), Proceedings of the 24th annual meeting of the special interest group on discourse and dialogue (pp.229–241). Association for Computational Linguistics. 10.18653/v1/2023.sigdial‑1.22
    https://doi.org/10.18653/v1/2023.sigdial-1.22 [Google Scholar]
  4. Agrawal, G., Kumarage, T., Alghami, Z., & Liu, H.
    (2023) Can Knowledge Graphs Reduce Hallu-cinations in LLMs? : A Survey. arxiv.org/abs/2311.07914
  5. Allgeuer, P., Ali, H., & Wermter, S.
    (2024) When robots get chatty: Grounding multimodal humanrobot conversation and collaboration. Proceedings of the International Conference on Artificial Neural Networks. 10.1007/978‑3‑031‑72341‑4_21
    https://doi.org/10.1007/978-3-031-72341-4_21 [Google Scholar]
  6. Aylett, M. P., & Romeo, M.
    (2023) You don’t need to speak, you need to listen: Robot interaction and human-like turn-taking. Proceedings of the 5th International Conference on Conversational User Interfaces. 10.1145/3571884.3603750
    https://doi.org/10.1145/3571884.3603750 [Google Scholar]
  7. Bohus, D., & Horvitz, E.
    (2010) Facilitating multiparty dialog with gaze, gesture, and speech. International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction. 10.1145/1891903.1891910
    https://doi.org/10.1145/1891903.1891910 [Google Scholar]
  8. (2011) Multiparty turn taking in situated dialog: Study, lessons, and directions. Proceedings of the SIGDIAL 2011 Conference, 98–109.
    [Google Scholar]
  9. Bommasani, R., Klyman, K., Longpre, S., Kapoor, S., Maslej, N., Xiong, B., Zhang, D., & Liang, P.
    (2023) The foundation model transparency index. https://arxiv.org/abs/2310.12941
  10. Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P.
    (2023) Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/
  11. Cooper, S., Fava, A. D., Vivas, C., Marchionni, L., & Ferro, F.
    (2020) Ari: The social assistive robot and companion. 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 745–751. https://api.semanticscholar.org/CorpusID:222418995. 10.1109/RO‑MAN47096.2020.9223470
    https://doi.org/10.1109/RO-MAN47096.2020.9223470 [Google Scholar]
  12. Deng, J., Guo, J., Yuxiang, Z., Yu, J., Kotsia, I., & Zafeiriou, S.
    (2019) Retinaface: Single-stage dense face localisation in the wild. arxiv.
    [Google Scholar]
  13. Eisenberg, A., Gannot, S., & Chazan, S. E.
    (2023) A two-stage speaker extraction algorithm under adverse acoustic conditions using a single-microphone. 2023 31st European Signal Processing Conference (EUSIPCO), 266–270. 10.23919/EUSIPCO58844.2023.10289764
    https://doi.org/10.23919/EUSIPCO58844.2023.10289764 [Google Scholar]
  14. Eshghi, A., & Healey, P. G.
    (2016) Collective contexts in conversation: Grounding by proxy. Cog-nitive science, 40 (2), 299–324. 10.1111/cogs.12225
    https://doi.org/10.1111/cogs.12225 [Google Scholar]
  15. Fu, J., Ng, S.-K., Jiang, Z., & Liu, P.
    (2023) Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
    [Google Scholar]
  16. Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J.
    (2021) Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.
    [Google Scholar]
  17. Gu, J.-C., Tan, C.-H., Tao, C., Ling, Z.-H., Hu, H., Geng, X., & Jiang, D.
    (2022) HeterMPC: A Heterogeneous Graph Neural Network for Response Generation in Multi-Party Conversations. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5086–5097. 10.18653/v1/2022.acl‑long.349
    https://doi.org/10.18653/v1/2022.acl-long.349 [Google Scholar]
  18. Gu, J.-C., Tao, C., & Ling, Z.-H.
    (2022) WHO Says WHAT to WHOM: A Survey of Multi-Party Conversations. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22). 10.24963/ijcai.2022/768
    https://doi.org/10.24963/ijcai.2022/768 [Google Scholar]
  19. Gu, J.-C., Tao, C., Ling, Z., Xu, C., Geng, X., & Jiang, D.
    (2021) MPC-BERT: A pre-trained language model for multi-party conversation understanding. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 3682–3692. 10.18653/v1/2021.acl‑long.285
    https://doi.org/10.18653/v1/2021.acl-long.285 [Google Scholar]
  20. Gunson, N., Addlesee, A., Hernandez Garcia, D., Romeo, M., Dondrup, C., & Lemon, O.
    (2024) A holistic evaluation methodology for multi-party spoken conversational agents. ACM International Conference on Intelligent Virtual Agents (IVA ’24). 10.1145/3652988.3673966
    https://doi.org/10.1145/3652988.3673966 [Google Scholar]
  21. Hu, W., Chan, Z., Liu, B., Zhao, D., Ma, J., & Yan, R.
    (2019) GSN: A graph-structured network for multi-party dialogues. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19). 10.24963/ijcai.2019/696
    https://doi.org/10.24963/ijcai.2019/696 [Google Scholar]
  22. Incao, S., Mazzola, C., Belgiovine, G., & Sciutti, A.
    (2024) A roadmap for embodied and social grounding in llms. arXiv preprint arXiv:2409.16900.
    [Google Scholar]
  23. Iniguez-Carrillo, A. L., Gaytan-Lugo, L. S., Garcia-Ruiz, M. A., & Maciel-Arellano, R.
    (2021) Usability questionnaires to evaluate voice user interfaces. IEEE Latin America Transactions, 19 (9), 1468–1477. 10.1109/TLA.2021.9468439
    https://doi.org/10.1109/TLA.2021.9468439 [Google Scholar]
  24. Jayagopi, D. B., & Odobez, J.-M.
    (2013) Given that, should i respond? contextual addressee estimation in multi-party human-robot interactions. Proceedings of the 8th ACM/IEEE International Conference on Human-Robot Interaction, 147–148.
    [Google Scholar]
  25. Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., & Fung, P.
    (2023) Towards Mitigating Hallucination in Large Language Models via Self-Reflection. EMNLP 2023, 1827–1843. arxiv.org/abs/2310.06271
    [Google Scholar]
  26. Jia, J., Komma, A., Leffel, T., Peng, X., Nagesh, A., Soliman, T., Galstyan, A., & Kumar, A.
    (2024) Leveraging LLMs for dialogue quality measurement. InY. Yang, A. Davani, A. Sil, & A. Kumar (Eds.), Proceedings of the 2024 conference of the north american chapter of the association for computational linguistics: Human language technologies (volume 6: Industry track) (pp.359–367). Association for Computational Linguistics. 10.18653/v1/2024.naacl‑industry.30
    https://doi.org/10.18653/v1/2024.naacl-industry.30 [Google Scholar]
  27. Johansson, M., & Skantze, G.
    (2015) Opportunities and obligations to take turns in collaborative multi-party human-robot interaction. Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 305–314. 10.18653/v1/W15‑4642
    https://doi.org/10.18653/v1/W15-4642 [Google Scholar]
  28. Johansson, M., Skantze, G., & Gustafson, J.
    (2014) Comparison of human-human and human-robot turn-taking behaviour in multiparty situated interaction. Proceedings of the 2014 Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, 21–26. 10.1145/2666242.2666249
    https://doi.org/10.1145/2666242.2666249 [Google Scholar]
  29. Kim, C. Y., Lee, C. P., & Mutlu, B.
    (2024) Understanding large-language model (LLM)-powered human-robot interaction. Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 371–380. 10.1145/3610977.3634966
    https://doi.org/10.1145/3610977.3634966 [Google Scholar]
  30. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y.
    (2023) Large language models are zero-shot reasoners.
    [Google Scholar]
  31. Lewis, J. R., & Hardzinski, M. L.
    (2015) Investigating the psychometric properties of the Speech User Interface Service Quality questionnaire. International Journal of Speech Technology, 18 (3), 479–487. 10.1007/s10772‑015‑9289‑1
    https://doi.org/10.1007/s10772-015-9289-1 [Google Scholar]
  32. Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L.
    (2014) Microsoft COCO: common objects in context. InD. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision — ECCV 2014 — 13th european conference, zurich, switzerland, september 6–12, 2014, proceedings, part V (pp.740–755). Springer. 10.1007/978‑3‑319‑10602‑1_48
    https://doi.org/10.1007/978-3-319-10602-1_48 [Google Scholar]
  33. Mahadevan, K., Chien, J., Brown, N., Xu, Z., Parada, C., Xia, F., Zeng, A., Takayama, L., & Sadigh, D.
    (2024) Generative expressive robot behaviors using large language models. Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 482–491. 10.1145/3610977.3634999
    https://doi.org/10.1145/3610977.3634999 [Google Scholar]
  34. Mahajan, K., & Shaikh, S.
    (2021) On the need for thoughtful data collection for multi-party dialogue: A survey of available corpora and collection methods. InH. Li, G.-A. Levow, Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke, N. Dethlefs, Y. Wu, & J. J. Li (Eds.), Proceedings of the 22nd annual meeting of the special interest group on discourse and dialogue (pp.338352). Association for Computational Linguistics. 10.18653/v1/2021.sigdial‑1.36
    https://doi.org/10.18653/v1/2021.sigdial-1.36 [Google Scholar]
  35. Mazzola, C., Romeo, M., Rea, F., Sciutti, A., & Cangelosi, A.
    (2023) To whom are you talking? a deep learning model to endow social robots with addressee estimation skills. 2023 International Joint Conference on Neural Networks (IJCNN), 1–10. 10.1109/IJCNN54540.2023.10191452
    https://doi.org/10.1109/IJCNN54540.2023.10191452 [Google Scholar]
  36. Mishra, C., & Skantze, G.
    (2022) Knowing where to look: A planning-based architecture to automate the gaze behavior of social robots. 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 1201–1208. 10.1109/RO‑MAN53752.2022.9900740
    https://doi.org/10.1109/RO-MAN53752.2022.9900740 [Google Scholar]
  37. Mittelstädt, J. M., Maier, J., Goerke, P., Zinn, F., & Hermes, M.
    (2024) Large language models can outperform humans in social situational judgments. Scientific Reports, 14 (1), 27449. 10.1038/s41598‑024‑79048‑0
    https://doi.org/10.1038/s41598-024-79048-0 [Google Scholar]
  38. Mohamed, Y., & Lemaignan, S.
    (2021) Ros for human-robot interaction. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3020–3027. 10.1109/IROS51168.2021.9636816
    https://doi.org/10.1109/IROS51168.2021.9636816 [Google Scholar]
  39. Murali, P., Steenstra, I., Yun, H. S., Shamekhi, A., & Bickmore, T.
    (2023) Improving multiparty interactions with a robot using large language models. Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 10.1145/3544549.3585602
    https://doi.org/10.1145/3544549.3585602 [Google Scholar]
  40. Novikova, J., Lemon, O., & Rieser, V.
    (2016) Crowd-sourcing NLG data: Pictures elicit better data. CoRR, abs/1608.00339. arxiv.org/abs/1608.00339. 10.18653/v1/W16‑6644
    https://doi.org/10.18653/v1/W16-6644 [Google Scholar]
  41. Parada, C.
    (2024) What do foundation models have to do with and for hri?Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 21. 10.1145/3610977.3638460
    https://doi.org/10.1145/3610977.3638460 [Google Scholar]
  42. Parreira, M. T., Gillet, S., Vazquez, M., & Leite, I.
    (2022) Design implications for effective robot gaze behaviors in multiparty interactions. Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction, 976–980. 10.1109/HRI53351.2022.9889481
    https://doi.org/10.1109/HRI53351.2022.9889481 [Google Scholar]
  43. Rawte, V., Sheth, A., & Das, A.
    (2023) A Survey of Hal lucination in Large Foundation Models. arxiv.org/abs/2309.05922
    [Google Scholar]
  44. Schlangen, D., & Skantze, G.
    (2009) A general, abstract model of incremental dialogue processing. InA. Lascarides, C. Gardent, & J. Nivre (Eds.), Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009) (pp.710–718). Association for Computational Linguistics. https://aclanthology.org/E09-1081/. 10.3115/1609067.1609146
    https://doi.org/10.3115/1609067.1609146 [Google Scholar]
  45. Shriberg, E., Stolcke, A., Hakkani-Tiir, D., & Heck, L. P.
    (2012) Learning when to listen: Detecting system-addressed speech in human-human-computer dialog. Interspeech, 334–337. 10.21437/Interspeech.2012‑83
    https://doi.org/10.21437/Interspeech.2012-83 [Google Scholar]
  46. Skantze, G.
    (2021) Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 671, 101178. 10.1016/j.csl.2020.101178
    https://doi.org/10.1016/j.csl.2020.101178 [Google Scholar]
  47. Skantze, G., Johansson, M., & Beskow, J.
    (2015) Exploring turn-taking cues in multi-party human-robot discussions about ob jects. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 67–74. 10.1145/2818346.2820749
    https://doi.org/10.1145/2818346.2820749 [Google Scholar]
  48. Spatola, N., Kuihnlenz, B., & Cheng, G.
    (2021) Perception and Evaluation in Human-Robot In-teraction: The Human-Robot Interaction Evaluation Scale (HRIES) — A Multicomponent Approach of Anthropomorphism. International Journal of Social Robotics, 13(7), 1517–1539. 10.1007/s12369‑020‑00667‑4
    https://doi.org/10.1007/s12369-020-00667-4 [Google Scholar]
  49. Tan, C.-H., Gu, J.-C., & Ling, Z.-H.
    (2023) Is ChatGPT a good multi-party conversation solver?InH. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the association for computational linguistics: Emnlp 2023 (pp.4905–4915). Association for Computational Linguistics. 10.18653/v1/2023.findings‑emnlp.326
    https://doi.org/10.18653/v1/2023.findings-emnlp.326 [Google Scholar]
  50. Traum, D.
    (2004) Issues in multiparty dialogues. Advances in Agent Communication: International Workshop on Agent Communication Languages, ACL 2003, Melbourne, Australia, July 14, 2003. Revised and Invited Papers, 201–211. 10.1007/978‑3‑540‑24608‑4_12
    https://doi.org/10.1007/978-3-540-24608-4_12 [Google Scholar]
  51. Tzinis, E., Wang, Z., Jiang, X., & Smaragdis, P.
    (2022) Compute and memory efficient universal sound source separation. J. Signal Process. Syst., 94 (2), 245–259. 10.1007/s11265‑021‑01683‑x
    https://doi.org/10.1007/s11265-021-01683-x [Google Scholar]
  52. Wachowiak, L., Coles, A., Celiktutan, O., & Canal, G.
    (2024) Are large language models aligned with people’s social intuitions for human-robot interactions?2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2520–2527. 10.1109/IROS58592.2024.10801325
    https://doi.org/10.1109/IROS58592.2024.10801325 [Google Scholar]
  53. Wagner, D., Churchill, A., Sigtia, S., Georgiou, P., Mirsamadi, M., Mishra, A., & Marchi, E.
    (2024) A multimodal approach to device-directed speech detection with large language models. ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 10451–10455. 10.1109/ICASSP48485.2024.10446224
    https://doi.org/10.1109/ICASSP48485.2024.10446224 [Google Scholar]
  54. Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., & Li, B.
    (2023) DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. (NeurIPS 2023). arxiv.org/abs/2306.11698
  55. Wang, C., Hasler, S., Tanneberg, D., Ocker, F., Joublin, F., Ceravola, A., Deigmoeller, J., & Gienger, M.
    (2024) Lami: Large language models for multi-modal human-robot interaction. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 1–10. 10.1145/3613905.3651029
    https://doi.org/10.1145/3613905.3651029 [Google Scholar]
  56. Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., & Chen, X.
    (2023) Large language models as optimizers.
    [Google Scholar]
  57. Zhang, C., Chen, J., Li, J., Peng, Y., & Mao, Z.
    (2023) Large language models for human-robot interaction: A review. Biomimetic Intelligence and Robotics, 3(4), 100131. 10.1016/j.birob.2023.100131
    https://doi.org/10.1016/j.birob.2023.100131 [Google Scholar]
  58. Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G.
    (2023) Instruction tuning for large language models: A survey.
    [Google Scholar]
  59. Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., & Wang, X.
    (2022) Byte-track: Multi-ob ject tracking by associating every detection box. Proceedings of the European Conference on Computer Vision (ECCV).
    [Google Scholar]
  60. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I.
    (2024) Judging llm-as-a-judge with mt-bench and chatbot arena. Proceedings of the 37th International Conference on Neural Information Processing Systems.
    [Google Scholar]
  61. Zhong, M., Liu, Y., Xu, Y., Zhu, C., & Zeng, M.
    (2022) DialogLM: Pre-trained model for long dialogue understanding and summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 361, 11765–11773. 10.1609/aaai.v36i10.21432
    https://doi.org/10.1609/aaai.v36i10.21432 [Google Scholar]
  62. Zhong, Y., Xie, J., Wang, J., Fan, B., Fang, Z., & Peng, B.
    (2024) Improving large language models in multi-party conversations through role-playing. InD.-S. Huang, X. Zhang, & Q. Zhang (Eds.), Advanced intel ligent computing technology and applications (pp.209–220). Springer Nature Singapore. 10.1007/978‑981‑97‑5663‑6_18
    https://doi.org/10.1007/978-981-97-5663-6_18 [Google Scholar]
/content/journals/10.1075/is.25014.gar
Loading
/content/journals/10.1075/is.25014.gar
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error