Volume 17, Issue 2
  • ISSN 1572-0373
  • E-ISSN: 1572-0381
Buy:$35.00 + Taxes


Human instructors often refer to objects and actions involved in a task description using both linguistic and non-linguistic means of communication. Hence, for robots to engage in natural human-robot interactions, we need to better understand the various relevant aspects of human multi-modal task descriptions. We analyse reference resolution to objects in a data collection comprising two object manipulation tasks (22 teacher student interactions in Task 1 and 16 in Task 2) and find that 78.76% of all referring expressions to the objects relevant in Task 1 are verbally underspecified and 88.64% of all referring expressions are verbally underspecified in Task 2. The data strongly suggests that a language processing module for robots must be genuinely multi-modal, allowing for seamless integration of information transmitted in the verbal and the visual channel, whereby tracking the speaker’s eye gaze and gestures as well as object recognition are necessary preconditions.


Article metrics loading...

Loading full text...

Full text loading...


  1. Admoni, H. , Datsikas, C. , & Scassellati, B.
    (2014) Speech and gaze conflicts in collaborative human-robot interactions. InProceedings of the36th Annual Conference of the Cognitive Science Society (CogSci 2014).
    [Google Scholar]
  2. Ahrenholz, B.
    (2007) Verweise mit Demonstrativa im gesprochenen Deutsch: Grammatik, Zweitspracherwerb und Deutsch als Fremdsprache (Vol.17). Walter de Gruyter. doi: 10.1515/9783110894127
    https://doi.org/10.1515/9783110894127 [Google Scholar]
  3. Almor, A.
    (1999) Noun-phrase anaphora and focus: The informational load hypothesis. Psychological Review, 106(4), 748. doi: 10.1037/0033‑295x.l06.4.748
    https://doi.org/10.1037/0033-295x.l06.4.748 [Google Scholar]
  4. Arnold, J. E. , Eisenband, J. G. , Brown-Schmidt, S. , & Trueswell, J. C.
    (2000) The rapid use of gender information: Evidence of the time course of pronoun resolution from eyetracking. Cognition, 76(1), B13–B26. doi: 10.1016/S0010‑0277(00)00073‑1
    https://doi.org/10.1016/S0010-0277(00)00073-1 [Google Scholar]
  5. Arts, A. , Maes, A. , Noordman, L. , & Jansen, C.
    (2011) Overspecification facilitates object identification. Journal of Pragmatics, 43(1), 361–374. doi: 10.1016/j.pragma.2010.07.013
    https://doi.org/10.1016/j.pragma.2010.07.013 [Google Scholar]
  6. Benthall, J. , Argyle, M. , & Cook, M.
    (1976) Gaze and mutual gaze. RAIN(12), 7. doi: 10.2307/3032267
    https://doi.org/10.2307/3032267 [Google Scholar]
  7. Böckler, A. , Knoblich, G. , & Sebanz, N.
    (2011) Observing shared attention modulates gaze following. Cognition, 120(2), 292–298. doi: 10.1016/j.cognition.2011.05.002
    https://doi.org/10.1016/j.cognition.2011.05.002 [Google Scholar]
  8. Brennan, S. E.
    (1996) Lexical entrainment in spontaneous dialog. Proceedings of International Symposium on Spoken Dialog, 41–44.
    [Google Scholar]
  9. (2000) Processes that shape conversation and their implications for computational linguistics. InProceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp.1–11). doi: 10.3115/1075218.1075219
    https://doi.org/10.3115/1075218.1075219 [Google Scholar]
  10. Brennan, S. E. , Chen, X. , Dickinson, C. A. , Neider, M. B. , & Zelinsky, G. J.
    (2008) Coordinating cognition: The costs and benefits of shared gaze during collaborative search. Cognition, 106(3), 1465–1477. doi: 10.1016/j.cognition.2007.05.012
    https://doi.org/10.1016/j.cognition.2007.05.012 [Google Scholar]
  11. Brennan, S. E. , & Clark, H. H.
    (1996) Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(6), 1482–1493. doi: 10.1037/0278‑7393.22.6.1482
    https://doi.org/10.1037/0278-7393.22.6.1482 [Google Scholar]
  12. Chai, J. Y. , Prasov, Z. , & Qu, S.
    (2006) Cognitive principles in robust multimodal interpretation. Journal of Artificial Intelligence Research (JAIR), 27, 55–83.
    [Google Scholar]
  13. Chen, Y. , Schermerhorn, P. , & Scheutz, M.
    (2012) Adaptive eye gaze patterns in interactions with human and artificial agents. ACM Transactions on Interactive Intelligent Systems, 1(2), 13. doi: 10.1145/2070719.2070726
    https://doi.org/10.1145/2070719.2070726 [Google Scholar]
  14. Clark, H. H.
    (2003) Pointing and placing. Pointing: Where language, culture, and cognition meet, 243–268. doi: 10.1075/gest.4.2.08gul
    https://doi.org/10.1075/gest.4.2.08gul [Google Scholar]
  15. Clark, H. H. , & Krych, M. A.
    (2004) Speaking while monitoring addressees for understanding. Journal of Memory and Language, 50(1), 62–81. doi: 10.1016/j.jml.2003.08.004
    https://doi.org/10.1016/j.jml.2003.08.004 [Google Scholar]
  16. Clark, H. H. , & Wilkes-Gibbs, D.
    (1986) Referring as a collaborative process. Cognition, 22(1), 1–39. doi: 10.1016/0010‑0277(86)90010‑7
    https://doi.org/10.1016/0010-0277(86)90010-7 [Google Scholar]
  17. Dahan, D. , Tanenhaus, M. K. , & Chambers, C. G.
    (2002) Accent and reference resolution in spoken-language comprehension. Journal of Memory and Language, 47(2), 292–314. doi: 10.1016/s0749‑596x(02)00001‑3
    https://doi.org/10.1016/s0749-596x(02)00001-3 [Google Scholar]
  18. Dale, R. , & Reiter, E.
    (1995) Computational interpretations of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19(2), 233–263. doi: 10.1016/0364‑0213(95)90018‑7
    https://doi.org/10.1016/0364-0213(95)90018-7 [Google Scholar]
  19. Fang, R. , Doering, M. , & Chai, J. Y.
    (2015) Embodied collaborative referring expression generation in situated human-robot interaction. InProceedings of the10th Annual ACM/IEEE International Conference on Human-Robot Interaction (pp.271–278). doi: 10.1145/2696454.2696467
    https://doi.org/10.1145/2696454.2696467 [Google Scholar]
  20. Frischen, A. , Bayliss, A. P. , & Tipper, S. P.
    (2007) Gaze cueing of attention: visual attention, social cognition, and individual differences. Psychological Bulletin, 133(4), 691–721. doi: 10.1037/0033‑2909.133.4.694
    https://doi.org/10.1037/0033-2909.133.4.694 [Google Scholar]
  21. Furnas, G. , Landauer, T. , Gomez, L. , & Dumais, S.
    (1984) Statistical semantics: Analysis of the potential performance of keyword information systems. InHuman factors in computer systems (pp.187–212). doi: 10.1002/j.l538‑7305.1983.tb03513.x
    https://doi.org/10.1002/j.l538-7305.1983.tb03513.x [Google Scholar]
  22. (1987) The vocabulary problem in human-system communication. Communications of the ACM, 50(11), 964–971. doi: 10.1145/32206.32212
    https://doi.org/10.1145/32206.32212 [Google Scholar]
  23. Gatt, A. , Krahmer, E. , van Deemter, K. , & van Gompel, R. P.
    (2014) Models and empirical data for the production of referring expressions. Language, Cognition and Neuroscience, 29(8), 899–911. doi: 10.1080/23273798.2014.933242
    https://doi.org/10.1080/23273798.2014.933242 [Google Scholar]
  24. Goudbeek, M. , & Krahmer, E.
    (2012) Alignment in interactive reference production: Content planning, modifier ordering, and referential overspecification. Topics in Cognitive Science, 4(2), 269–289. doi: 10.1111/j.l756‑8765.2012.01186.x
    https://doi.org/10.1111/j.l756-8765.2012.01186.x [Google Scholar]
  25. Grice, H.
    (1975) Logic and conversation. InSyntax and semantics: Speech acts (pp.41–58). New York, doi: 10.2307/324613
    https://doi.org/10.2307/324613 [Google Scholar]
  26. Griffin, Z. M.
    (2001) Gaze durations during speech reflect word selection and phonological encoding. Cognition, 82(Bl–Bl4). doi: 10.1016/S0010‑0277(01)00138‑X
    https://doi.org/10.1016/S0010-0277(01)00138-X [Google Scholar]
  27. Grosz, B. J. , Weinstein, S. , & Joshi, A. K.
    (1995) Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2), 203–225.
    [Google Scholar]
  28. Gundel, J. K.
    (2010) Reference and accessibility from a givenness hierarchy perspective. International Review of Pragmatics, 2(2), 148–168. doi: 10.1163/187731010X528322
    https://doi.org/10.1163/187731010X528322 [Google Scholar]
  29. Gundel, J. K. , Hedberg, N. , & Zacharski, R.
    (1993) Cognitive status and the form of referring expressions in discourse. Language, 271–307. doi: 10.1163/187731010x528322
    https://doi.org/10.1163/187731010x528322 [Google Scholar]
  30. (2012) Underspecification of cognitive status in reference production: Some empirical predictions. Topics in Cognitive Science, 4(2), 249–268. doi: 10.1111/j.l756‑8765.2012.01184.x
    https://doi.org/10.1111/j.l756-8765.2012.01184.x [Google Scholar]
  31. Gundel, J. K. , Hedberg, N. , Zacharski, R. , Mulkern, A. , Custis, T. , Swierzbin, B. , … Watters, S.
    (2006) Coding protocol for statuses on the giveness hierarchy, (unpublished manuscript)
    [Google Scholar]
  32. Hanna, J. E. , & Brennan, S. E.
    (2007) Speakers’ eye gaze disambiguates referring expressions early during face-to-face conversation. Journal of Memory and Language, 57(4), 596–615. doi: 10.1016/j.jml.2007.01.008
    https://doi.org/10.1016/j.jml.2007.01.008 [Google Scholar]
  33. Hanna, J. E. , & Tanenhaus, M. K.
    (2004) Pragmatic effects on reference resolution in a collaborative task: Evidence from eye movements. Cognitive Science, 28(1), 105–115. doi: 10.1207/sl5516709cog2801_5
    https://doi.org/10.1207/sl5516709cog2801_5 [Google Scholar]
  34. Huang, C.-M. , & Mutlu, B.
    (2014) Learning-based modeling of multimodal behaviors for humanlike robots. InProceedings of the2014 ACM/IEEE International Conference on Human-Robot Interaction (pp.57–61). doi: 10.1145/2559636.2559668
    https://doi.org/10.1145/2559636.2559668 [Google Scholar]
  35. Huwel, S. , Wrede, B. , & Sagerer, G.
    (2006) Robust speech understanding for multimodal human-robot communication. InProceedings of the 15th IEEE International Symposium on Robot and Human Interactive Communication (pp.45–50). doi: 10.1109/roman.2006.314393
    https://doi.org/10.1109/roman.2006.314393 [Google Scholar]
  36. Kehler, A.
    (2000) Cognitive status and form of reference in multimodal human-computer interaction. InProceedings of the14th AAAI Conference on Artificial Intelligence (pp.685–690).
    [Google Scholar]
  37. Kelleher, J. D. , & Kruijff, G.-J. M.
    (2006) Incremental generation of spatial referring expressions in situated dialog. InProceedings of the21st International Conference on Computational Linguistics (pp.1041–1048). doi: 10.3115/1220175.1220306
    https://doi.org/10.3115/1220175.1220306 [Google Scholar]
  38. Kendon, A.
    (2004) Gesture: Visible action as utterance. Cambridge University Press, doi: 10.1017/cbo9780511807572
    https://doi.org/10.1017/cbo9780511807572 [Google Scholar]
  39. Knoeferle, P. , & Crocker, M. W.
    (2006) The coordinated interplay of scene, utterance, and world knowledge: evidence from eye tracking. Cognitive Science, 30(3), 481–529. doi: 10.1207/sl5516709cog0000_65
    https://doi.org/10.1207/sl5516709cog0000_65 [Google Scholar]
  40. Kowadlo, G. , Ye, P. , & Zukerman, I.
    (2010) Influence of gestural salience on the interpretation of spoken requests. InProceedings of Interspeech (pp.2034–2037).
    [Google Scholar]
  41. Krahmer, E. , & Theune, M.
    (2002) Efficient context-sensitive generation of referring expressions. InInformation sharing: Reference and presupposition in language generation and interpretation. Stanford.
    [Google Scholar]
  42. Kranstedt, A. , Lucking, A. , Pfeiffer, T. , Rieser, H. , & Wachsmuth, I.
    (2006) Deictic object reference in task-oriented dialogue. Trends in Linguistic Studies and Monographs, 166, 155. doi: 10.1515/9783110197747.155
    https://doi.org/10.1515/9783110197747.155 [Google Scholar]
  43. Kruijff, G.-J. M. , Lison, P. , Benjamin, T. , Jacobsson, H. , Zender, H. , Kruijff-Korbayová, L , & Hawes, N.
    (2010) Situated dialogue processing for human-robot interaction. InCognitive systems (pp.311–364). Springer, doi: 10.1007/978‑3‑642‑11694‑0_8
    https://doi.org/10.1007/978-3-642-11694-0_8 [Google Scholar]
  44. Lambrecht, K.
    (1996) Information structure and sentence form: Topic, focus, and the mental representations of discourse referents (Vol.71). Cambridge University Press, doi: 10.2307/417062
    https://doi.org/10.2307/417062 [Google Scholar]
  45. Lemaignan, S. , Ros, R. , Sisbot, E. A. , Alami, R. , & Beetz, M.
    (2012) Grounding the interaction: Anchoring situated discourse in everyday human-robot interaction. International Journal of Social Robotics, 4(2), 181–199. doi: 10.1007/sl2369‑011‑0123‑x
    https://doi.org/10.1007/sl2369-011-0123-x [Google Scholar]
  46. Lozano, S. C. , & Tversky, B.
    (2006) Communicative gestures facilitate problem solving for both communicators and recipients. Journal of Memory and Language, 55(1), 47–63. doi: 10.1016/j.jml.2005.09.002
    https://doi.org/10.1016/j.jml.2005.09.002 [Google Scholar]
  47. McNeill, D.
    (1992) Hand and mind: What gestures reveal about thought. University of Chicago Press, doi: 10.2307/1576015
    https://doi.org/10.2307/1576015 [Google Scholar]
  48. (2008) Gesture and thought. University of Chicago Press, doi: 10.7208/chicago/9780226514642.001.0001
    https://doi.org/10.7208/chicago/9780226514642.001.0001 [Google Scholar]
  49. Pechmann, T.
    (1989) Incremental speech production and referential overspecification. Linguistics, 27(1), 89–110. doi: 10.1515/ling.l989.27.1.89
    https://doi.org/10.1515/ling.l989.27.1.89 [Google Scholar]
  50. Pitsch, K. , Lohan, K. S. , Rohlfing, K. , Saunders, J. , Nehaniv, C. L. , & Wrede, B.
    (2012) Better be reactive at the beginning, implications of the first seconds of an encounter for the tutoring style in human-robot-interaction. InProceedings of RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication (pp.974–981). doi: 10.1109/roman.2012.6343876
    https://doi.org/10.1109/roman.2012.6343876 [Google Scholar]
  51. Prasov, Z. , & Chai, J. Y.
    (2008) What’s in a gaze?: the role of eye-gaze in reference resolution in multimodal conversational interfaces. InProceedings of the13th International Conference on Intelligent User Interfaces (pp.20–29). doi: 10.1145/1378773.1378777
    https://doi.org/10.1145/1378773.1378777 [Google Scholar]
  52. Reiter, E. , Dale, R. , & Feng, Z.
    (2000) Building natural language generation systems (Vol.33). MIT Press, doi: 10.1017/cbo9780511519857
    https://doi.org/10.1017/cbo9780511519857 [Google Scholar]
  53. Scheutz, M. , Briggs, G. , Cantrell, R. , Krause, E. , Williams, T. , & Veale, R.
    (2013) Novel mechanisms for natural human-robot interactions in the DIARC architecture. InProceedings of AAAI Workshop on Intelligent Robotic Systems.
    [Google Scholar]
  54. Schmid, H.
    (1995) Improvements in part-of-speech tagging with an application to German. InProceedings of the ACL SIG DAT-Workshop, doi: 10.1007/978‑94‑017‑2390‑9_2
    https://doi.org/10.1007/978-94-017-2390-9_2 [Google Scholar]
  55. Staudte, M. , & Crocker, M. W.
    (2009a) Producing and resolving multi-modal referring expressions in human-robot interaction. InProceedings of the Pre-CogSci Workshop on Production of Referring Expressions, doi: 10.1145/1514095.1514111
    https://doi.org/10.1145/1514095.1514111 [Google Scholar]
  56. (2009b) Visual attention in spoken human-robot interaction. InProceedings of the4th ACM/IEEE International Conference on Human Robot Interaction (pp.77–84). doi: 10.1145/1514095.1514111
    https://doi.org/10.1145/1514095.1514111 [Google Scholar]
  57. Streeck, J.
    (1993) Gesture as communication i: Its coordination with gaze and speech. Communications Monographs, 60(4), 275–299. doi: 10.1080/03637759309376314
    https://doi.org/10.1080/03637759309376314 [Google Scholar]
  58. Tomasello, M. . & Akhtar, N.
    (1995) Two-year-olds use pragmatic cues to differentiate reference to objects and actions. Cognitive Development, 10(2), 201–224. doi: 10.1016/0885‑2014(95)90009‑8
    https://doi.org/10.1016/0885-2014(95)90009-8 [Google Scholar]
  59. Van Deemter, K. , Gatt, A. , van Gompel, R. P. , & Krahmer, E.
    (2012) Toward a computational psycholinguistics of reference production. Topics in Cognitive Science, 4(2), 166–183. doi: 10.1111/j.l‑8765.2012.01187.x
    https://doi.org/10.1111/j.l-8765.2012.01187.x [Google Scholar]
  60. Van der Sluis, I. , & Krahmer, E.
    (2007) Generating multimodal references. Discourse Processes, 44(3), 145–174. doi: 10.1080/01638530701600755
    https://doi.org/10.1080/01638530701600755 [Google Scholar]
  61. Vollmer, A.-L. , Lohan, K. S. , Fischer, K. , Nagai, Y. , Pitsch, K. , Fritsch, J. , … Wrede, B.
    (2009) People modify their tutoring behavior in robot-directed interaction for action learning. InProceedings of the8th International Conference on Development and Learning (pp.1–6). doi: 10.1109/devlrn.2009.5175516
    https://doi.org/10.1109/devlrn.2009.5175516 [Google Scholar]
  62. Williams, T. , Acharya, S. , Schreitter, S. , & Scheutz, M.
    (2016) Situated open world reference resolution for human-robot dialogue. InProceedings of theIEEE/ACM Conference on Human-Robot Interaction (p.forthcoming).
    [Google Scholar]
  63. Williams, T. , Schreitter, S. , Acharya, S. , & Scheutz, M.
    (2015) Towards situated open world reference resolution. InProceedings of the 2015 AAAI Fall Symposium on Al and HRI.
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): human-robot interaction; multi-modal communication; reference resolution
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error