Grounded Visual Dialogue
This project will explore how to model dialogue phenomena in visual dialogue and how these phenomena contribute to task
Visual dialog involves an agent to hold a meaningful conversation with a user in the context of an image. These systems find potential application in VR technology, dialog based image retrieval and agents which can provide viable information to visually-impaired people about an image or other visual content.
The current focus of visual dialogue is mainly on question-answering, i.e. providing one question and answer at a time. For example, the visual question answering shared task (Antol et al.,2015) and the visual dialogue data set (Daset al., 2017) both focus on answer retrieval with no or very little dialogue context needed. The Guess-What? challenge (Vries et al., 2017) reduces dialogue to retrieving a list of questions and answers to identify a common object. All of these datasets/ shared tasks currently do not display any human-like dialogue phenomena, such as discourse structure, evidence of joint understanding/ grounding, alignment and on-the-fly repair and clarification.
This work will
- design a new shared task and data collection for visual dialogue
- model the dynamics of meaning and grounding in visual dialogue by combining early work on symbolic dialogue modelling with current advances in deep learning in vision and language tasks, such as zero-shot and few-shot learning
- investigate how dynamic models of grounding and meaning in visual dialogue contribute to task success between user and system