Supun Bhathiya Hemanthage
Text-only as well as multimodal conversational agents have witnessed an unprecedented level of interest among the AI community in recent years especially with the emergence of deep learning techniques in natural language processing and computer vision. Despite the ground-breaking work in the field, most current deep learning models rely on co-occurrence of tokens in large text corpora and generate text void of meaning. This work aims to develop conversational agents capable of comprehending the real-world meaning of tokens with respect to knowledge-context and visual-context of multimodal conversations and generate responses accordingly. We especially focus on a smart fusion between graph-structured (such as knowledge-graphs, scene-graphs) representation and language models.