Shreyank Narayana Gowda
Computer vision has made great progress in recognition tasks in the last years. However, this progress is limited to the fully supervised setting, where there exists a lot of data. This dependency on data has two limitations. First, in the case of video, for example, it is extremely costly, since labeling is much more expensive. This limits the tasks and classes that are learnable. Second, datasets do not contain joint labels across different vision tasks such as object recognition, action recognition and scene recognition. Thus, this joint information cannot be leveraged. To mitigate these issues, we explore weak supervision of videos from captions and video descriptions, by modeling vision and language together. We argue that captions and descriptions which already exist on some video data (such as movies) contain joint information of objects, scenes, and actions, which can help improve current video understanding technology.
I'm a PhD student affliated with IPAB in the Informatics department at the University of Edinburgh where I am supervised by Laura Sevilla-Lara and co-supervised by Frank Keller. I am fortunate to be able to collaborate with Marcus Rohrbach of Facebook AI Research. I also collaborate with Yannis Kalantidis of Naver Labs Europe. I broadly work on video based computer vision applications. Before coming to Edinburgh, I spent two wonderful years as a masters student at Tsinghua University after being awarded a Chinese Government Scholarship on recommendation of the university, under the supervision of Chun Yuan.