Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding.
With one glance at an image, we can effortlessly imagine the world beyond the pixels (e.g. that [person1] ordered pancakes). While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a rationale explaining why its answer is true.
Overview of VCR
- 290k multiple choice questions
- 290k correct answers and rationales: one per question
- 110k images
- Counterfactual choices obtained with minimal bias, via our new Adversarial Matching approach
- Answers are 7.5 words on average; rationales are 16 words.
- High human agreement (>90%)
- Scaffolded on top of 80 object categories from COCO
- Questions are highly diverse and challenging: browse and see for yourself!
Subscribe to our list
VCR is an ongoing effort. For announcements follow Rowan on twitter » and subscribe to our Google group »
From Recognition to Cognition: Visual Commonsense Reasoning
If the paper inspires you, please cite us:
@inproceedings{zellers2019vcr, author = {Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin}, title = {From Recognition to Cognition: Visual Commonsense Reasoning}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2019} }