Visual Commonsense Reasoning
On VCR, a model must not only answer commonsense visual questions, but also provide a rationale that explains why the answer is true.
Submitting to the leaderboard
Submission is easy! You just need to email Rowan (rowanz at cs.washington.edu) with your predictions. Formatting instructions are below:
Please include in your email 1) a name for your model, 2) your team name (including your affiliation), and optionally, 3) a github repo or paper link.
I'll try to get back to you within a few days, usually sooner. Teams can only submit results from a model once every 7 days.
I reserve the right to not score any of your submissions if you cheat -- for instance, please don't make up a bunch of fake names / email addresses and send me multiple submissions under those names.
What kinds of submissions are allowed?
The only constraint is that your system must predict the answer first, then the rationale. (The rationales were selected to be highly relevant to the correct Q,A pair, so they leak information about the correct answer.)
- To deter this, the submission format involves submitting predictions for each possible rationale, conditioned on each possible answer.
- A simple way of setting up the experiments (used in the paper) is to consider a task with query and four response choices. For Q->A the query is the question, and the response choices are the answers. For QA->R, the query is the question and answer, concatenated together, and the response choices are the rationales.
Questions?
If it's not about something private, check out the google group below:
VCR Leaderboard
There are two different subtasks to VCR:
- Question Answering (Q->A): In this setup, a model is provided a question, and has to pick the best answer out of four choices. Only one of the four is correct.
- Answer Justification (QA->R): In this setup, a model is provided a question, along with the correct answer, and it has to justify it by picking the best rationale out of four choices.
We combine the two parts with the Q->AR metric, in which a model only gets a question right if it answers correctly and picks the right rationale. Models are evaluated in terms of accuracy (%). How well will your model do?
Rank | Model | Q->A | QA->R | Q->AR |
---|---|---|---|---|
Human Performance University of Washington (Zellers et al. '18) |
91.0 | 93.0 | 85.0 | |
📼 November 19, 2020 | BLENDER (single model) WeSee AI team, Tencent | 81.6 | 86.4 | 70.8 |
2 June 24, 2020 | ERNIE-ViL-large(ensemble of 15 models) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934 | 81.6 | 86.1 | 70.5 |
3 October 28, 2020 | MMCNet (ensemble of 4 models) UC Berkeley | 80.0 | 83.1 | 66.9 |
4 September 30, 2019 | UNITER-large (ensemble of 10 models) MS D365 AI https://arxiv.org/abs/1909.11740 | 79.8 | 83.4 | 66.8 |
5 June 24, 2020 | ERNIE-ViL-large(single model) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934 | 79.2 | 83.5 | 66.3 |
6 May 22, 2020 | VILLA-large (single model) MS D365 AI https://arxiv.org/pdf/2006.06195.pdf) | 78.9 | 82.8 | 65.7 |
7 October 30, 2020 | VitsNet Carnegie Mellon University | 78.8 | 81.8 | 64.6 |
8 July 16, 2020 | gnimix anonymous | 78.9 | 81.0 | 64.1 |
9 September 23, 2019 | UNITER-large (single model) MS D365 AI https://arxiv.org/abs/1909.11740 | 77.3 | 80.8 | 62.8 |
10 June 24, 2020 | ERNIE-ViL-base(single model) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934 | 77.0 | 80.3 | 62.1 |
11 October 22, 2020 | MMCNet UC Berkeley | 76.0 | 79.3 | 60.6 |
12 February 17, 2021 | Test_VILLA CCR | 76.6 | 79.0 | 60.6 |
13 May 22, 2020 | VILLA-base (single model) MS D365 AI https://arxiv.org/pdf/2006.06195.pdf) | 76.4 | 79.1 | 60.6 |
14 April 23, 2020 | KVL-BERT Beijing Institute of Technology | 76.4 | 78.6 | 60.3 |
15 August 9,2019 | ViLBERT (ensemble of 10 models) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265 | 76.4 | 78.0 | 59.8 |
16 September 23,2019 | VL-BERT (single model) MSRA & USTC https://arxiv.org/abs/1908.08530 | 75.8 | 78.4 | 59.7 |
17 March 23, 2019 | VVT runningcat | 75.3 | 78.9 | 59.7 |
18 January 8, 2021 | vlt (single model) Anonymous | 75.3 | 77.8 | 58.9 |
19 August 9,2019 | ViLBERT (ensemble of 5 models) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265 | 75.7 | 77.5 | 58.8 |
20 September 4,2019 | Unicoder-VL (ensemble of 2 models) MSRA & PKU https://arxiv.org/abs/1908.06066 | 76.0 | 77.1 | 58.6 |
21 September 23, 2019 | UNITER-base (single model) MS D365 AI https://arxiv.org/abs/1909.11740 | 75.0 | 77.2 | 58.2 |
22 November 5, 2020 | TDN VARMS (Sun Yat-sen University) | 75.7 | 76.4 | 58.0 |
23 May 13, 2019 | B2T2 (ensemble of 5 models) Google Research http://arxiv.org/abs/1908.05054 | 74.0 | 77.1 | 57.1 |
24 May 13, 2019 | B2T2 (single model) Google Research http://arxiv.org/abs/1908.05054 | 72.6 | 75.7 | 55.0 |
25 August 27,2019 | Unicoder-VL (single model) MSRA & PKU https://arxiv.org/abs/1908.06066 | 73.4 | 74.5 | 54.9 |
26 July 30,2019 | ViLBERT (single model) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265 | 73.3 | 74.6 | 54.8 |
27 November 14, 2019 | HGL Sun Yat-sen University | 72.2 | 73.4 | 53.2 |
28 May 22, 2019 | TNet (ensemble of 5) FlyingPig (Sun Yat-sen University) | 72.7 | 72.6 | 53.0 |
29 July 5,2019 | VisualBERT UCLA & AI2 & PKU https://arxiv.org/abs/1908.03557 | 71.6 | 73.2 | 52.4 |
30 March 2, 2021 | YXY Peking University | 71.6 | 72.8 | 52.3 |
31 February 27,2021 | SAC SAC | 71.6 | 72.4 | 52.1 |
32 February 23, 2021 | CMR flying melon | 71.3 | 72.4 | 51.8 |
33 September 4, 2019 | MKDN Peking University | 70.7 | 71.5 | 50.5 |
34 October 25, 2019 | TAB-VCR UIUC https://arxiv.org/abs/1910.14671 | 70.4 | 71.7 | 50.5 |
35 June 3, 2020 | RobustCL NeurIPS 2020 submission ID1218 | 70.4 | 71.5 | 50.5 |
36 May 22, 2019 | TNet (single model) FlyingPig (Sun Yat-sen University) | 70.9 | 70.6 | 50.4 |
37 November 29, 2019 | WWR-Net Tianjin University | 70.8 | 70.7 | 50.2 |
38 October 13, 2019 | transformer-r2c SYSU NEW BANNER (Sun Yat-sen University) | 70.7 | 70.5 | 50.0 |
39 September 4,2019 | CAR Sun Yat-sen University | 70.5 | 70.4 | 49.8 |
40 December 6,2020 | SIA V1 SIA | 69.6 | 71.3 | 49.8 |
41 May 13, 2019 | HGL HCP https://arxiv.org/abs/1910.11475 | 70.1 | 70.8 | 49.8 |
42 September 3, 2020 | SFW1 Jiangsu University | 70.2 | 70.7 | 49.7 |
43 September 3, 2020 | SFW2 Jiangsu University | 69.6 | 71.1 | 49.6 |
44 February 3, 2021 | vlb (single model) Anonymous | 69.8 | 69.8 | 48.9 |
45 May 18, 2019 | CCD Anonymous | 68.5 | 70.5 | 48.4 |
46 May 16, 2019 | MRCNet MILAB (Seoul National University) | 68.4 | 70.5 | 48.4 |
47 May 17, 2019 | MUGRN NeurIPS 2019 submission ID 21 | 68.2 | 69.4 | 47.5 |
48 May 13, 2019 | SGRE UTS https://github.com/AmingWu/Multi-modal-Circulant-Fusion/ | 67.5 | 69.7 | 46.9 |
49 Feb 19, 2019 | FAIR Facebook AI Research | 65.7 | 70.1 | 46.3 |
50 November 28, 2019 | DAF Beijing Institute of Technology | 66.9 | 68.7 | 46.0 |
51 Feb 25, 2019 | CKRE Peking University | 66.9 | 68.2 | 45.9 |
52 May 14, 2019 | emnet DCP | 66.6 | 68.0 | 45.4 |
53 Nov 28, 2018 | Recognition to Cognition Networks University of Washington https://github.com/rowanz/r2c | 65.1 | 67.3 | 44.0 |
54 May 20, 2019 | DVD SL | 66.3 | 65.0 | 43.3 |
55 March 27, 2019 | GS Reasoning UC San Diego | 65.7 | 61.0 | 41.1 |
56 May 17, 2019 | R2R (text only) Anonymous | 58.4 | 69.1 | 40.5 |
57 Nov 28, 2018 | BERT-Base Google AI Language (experiment by Rowan) https://github.com/google-research/bert | 53.9 | 64.5 | 35.0 |
58 Nov 28, 2018 | MLB Seoul National University (experiment by Rowan) https://github.com/jnhwkim/MulLowBiVQA | 46.2 | 36.8 | 17.2 |
59 November 15, 2019 | Visual-Lang-base (ensemble) Tiny-Group | 17.7 | 74.1 | 13.3 |
Random Performance | 25.0 | 25.0 | 6.2 |