Visual Commonsense Reasoning
On VCR, a model must not only answer commonsense visual questions, but also provide a rationale that explains why the answer is true.
Submitting to the leaderboard
Submission is easy! You just need to email Rowan (rowanz at cs.washington.edu) with your predictions. Formatting instructions are below:
Please include in your email 1) a name for your model, 2) your team name (including your affiliation), and optionally, 3) a github repo or paper link.
If you submit an ensemble, please tell me how many models you used in your ensemble.
I'll try to get back to you within a few days, usually sooner. Teams can only submit results from a model once every 7 days.
I reserve the right to not score any of your submissions if you cheat -- for instance, please don't make up a bunch of fake names / email addresses and send me multiple submissions under those names.
What kinds of submissions are allowed?
The only constraint is that your system must predict the answer first, then the rationale. (The rationales were selected to be highly relevant to the correct Q,A pair, so they leak information about the correct answer.)
- To deter this, the submission format involves submitting predictions for each possible rationale, conditioned on each possible answer.
- A simple way of setting up the experiments (used in the paper) is to consider a task with query and four response choices. For Q->A the query is the question, and the response choices are the answers. For QA->R, the query is the question and answer, concatenated together, and the response choices are the rationales.
Questions?
If it's not about something private, check out the google group below:
VCR Leaderboard
There are two different subtasks to VCR:
- Question Answering (Q->A): In this setup, a model is provided a question, and has to pick the best answer out of four choices. Only one of the four is correct.
- Answer Justification (QA->R): In this setup, a model is provided a question, along with the correct answer, and it has to justify it by picking the best rationale out of four choices.
We combine the two parts with the Q->AR metric, in which a model only gets a question right if it answers correctly and picks the right rationale. Models are evaluated in terms of accuracy (%). How well will your model do?
Rank | Model | Q->A | QA->R | Q->AR |
---|---|---|---|---|
Human Performance University of Washington (Zellers et al. '18) |
91.0 | 93.0 | 85.0 | |
š¼ July 10, 2024 | OV-Grounding Anonymous | 91.4 | 92.8 | 85.0 |
2 September 11, 2023 | GPT4RoI Anonymous | 89.4 | 91.0 | 81.6 |
3 January 19, 2024 | ViP-LLaVa Anonymous | 89.2 | 90.9 | 81.3 |
4 March 14, 2024 | LLME-VCR Advanced Data Engineering and Real-time Computing Laboratory (ADE), School of Computor Science and Technology, Huazhong University of Science and Technology (HUST). | 87.0 | 86.6 | 75.7 |
5 November 28, 2022 | HunYuan_vcr Tencent Data Platform | 85.8 | 88.0 | 75.6 |
6 January 28, 2023 | SP-VCR (ensemble of 4 models) Shopee MMU | 83.6 | 88.6 | 74.4 |
7 April 8, 2022 | KS-MGSR KDDI Research and SNAP | 85.3 | 86.9 | 74.3 |
8 January 17, 2022 | VLUA+ Kuaishou MMU | 84.8 | 87.0 | 74.0 |
9 February 21, 2022 | VQA-GNN + MerlotReserve-Large (ensemble of 2 models) Anonymous | 85.2 | 86.6 | 74.0 |
10 April 8, 2022 | VLMT-VCR Anonymous | 84.7 | 85.9 | 72.9 |
11 December 14, 2021 | VL-RoBERTa Joint Laboratory of HIT and iFLYTEK Research (HFL) | 84.2 | 86.4 | 72.8 |
12 August 25, 2022 | SP-VCR (single model) Shopee MMU | 82.3 | 87.4 | 72.2 |
13 June 28, 2021 | VLUA (single model) Kuaishou MMU | 82.3 | 87.0 | 72.0 |
14 November 1, 2021 | š·MerlotReserve-Large University of Washington / AI2 https://rowanzellers.com/merlotreserve | 84.0 | 84.9 | 71.5 |
15 May 10, 2021 | UNIMO+ERNIE(ensemble of 7 models) Baidu NLP https://arxiv.org/abs/2012.15409 | 82.3 | 86.5 | 71.4 |
16 November 19, 2020 | BLENDER (single model) WeSee AI team, Tencent | 81.6 | 86.4 | 70.8 |
17 June 24, 2020 | ERNIE-ViL-large(ensemble of 15 models) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934 | 81.6 | 86.1 | 70.5 |
18 January 12, 2024 | EventLens-large Advanced Data Engineering and Real-time Computing Laboratory (ADE), School of Computor Science and Technology, Huazhong University of Science and Technology (HUST). | 82.7 | 82.7 | 68.5 |
19 October 28, 2020 | MMCNet (ensemble of 4 models) UC Berkeley | 80.0 | 83.1 | 66.9 |
20 September 30, 2019 | UNITER-large (ensemble of 10 models) MS D365 AI https://arxiv.org/abs/1909.11740 | 79.8 | 83.4 | 66.8 |
21 April 7, 2022 | RobustNet Anonymous | 78.8 | 84.0 | 66.5 |
22 June 24, 2020 | ERNIE-ViL-large(single model) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934 | 79.2 | 83.5 | 66.3 |
23 November 15, 2021 | ADVL (single model, formerly CLIP-TD) Anonymous | 79.6 | 82.9 | 66.2 |
24 June 8, 2021 | DELV Intel Labs Cognitive AI & MSRA NLC | 79.4 | 82.5 | 65.8 |
25 May 22, 2020 | VILLA-large (single model) MS D365 AI https://arxiv.org/pdf/2006.06195.pdf) | 78.9 | 82.8 | 65.7 |
26 January 12, 2024 | EventLens-large Advanced Data Engineering and Real-time Computing Laboratory (ADE), School of Computor Science and Technology, Huazhong University of Science and Technology (HUST). | 81.0 | 80.3 | 65.5 |
27 May 28, 2021 | MERLOT (single model) University of Washington / AI2 https://rowanzellers.com/merlot | 80.6 | 80.4 | 65.1 |
28 October 30, 2020 | VitsNet Carnegie Mellon University | 78.8 | 81.8 | 64.6 |
29 July 16, 2020 | gnimix anonymous | 78.9 | 81.0 | 64.1 |
30 June 5, 2021 | SEITU Anonymous https://github.com/MyLittleChange/SEITU | 77.9 | 80.7 | 63.0 |
31 September 23, 2019 | UNITER-large (single model) MS D365 AI https://arxiv.org/abs/1909.11740 | 77.3 | 80.8 | 62.8 |
32 February 15, 2022 | VQA-GNN Anonymous | 77.9 | 80.0 | 62.8 |
33 November 1, 2021 | š·MerlotReserve-Base University of Washington / AI2 https://rowanzellers.com/merlotreserve | 79.3 | 78.7 | 62.6 |
34 March 15, 2022 | VCR-test Anonymous | 77.3 | 80.2 | 62.4 |
35 June 24, 2020 | ERNIE-ViL-base(single model) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934 | 77.0 | 80.3 | 62.1 |
36 September 10, 2021 | Kam-net Anonymous | 77.4 | 79.2 | 61.8 |
37 April 9, 2024 | ICAR : Image Compression and Attentional Redundancy for Visual Commonsense Reasoning Prof. Xie Yiyuan and student Shang Chaofan, School of Electronic Information Engineering, Southwest University, Beibei District, Chongqing, China | 77.1 | 79.1 | 61.3 |
38 May 22, 2020 | VILLA-base (single model) MS D365 AI https://arxiv.org/pdf/2006.06195.pdf) | 76.4 | 79.1 | 60.6 |
39 October 22, 2020 | MMCNet UC Berkeley | 76.0 | 79.3 | 60.6 |
40 February 17, 2021 | Test_VILLA CCR | 76.6 | 79.0 | 60.6 |
41 December 3, 2022 | VILLA_TEST Alignment, Shandong University | 76.6 | 79.0 | 60.6 |
42 December 3, 2022 | VILLA_GIST Alignment, Shandong University | 76.4 | 79.0 | 60.4 |
43 April 23, 2020 | KVL-BERT Beijing Institute of Technology | 76.4 | 78.6 | 60.3 |
44 August 9,2019 | ViLBERT (ensemble of 10 models) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265 | 76.4 | 78.0 | 59.8 |
45 March 23, 2019 | VVT runningcat | 75.3 | 78.9 | 59.7 |
46 September 23,2019 | VL-BERT (single model) MSRA & USTC https://arxiv.org/abs/1908.08530 | 75.8 | 78.4 | 59.7 |
47 August 2, 2021 | PVL (single model) PVL-team (UCLA) | 76.9 | 77.1 | 59.7 |
48 September 10, 2021 | SGEITL Anonomous | 76.0 | 78.0 | 59.6 |
49 April 1, 2023 | UNITER_kd WalkingDog, Shandong University | 75.7 | 77.7 | 59.0 |
50 January 8, 2021 | vlt (single model) Anonymous | 75.3 | 77.8 | 58.9 |
51 August 9,2019 | ViLBERT (ensemble of 5 models) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265 | 75.7 | 77.5 | 58.8 |
52 December 14, 2022 | UNITER_joint WalkingDog, Shandong University | 75.6 | 77.5 | 58.8 |
53 June 16, 2021 | GITRL Anonomous | 75.5 | 77.5 | 58.7 |
54 September 4,2019 | Unicoder-VL (ensemble of 2 models) MSRA & PKU https://arxiv.org/abs/1908.06066 | 76.0 | 77.1 | 58.6 |
55 March 5, 2022 | PEVL Anonymous | 76.0 | 76.7 | 58.6 |
56 December 14, 2022 | UNITER_independent WalkingDog, Shandong University | 75.5 | 77.3 | 58.6 |
57 September 23, 2019 | UNITER-base (single model) MS D365 AI https://arxiv.org/abs/1909.11740 | 75.0 | 77.2 | 58.2 |
58 November 5, 2020 | TDN VARMS (Sun Yat-sen University) | 75.7 | 76.4 | 58.0 |
59 May 13, 2019 | B2T2 (ensemble of 5 models) Google Research http://arxiv.org/abs/1908.05054 | 74.0 | 77.1 | 57.1 |
60 April 5, 2024 | BLIP-VCR Anonymous | 75.8 | 74.3 | 56.8 |
61 March 13, 2023 | GPT4 4-shot OpenAI (experiment by Rowan) | 73.5 | 75.4 | 56.2 |
62 September 11, 2021 | YTX Harbin Institute of Technology | 73.8 | 74.8 | 55.4 |
63 April 1, 2023 | VL-BERT_kd WalkingDog, Shandong University | 73.2 | 75.7 | 55.4 |
64 May 13, 2019 | B2T2 (single model) Google Research http://arxiv.org/abs/1908.05054 | 72.6 | 75.7 | 55.0 |
65 August 27,2019 | Unicoder-VL (single model) MSRA & PKU https://arxiv.org/abs/1908.06066 | 73.4 | 74.5 | 54.9 |
66 July 30,2019 | ViLBERT (single model) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265 | 73.3 | 74.6 | 54.8 |
67 November 29, 2022 | VL-BERT_prec WalkingDog, Shandong University | 73.4 | 74.5 | 54.8 |
68 March 5, 2022 | ALBEF Salesforce Research (experiment by Anonymous) https://arxiv.org/abs/2107.07651 | 72.9 | 74.5 | 54.7 |
69 May 26, 2021 | ViLBERT_GCN Beijing Institute of Technology | 73.3 | 74.4 | 54.6 |
70 April 7, 2021 | DCGR Harbin Institute of Technology | 72.8 | 74.2 | 54.3 |
71 March 11, 2021 | CARC Harbin Institute of Technology | 72.9 | 73.8 | 54.1 |
72 November 14, 2019 | HGL Sun Yat-sen University | 72.2 | 73.4 | 53.2 |
73 May 22, 2019 | TNet (ensemble of 5) FlyingPig (Sun Yat-sen University) | 72.7 | 72.6 | 53.0 |
74 March 31, 2021 | CMR flying melon | 72.3 | 72.8 | 52.8 |
75 July 5,2019 | VisualBERT UCLA & AI2 & PKU https://arxiv.org/abs/1908.03557 | 71.6 | 73.2 | 52.4 |
76 February 22, 2022 | TAB-KD Anonymous | 71.6 | 72.8 | 52.4 |
77 March 8,2021 | SAC SAC | 71.7 | 72.8 | 52.2 |
78 January 5, 2022 | GTEHG MIC-Tongji University | 71.2 | 72.4 | 51.7 |
79 April 13, 2021 | A3 Net A3 Net | 71.2 | 72.0 | 51.4 |
80 January 17, 2022 | JCL Tinjin University | 70.7 | 71.6 | 50.9 |
81 November 29, 2022 | TAB-VCR_attribute WalkingDog, Shandong University | 70.5 | 71.6 | 50.8 |
82 September 4, 2019 | MKDN Peking University | 70.7 | 71.5 | 50.5 |
83 October 25, 2019 | TAB-VCR UIUC https://arxiv.org/abs/1910.14671 | 70.4 | 71.7 | 50.5 |
84 June 3, 2020 | RobustCL NeurIPS 2020 submission ID1218 | 70.4 | 71.5 | 50.5 |
85 October 30, 2021 | YXY Peking University | 70.6 | 71.2 | 50.5 |
86 May 22, 2019 | TNet (single model) FlyingPig (Sun Yat-sen University) | 70.9 | 70.6 | 50.4 |
87 March 26, 2021 | UABE WalkingDog from Qilu University of Technology | 70.3 | 71.3 | 50.4 |
88 November 29, 2019 | WWR-Net Tianjin University | 70.8 | 70.7 | 50.2 |
89 October 13, 2019 | transformer-r2c SYSU NEW BANNER (Sun Yat-sen University) | 70.7 | 70.5 | 50.0 |
90 May 13, 2019 | HGL HCP https://arxiv.org/abs/1910.11475 | 70.1 | 70.8 | 49.8 |
91 September 4,2019 | CAR Sun Yat-sen University | 70.5 | 70.4 | 49.8 |
92 December 6,2020 | SIA V1 SIA | 69.6 | 71.3 | 49.8 |
93 September 3, 2020 | SFW1 Jiangsu University | 70.2 | 70.7 | 49.7 |
94 January 20, 2022 | CCN-KD Anonymous | 70.0 | 70.6 | 49.7 |
95 September 3, 2020 | SFW2 Jiangsu University | 69.6 | 71.1 | 49.6 |
96 May 3, 2021 | RKB Hefei University of Technology | 69.6 | 70.7 | 49.3 |
97 June 10, 2021 | PUV (Pretrain UNITER by VC feature) Anonymous | 69.8 | 70.2 | 49.3 |
98 February 3, 2021 | vlb (single model) Anonymous | 69.8 | 69.8 | 48.9 |
99 June 10, 2021 | BLU Anonomous | 69.3 | 70.1 | 48.9 |
100 May 16, 2019 | MRCNet MILAB (Seoul National University) | 68.4 | 70.5 | 48.4 |
101 May 18, 2019 | CCD Anonymous | 68.5 | 70.5 | 48.4 |
102 May 17, 2019 | MUGRN NeurIPS 2019 submission ID 21 | 68.2 | 69.4 | 47.5 |
103 May 13, 2019 | SGRE UTS https://github.com/AmingWu/Multi-modal-Circulant-Fusion/ | 67.5 | 69.7 | 46.9 |
104 January 30, 2022 | R2C-KD Anonymous | 66.9 | 69.4 | 46.6 |
105 February 19, 2019 | FAIR Facebook AI Research | 65.7 | 70.1 | 46.3 |
106 January 20, 2022 | CCN-NKD Anonymous | 67.2 | 68.2 | 46.1 |
107 November 28, 2019 | DAF Beijing Institute of Technology | 66.9 | 68.7 | 46.0 |
108 February 25, 2019 | CKRE Peking University | 66.9 | 68.2 | 45.9 |
109 October 30, 2021 | MIE flying melon | 66.2 | 68.8 | 45.5 |
110 August 8, 2022 | ATGAN MMT, Tianjin University | 63.4 | 71.9 | 45.5 |
111 May 14, 2019 | emnet DCP | 66.6 | 68.0 | 45.4 |
112 Nov 28, 2018 | Recognition to Cognition Networks University of Washington https://github.com/rowanz/r2c | 65.1 | 67.3 | 44.0 |
113 May 20, 2019 | DVD SL | 66.3 | 65.0 | 43.3 |
114 March 27, 2019 | GS Reasoning UC San Diego | 65.7 | 61.0 | 41.1 |
115 May 17, 2019 | R2R (text only) Anonymous | 58.4 | 69.1 | 40.5 |
116 May 28, 2021 | R2CC Arjun Singh | 60.7 | 61.7 | 37.6 |
117 Nov 28, 2018 | BERT-Base Google AI Language (experiment by Rowan) https://github.com/google-research/bert | 53.9 | 64.5 | 35.0 |
118 June 18, 2021 | MUR Anonymous | 46.5 | 46.0 | 25.6 |
119 September 15, 2021 | SG-QA-model Anonymous | 70.9 | 24.9 | 17.7 |
120 Nov 28, 2018 | MLB Seoul National University (experiment by Rowan) https://github.com/jnhwkim/MulLowBiVQA | 46.2 | 36.8 | 17.2 |
121 October 23, 2021 | BERT-base-vc-ft LUKA-Axe | 58.6 | 24.9 | 14.5 |
122 November 15, 2019 | Visual-Lang-base (ensemble) Tiny-Group | 17.7 | 74.1 | 13.3 |
Random Performance | 25.0 | 25.0 | 6.2 |