VCR: Visual Commonsense Reasoning

Visual Commonsense Reasoning

On VCR, a model must not only answer commonsense visual questions, but also provide a rationale that explains why the answer is true.

Submitting to the leaderboard

Submission is easy! You just need to email Rowan (rowanz at cs.washington.edu) with your predictions. Formatting instructions are below:

Please include in your email 1) a name for your model, 2) your team name (including your affiliation), and optionally, 3) a github repo or paper link.

If you submit an ensemble, please tell me how many models you used in your ensemble.

I'll try to get back to you within a few days, usually sooner. Teams can only submit results from a model once every 7 days.

I reserve the right to not score any of your submissions if you cheat -- for instance, please don't make up a bunch of fake names / email addresses and send me multiple submissions under those names.

What kinds of submissions are allowed?

The only constraint is that your system must predict the answer first, then the rationale. (The rationales were selected to be highly relevant to the correct Q,A pair, so they leak information about the correct answer.)

To deter this, the submission format involves submitting predictions for each possible rationale, conditioned on each possible answer.
A simple way of setting up the experiments (used in the paper) is to consider a task with query and four response choices. For Q->A the query is the question, and the response choices are the answers. For QA->R, the query is the question and answer, concatenated together, and the response choices are the rationales.

Questions?

If it's not about something private, check out the google group below:

VCR Leaderboard

There are two different subtasks to VCR:

Question Answering (Q->A): In this setup, a model is provided a question, and has to pick the best answer out of four choices. Only one of the four is correct.
Answer Justification (QA->R): In this setup, a model is provided a question, along with the correct answer, and it has to justify it by picking the best rationale out of four choices.

We combine the two parts with the Q->AR metric, in which a model only gets a question right if it answers correctly and picks the right rationale. Models are evaluated in terms of accuracy (%). How well will your model do?

Rank	Model	Q->A	QA->R	Q->AR
	Human Performance University of Washington (Zellers et al. '18)	91.0	93.0	85.0
📼 July 10, 2024	OV-Grounding Anonymous	91.4	92.8	85.0
2 September 11, 2023	GPT4RoI Anonymous	89.4	91.0	81.6
3 January 19, 2024	ViP-LLaVa Anonymous	89.2	90.9	81.3
4 March 14, 2024	LLME-VCR Advanced Data Engineering and Real-time Computing Laboratory (ADE), School of Computor Science and Technology, Huazhong University of Science and Technology (HUST).	87.0	86.6	75.7
5 November 28, 2022	HunYuan_vcr Tencent Data Platform	85.8	88.0	75.6
6 January 28, 2023	SP-VCR (ensemble of 4 models) Shopee MMU	83.6	88.6	74.4
7 April 8, 2022	KS-MGSR KDDI Research and SNAP	85.3	86.9	74.3
8 January 17, 2022	VLUA+ Kuaishou MMU	84.8	87.0	74.0
9 February 21, 2022	VQA-GNN + MerlotReserve-Large (ensemble of 2 models) Anonymous	85.2	86.6	74.0
10 April 8, 2022	VLMT-VCR Anonymous	84.7	85.9	72.9
11 December 14, 2021	VL-RoBERTa Joint Laboratory of HIT and iFLYTEK Research (HFL)	84.2	86.4	72.8
12 August 25, 2022	SP-VCR (single model) Shopee MMU	82.3	87.4	72.2
13 June 28, 2021	VLUA (single model) Kuaishou MMU	82.3	87.0	72.0
14 November 1, 2021	🍷MerlotReserve-Large University of Washington / AI2 https://rowanzellers.com/merlotreserve	84.0	84.9	71.5
15 May 10, 2021	UNIMO+ERNIE(ensemble of 7 models) Baidu NLP https://arxiv.org/abs/2012.15409	82.3	86.5	71.4
16 November 19, 2020	BLENDER (single model) WeSee AI team, Tencent	81.6	86.4	70.8
17 June 24, 2020	ERNIE-ViL-large(ensemble of 15 models) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934	81.6	86.1	70.5
18 January 12, 2024	EventLens-large Advanced Data Engineering and Real-time Computing Laboratory (ADE), School of Computor Science and Technology, Huazhong University of Science and Technology (HUST).	82.7	82.7	68.5
19 October 28, 2020	MMCNet (ensemble of 4 models) UC Berkeley	80.0	83.1	66.9
20 September 30, 2019	UNITER-large (ensemble of 10 models) MS D365 AI https://arxiv.org/abs/1909.11740	79.8	83.4	66.8
21 April 7, 2022	RobustNet Anonymous	78.8	84.0	66.5
22 June 24, 2020	ERNIE-ViL-large(single model) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934	79.2	83.5	66.3
23 November 15, 2021	ADVL (single model, formerly CLIP-TD) Anonymous	79.6	82.9	66.2
24 June 8, 2021	DELV Intel Labs Cognitive AI & MSRA NLC	79.4	82.5	65.8
25 May 22, 2020	VILLA-large (single model) MS D365 AI https://arxiv.org/pdf/2006.06195.pdf)	78.9	82.8	65.7
26 January 12, 2024	EventLens-large Advanced Data Engineering and Real-time Computing Laboratory (ADE), School of Computor Science and Technology, Huazhong University of Science and Technology (HUST).	81.0	80.3	65.5
27 May 28, 2021	MERLOT (single model) University of Washington / AI2 https://rowanzellers.com/merlot	80.6	80.4	65.1
28 October 30, 2020	VitsNet Carnegie Mellon University	78.8	81.8	64.6
29 July 16, 2020	gnimix anonymous	78.9	81.0	64.1
30 June 5, 2021	SEITU Anonymous https://github.com/MyLittleChange/SEITU	77.9	80.7	63.0
31 September 23, 2019	UNITER-large (single model) MS D365 AI https://arxiv.org/abs/1909.11740	77.3	80.8	62.8
32 February 15, 2022	VQA-GNN Anonymous	77.9	80.0	62.8
33 November 1, 2021	🍷MerlotReserve-Base University of Washington / AI2 https://rowanzellers.com/merlotreserve	79.3	78.7	62.6
34 March 15, 2022	VCR-test Anonymous	77.3	80.2	62.4
35 June 24, 2020	ERNIE-ViL-base(single model) ERNIE-team - Baidu https://arxiv.org/abs/2006.16934	77.0	80.3	62.1
36 September 10, 2021	Kam-net Anonymous	77.4	79.2	61.8
37 April 9, 2024	ICAR : Image Compression and Attentional Redundancy for Visual Commonsense Reasoning Prof. Xie Yiyuan and student Shang Chaofan, School of Electronic Information Engineering, Southwest University, Beibei District, Chongqing, China	77.1	79.1	61.3
38 May 22, 2020	VILLA-base (single model) MS D365 AI https://arxiv.org/pdf/2006.06195.pdf)	76.4	79.1	60.6
39 October 22, 2020	MMCNet UC Berkeley	76.0	79.3	60.6
40 February 17, 2021	Test_VILLA CCR	76.6	79.0	60.6
41 December 3, 2022	VILLA_TEST Alignment, Shandong University	76.6	79.0	60.6
42 December 3, 2022	VILLA_GIST Alignment, Shandong University	76.4	79.0	60.4
43 April 23, 2020	KVL-BERT Beijing Institute of Technology	76.4	78.6	60.3
44 August 9,2019	ViLBERT (ensemble of 10 models) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265	76.4	78.0	59.8
45 March 23, 2019	VVT runningcat	75.3	78.9	59.7
46 September 23,2019	VL-BERT (single model) MSRA & USTC https://arxiv.org/abs/1908.08530	75.8	78.4	59.7
47 August 2, 2021	PVL (single model) PVL-team (UCLA)	76.9	77.1	59.7
48 September 10, 2021	SGEITL Anonomous	76.0	78.0	59.6
49 April 1, 2023	UNITER_kd WalkingDog, Shandong University	75.7	77.7	59.0
50 January 8, 2021	vlt (single model) Anonymous	75.3	77.8	58.9
51 August 9,2019	ViLBERT (ensemble of 5 models) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265	75.7	77.5	58.8
52 December 14, 2022	UNITER_joint WalkingDog, Shandong University	75.6	77.5	58.8
53 June 16, 2021	GITRL Anonomous	75.5	77.5	58.7
54 September 4,2019	Unicoder-VL (ensemble of 2 models) MSRA & PKU https://arxiv.org/abs/1908.06066	76.0	77.1	58.6
55 March 5, 2022	PEVL Anonymous	76.0	76.7	58.6
56 December 14, 2022	UNITER_independent WalkingDog, Shandong University	75.5	77.3	58.6
57 September 23, 2019	UNITER-base (single model) MS D365 AI https://arxiv.org/abs/1909.11740	75.0	77.2	58.2
58 November 5, 2020	TDN VARMS (Sun Yat-sen University)	75.7	76.4	58.0
59 May 13, 2019	B2T2 (ensemble of 5 models) Google Research http://arxiv.org/abs/1908.05054	74.0	77.1	57.1
60 April 5, 2024	BLIP-VCR Anonymous	75.8	74.3	56.8
61 March 13, 2023	GPT4 4-shot OpenAI (experiment by Rowan)	73.5	75.4	56.2
62 September 11, 2021	YTX Harbin Institute of Technology	73.8	74.8	55.4
63 April 1, 2023	VL-BERT_kd WalkingDog, Shandong University	73.2	75.7	55.4
64 May 13, 2019	B2T2 (single model) Google Research http://arxiv.org/abs/1908.05054	72.6	75.7	55.0
65 August 27,2019	Unicoder-VL (single model) MSRA & PKU https://arxiv.org/abs/1908.06066	73.4	74.5	54.9
66 July 30,2019	ViLBERT (single model) Georgia Tech & Facebook AI Research https://arxiv.org/abs/1908.02265	73.3	74.6	54.8
67 November 29, 2022	VL-BERT_prec WalkingDog, Shandong University	73.4	74.5	54.8
68 March 5, 2022	ALBEF Salesforce Research (experiment by Anonymous) https://arxiv.org/abs/2107.07651	72.9	74.5	54.7
69 May 26, 2021	ViLBERT_GCN Beijing Institute of Technology	73.3	74.4	54.6
70 April 7, 2021	DCGR Harbin Institute of Technology	72.8	74.2	54.3
71 March 11, 2021	CARC Harbin Institute of Technology	72.9	73.8	54.1
72 November 14, 2019	HGL Sun Yat-sen University	72.2	73.4	53.2
73 May 22, 2019	TNet (ensemble of 5) FlyingPig (Sun Yat-sen University)	72.7	72.6	53.0
74 March 31, 2021	CMR flying melon	72.3	72.8	52.8
75 July 5,2019	VisualBERT UCLA & AI2 & PKU https://arxiv.org/abs/1908.03557	71.6	73.2	52.4
76 February 22, 2022	TAB-KD Anonymous	71.6	72.8	52.4
77 March 8,2021	SAC SAC	71.7	72.8	52.2
78 January 5, 2022	GTEHG MIC-Tongji University	71.2	72.4	51.7
79 April 13, 2021	A3 Net A3 Net	71.2	72.0	51.4
80 January 17, 2022	JCL Tinjin University	70.7	71.6	50.9
81 November 29, 2022	TAB-VCR_attribute WalkingDog, Shandong University	70.5	71.6	50.8
82 September 4, 2019	MKDN Peking University	70.7	71.5	50.5
83 October 25, 2019	TAB-VCR UIUC https://arxiv.org/abs/1910.14671	70.4	71.7	50.5
84 June 3, 2020	RobustCL NeurIPS 2020 submission ID1218	70.4	71.5	50.5
85 October 30, 2021	YXY Peking University	70.6	71.2	50.5
86 May 22, 2019	TNet (single model) FlyingPig (Sun Yat-sen University)	70.9	70.6	50.4
87 March 26, 2021	UABE WalkingDog from Qilu University of Technology	70.3	71.3	50.4
88 November 29, 2019	WWR-Net Tianjin University	70.8	70.7	50.2
89 October 13, 2019	transformer-r2c SYSU NEW BANNER (Sun Yat-sen University)	70.7	70.5	50.0
90 May 13, 2019	HGL HCP https://arxiv.org/abs/1910.11475	70.1	70.8	49.8
91 September 4,2019	CAR Sun Yat-sen University	70.5	70.4	49.8
92 December 6,2020	SIA V1 SIA	69.6	71.3	49.8
93 September 3, 2020	SFW1 Jiangsu University	70.2	70.7	49.7
94 January 20, 2022	CCN-KD Anonymous	70.0	70.6	49.7
95 September 3, 2020	SFW2 Jiangsu University	69.6	71.1	49.6
96 May 3, 2021	RKB Hefei University of Technology	69.6	70.7	49.3
97 June 10, 2021	PUV (Pretrain UNITER by VC feature) Anonymous	69.8	70.2	49.3
98 February 3, 2021	vlb (single model) Anonymous	69.8	69.8	48.9
99 June 10, 2021	BLU Anonomous	69.3	70.1	48.9
100 May 16, 2019	MRCNet MILAB (Seoul National University)	68.4	70.5	48.4
101 May 18, 2019	CCD Anonymous	68.5	70.5	48.4
102 May 17, 2019	MUGRN NeurIPS 2019 submission ID 21	68.2	69.4	47.5
103 May 13, 2019	SGRE UTS https://github.com/AmingWu/Multi-modal-Circulant-Fusion/	67.5	69.7	46.9
104 January 30, 2022	R2C-KD Anonymous	66.9	69.4	46.6
105 February 19, 2019	FAIR Facebook AI Research	65.7	70.1	46.3
106 January 20, 2022	CCN-NKD Anonymous	67.2	68.2	46.1
107 November 28, 2019	DAF Beijing Institute of Technology	66.9	68.7	46.0
108 February 25, 2019	CKRE Peking University	66.9	68.2	45.9
109 October 30, 2021	MIE flying melon	66.2	68.8	45.5
110 August 8, 2022	ATGAN MMT, Tianjin University	63.4	71.9	45.5
111 May 14, 2019	emnet DCP	66.6	68.0	45.4
112 Nov 28, 2018	Recognition to Cognition Networks University of Washington https://github.com/rowanz/r2c	65.1	67.3	44.0
113 May 20, 2019	DVD SL	66.3	65.0	43.3
114 March 27, 2019	GS Reasoning UC San Diego	65.7	61.0	41.1
115 May 17, 2019	R2R (text only) Anonymous	58.4	69.1	40.5
116 May 28, 2021	R2CC Arjun Singh	60.7	61.7	37.6
117 Nov 28, 2018	BERT-Base Google AI Language (experiment by Rowan) https://github.com/google-research/bert	53.9	64.5	35.0
118 June 18, 2021	MUR Anonymous	46.5	46.0	25.6
119 September 15, 2021	SG-QA-model Anonymous	70.9	24.9	17.7
120 Nov 28, 2018	MLB Seoul National University (experiment by Rowan) https://github.com/jnhwkim/MulLowBiVQA	46.2	36.8	17.2
121 October 23, 2021	BERT-base-vc-ft LUKA-Axe	58.6	24.9	14.5
122 November 15, 2019	Visual-Lang-base (ensemble) Tiny-Group	17.7	74.1	13.3
	Random Performance	25.0	25.0	6.2