Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning (EMNLP'2019)

Paper » Code » Dataset »

What is Cosmos QA?


Cosmos QA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people's everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context.

Reading comprehension requires not only understanding what is stated explicitly in text, but also reading between the lines, i.e., understanding what is not stated yet obviously true (Norvig, 1987).

Cosmos QA Examples


Example 1

Paragraph: It's a very humbling experience when you need someone to dress you every morning, tie your shoes, and put your hair up. Every menial task takes an unprecedented amount of effort. It made me appreciate Dan even more. But anyway I shan't dwell on this (I'm not dying after all) and not let it detact from my lovely 5 days with my friends visiting from Jersey.

Question: What's a possible reason the writer needed someone to dress him every morning?

Options: (click the choice to see if it's correct or not)


Example 2

Paragraph: A woman had topped herself by jumping off the roof of the hospital she had just recently been admitted to. She was there because the first or perhaps latest suicide attempt was unsuccessful. She put her clothes on, folded the hospital gown and made the bed. She walked through the unit unimpeded and took the elevator to the top floor.

Question: What would have happened to the woman if the staff at the hospital were doing their job properly?

Options: (click the choice to see if it's correct or not)


Cosmos QA Leaderboard


Submitting to the Leaderboard:

To benchmark approaches to Cosmos QA, we have a leaderboard for the test set. If you have a model for solving Cosmos QA and would like to make a submission, you should follow the rules and policies, and create your submission here.


Rank Model Test accuracy (%)
Human Performance (Huang et al., 2019) 94.0

🥇

August 30
BERT-FT Multiway

Huang et al., 2019

68.4

2

August 30
DMCN

Shanghai Jiao Tong University, Zhang et al., 2019

https://arxiv.org/pdf/1901.09381.pdf
67.6

3

August 30
BERT-FT

Google AI Language, Devlin et al., 2018

https://arxiv.org/pdf/1810.04805.pdf
67.1

4

August 30
GPT-FT

OpenAI, Radford et al., 2018

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
54.4

5

August 30
Commonsense-RC

Yuanfudao Research, Wang et al., 2018

https://arxiv.org/pdf/1803.00191.pdf
48.2

6

August 30
Gated-Attention Reader

Carnegie Mellon University, Dhingra et al., 2017

https://arxiv.org/pdf/1606.01549.pdf
46.2

7

August 30
Co-Matching

Singapore Management University & IBM Research, Wang et al., 2018

https://arxiv.org/pdf/1806.04068.pdf
44.7

8

August 30
Stanford Attention Reader

Stanford University, Chen et al., 2016

https://arxiv.org/pdf/1606.02858v2.pdf
44.4

9

August 30
Sliding Window

Richardson et al., 2013

https://www.aclweb.org/anthology/D13-1020
24.9

Authors

Contact

Questions about the dataset, or want to get in touch? Please contact Lifu Huang on Twitter, open up a pull request on Github, or email me: Gmail.