Skip to content

puzhang1993/ScienceQA

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ScienceQA: Science Question Answering

VQA Science Problems ScienceQA Chain-of-Thought GPT-3 LLMs

Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".

For more details, please refer to the project page with dataset exploration and visualization tools: https://scienceqa.github.io.

πŸ”” If you have any questions or suggestions, please don't hesitate to let us know. You can directly email Pan Lu at UCLA using the email address lupantech@gmail.com, comment on the Twitter, or post an issue on this repository.

πŸ’₯ News πŸ’₯

πŸ”₯ Leaderboard πŸ”₯

πŸ”” The leaderboard is continuously being updated. If you have any new results to contribute, please feel free to reach out to us.

# Method Sources Date #Size #Param Avg NAT SOC LAN TXT IMG NO G1-6 G7-12
1 Random Chance NeurIPS 2022 09/2022 - - 39.83 40.28 46.13 29.25 47.45 40.08 33.66 39.35 40.67
2 Human Average NeurIPS 2022 09/2022 - - 88.40 90.23 84.97 87.48 89.60 87.50 88.10 91.59 82.42
3 MCAN CVPR 2019 09/2022 95M 95M 54.54 56.08 46.23 58.09 59.43 51.17 55.40 51.65 59.72
4 Top-Down CVPR 2018 09/2022 70M 70M 59.02 59.50 54.33 61.82 62.90 54.88 59.79 57.27 62.16
5 BAN NeurIPS 2018 09/2022 112M 112M 59.37 60.88 46.57 66.64 62.61 52.60 65.51 56.83 63.94
6 DFAF CVPR 2019 09/2022 74M 74M 60.72 64.03 48.82 63.55 65.88 54.49 64.11 57.12 67.17
7 ViLT ICML 2021 09/2022 113M 113M 61.14 60.48 63.89 60.27 63.20 61.38 57.00 60.72 61.90
8 Patch-TRM NeurIPS 2021 09/2022 90M 90M 61.42 65.19 46.79 65.55 66.96 55.28 64.95 58.04 67.50
9 VisualBERT ACL 2020 09/2022 111M 111M 61.87 59.33 69.18 61.18 62.71 62.17 58.54 62.96 59.92
10 UnifiedQA EMNLP 2020 09/2022 223M 223M 70.12 68.16 69.18 74.91 63.78 61.38 77.84 72.98 65.00
11 UnifiedQA (CoT) NeurIPS 2022 09/2022 223M 223M 74.11 71.00 76.04 78.91 66.42 66.53 81.81 77.06 68.82
12 GPT-3 (2-shot) NeurIPS 2020 09/2022 173B 0M 73.97 74.64 69.74 76.00 74.44 67.28 77.42 76.80 68.89
13 GPT-3 (zero-shot) NeurIPS 2020 09/2022 173B 0M 74.04 75.04 66.59 78.00 74.24 65.74 79.58 76.36 69.87
14 GPT-3.5 (CoT) (AE) NeurIPS 2022 09/2022 173B 0M 74.61 76.60 65.92 77.55 75.51 66.09 79.58 78.49 67.63
15 GPT-3 (CoT) (ALE) NeurIPS 2022 09/2022 173B 0M 75.17 75.44 70.87 78.09 74.68 67.43 79.93 78.23 69.68
16 Multimodal-CoT (T) arXiv 2302.00923 02/2023 223M 223M 70.53 71.09 70.75 69.18 71.16 65.84 71.57 71.00 69.68
17 Multimodal-CoT arXiv 2302.00923 02/2023 223M 223M 84.91 87.52 77.17 85.82 87.88 82.90 86.83 84.65 85.37
18 LLaMA-Adapter (T) arXiv 2303.16199 03/2023 6B 1.2M 78.31 79.00 73.79 80.55 78.30 70.35 83.14 79.77 75.68
19 LLaMA-Adapter arXiv 2303.16199 03/2023 6B 1.2M 85.19 84.37 88.30 84.36 83.72 80.32 86.90 85.83 84.05
20 GPT-4 (CoT) (ALE) - 03/2023 ? 0M ~82.63 - - - - - - - -

Some notations in the table

  • #Size: the model size (number of model parameters)
  • #Param: number of tuned model parameters
  • GPT-3: the text-davinci-002 engine
  • GPT-4: the gpt-4 engine
  • AE: the output is the answer followed by the explanation
  • ALE: the output is the answer followed by the lecture and explanation
  • T: the input only involves textual features

πŸ—ΊοΈ About ScienceQA

We present Science Question Answering (ScienceQA), a new benchmark that consists of 21,208 multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. The lecture and explanation provide general external knowledge and specific reasons, respectively, for arriving at the correct answer.

scienceqa

ScienceQA, in contrast to previous datasets, has richer domain diversity from three subjects: natural science, language science, and social science. ScienceQA features 26 topics, 127 categories, and 379 skills that cover a wide range of domains.

domains

We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.

For more details, you can find our project page here and our paper here.

πŸ‘» Download the Dataset

The text part of the ScienceQA dataset is provided in data/scienceqa/problems.json. You can download the image data of ScienceQA by running:

. tools/download.sh

Alternatively, you can download ScienceQA from Google Drive and unzip the images under root_dir/data.

πŸ’₯ The ScienceQA dataset is now available at HuggingFace Datasets!

😈 Explore ScienceQA

For more details, you can explore the datatset and check the visualizations here: Explore and Visualizations.

explore

πŸ™ Requirements

python==3.8.10
huggingface-hub
nltk==3.5
numpy==1.23.2
openai==0.23.0
pandas==1.4.3
rouge==1.0.1
sentence-transformers==2.2.2
torch==1.12.1+cu113
transformers==4.21.1

Install all required python dependencies:

pip install -r requirements.txt

πŸ€– Run the GPT-3 (CoT) Model for ScienceQA

Generate the image captions

We use the image captioning model to generate the text content for images in ScienceQA. The pre-generated image captions are provided in data/captions.json.

(Optionally) You can generate the image captions with user-specific arguments with the following command, which will save the caption data in data/captions_user.json.

cd tools
python generate_caption.py

Run the model

We build a few-shot GPT-3 model via chain-of-thought (CoT) prompting to generate the answer followed by the lecture and the explanation (QCM→ALE). The prompt instruction encoding for the test example in GPT-3 (CoT) is defined as below:

scienceqa

In our final model, we develop GPT-3 (CoT) prompted with two in-context examples and evalute it on the ScienceQA test split:

cd models
python run_gpt3.py \
--label exp1 \
--test_split test \
--test_number -1 \
--shot_number 2 \
--prompt_format QCM-ALE \
--seed 3

Evaluate the results

Our final GPT-3 (CoT) model achieves a state-of-the-art accuracy of 75.17% on the test split. One prediction example is visualized below. We can see that GPT-3 (CoT) predicts the correct answer and generates a reasonable lecture and explanation to mimic the human thought process.

scienceqa

We can get the accuracy metrics on average and across different question classes by running:

cd tools
python evaluate_acc.py

We can run the following command to evaluate the generated lectures and explanations automatically:

cd tools
python evaluate_explaination.py

Try different prompt templates

You can try other prompt templates. For example, if you want the model to take the question, the context, and the multiple options as input, and output the answer after the lecture and explanation (QCM→LEA), you can run the following script:

cd models
python run_gpt3.py \
--label exp1 \
--test_split test \
--test_number -1 \
--shot_number 2 \
--prompt_format QCM-LEA \
--seed 3

⚠️ Licenses

MIT license

This work is licensed under a MIT License.

License: CC BY-SA 4.0

The ScienceQA dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

βœ… Cite

If the paper, codes, or the dataset inspire you, please kindly cite us:

@inproceedings{lu2022learn,
    title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering},
    author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan},
    booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)},
    year={2022}
}

About

Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 97.5%
  • Shell 2.5%