To replicate the Standard-LM (Direct) and the Chain-of-Thought (CoT) baselines, please run the following commands:
cd ./baselines
python gpt3_baseline.py \
--api_key "Your OpenAI API Key" \
--model_name "Model Name [text-davinci-003 | gpt-4]" \
--dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]" \
--split dev \
--mode "Baseline [Direct | CoT]" \
--max_new_tokens "16 for Direct; 1024 for CoT" \The results will be saved in ./baselines/results. To evaluate the results, please run the following commands:
python evaluate.py \
--dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]" \
--model_name "Model Name [text-davinci-003 | gpt-4]" \
--split dev \
--mode "Baseline [Direct | CoT]" \To generate logic programs for logical reasoning problems in each dataset, at the root directory, run the following commands:
python models/logic_program.py \
--api_key "Your OpenAI API Key" \
--dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]" \
--split dev \
--model_name "Model Name [text-davinci-003 | gpt-4]" \
--max_new_tokens 1024 \The generated logic programs will be saved in outputs/logic_programs. You can also reuse the logic programs we generated in ./outputs/logic_programs.
After generating logic programs, we can perform inference with symbolic solvers. At the root directory, run the following commands:
DATASET="Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]"
SPLIT="Dataset Split [dev | test]"
MODEL="The logic programs are generated by which model? [text-davinci-003 | gpt-4]"
BACKUP="The random backup answer (random) or CoT-Logic collabration mode (LLM)"
python models/logic_inference.py \
--model_name ${MODEL} \
--dataset_name ${DATASET} \
--split ${SPLIT} \
--backup_strategy ${BACKUP} \
--backup_LLM_result_path ./baselines/results/CoT_${DATASET}_${SPLIT}_${MODEL}.jsonThe logic reasoning results will be saved in outputs/logic_inferences.
Backup Strategies:
random: If the generated logic program cannot be executed by the symbolic solver, we will use random guess as the prediction.LLM: If the generated logic program cannot be executed by the symbolic solver, we will back up to using CoT to generate the prediction. To run this mode, you need to have the corresponding baseline LLM results stored in./baselines/results. To make the inference more efficient, the model will just load the baseline LLM results and use them as the prediction if the symbolic solver fails.
To evaluate the logic reasoning results, please run the following commands:
python models/evaluation.py \
--dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction]" \
--model_name "The logic programs are generated by which model? [text-davinci-003 | gpt-4]" \
--split dev \
--backup "The basic mode (random) or CoT-Logic collabration mode (LLM)"